The decision tree algorithm can be said to be close to our lives. Through the following cases, you will find that we are using the decision tree algorithm intentionally or unintentionally every day (it looks amazing).

Xiao Ming has to go to school every morning. You can walk, take the bus, and take the car of Uncle Wang next door. At this time, Xiao Ming began to make a decision: first look at the weather and choose to walk to school when it is not raining; when it rains, look at whether the next door Uncle Wang is free, and take Pharaoh’s car to school when he is free Choose to go to school by bus. as the picture shows.

Through the above case, the decision tree can be defined: the above picture is the decision tree (um... a bit perfunctory). The decision tree is composed of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. The internal nodes represent a feature or attribute (weather, availability), and the leaf nodes represent a class (walking, taking the bus, and taking the car of the next door Uncle Wang).

So how to construct this tree through the decision tree algorithm? (Is it the hand of God?) The above case is relatively simple. We now propose a very classic case. As shown in the figure, do we first make decisions based on weather, humidity or wind level? Entropy and information gain are to be proposed here.

1. let's look at what entropy is. Simply put, entropy describes the degree of confusion (or uncertainty) of things. For example: Chinese football enters the World Cup, the uncertainty may be 0, so the entropy may be 0; the uncertainty of the 6-sided dice is lower than that of the 12-sided dice, so the entropy will be lower. Now let’s look at the formula of entropy: H = -∑<sup>n</sup><sub>i=1</sub>p(xi)log<sub>2</sub>p(xi)

The entropy of the six dice: 1/6*log<sub>2</sub> 6 times of 1/6, which is 2.585

By analogy, the entropy of the 12 faces is: 3.585

Finally, we calculate the information entropy of this case: not playing ball is 5, playing ball is 9, so the entropy is calculated as:

-(5/14 * log(5/14,2) + 9/14 * log(9/14,2)) 0.940

According to which feature should we divide the data set first? We have a principle, which is to make disordered data into order, in other words, to make the entropy smaller, the smaller the better. The information gain is the change in entropy before and after dividing the data set. Here, the larger the information gain, the better.

Let's take the weather as an example to calculate the information gain after division:

- On sunny days: 2/5 to play, 3/5 not to play, the entropy is: -(2/5 * log(2/5,2) + 3/5 * log(3/5,2)) 0.971def createBranch() : Check whether the classification labels of all the data in the data set are the same: If so return class label Else: Find the best feature of the divided data set (after division, the information entropy is the smallest, which is the feature with the largest information gain). Each divided subset calls the function createBranch (function to create a branch) and adds the return result to the branch node. return branch node
- Entropy in cloudy sky: 0
- Entropy in rainy days: 0.971
The probability that the weather is sunny, cloudy and rainy is 5/14, 4/14 and 5/14, so the entropy after division is: 5/14
the weather *0.971+4/14*0+5/14 * 0.971 to get 0.693, information gain It is 0.940-0.693 to 0.247. Similarly, the information gain of other features can be obtained. The weather information gain here is the largest, so it is chosen as the initial division basis. After selecting the weather as the first division basis, the division can be terminated if it can be classified correctly, and the information gain of the remaining features will continue to be calculated if it cannot be classified correctly, and the previous operation will continue, and the result is shown in the figure.Pseudo code So the decision tree is a recursive algorithm, the pseudo code is as follows:result

The data is for judging whether it is a fish, there are two characteristics:

- Whether to survive in the water def creatDataSet(): dataSet = [[1, 1,'yes'], [1, 1,'yes'], [1, 0,'no'], [0, 1,'no' ], [0, 1,'no']] labels = ['no surfacing','flippers'] return dataSet, labels where dataSet is data, and labels are the names of two features. Calculating the entropy Here we define a function to calculate the entropy of the data set: from math import log def calcshannon(dataSet): num = len(dataSet) labelCounts = {} for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts .keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannon = 0.0 for key in labelCounts: prob = float(labelCounts[key])/num shannon -= prob * log(prob, 2) The code of return shannon is relatively simple, that is, the last column (that is, the classification label) is used to obtain the entropy of the incoming data. To divide the data set, first set up a function to divide the data set. The parameters are: the data to be divided, the divided feature and the returned feature value. This function will be called in the choose function to calculate the best divided feature. def splitDataSet(dataSet, axis, value): retDataSet = [] for featVec in dataSet: if featVec[axis] == value: reduce = featVec[:axis] reduce.extend(featVec[axis+1:]) retDataSet.append (reduce) return retDataSet
- Whether there are flippers
Here we need to manually construct the data ourselves:
Case study

def choose(dataSet): numfeature = len(dataSet[0])-1 baseEntropy = calcshannon(dataSet) bestinfogain = 0.0 bestfeature = -1 for i in range(numfeature): featlist = [example[i] for example in dataSet] vals = set(featlist) newEntropy = 0.0 for value in vals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob*calcshannon(subDataSet) infoGain = baseEntropy-newEntropy if (infoGain> bestinfogain): bestinfogain = infoGain bestfeature = i return bestfeature

When all the features are used and the data cannot be divided thoroughly, majority voting is needed to determine the classification of the leaf nodes. The code is as follows, similar to the sorting in the KNN in the previous article.

import operator def majority(classList): classCount = {} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedcount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedcount[0][0]

Finally, the code to create the tree:

def createTree(dataSet, labels): classList = [example[-1] for example in dataSet] if classList.count(classList[0]) == len(classList): return classList[0] if len(dataSet[0]) == 1: return majority(classList) bestFeat = choose(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} del (labels[bestFeat]) Vals = [example[bestFeat] for example in dataSet] uvals = set(Vals) for value in uvals: sublabels = labels[:] myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), sublabels) return myTree

There are two conditions for terminating recursion: one is that all categories can be divided correctly, and the other is that the feature is used.

- Advantages: easy to understand
- Disadvantages: easy to overfit

Reference: https://cloud.tencent.com/developer/article/1151558 Decision Tree for Machine Learning Practice-Cloud + Community-Tencent Cloud