The decision tree algorithm can be said to be close to our lives. Through the following case, you will find that we use the decision tree algorithm intentionally or unintentionally every day (it looks amazing). Xiao Ming has to go to school every morning. You can walk, take the bus, and take the car of Uncle Wang next door. At this time, Xiao Ming began to make a decision: first look at the weather and choose to walk to school when it is not raining; when it rains, look at the next door Uncle Wang if he is free, and take Pharaoh’s car to school when he is free. Choose to go to school by bus. as the picture shows.

Case study

Through the above case, the decision tree can be defined: the above picture is the decision tree (um... a bit perfunctory). The decision tree is composed of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. The internal nodes represent a feature or attribute (weather, whether they are available), and the leaf nodes represent a class (walking, taking a bus, and taking a car next door).

So how to construct this tree through the decision tree algorithm? (Is it the hand of God?) The above case is relatively simple. We now present a very classic case. As shown in the figure, do we first make decisions based on weather, humidity or wind level? Entropy and information gain are to be proposed here.

Case study

1. let's look at what entropy is. Simply put, entropy describes the degree of confusion (or uncertainty) of things. For example: Chinese football enters the World Cup, the uncertainty may be 0, so the entropy may be 0; the uncertainty of the 6-sided dice is lower than that of the 12-sided dice, so the entropy will be lower. Now let’s look at the formula of entropy: H = -∑ni=1p(xi)log2p(xi) The entropy of the six dice: 1/6*log21/6, which is 2.585 and so on, the 12 sides The entropy of is: 3.585 Finally, we calculate the information entropy of this case: not playing ball is 5, playing ball is 9, so the entropy is calculated as:

-(5/14 * log(5/14,2) + 9/14 * log(9/14,2)) 0.940

Which feature should we use to divide the data set first? We have a principle, which is to make disordered data orderly. In other words, to make the entropy smaller, the smaller the better. The information gain is the change in entropy before and after dividing the data set. Here, the larger the information gain, the better. Let's take weather as an example to calculate the information gain after division:

- On sunny days: 2/5 to play, 3/5 not to play, the entropy is:

-(2/5 * log(2/5,2) + 3/5 * log(3/5,2)) 0.971

- Entropy in cloudy sky: 0
- Entropy in rainy days: 0.971

The probability of weather being sunny, cloudy and rainy is 5/14, 4/14 and 5/14, so the entropy after division is: 5/14 * 0.971+4/14 * 0+5/14 * 0.971 to get 0.693 , The information gain is 0.940-0.693, which is 0.247. Similarly, the information gain of other characteristics can be obtained. The weather information gain here is the largest, so it is chosen as the initial division basis. After selecting the weather as the first division basis, the division can be terminated if it can be classified correctly, and the information gain of the remaining features will continue to be calculated for those that cannot be classified correctly, and the previous operation will continue. The result is shown in the figure.

result

So the decision tree is a recursive algorithm, the pseudo code is as follows:

def createBranch(): Check whether the classification labels of all the data in the data set are the same: If so return class label Else: Find the best feature of the divided data set (after the division, the information entropy is the smallest, that is, the feature with the largest information gain) Divide the data set Create branch node for each divided subset Call the function createBranch (function to create a branch) and add the return result to the branch node return branch node

The data is used to judge whether it is a fish. There are two characteristics:

- Whether to survive in the water
- Whether there are flippers

The case here requires us to manually construct the data ourselves:

def creatDataSet(): dataSet = [ [1, 1,'yes'], [1, 1,'yes'], [1, 0,'no'], [0, 1,'no'], [0, 1,'no'] ] labels = ['no surfacing','flippers'] return dataSet, labels

Here dataSet is the data, and labels are the names of the two features.

Here we define a function to calculate the entropy of the data set:

from math import log def calcshannon(dataSet): num = len(dataSet) labelCounts = {} for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannon = 0.0 for key in labelCounts: prob = float(labelCounts[key])/num shannon -= prob * log(prob, 2) return shannon

This code is relatively simple, that is, the last column (that is, the classification label) is used to obtain the entropy of the incoming data.

1. set a function to divide the data set. The parameters are: the data to be divided, the divided feature and the returned feature value. This function will be called in the choose function to calculate the best divided feature.

def splitDataSet(dataSet, axis, value): retDataSet = [] for featVec in dataSet: if featVec[axis] == value: reduce = featVec[:axis] reduce.extend(featVec[axis+1:]) retDataSet.append(reduce) return retDataSet

def choose(dataSet): numfeature = len(dataSet[0])-1 baseEntropy = calcshannon(dataSet) bestinfogain = 0.0 bestfeature = -1 for i in range(numfeature): featlist = [example[i] for example in dataSet] vals = set(featlist) newEntropy = 0.0 for value in vals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob*calcshannon(subDataSet) infoGain = baseEntropy-newEntropy if (infoGain> bestinfogain): bestinfogain = infoGain bestfeature = i return bestfeature

When all the features are used and the data cannot be divided thoroughly, majority voting is needed to determine the classification of the leaf nodes. The code is as follows, similar to the sorting in the KNN in the previous article.

import operator def majority(classList): classCount = {} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedcount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedcount[0][0]

Finally, the code to create the tree:

def createTree(dataSet, labels): classList = [example[-1] for example in dataSet] if classList.count(classList[0]) == len(classList): return classList[0] if len(dataSet[0]) == 1: return majority(classList) bestFeat = choose(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} del (labels[bestFeat]) Vals = [example[bestFeat] for example in dataSet] uvals = set(Vals) for value in uvals: sublabels = labels[:] myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), sublabels) return myTree

There are two conditions for terminating recursion: one is that all categories can be divided correctly, and the other is that the feature is used.

result

- Advantages: easy to understand
- Disadvantages: easy to overfit

Reference: https://cloud.tencent.com/developer/article/1155550 Decision Tree for Machine Learning Practice-Cloud + Community-Tencent Cloud