Python implements random forest to predict broadband customers off-grid (source data and code attached)

Python implements random forest to predict broadband customers off-grid (source data and code attached)

Preface

Combined algorithm is also called integrated learning. In the financial industry or non-image recognition fields, the effect is sometimes even better than deep learning. It is the goal of this article to understand the basic principles and apply the code to actual business cases . This article will introduce in detail how to use Python to implement the classic method of random forest in integrated learning to predict the loss of broadband customers. It will be divided into two parts:

  • Detailed principle introduction
  • Python code in action

Integrated learning

The protagonist of this article is the random forest, so we will start with the bagging method , the branch to which the random forest belongs , and explain the principles and steps of the integrated learning method in a simple way. The bagging method process is as follows

At first glance, the steps in the picture may be a bit complicated, so let's disassemble it step by step. The bagging bagging word is the essence, as the name suggests is about more models into the same bag back, so this bag as a new model to achieve the forecast demand, and nothing more. In other words, combining multiple models to form a new large model , the final prediction result of this large model is determined by the multiple small models, and the decision method is that the minority obeys the majority.

Assuming that there are 100,000 pieces of original data, use these data to make ten decision trees (of course it can also be other models), and finally these 10 trees will be packed into the same bag. At this time, take one of the data and put it into this bag, and you will get 10 prediction values ​​(one for each tree). If three of the trees give a prediction value of 0, the remaining seven trees give a value of 1. Then we can know that the probability of this bag's prediction result of this data being 0 is 3/10.

In order to have a deeper understanding of the bagging method, the following will answer three common questions related to the bagging method :

Q: What is the appropriate sample size range for each model in the bag?

Answer: If it is the above example, there are ten trees in the bag, and the total amount of source data is 100,000. The minimum sample size for each tree is at least 1w (10w/10 trees = 1w/tree ), because at least it is necessary to ensure that no samples are wasted, but what is the maximum number of samples each tree can take? In fact, when the sample size is known and the number of models in the same bag is n, the sample selection ratio is 1/n ~ 0.8. It is absolutely meaningless to take 100% of the samples for each small model. It is the same as no sampling. In this way, bagging is not reflected. Only the data used by each model is different. Combine it. After each vote (predicted result) is also meaningful.

Q: Will the correlation between the models in the bag affect the final decision result?

Answer: The most important point of the bagging method is that each model in the bag cannot be correlated. The less relevant the better. The irrelevance here is mainly reflected in the different samples used to train each model. 2. the higher the accuracy of each model, the better, so that its votes are more valuable.

PS: The different samples of the training model can be understood as a presidential election. 10 waves of voters are selected to vote. The greater the difference between the 10 waves of voters, the better. In this way, you can only stand out when the voters are very different. It is enough to show your strength. If the differences between each of these 10 waves of voters are very small, such as being partial to the presidential candidate, the convincing power of the voting results will be greatly reduced.

Question: Is the high accuracy of the model mentioned above even if the model is very complicated? What if the accuracy of each model is high but overfitting?

Answer: In the bagging method, the more accurate the model, the better, even if it is over-fitting. Because a model wants to be as accurate as possible on the training set, and the degree of accuracy is mostly proportional to the complexity of the model, it is normal and forgiven for overfitting. Complexity and overfitting are only for each model in the bag, because it will be weighted in the end, so the entire bag (whole) will not be overfitted.

Random forest

The implementation steps of random forest are as follows:

Regarding the random forest algorithm, this article explains the following issues

Question: Why should we sample randomly on the column?

Answer: Before introducing one of the author's favorite metaphors, let's first look at an actual business scenario from a city commercial bank. We have a large spreadsheet with a lot of historical data, about 50 variables (more than 50 columns), the variables come from several different companies such as People's Bank, telecommunications, etc. (the same customer in different companies), and finally hope to predict Is whether the customer will default. The spreadsheet is composed as follows:

According to basic business knowledge, there are often many missing values ​​in bank-related data . The above figure is an example. Normally, only the data in the column of variables to be predicted is complete. After all, whether customers are in breach of contract. Historical data is easy to find, but the two parts of the blue box and the green box often have more missing values ​​and are more random. The specific degree of randomness is shown in the figure below:

The red box indicates that the data is missing. Here only part of the row and part of the column data is shown. If the size of the data table is 40,000 rows * 50 columns, how random is the distribution of the missing data? ? Therefore, how to make full use of this incomplete data has become a key issue to be solved. At this time, the super vivid analogy of "island-lake-coconut tree" can be presented:

  • The entire table is regarded as a huge island. The length and width of the island correspond to the length of the horizontal axis and the length of the vertical axis of the spreadsheet.
  • The missing data segments in the table are regarded as small lakes randomly distributed, and the places with data are regarded as land
  • There is a huge value (data value) buried in the bottom of the entire island. By planting trees at random (random sampling in rows and rows by bagging) to absorb nutrients from the ground, after all, trees cannot be planted on lakes, so as long as they are random enough, You can always make full use of the land.

Because the ranks are all random, it is possible to truly divide the entire data table into multiple parts at random, and use one for each model. As long as the number of models is sufficient, there will always be models that can obtain the maximum value of the data set. And because of categorical variables often is extremely unbalanced in . As for how to collect the information about the trees that have been planted again, you can put a collector on the trees that are closer to the land, and then add the nutrients collected by the trees from the land to a layer. Finally, it is realized that the nutrients of the land are summarized in the trees, the nutrients of the trees are summarized in the collector, and the nutrient of the collector is summarized in another collector in the upper layer, and finally the information of multiple islands in the entire data ocean is summarized. Jinfu's cooperation uses a distributed deep random forest algorithm to detect cash fraud .

The operation after the first step of the random forest can completely refer to the steps mentioned in the integrated learning-bagging method.

Question: Since the prediction results given by each model will be weighted at the end, what is the weight of each decision tree in the random forest?

Answer: The weight of each decision tree in a random forest is the same. If there are 10 decision trees (or other models) in this bag, the weight of the prediction result given by each tree is 1/10, which is The characteristics of random forests. If the weights are not the same, it is the promotion branch in integrated learning such as Adaboost that will be mentioned in subsequent tweets .

Q: In the bagging method, the more models in the bag, the better? Is the proportion of source data used to train each model in the bag as large as possible?

Answer: It is better to have more models in the bag, and the proportion of source data used to train each model in the bag is better, but this does not mean that the more the better and the smaller the better, it must also be combined with the characteristics of the data set and some deeper levels. Knowledge of the model algorithm.

The advantages of the bagging method are as follows:

  • The accuracy is significantly higher than any single classifier in the combination
  • For loud noises, the performance is not bad, and it is robust
  • Not easy to overfit

Advantages of random forest algorithm :

  • The accuracy rate can sometimes be as beautiful as the neural network daughter-in-law, higher than logistic regression
  • More robust to errors and outliers
  • The problem of decision trees that are prone to overfitting will be weakened with the size of the forest
  • High speed (distributed) and good performance in the case of big data

Python combat

Data exploration

The goal of this actual combat is to demonstrate the usage and tuning method of random forest. Because ensemble learning, like neural network, is a black box model with poor interpretability, so we don’t need to explore the specific meaning of each variable in the data set too much. We only need to pay attention to the last variable broadband, and strive to pass such as age, length of use, Variables such as payment status, traffic volume and call status can make a more accurate prediction of whether broadband customers will renew their fees .

import pandas as pd
import numpy as np

df = pd.read_csv('broadband.csv') # Broadband customer data
df.head(); df.info()

Parameter Description

This code file is only to demonstrate the usage and tuning method of random forest, so we only need to pay attention to the last broadband for the data parameters 0-离开,1-留存. The meanings of other independent variables are not explored. After all, the data sets in real work are completely different. First of all, the column names are all lowercase

df.rename(str.lower, axis='columns', inplace=True)

Now check the broadbanddistribution of the dependent variable to see if there is an imbalance

from collections import Counter
print('Broadband:', Counter(df['broadband'])) 
## Broadband: Counter({0: 908, 1: 206}) is relatively unbalanced.
## According to the principle part, it can be seen that random forest is a powerful tool for dealing with data imbalance.

Then split the test set and training set, the client idis useless, so discard it cust_id,

y = df['broadband'] 
X = df.iloc[:, 1:-1] 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                    test_size=0.4, random_state=12345)

Decision tree modeling

We first perform a complete decision tree modeling to compare with random forest

import sklearn.tree as tree

# Directly use the cross grid search to optimize the decision tree model, and optimize while training
from sklearn.model_selection import GridSearchCV
# Grid search parameters: parameters in normal decision tree modeling-evaluation index, tree depth,
 ## The minimum number of split leaf samples and the depth of the tree
param_grid = {'criterion': ['entropy','gini'],
             'max_depth': [2, 3, 4, 5, 6, 7, 8],
             'min_samples_split': [4, 8, 12, 16, 20, 24, 28]} 
                # Generally speaking, more than a dozen levels of trees are already relatively deep

clf = tree.DecisionTreeClassifier() # Define a tree
clfcv = GridSearchCV(estimator=clf, param_grid=param_grid, 
                            scoring='roc_auc', cv=4) 
        # Pass in the model, grid search parameters, evaluation indicators, and the number of CV cross-validation
      ## This is just a definition, the model has not yet started to train

clfcv.fit(X=X_train, y=y_train)

# Use the model to make predictions on the test set
test_est = clfcv.predict(X_test)

# Model evaluation
import sklearn.metrics as metrics

print("Decision tree accuracy:")
print(metrics.classification_report(y_test,test_est)) 
        # The matrix table is actually not very useful
print("Decision Tree AUC:")
fpr_test, tpr_test, th_test = metrics.roc_curve(y_test, test_est)
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))

AUC More than 0.5 is the most basic requirement. It can be seen that the accuracy of the model is still relatively poor, and the tuning skills of the decision tree will not be expanded too much. We will show it in the random forest tuning section

Random forest modeling

Random forest modeling also uses grid search . For detailed parameter explanations for implementing random forest modeling in Python, please refer to the code comments.

param_grid = {
    'criterion':['entropy','gini'],
    'max_depth':[5, 6, 7, 8], # Depth: here is the depth of each decision tree in the forest
    'n_estimators':[11,13,15], # Number of decision trees-random forest specific parameters
    'max_features':[0.3,0.4,0.5],
     # The proportion of variables used by each decision tree-random forest specific parameters (combined principle)
    'min_samples_split':[4,8,12,16] # The minimum split sample size of leaves
}

import sklearn.ensemble as ensemble # ensemble learning: integrated learning

rfc = ensemble.RandomForestClassifier()
rfc_cv = GridSearchCV(estimator=rfc, param_grid=param_grid,
                      scoring='roc_auc', cv=4)
rfc_cv.fit(X_train, y_train)

# Use random forest to predict the test set
test_est = rfc_cv.predict(X_test)
print('Random forest accuracy...')
print(metrics.classification_report(test_est, y_test))
print('Random Forest AUC...')
fpr_test, tpr_test, th_test = metrics.roc_curve(test_est, y_test)
     # Construct roc curve
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))

It can be seen that the accuracy of the model is greatly improved

Why print the best parameters given by gradient optimization? The purpose of printing the best parameters of the gradient optimization results is to judge whether the various parameters of this classification model are on the decision boundary. In short, we do not want the decision boundary to limit the effect of this model. (Usually put the complexity aside first at this time)

It is not difficult to find that the parameters max_depth, min_samples_split, and n_estimatorsthe range settings of these three parameters may limit the accuracy of the model, so it needs to be adjusted appropriately

"""
{'criterion':'gini',
 'max_depth': 8, on the boundary of the maximum value, so the maximum value range of this parameter should be increased
 'max_features': 0.5, which is also on the boundary of the maximum value, indicating that the minimum range of this parameter should be increased
 'min_samples_split': 4, in the same way, on the minimum boundary, consider reducing the range
 'n_estimators': 15 In the same way, on the maximum boundary, the range can be adjusted appropriately
 """
 # Adjustment result
 param_grid = {
    'criterion':['entropy','gini'],
    'max_depth':[7, 8, 10, 12], 
    # The first 5, 6 can also be appropriately removed, anyway, it is no longer useful
    'n_estimators':[11, 13, 15, 17, 19], #number of decision trees-random forest specific parameters
    'max_features':[0.4, 0.5, 0.6, 0.7],
     #The proportion of variables used by each decision tree-random forest specific parameters
    'min_samples_split':[2, 3, 4, 8, 12, 16] # The minimum split sample size of leaves

Now let’s see the result of modeling again

At this time, they are all within the decision boundary, but in fact, adjusting the parameters is a technical task. It is not only adjusted through the single indicator of the decision boundary . Follow-up tweets will be updated one after another.

summary

Finally, to summarize: Random Forest is a very classic method in ensemble learning, with simple basic principles, elegant implementation, and ready-to-learn. Moreover, random forests are widely used and are not limited to the common financial field. As long as the data is unbalanced or the random missing is serious, it is worth trying . If you are also interested in the data and code used in this article, you can get it by private message. It will be online at a fixed time every day. See you in the next case.

Follow-up will continue to update Python practices in common scenarios

Reference: https://cloud.tencent.com/developer/article/1672591 Python implements random forest prediction of broadband customer off-grid (with source data and code)-Cloud + Community-Tencent Cloud