❈ author: Wang Yong, the current interest in the project business analysis, Python, machine learning, Kaggle. 17 years in project management, 11 years in the communications industry, project managers in charge of contract delivery, and 6 years in the manufacturing industry. Project management: PMO, transformation, production transfer, liquidation and asset handling. MBA, PMI-PBA, PMP. ❈
I participated in two Kaggle competitions, one is Titanic (classification) and the other is HousePrice (regression). They got the top 7% (spent about 3 months of spare time) and the top 13% (spent about 2 months of spare time) respectively. Since I have just started machine learning for 5 months, I spent a lot of time doing repetitive and useless work.
The purpose of this article is mainly to share and discuss:
1. The building block method that I summarized. (That is, using Pandas' Pipe and Sklearn's Pipeline)
2. Self-understanding of practices in feature engineering. (For example: why log transfer, normalization, etc.)
3. Share the problems you have encountered (overfitting caused by feature engineering), and ask the experts to solve their doubts
1. First of all, tell the important things three times.
Feature Engineering, Feature Engineering, Feature Engineering
The purpose of machine learning is to use a certain algorithm to train a model with known data (including X (features), Y (labels)). Use this model to predict new data to the predicted result (label).
For the known data and the (features) in the new data, it needs to be processed by feature engineering. To train the model or make predictions.
For the data processed by different feature engineering methods, the models obtained during training are different, the results of tuning are different, and the results of prediction are even more different. Therefore, in machine learning, feature engineering often takes 80% of the time, while model training takes 20% of the time.
I spent a lot of time in the first Titanic competition, learning and testing various tuning and integration methods. The same strategy was tried in the House Price competition, but the result was not very good. Often the results will influence each other, and sometimes there is a feeling that machine learning is metaphysics.
After re-examining, I divided the entire House Price machine learning into two major steps: namely:
1. Feature engineering (only use Pandas, StatsModel, scipy, numpy, seaborn and other libraries)
1.1 Input: Original Train, Test data set, merge the original Train and Test into one data set combined
1.2 Processing: Pandas Pipe
Define various functions according to various possibilities and various feature engineering methods (input combined, input pre_combined)
Use PandasPipe to connect this function together like building blocks. Use a list to store these functions in order)
For example: pipe_basic = [pipe_basic_fillna,pipe_fillna_ascat,pipe_bypass,pipe_bypass,pipe_log_getdummies,pipe_export,pipe_r2test]
This list is, 1. Basic filling of empty values, 2. Converting data types, 3. Blank function (for aligning beauty, do nothing), 4. Log conversion, category data dummy processing, 5. Export to hdf5 file, 6. Check R2 value
Using various permutations and combinations, or combinations of various parameters, a wealth of pipes can be generated, and each pipe can generate a preprocessed file.
1.3 Output: N preprocessed hdf5 files in a folder. For the permutation and combination of various feature engineering, or various novel feature engineering methods on Kaggle.
After the feature engineering is processed, a large amount of preprocessed data has been generated. And the R2 value of these preprocessed data [0~1]. If the R2 value is too low, for example, less than 80%, you can consider deleting it directly. Because the X in the preprocessed data can only explain 80% of the Y value. The R2 value is too low and there is no value for further processing.
2. Machine learning stage (training and generating models, the goal is to obtain the lowest possible RMSE value (for training data), and at the same time have the ability to normalize (for test data))
The first step is to establish a benchmark and filter out the best one (several) preprocessed files (set the random number to a fixed value)
The second step is to adjust parameters for the preprocessed files that have been screened out. Find the most suitable algorithms (usually the lowest RMSE value and different Kernel) (set the random number to a fixed value)
The third step is to use the adjusted parameters to preprocess the Traing data in the file for average and stacking.
The fourth part is to generate a csv file and submit it to Kaggle to see how the score is.
After using the above method, I basically got a relatively stable LB score, avoiding the previous ups and downs. And avoid a lot of duplication of work.