Linear regression of sklearn tuning package man

Linear regression of sklearn tuning package man

Principle of Linear Regression

As shown in the figure, this is a set of two-dimensional data. Let us first think about how to fit these scattered points with a straight line. To put it bluntly: try to let the fitted straight line pass through these scattered points (the points are very close to the fitted straight line).

Objective function (cost function)

To make these points very close to the fitted straight line, we need to use a mathematical formula to express:

Gradient descent

When explaining regression before, the minimum value was obtained by derivation, but the data must be reversible. In this case, the gradient descent method is usually used, which is to shift in the direction of the slope. For details, please see this article ( https://www.jianshu.com/p/96566542b07a ). tips: This article explains the gradient ascent method, which is similar to the gradient descent method.

Actual combat-housing price prediction

data import

This data uses the data set that comes with sklearn, and imports our boston housing price data set through sklearn.datasets.

from sklearn.datasets import load_boston
boston = load_boston()

The details of the data set can be viewed through the DESCR attribute. The data here has 14 columns, the first 13 columns are feature data, and the last column is label data.

print(boston.DESCR)

The data and target of boston store features and labels respectively:

Split the data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.2, random_state=2)
Data preprocessing

The ordinary linear regression model is too simple and easily leads to under-fitting. We can increase the characteristic polynomial to make the linear regression model better fit the data. In sklearn, feature polynomials are added through PolynomialFeatures in the preprocessing module. The important parameters are:

  • degree: the number of polynomial features, the default is 2
  • include_bias: The default is True, which includes a bias column, which is used as the intercept term in the linear model. False is selected here, because in linear regression, you can set whether the intercept term is required.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2,include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)
Model training and evaluation

The linear algorithm uses the LinearRegression method in the sklearn.linear_model module. The commonly used parameters are as follows:

  • fit_intercept: The default is True, whether to calculate the intercept term.
  • normalize: The default is False, whether to normalize the data.

Simple linear regression

from sklearn.linear_model import LinearRegression

model2 = LinearRegression(normalize=True)
model2.fit(X_train, y_train)
model2.score(X_test, y_test)

# result
# 0.77872098747725804

Polynomial linear regression

model3 = LinearRegression(normalize=True)
model3.fit(X_train_poly, y_train)
model3.score(X_test_poly, y_test)

# result
# 0.895848854203947
summary

The continuous increase in the number of polynomials can have good results on the training set, but it is easy to cause over-fitting and cannot have good effects on the test set, which is often said: model generalization ability difference.

Reference: https://cloud.tencent.com/developer/article/1197123 Linear regression of sklearn tuning package man-Cloud + Community-Tencent Cloud