As shown in the figure, this is a set of two-dimensional data. Let us first think about how to fit these scattered points with a straight line. To put it bluntly: try to let the fitted straight line pass through these scattered points (the points are very close to the fitted straight line).
To make these points very close to the fitted straight line, we need to use a mathematical formula to express:
When explaining regression before, the minimum value was obtained by derivation, but the data must be reversible. In this case, the gradient descent method is usually used, which is to shift in the direction of the slope. For details, please see this article ( https://www.jianshu.com/p/96566542b07a ). tips: This article explains the gradient ascent method, which is similar to the gradient descent method.
This data uses the data set that comes with sklearn, and imports our boston housing price data set through sklearn.datasets.
from sklearn.datasets import load_boston boston = load_boston()
The details of the data set can be viewed through the DESCR attribute. The data here has 14 columns, the first 13 columns are feature data, and the last column is label data.
print(boston.DESCR)
The data and target of boston store features and labels respectively:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.2, random_state=2)
The ordinary linear regression model is too simple and easily leads to under-fitting. We can increase the characteristic polynomial to make the linear regression model better fit the data. In sklearn, feature polynomials are added through PolynomialFeatures in the preprocessing module. The important parameters are:
from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2,include_bias=False) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.fit_transform(X_test)
The linear algorithm uses the LinearRegression method in the sklearn.linear_model module. The commonly used parameters are as follows:
Simple linear regression
from sklearn.linear_model import LinearRegression model2 = LinearRegression(normalize=True) model2.fit(X_train, y_train) model2.score(X_test, y_test) # result # 0.77872098747725804
Polynomial linear regression
model3 = LinearRegression(normalize=True) model3.fit(X_train_poly, y_train) model3.score(X_test_poly, y_test) # result # 0.895848854203947
The continuous increase in the number of polynomials can have good results on the training set, but it is easy to cause over-fitting and cannot have good effects on the test set, which is often said: model generalization ability difference.