The use of PCA has three main functions: 1). It greatly saves the time of subsequent machine learning; 2). Data visualization; 3). Noise reduction.
The following will use the handwritten data set in sklearn to see the role of these three aspects.
from sklearn import datasets from sklearn.neighbors import KNeighborsClassifier digits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Before dimensionality reduction:
%%time knn = KNeighborsClassifier() knn.fit(X_train, y_train) print(knn.score(X_test, y_test))
Time required:
After dimensionality reduction:
# Advanced dimensionality reduction and reclassification pca = PCA(n_components=2) pca.fit(X_train) X_train_reduction = pca.transform(X_train) X_test_reduction = pca.transform(X_test)
%%time knn = KNeighborsClassifier() knn.fit(X_train_reduction, y_train) print(knn.score(X_test_reduction, y_test))
It can be seen that PCA can greatly reduce the running speed of the algorithm, but greatly reduces the accuracy. Reducing to two dimensions will lose too much information, so we can use the explained_variance_ratio_ parameter in sklearn to see how many interpretable variances of the first axis are.
Get all the variances in the principal components and plot them:
# All principal components pca = PCA(n_components=X.shape[1]) pca.fit(X_train) print(pca.explained_variance_ratio_) all_var = [] for i in range(X.shape[1]): all_var.append(np.sum(pca.explained_variance_ratio_[:i])) plt.plot(all_var,'o-',color ='g') plt.show()
In this way, we can see how many cumulative explainable variances there are when there are many dimensions. But sklearn provides a more convenient method. In fact, you can directly pass in this percentage in PCA():
# How much interpretable variance can we pass in in PCA, eg. 0.95 # And you can see that it is 0.95 in 28 dimensions pca = PCA(0.95) pca.fit(X_train) pca.n_components_
Will output 28, that is, the first 28 dimensions can explain 95%.
X_train_reduction = pca.transform(X_train) X_test_reduction = pca.transform(X_test)
%%time knn = KNeighborsClassifier() knn.fit(X_train_reduction, y_train) print(knn.score(X_test_reduction, y_test))
In this way, the time is shorter than at the beginning, and the score obtained is higher. If there are a large number of samples, it is worth sacrificing a bit of accuracy for less time.
Reduce the dimensionality to two dimensions and you can directly visualize it.
pca = PCA(n_components=2) pca.fit(X) X_reduction = pca.transform(X) for i in range(10): plt.scatter(X_reduction[y==i, 0], X_reduction[y == i, 1], alpha=0.7) plt.show()
This is a more commonly used method. Some of the features lost by PCA may actually be noise, and dropping these noises will increase the accuracy of the model. For example, add some noise to the above handwritten data, then after visualization:
But after PCA dimensionality reduction (take 50%):
The above are some of the notes taken during the course of https://coding.imooc.com/learn/list/169.html [Python3 introductory machine learning].