The derivation and solution of PCA (3)-The role of PCA

The derivation and solution of PCA (3)-The role of PCA

The use of PCA has three main functions: 1). It greatly saves the time of subsequent machine learning; 2). Data visualization; 3). Noise reduction.

The following will use the handwritten data set in sklearn to see the role of these three aspects.

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

digits = datasets.load_digits()
X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
1. Save time

Before dimensionality reduction:

%%time

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))

Time required:

After dimensionality reduction:

# Advanced dimensionality reduction and reclassification
pca = PCA(n_components=2)
pca.fit(X_train)
X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)
%%time
knn = KNeighborsClassifier()
knn.fit(X_train_reduction, y_train)
print(knn.score(X_test_reduction, y_test))

It can be seen that PCA can greatly reduce the running speed of the algorithm, but greatly reduces the accuracy. Reducing to two dimensions will lose too much information, so we can use the explained_variance_ratio_ parameter in sklearn to see how many interpretable variances of the first axis are.

Get all the variances in the principal components and plot them:

# All principal components
pca = PCA(n_components=X.shape[1])
pca.fit(X_train)
print(pca.explained_variance_ratio_)

all_var = []
for i in range(X.shape[1]):
    all_var.append(np.sum(pca.explained_variance_ratio_[:i]))

plt.plot(all_var,'o-',color ='g')
plt.show()

In this way, we can see how many cumulative explainable variances there are when there are many dimensions. But sklearn provides a more convenient method. In fact, you can directly pass in this percentage in PCA():

# How much interpretable variance can we pass in in PCA, eg. 0.95
# And you can see that it is 0.95 in 28 dimensions

pca = PCA(0.95)
pca.fit(X_train)
pca.n_components_ 

Will output 28, that is, the first 28 dimensions can explain 95%.

X_train_reduction = pca.transform(X_train)
X_test_reduction = pca.transform(X_test)
%%time
knn = KNeighborsClassifier()
knn.fit(X_train_reduction, y_train)
print(knn.score(X_test_reduction, y_test))

In this way, the time is shorter than at the beginning, and the score obtained is higher. If there are a large number of samples, it is worth sacrificing a bit of accuracy for less time.

2. Visualization

Reduce the dimensionality to two dimensions and you can directly visualize it.

pca = PCA(n_components=2)
pca.fit(X)
X_reduction = pca.transform(X)

for i in range(10):
    plt.scatter(X_reduction[y==i, 0], X_reduction[y == i, 1], alpha=0.7)
plt.show()
3. Noise reduction

This is a more commonly used method. Some of the features lost by PCA may actually be noise, and dropping these noises will increase the accuracy of the model. For example, add some noise to the above handwritten data, then after visualization:

But after PCA dimensionality reduction (take 50%):

The above are some of the notes taken during the course of https://coding.imooc.com/learn/list/169.html [Python3 introductory machine learning].

Reference: https://cloud.tencent.com/developer/article/1734931 The derivation and solution of PCA (3)-The role of PCA-Cloud + Community-Tencent Cloud