This series of tutorials are reading notes for "Machine Learning in Action". First of all, let me talk about the reasons for writing this series of tutorials: 1. the code of "Machine Learning Practical Combat" is written by Python2. Some codes will report errors when running on Python3. This tutorial is based on Python3 to make code revisions; second: I read some of them before Machine learning books were not recorded and quickly forgotten. Writing tutorials is also a review process; third, compared to crawlers and data analysis, machine learning is more difficult to learn. I hope to pass this series of texts. Tutorials, let readers avoid detours on the way to learn machine learning.
Here Helen collected 1,000 rows of data, with three characteristics: the number of frequent flyer miles earned each year; the percentage of time spent playing video games; the number of liters of ice cream consumed each week. And the type label of the object, as shown in the figure.
import numpy as np import operator def file2matrix(filename): fr = open(filename) arrayOLines = fr.readlines() numberOflines = len(arrayOLines) returnMat = np.zeros((numberOflines, 3)) classLabelVector = [] index = 0 for line in arrayOLines: line = line.strip() listFromLine = line.split('\t') returnMat[index, :] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index = index + 1 return returnMat, classLabelVector
Define the function to parse the data: 4-9 lines: read the file, and get the number of file lines, create a file line number (1000 lines) and 3 columns of Numpy all 0 array, create a classLabelVector list for storing class labels.
Lines 10-17: Loop through the file, store the first three columns of data in the returnMat array, and store the last column in the classLabelVector list. The result is shown in the figure.
The above code is written in the book, and it is very convenient for pandas to read the data and then come out. The code is as follows:
import numpy as np import operator import pandas as pd def file2matrix(filename): data = pd.read_table(open(filename), sep='\t', header=None) returnMat = data[[0,1,2]].values classLabelVector = data[3].values return returnMat, classLabelVector
Since the numerical difference between the features is too large, when calculating the distance, the attribute with a large value will have a greater impact on the result. Here, the data needs to be normalized: new = (old-min)/(max-min). code show as below:
def autoNorm(dataSet): minval = dataSet.min(0) maxval = dataSet.max(0) ranges = maxval-minval normDataSet = np.zeros(np.shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet-np.tile(minval, (m,1)) normDataSet = normDataSet/np.tile(ranges, (m,1)) return normDataSet, ranges, minval
The incoming parameter is the test data (returnMat); first calculate min and max according to the 0 axis (that is, by column), as shown in the figure for a simple example; then construct a 0 matrix of the same size as the data (normDataSet) ;
Readers can use Baidu for the usage of tile function. Here is a case after use. The function is to repeat m rows of a one-dimensional array, as shown in the figure, so that data normalization can be calculated.
The distance used here is Euclidean distance, and the formula is:
def classify(inX, dataSet, labels, k): dataSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSize,1)) -dataSet sqdiffMat = diffMat ** 2 sqDistance = sqdiffMat.sum(axis = 1) distances = sqDistance ** 0.5 sortedDist = distances.argsort() classCount ={} for i in range(k): voteIlable = labels[sortedDist[i]] classCount[voteIlable] = classCount.get(voteIlable, 0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]
inX is the training data; dataSet is the test data, labels is the category label; k is the value;
Line 2-6: Calculate Euclidean distance
7-Finally: Perform index sorting (argsort) on the calculated distance, and then sort the dictionary to get the category with the most value.
Here, we select the top 10% of the data as the test sample to test the classifier.
def test(): r = 0.1 X, y = file2matrix('data/datingTestSet2.txt') new_X, ranges, minval = autoNorm(X) m = new_X.shape[0] numTestVecs = int(m*r) error = 0.0 for i in range(numTestVecs): result = classify(new_X[i, :],new_X[numTestVecs:m, :], y[numTestVecs:m], 3) print('Classification result: %d, real data: %d' %(result, y[i])) if (result != y[i]): error = error + 1.0 print('Error rate: %f'% (error/float(numTestVecs)))
Finally, write a simple test system, the code can automatically get the classification label of the date by manually inputting three attribute characteristics.
def system(): style = ['do not like','general','like'] ffmile = float(input('flight mileage')) game = float(input('game')) ice = float(input('ice cream')) X, y = file2matrix('data/datingTestSet2.txt') new_X, ranges, minval = autoNorm(X) inArr = np.array([ffmile, game, ice]) result = classify((inArr-minval)/ranges, new_X, y, 3) print('this person', style[result-1])