KNN algorithm for machine learning

KNN algorithm for machine learning

This series of tutorials are reading notes for "Machine Learning in Action". First of all, let me talk about the reasons for writing this series of tutorials: 1. the code of "Machine Learning Practical Combat" is written by Python2. Some codes will report errors when running on Python3. This tutorial is based on Python3 to make code revisions; second: I read some of them before Machine learning books were not recorded and quickly forgotten. Writing tutorials is also a review process; third, compared to crawlers and data analysis, machine learning is more difficult to learn. I hope to pass this series of texts. Tutorials, let readers avoid detours on the way to learn machine learning.

Features of this series of tutorials:

  • Based on "Machine Learning Practical Combat", calculate the distance between the test sample and all the training samples, sort the distance in ascending order, take the top k and calculate the most classification of the k samples. KNN's dating object classification problem description and data situation Helen uses a dating website to find a date . After a period of time, she discovered that she had interacted with three types of people:
  • Try to avoid talking about too many mathematical formulas, and explain the principles of each algorithm in a simple and straightforward way
  • For the code implemented by the algorithm, we will explain in detail which readers can eat:
  • Understand the basic terminology of machine learning
  • Know Python language
  • Will use the k-nearest neighbor algorithm (KNN) principle of numpy and pandas library. KNN algorithm is a classification algorithm. There is an old saying to describe the KNN algorithm: "Near Zhu is red, and near ink is black". Algorithm principle: Calculate the distance between the test sample and each training sample (see below for the distance calculation method), take the first k training samples with the smallest distance, and finally select the category that appears the most among these k samples as the category of the test sample. As shown in the figure, the green one is the test sample. When k is 3, the sample belongs to the red category; when k is 5, it belongs to the blue category. Therefore, the choice of k value greatly affects the results of the algorithm, and usually the value of k is not greater than 20.
    KNN algorithm principle
    After introducing the principle, take a look at the pseudo-code flow of the KNN algorithm:
  • People who don't like
  • Charismatic person
  • Charismatic person

Here Helen collected 1,000 rows of data, with three characteristics: the number of frequent flyer miles earned each year; the percentage of time spent playing video games; the number of liters of ice cream consumed each week. And the type label of the object, as shown in the figure.

Analytical data
import numpy as np
import operator

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOflines = len(arrayOLines)
    returnMat = np.zeros((numberOflines, 3))
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index, :] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index = index + 1
    return returnMat, classLabelVector

Define the function to parse the data: 4-9 lines: read the file, and get the number of file lines, create a file line number (1000 lines) and 3 columns of Numpy all 0 array, create a classLabelVector list for storing class labels.

Lines 10-17: Loop through the file, store the first three columns of data in the returnMat array, and store the last column in the classLabelVector list. The result is shown in the figure.

The above code is written in the book, and it is very convenient for pandas to read the data and then come out. The code is as follows:

import numpy as np
import operator
import pandas as pd

def file2matrix(filename):
    data = pd.read_table(open(filename), sep='\t', header=None)
    returnMat = data[[0,1,2]].values
    classLabelVector = data[3].values
    return returnMat, classLabelVector
Normalized

Since the numerical difference between the features is too large, when calculating the distance, the attribute with a large value will have a greater impact on the result. Here, the data needs to be normalized: new = (old-min)/(max-min). code show as below:

def autoNorm(dataSet):
    minval = dataSet.min(0)
    maxval = dataSet.max(0)
    ranges = maxval-minval
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet-np.tile(minval, (m,1))
    normDataSet = normDataSet/np.tile(ranges, (m,1))
    return normDataSet, ranges, minval

The incoming parameter is the test data (returnMat); first calculate min and max according to the 0 axis (that is, by column), as shown in the figure for a simple example; then construct a 0 matrix of the same size as the data (normDataSet) ;

Readers can use Baidu for the usage of tile function. Here is a case after use. The function is to repeat m rows of a one-dimensional array, as shown in the figure, so that data normalization can be calculated.

KNN algorithm

The distance used here is Euclidean distance, and the formula is:

def classify(inX, dataSet, labels, k):
    dataSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSize,1)) -dataSet
    sqdiffMat = diffMat ** 2
    sqDistance = sqdiffMat.sum(axis = 1)
    distances = sqDistance ** 0.5
    sortedDist = distances.argsort()
    classCount ={}
    for i in range(k):
        voteIlable = labels[sortedDist[i]]
        classCount[voteIlable] = classCount.get(voteIlable, 0) + 1
    sortedClassCount = sorted(classCount.items(),
                             key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

inX is the training data; dataSet is the test data, labels is the category label; k is the value;

Line 2-6: Calculate Euclidean distance

7-Finally: Perform index sorting (argsort) on the calculated distance, and then sort the dictionary to get the category with the most value.

Test the classifier

Here, we select the top 10% of the data as the test sample to test the classifier.

def test():
    r = 0.1
    X, y = file2matrix('data/datingTestSet2.txt')
    new_X, ranges, minval = autoNorm(X)
    m = new_X.shape[0]
    numTestVecs = int(m*r)
    error = 0.0
    for i in range(numTestVecs):
        result = classify(new_X[i, :],new_X[numTestVecs:m, :], y[numTestVecs:m], 3)
        print('Classification result: %d, real data: %d' %(result, y[i]))
        if (result != y[i]):
            error = error + 1.0
    print('Error rate: %f'% (error/float(numTestVecs)))
Test system

Finally, write a simple test system, the code can automatically get the classification label of the date by manually inputting three attribute characteristics.

def system():
    style = ['do not like','general','like']
    ffmile = float(input('flight mileage'))
    game = float(input('game'))
    ice = float(input('ice cream'))
    X, y = file2matrix('data/datingTestSet2.txt')
    new_X, ranges, minval = autoNorm(X)
    inArr = np.array([ffmile, game, ice])
    result = classify((inArr-minval)/ranges, new_X, y, 3)
    print('this person', style[result-1])

Algorithm advantages and disadvantages

  • Advantages: high precision, insensitive to outliers
  • Disadvantages: Complicated calculations (think about the calculation of the distance between each test sample and the training sample). Write it at the end and just start reading, the reader may be uncomfortable, just type the code several times. Welcome everyone to like and leave comments, and you can interact with me on Weibo (Luo Luopan).
Reference: https://cloud.tencent.com/developer/article/1151556 KNN algorithm for machine learning combat-Cloud + Community-Tencent Cloud