Naive Bayes of machine learning combat

Naive Bayes of machine learning combat

Before learning the naive Bayes classification model, let's review the KNN and decision tree we learned before. The reader's own summary: Different machine learning methods have different assumptions and theories to support, and these assumptions and theories are to a large extent Reflects the advantages and disadvantages of the algorithm.

KNN: In the sample space, the same type of data is clustered in the space, that is, the distance will be close. Based on this assumption, only the distance between the test sample and the training sample needs to be calculated. The category of the closest distance sample is largely The category of the test sample.

Decision tree: Based on information theory. The sample data is chaotic (high entropy), but fortunately there are features. How to effectively use the features to make the sample not chaotic and separable is what the decision tree needs to accomplish.

The Naive Bayes introduced today is an algorithm based on the Bayes criterion, so why add naive? Because the concubines can't do it! (Naive Bayes assumes that the features are independent and equally important).

Naive Bayes principle

As shown in the figure, it is our sample data set.

We now use p1(x,y) to indicate the probability that the data point (x,y) belongs to category 1 (the category represented by the dot in the figure), and p2(x,y) to indicate that the data point (x,y) belongs to the category The probability of 2 (the category represented by the triangle in the figure), then for a new data point (x, y), the following rules can be used to determine its category:

  • If p1(x,y)> p2(x,y), then the category is 1
  • If p2(x,y)> p1(x,y), then the category is 2

In other words, we will choose a category with a high probability. This is the core idea of ​​the algorithm, but how to calculate p1(x,y) and p2(x,y)? Knowledge of conditional probability and Bayesian criterion is needed here.

Conditional probability and Bayes criterion

Conditional probability is the probability of occurrence of A under the premise of occurrence of B. The mathematical formula is P(A|B). Let's take a look at the case in the book:

A jar containing 7 stones, of which 3 are white and 4 are black, the probability of taking out the white is 3/7, and the probability of taking out the black is 4/7. This is very simple.

If two buckets are used for storage, as shown in the figure, what is the probability of taking out the white ball in the case of bucket B? Obviously it is 1/3, which can be seen in the figure, but when the problem is complicated, it is very difficult to find it directly. Here we need the conditional probability formula:

Another algorithm for calculating conditional probability is Bayes' criterion:

Algorithm principle

We continue to look at the previous case:

  • If p1(x,y)> p2(x,y), then the category is 1
  • If p2(x,y)> p1(x,y), then the category is 2

Here we use conditional probability notation, p(c1|x,y) and p(c2|x,y), this is very difficult to solve in practice, so use Bayesian criterion:

This becomes:

  • If P(c1|x, y)> P(c2|x, y), then it belongs to category c1
  • If P(c2|x, y)> P(c1|x, y), then it belongs to category c2

Naive Bayesian text classification

Problem description and data

Take the message board of an online community as an example, with insulting text (1) and normal speech (0), the data was created by myself.

def loadDataSet():


    Create a data set

    :return: word list postingList, category classVec


    postingList = [['my','dog','has','flea','problems','help','please'], #[0,0,1,1,1... ]






    classVec = [0, 1, 0, 1, 0, 1] # 1 is abusive, 0 not

    return postingList, classVec
Construct word vector

We all know that computers can't directly recognize text. At this time, we need to convert text to numbers. How to do the conversion?

For example, here are two paragraphs of text:

  • I love China
  • I don't like eating apples

First segment the text, and then compose a non-repetitive list (vocabulary):

I, love, China, don't like, eat apples

Then for the vocabulary, the two paragraphs of text are vectorized, and the words that appear will be assigned a value of 1, otherwise it will be 0, and the two paragraphs of text will be converted into the following two word vectors.

  • 1, 1, 1, 0 ,0
  • 1, 0, 0, 0 ,0

The following code builds a word vector:

def createVocabList(dataSet):

    vocabSet = set([])

    for i in dataSet:

        vocabSet = vocabSet | set(i)

    return list(vocabSet)

def set0fWords2Vec(vocabList, inputSet):

    returnVec = [0]\*len(vocabList)

    for word in inputSet:

        if word in inputSet:

            returnVec[vocabList.index(word)] = 1

    return returnVec
Training algorithm

Due to multiple features, the calculation formula here is written in the form of a w matrix:

code show as below:

from numpy import/*

def trainNB0(trainMatrix, trainCategory):

    numTrainDocs = len(trainMatrix)

    numWords = len(trainMatrix[0])

    pAbusive = sum(trainCategory)/float(numTrainDocs)

    p0Num = ones(numWords); p1Num = ones(numWords)

    p0Denom = 2.0; p1Denom = 2.0

    for i in range(numTrainDocs):

        if trainCategory[i] == 1:

            p1Num += trainMatrix[i]

            p1Denom += sum(trainMatrix[i])


            p0Num += trainMatrix[i]

            p0Denom += sum(trainMatrix[i])

    p1Vect = log(p1Num/p1Denom)

    p0Vect = log(p0Num/p0Denom)

    return p0Vect, p1Vect, pAbusive
Test algorithm

Finally, test the algorithm:

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):

    p1 = sum(vec2Classify/* p1Vec) + log(pClass1) # P(w|c1)/* P(c1), the numerator of the Bayesian criterion

    p0 = sum(vec2Classify/* p0Vec) + log(1.0-pClass1) # P(w|c0)/* P(c0), the numerator of the Bayesian criterion·

    if p1> p0:

        return 1


        return 0

def testingNB():

    listOPosts, listClasses = loadDataSet()

    myVocabList = createVocabList(listOPosts)

    trainMat = []

    for postinDoc in listOPosts:

        trainMat.append(set0fWords2Vec(myVocabList, postinDoc))

    p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))

    testEntry = ['love','my','dalmation']

    thisDoc = array(set0fWords2Vec(myVocabList, testEntry))

    print(testEntry,'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))

    testEntry = ['stupid','garbage']

    thisDoc = array(set0fWords2Vec(myVocabList, testEntry))

    print(testEntry,'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))

Algorithm advantages and disadvantages

  • Advantages: It can be used with a small amount of data, and can handle many types of problems
  • Disadvantages: more sensitive to data
Reference: Naive Bayes of machine learning combat-Cloud + Community-Tencent Cloud