Before learning the naive Bayes classification model, let's review the KNN and decision tree we learned before. The reader's own summary: Different machine learning methods have different assumptions and theories to support, and these assumptions and theories are to a large extent Reflects the advantages and disadvantages of the algorithm.
KNN: In the sample space, the same type of data is clustered in the space, that is, the distance will be close. Based on this assumption, only the distance between the test sample and the training sample needs to be calculated. The category of the closest distance sample is largely The category of the test sample.
Decision tree: Based on information theory. The sample data is chaotic (high entropy), but fortunately there are features. How to effectively use the features to make the sample not chaotic and separable is what the decision tree needs to accomplish.
The Naive Bayes introduced today is an algorithm based on the Bayes criterion, so why add naive? Because the concubines can't do it! (Naive Bayes assumes that the features are independent and equally important).
As shown in the figure, it is our sample data set.
We now use p1(x,y) to indicate the probability that the data point (x,y) belongs to category 1 (the category represented by the dot in the figure), and p2(x,y) to indicate that the data point (x,y) belongs to the category The probability of 2 (the category represented by the triangle in the figure), then for a new data point (x, y), the following rules can be used to determine its category:
In other words, we will choose a category with a high probability. This is the core idea of the algorithm, but how to calculate p1(x,y) and p2(x,y)? Knowledge of conditional probability and Bayesian criterion is needed here.
Conditional probability is the probability of occurrence of A under the premise of occurrence of B. The mathematical formula is P(A|B). Let's take a look at the case in the book:
A jar containing 7 stones, of which 3 are white and 4 are black, the probability of taking out the white is 3/7, and the probability of taking out the black is 4/7. This is very simple.
If two buckets are used for storage, as shown in the figure, what is the probability of taking out the white ball in the case of bucket B? Obviously it is 1/3, which can be seen in the figure, but when the problem is complicated, it is very difficult to find it directly. Here we need the conditional probability formula:
Another algorithm for calculating conditional probability is Bayes' criterion:
We continue to look at the previous case:
Here we use conditional probability notation, p(c1|x,y) and p(c2|x,y), this is very difficult to solve in practice, so use Bayesian criterion:
This becomes:
Take the message board of an online community as an example, with insulting text (1) and normal speech (0), the data was created by myself.
def loadDataSet(): """ Create a data set :return: word list postingList, category classVec """ postingList = [['my','dog','has','flea','problems','help','please'], #[0,0,1,1,1... ] ['maybe','not','take','him','to','dog','park','stupid'], ['my','dalmation','is','so','cute','I','love','him'], ['stop','posting','stupid','worthless','garbage'], ['mr','licks','ate','my','steak','how','to','stop','him'], ['quit','buying','worthless','dog','food','stupid']] classVec = [0, 1, 0, 1, 0, 1] # 1 is abusive, 0 not return postingList, classVec
We all know that computers can't directly recognize text. At this time, we need to convert text to numbers. How to do the conversion?
For example, here are two paragraphs of text:
First segment the text, and then compose a non-repetitive list (vocabulary):
I, love, China, don't like, eat apples
Then for the vocabulary, the two paragraphs of text are vectorized, and the words that appear will be assigned a value of 1, otherwise it will be 0, and the two paragraphs of text will be converted into the following two word vectors.
The following code builds a word vector:
def createVocabList(dataSet): vocabSet = set([]) for i in dataSet: vocabSet = vocabSet | set(i) return list(vocabSet) def set0fWords2Vec(vocabList, inputSet): returnVec = [0]\*len(vocabList) for word in inputSet: if word in inputSet: returnVec[vocabList.index(word)] = 1 return returnVec
Due to multiple features, the calculation formula here is written in the form of a w matrix:
code show as below:
from numpy import/* def trainNB0(trainMatrix, trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); p1Num = ones(numWords) p0Denom = 2.0; p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) p0Vect = log(p0Num/p0Denom) return p0Vect, p1Vect, pAbusive
Finally, test the algorithm:
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify/* p1Vec) + log(pClass1) # P(w|c1)/* P(c1), the numerator of the Bayesian criterion p0 = sum(vec2Classify/* p0Vec) + log(1.0-pClass1) # P(w|c0)/* P(c0), the numerator of the Bayesian criterion· if p1> p0: return 1 else: return 0 def testingNB(): listOPosts, listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat = [] for postinDoc in listOPosts: trainMat.append(set0fWords2Vec(myVocabList, postinDoc)) p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses)) testEntry = ['love','my','dalmation'] thisDoc = array(set0fWords2Vec(myVocabList, testEntry)) print(testEntry,'classified as:', classifyNB(thisDoc, p0V, p1V, pAb)) testEntry = ['stupid','garbage'] thisDoc = array(set0fWords2Vec(myVocabList, testEntry)) print(testEntry,'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))