Python is interesting|Chinese text sentiment analysis

Python is interesting|Chinese text sentiment analysis

Preface

In the previous article, I told you the path of python machine learning. This is not to say that you are not practicing the fake style. This time, Luo Luopan will lead you to complete a machine learning project for Chinese text sentiment analysis. Today's process is as follows:

Data situation and processing

Data situation

The data here is the comment data on Dianping (provided by Teacher Wang Shuyi), mainly the comment text and scoring. We first read in the data and look at the situation of the data:

import numpy as np
import pandas as pd
data = pd.read_csv('data1.csv')
data.head()
Emotional division

Looking at the unique value of the star field, the scores are 1, 2, 4, and 5.

Chinese text sentiment analysis belongs to our classification problem (that is, negative and positive). Here is the score. Then we design the code to make the score less than 3 be negative (0), and the score greater than 3 is positive (1).

Define a function, and then use the apply method, so that you get a new column (knowledge points in data analysis)

def make_label(star):
    if star> 3:
        return 1
    else:
        return 0
    
data['sentiment'] = data.star.apply(make_label)

Toolkit (snownlp)

First of all, we don't use machine learning methods, we use a third library (snownlp), this library can directly perform sentiment analysis on text (remember to install), and the method of use is also very simple. What is returned is the probability of positivity.

from snownlp import SnowNLP

text1 ='This thing is good'
text2 ='This thing is rubbish'

s1 = SnowNLP(text1)
s2 = SnowNLP(text2)

print(s1.sentiments,s2.sentiments)
# result 0.8623218777387431 0.21406279508712744

In this way, we define that it is greater than 0.6, which is positive, and the result can be obtained in the same way.

def snow_result(comemnt):
    s = SnowNLP(comemnt)
    if s.sentiments >= 0.6:
        return 1
    else:
        return 0
    
data['snlp_result'] = data.comment.apply(snow_result)

The results of the five elements above look very bad (5 out of 2 are correct), so how many are correct? We can compare the result with the sentiment field, and I will count if it is equal, so that we can see the approximate accuracy when we divide by the total sample.

counts = 0
for i in range(len(data)):
    if data.iloc[i,2] == data.iloc[i,3]:
        counts+=1

print(counts/len(data))
# result 0.763

Naive Bayes

The previous method of using the third library, the result is not particularly ideal (0.763), and this method has a big drawback: poor pertinence.

What do you mean? We all know that language expressions are different in different scenarios. For example, this is useful in product reviews, but may not be applicable in blog comments.

Therefore, we need to train our own model for this scenario. This article will use sklearn to implement the naive Bayes model (the principle will be explained later). Send the slearn cheat sheet first (the download address is in HD below).

The approximate process is:

  • Import Data
  • Split data
  • Data preprocessing
  • Training model
  • Test model
jieba participle

1. we segment the comment data. Why do we need to segment words? Chinese and English are different, for example: i love python, which uses spaces to separate words; our Chinese is different, for example: I like programming, we have to divide into me/like/programming (separated by spaces), this is mainly for the latter Word vector preparation.

import jieba

def chinese_word_cut(mytext):
    return "".join(jieba.cut(mytext))

data['cut_comment'] = data.comment.apply(chinese_word_cut)
Divide the data set

The classification problem requires x (feature), and y (label). The comment after the word segmentation is x and the emotion is y. Divide into training set and test set according to the ratio of 8:2.

X = data['cut_comment']
y = data.sentiment

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
Word vector (data processing)

Computers cannot recognize text, only numbers. How to deal with the text, the simplest is the word vector. What is a word vector, let's illustrate through a case, the following is our text:

I love the dog
I hate the dog 

The word vector is processed like this:

Simply put, the word vector is that we arrange the words that appear in the entire text one by one, and then each row of data is mapped to these columns, and the ones that appear are 1, and the ones that do not appear are 0. In this way, the text data is converted into a 01 sparse matrix (This is also the reason for the Chinese word segmentation above, such a word is a column).

Fortunately, there is such a method directly for us to use in sklearn. Commonly used parameters of the CountVectorizer method:

  • max_df: The keywords (too trivial) that appear in documents exceeding this ratio are removed.
  • min_df: The keywords (too unique) that appear in the documents below this number are removed.
  • token_pattern: Mainly deal with numbers and punctuation marks through regularization.
  • stop_words: Set up a stop word list, such words will not be counted (mostly dummy words, articles, etc.), and a list structure is required, so a function is defined in the code to process the stop word list.
from sklearn.feature_extraction.text import CountVectorizer

def get_custom_stopwords(stop_words_file):
    with open(stop_words_file) as f:
        stopwords = f.read()
    stopwords_list = stopwords.split('\n')
    custom_stopwords_list = [i for i in stopwords_list]
    return custom_stopwords_list

stop_words_file ='HIT stop word list.txt'
stopwords = get_custom_stopwords(stop_words_file)

vect = CountVectorizer(max_df = 0.8, 
                       min_df = 3, 
                       token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b', 
                       stop_words=frozenset(stopwords))

If you want to see what data is coming out at the bottom, you can check it with the following code.

test = pd.DataFrame(vect.fit_transform(X_train).toarray(), columns=vect.get_feature_names())
test.head()
Training model

The training model is very simple, using the naive Bayes algorithm, and the result is 0.899, which is much better than the previous snownlp.

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

X_train_vect = vect.fit_transform(X_train)
nb.fit(X_train_vect, y_train)
train_score = nb.score(X_train_vect, y_train)
print(train_score)

# result 0.899375
Test Data

Of course, we need test data to verify the accuracy, the result is 0.8275, the accuracy is still good.

X_test_vect = vect.transform(X_test)
print(nb.score(X_test_vect, y_test))

# result 0.8275

Of course, we can also put the results in the data data:

X_vec = vect.transform(X)
nb_result = nb.predict(X_vec)
data['nb_result'] = nb_result
Discussion and deficiencies
  • Small sample size
  • The model is not tuned
  • No cross-validation
Reference: https://cloud.tencent.com/developer/article/1411334 Python Interesting|Chinese Text Sentiment Analysis-Cloud + Community-Tencent Cloud