In the hot summer, in this sweltering weather, water-playing, swimming and other water-contact activities have become activities that everyone wants to experience, in order to lower the temperature, the editor also prepares water activities and goes rafting, but the scenic spots did not go. I don’t know if it’s fun or not, so I crawled the comment data of this drifting spot on Meituan to analyze how everyone’s experience was, as a reference.
The editor uses a crawler to collect data from Meituan, conducts a data exploration on the comment star rating and comment time in the data, cleans the comment data, draws a comprehensive comment word cloud, positive and negative emotion word cloud, and finally uses the LDA theme model. The exploration of comment topics.
Partners who are not interested in data acquisition can skip directly reading the analysis part.
Open Meituan, search for drifting, and find the destination you want to go. The editor is Gao Guohe, you can see that there are 1681 comments:
Click in to view the comments, open the developer tools, turn two pages and find that it is loaded asynchronously, and you can see which package the data source is in:
Check the request header and parameter part of this package, and use the get method request. The offset in the url is the parameter that controls the number of pages. The number of steps is 10, so that the request link can be constructed based on the total number of comments. The editor in the request header is logged in. Bring the cookie, there are no other parameters:
The requested data is returned in json format, and it only needs to be parsed to extract the required data directly, the code:
The result is shown in the figure:
Evaluation time and star distribution
The comment time is in the form of a timestamp, which needs to be converted into year, month, and day, and the trend of the number of comments per year is drawn according to the annual time series:
It can be seen from the above figure that this drifting attraction was launched in Meituan in 2016. The number of comments has been increasing year by year. It can be known that the number of tourists has increased year by year, and the number of comments has decreased in 20 years, because it is only July.
Let's take a look at the distribution of comment months and comment stars:
The review months are mostly concentrated in June to August, which happens to be the hottest summer time, and the ratings are mainly concentrated between 4-5 points.
In the comment data, consider the users who may have maliciously brushed the comments. The content of the comments is almost the same. Some comments are very similar, and there are differences in the use of words. Deletes may be deleted by mistake, so only delete completely duplicates:
Next, check whether there are missing values. The editor has 1680 data items, but there are only more than 900 items that are not empty. The editor investigates the problem, which is not a crawler problem. Check the original data of Meituan and find that the following comment data is simply No, but there are 1680 comments written in the attractions, there are actually only 952 comments, so delete the empty ones:
In the comment data, we also need to put forward the unnecessary high-frequency characters that may appear, as well as alphanumeric, alphanumeric, which are useless. Because it is sentiment analysis, the data may be mixed with'Meituan, drifting, scenic spots', etc. The high-frequency words that appear, need to be deleted:
Next, we need to segment words, mark part of speech, and remove stop words. The stop-word text editor is given to stoplist.txt. In the part of speech, the part of speech is x, which represents punctuation. Delete. The final result has four columns, the first column is word The comment id, the second column of words, the third column of parts of speech, the fourth column is the position of each word in the corresponding comment:
Extract nouns and adjectives. The goal is to analyze the experience of tourists. Only when clear nouns and adjectives appear in the comments are meaningful, so part-of-speech tagging is carried out; n stands for nouns and adj stands for adjectives. First select the row where the noun adjective is located, and select the index , And then select all the words of this comment from the merged result above according to the index:
Draw a word cloud to view the effect of word segmentation:
It can be seen from the figure that after the preprocessing of the comment data, the word segmentation effect is more in line with expectations. Among them, words such as "stimulus", "good" and "fun" appear more frequently, which preliminarily judges that the tourist experience is good.
Positive and negative emotions
Since the analysis of sentiment, we must first match the sentiment words. We use the sentiment vocabulary, which is also provided by the editor. The sentiment vocabulary is the “word set for sentiment analysis” released by CNKI in 2007, mainly using “Chinese positive evaluation”. 'Chinese positive sentiment','Chinese negative sentiment','Chinese negative evaluation' and other vocabulary lists'Chinese positive evaluation','Chinese positive sentiment' are merged, and each word is given an initial weight of 1, as a positive comment sentiment vocabulary 'Chinese Negative Sentiment' and'Chinese Negative Evaluation' are combined, and each word is given an initial weight of -1. As a negative comment emotional vocabulary, it is optimized on the basis of the provided vocabulary, and some words are added to match the emotional word code:
Because there are multiple negative phenomena in Chinese, that is, when a negative word appears an odd number of times, it means negation; an even number means affirmation. According to Chinese habits, search for the first two words of each emotional word, if it is an odd number, adjust it to the opposite emotion is excellent:
After trimming, extract the positive and negative emotion words:
Draw a word cloud of positive and negative emotions, the picture above is positive, and the picture below is negative:
From the positive emotion word cloud, it can be seen that words such as "good", "like", "worth" and "stimulus" appear more frequently, without negative emotion words.
From the negative emotion word cloud, we can see that there are more words such as "refueling", "not good", "expensive" and "pit", and the negative emotions are well extracted.
However, "stimulus" appears in both pictures, indicating that both positive and negative people feel stimulus.
LDA topic model
If a document has multiple themes, some specific words that can represent different themes will appear repeatedly. At this time, using the topic model, you can find the rules of words used in the text, and link the regularly displayed texts together. To seek useful information in unstructured text collections.
Through the LDA topic model, it is possible to mine the potential topics in the data set, and then analyze the focus of the data set and its related feature words, and code reply keywords to obtain and view.
1. establish a dictionary and corpus, then optimize the number of topics, determine the most suitable number of topics, and check the average cosine similarity between the topics. In this project, the number of topics reaches the minimum when the number of topics is 3:
In the final topic, a list represents a topic, and it contains the 10 most likely words in a topic. The picture above is positive, and the picture below is negative:
Based on the above themes, the characteristics of rafting spots can be drawn: