The second bullet of the embarrassing encyclopedia of Python data analysis

The second bullet of the embarrassing encyclopedia of Python data analysis

Last time I talked about the analysis of embarrassment encyclopedia, today I will analyze another table, which is the user information table.

Data preprocessing

  • Import Data
import pandas as pd
import pymongo
import jieba.analyse
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
client = pymongo.MongoClient('localhost',port = 27017)
qiushi = client['qiushi']
qiushi_info = qiushi['qiushi_info']
data1 = pd.DataFrame(list(qiushi_info.find()))

qiushi = client['qiushi']
user_info = qiushi['user_info']
data2 = pd.DataFrame(list(user_info.find()))

data1 is paragraph information, data2 is user information, both tables have user url, we can merge them.

  • merge
all_data = pd.merge(data1,data2,on='user_url')
all_data
  • De-duplication Since some high-play users have posted multiple jokes, it is necessary to de-duplicate here and obtain a unique value through the user id.
data3 = all_data.drop_duplicates(['id'])

Duanzi Shou constellation distribution

For the analysis of numbers, I have already talked about a few last time. I am mainly interested in the constellations and regions of the narrator. I will analyze it today. You can also analyze each dimension.

xingzuo = data3.groupby('constellation').size()

plt.figure(figsize=(10,6),dpi=80)
labels = list(xingzuo.index)
sizes = list(xingzuo)
plt.xlabel('Constellation')
plt.ylabel('Number of users')
plt.title('Embarrassed Encyclopedia user constellation map')
plt.bar(range(len(sizes)),sizes,tick_label=labels,color='#99CC01',alpha=0.7)#alpha is transparency
plt.grid(color='#95a5a6',linestyle='--', linewidth=1,axis='y',alpha=0.6)
plt.legend(['Number of users'])
for x,y in zip(range(len(sizes)),sizes):
    plt.text(x, y,y, ha='center', va ='bottom')

Except for the unknown, Libra has the most users and Aries has the least.

Libras often pursue a sense of peace and harmony. They are good at conversation and strong communication skills are their greatest strengths. But their biggest shortcoming is often hesitation. Libras are easy to impose their own ideas on others, Libras should be careful about this

Aries is like a child, straightforward, passionate, impulsive, but also very self-centered and childish

Duanzishou regional distribution

As shown in the figure, the data is divided into provinces and cities. We only extract the province data, which can be processed during crawling.

list_1=[]
for i in range(0,273):
    list_1.append(data3.iat[i,-6].split('ยท')[0])
data3['province'] = list_1
data3
sheng = data3.groupby('province').size()
plt.figure(figsize=(20,6),dpi=80)
labels = list(sheng.index)
sizes = list(sheng)
plt.xlabel('province and city')
plt.ylabel('Number of users')
plt.title('Province and city distribution map of embarrassing encyclopedia users')
plt.bar(range(len(sizes)),sizes,tick_label=labels,color='#99CC01',alpha=0.7)#alpha is transparency
plt.grid(color='#95a5a6',linestyle='--', linewidth=1,axis='y',alpha=0.6)
plt.legend(['Number of users'])
for x,y in zip(range(len(sizes)),sizes):
    plt.text(x, y,y, ha='center', va ='bottom')

Let's take a look at which province is rich in jokes. We can also call Baidu api to get the latitude and longitude of the province, and then draw such a map with BDP.

summary

Through 2 cases, the basic process of python data analysis is mainly explained.

  • data import
  • Data preprocessing
  • Data Integration
  • data visualization
Reference: https://cloud.tencent.com/developer/article/1394182 Python data analysis: the second part of the Encyclopedia of Embarrassment-Cloud + Community-Tencent Cloud