Jianshu Unofficial Big Data (3)

Jianshu Unofficial Big Data (3)

Recently I went back to my hometown, and the program of crawling the short book stopped. I returned to Changsha and continued to crawl. I was very happy to climb to about 300W. After exporting it, I found a lot of repetitions. I remember I said that I set it up. Take a look at the code, dizzy:

The inserted is the author_infos table, but the judgment is the author_url table, and then I plan to call the url to crawl the user details after deduplication, but mongodb will not be able to deduplicate, and I have not figured it out after Baidu; furthermore, to the right The senior said that I crawled too few fields, then I want to modify the crawling again (I have been crying in the toilet).

Code

import requests
from lxml import etree
import time
import pymongo

client = pymongo.MongoClient('localhost', 27017)
jianshu = client['jianshu']
author_infos = jianshu['author_infos']

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Connection':'keep-alive'
}

def get_article_url(url,page):
    link_view ='{}?order_by=added_at&page={}'.format(url,str(page))
    try:
        html = requests.get(link_view,headers=headers)
        selector = etree.HTML(html.text)
        infos = selector.xpath('//div[@class="name"]')
        for info in infos:
            author_name = info.xpath('a/text()')[0]
            authorurl = info.xpath('a/@href')[0]
            if'http://www.jianshu.com'+ authorurl in [item['url'] for item in author_infos.find()]:
                pass
            else:
            # print('http://www.jianshu.com'+authorurl,author_name)
            # author_infos.insert_one({'author_name':author_name,'author_url':'http://www.jianshu.com'+authorurl})
                get_all_info('http://www.jianshu.com'+authorurl)
                get_reader_url(authorurl)
        time.sleep(2)
    except requests.exceptions.ConnectionError:
        pass

# get_article_url('http://www.jianshu.com/c/bDHhpK',2)
def get_reader_url(url):
    link_views = ['http://www.jianshu.com/users/{}/followers?page={}'.format(url.split('/')[-1],str(i)) for i in range(1,100)]
    for link_view in link_views:
        try:
            html = requests.get(link_view,headers=headers)
            selector = etree.HTML(html.text)
            infos = selector.xpath('//li/div[@class="info"]')
            for info in infos:
                author_name = info.xpath('a/text()')[0]
                authorurl = info.xpath('a/@href')[0]
                # print(author_name,authorurl)
                # author_infos.insert_one({'author_name': author_name,'author_url':'http://www.jianshu.com' + authorurl})
                get_all_info('http://www.jianshu.com' + authorurl)
        except requests.exceptions.ConnectionError:
            pass
# get_reader_url('http://www.jianshu.com/u/7091a52ac9e5')

def get_all_info(url):
    html = requests.get(url,headers=headers)
    selector = etree.HTML(html.text)
    try:
        author_name = selector.xpath('//a[@class="name"]/text()')[0]
        author_focus = selector.xpath('//div[@class="info"]/ul/li[1]/div/p/text()')[0]
        author_fans = selector.xpath('//div[@class="info"]/ul/li[2]/div/p/text()')[0]
        author_article = selector.xpath('//div[@class="info"]/ul/li[3]/div/p/text()')[0]
        author_write_amount = selector.xpath('//div[@class="info"]/ul/li[4]/div/p/text()')[0]
        author_get_like = selector.xpath('//div[@class="info"]/ul/li[5]/div/p/text()')[0]
        author_intrus = selector.xpath('//div[1]/div/div[2]/div[2]/div/text()')
        author_intru = selector.xpath('//div[1]/div/div[2]/div[2]/div/text()') if len(author_intrus) != 0 else "None"
        if selector.xpath('//span[@class="author-tag"]'):
            author_title ='Signed author'
        else:
            author_title ='Ordinary author'
        infos = {
            'url':url,
            'name':author_name,
            'focus':author_focus,
            'fans':author_fans,
            'article':author_article,
            'write_amount':author_write_amount,
            'get_like':author_get_like,
            'intru':author_intru,
            'title':author_title
        }
        author_infos.insert_one(infos)
    except IndexError:
        pass

Let’s do this today, mainly to record my learning process. This article has been filed in the copyright seal, if you need to reprint, please visit the copyright seal. 58708803

Reference: https://cloud.tencent.com/developer/article/1394186 Jianshu Unofficial Big Data (3)-Cloud + Community-Tencent Cloud