Jianshu Unofficial Big Data (2)

Jianshu Unofficial Big Data (2)

PS: This article is very important. The big data mentioned in my article is not a hot topic of big data. I have read an article on big data a few days ago. In short: You can talk about big data for data that cannot be processed under your current conditions. There is no specified amount of data for this. The crawler crawled all night, and 170W+ has been crawled so far. I thought about it in the morning, but the efficiency is not enough. I don’t know how to distribute crawlers, so I have to stop and change the code. At this time, careful friends will think that I want to explain. The breakpoint continues to climb (Do you want to start again after breaking?). But today it's just a pseudo-breakpoint, but it will provide you with an idea.

Crawl popular and city URLs

import requests
from lxml import etree
import pymongo

client = pymongo.MongoClient('localhost', 27017)
jianshu = client['jianshu']
topic_urls = jianshu['topic_urls']

host_url ='http://www.jianshu.com'
hot_urls = ['http://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(1,40)]
city_urls = ['http://www.jianshu.com/recommendations/collections?page={}&order_by=city'.format(str(i)) for i in range(1,3)]

def get_channel_urls(url):
    html = requests.get(url)
    selector = etree.HTML(html.text)
    infos = selector.xpath('//div[@class="count"]')
    for info in infos:
        part_url = info.xpath('a/@href')[0]
        article_amounts = info.xpath('a/text()')[0]
        focus_amounts = info.xpath('text()')[0].split('·')[1]
        # print(part_url,article_amounts,focus_amounts)
        topic_urls.insert_one({'topicurl':host_url + part_url,'article_amounts':article_amounts,
                              'focus_amounts':focus_amounts})

# for hot_url in hot_urls:
# get_channel_urls(hot_url)

for city_url in city_urls:
    get_channel_urls(city_url)

This part of the code is to crawl the URL and store it in the topic_urls table. Other crawling details are relatively simple, so I won't go into more details.

Crawl article authors and fans

import requests
from lxml import etree
import time
import pymongo

client = pymongo.MongoClient('localhost', 27017)
jianshu = client['jianshu']
author_urls = jianshu['author_urls']
author_infos = jianshu['author_infos']

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Connection':'keep-alive'
}

def get_article_url(url,page):
    link_view ='{}?order_by=added_at&page={}'.format(url,str(page))
    try:
        html = requests.get(link_view,headers=headers)
        selector = etree.HTML(html.text)
        infos = selector.xpath('//div[@class="name"]')
        for info in infos:
            author_name = info.xpath('a/text()')[0]
            authorurl = info.xpath('a/@href')[0]
            if'http://www.jianshu.com'+ authorurl in [item['author_url'] for item in author_urls.find()]:
                pass
            else:
            # print('http://www.jianshu.com'+authorurl,author_name)
                author_infos.insert_one({'author_name':author_name,'author_url':'http://www.jianshu.com'+authorurl})
                get_reader_url(authorurl)
        time.sleep(2)
    except requests.exceptions.ConnectionError:
        pass

# get_article_url('http://www.jianshu.com/c/bDHhpK',2)
def get_reader_url(url):
    link_views = ['http://www.jianshu.com/users/{}/followers?page={}'.format(url.split('/')[-1],str(i)) for i in range(1,100)]
    for link_view in link_views:
        try:
            html = requests.get(link_view,headers=headers)
            selector = etree.HTML(html.text)
            infos = selector.xpath('//li/div[@class="info"]')
            for info in infos:
                author_name = info.xpath('a/text()')[0]
                authorurl = info.xpath('a/@href')[0]
                # print(author_name,authorurl)
                author_infos.insert_one({'author_name': author_name,'author_url':'http://www.jianshu.com' + authorurl})
        except requests.exceptions.ConnectionError:
            pass
# get_reader_url('http://www.jianshu.com/u/7091a52ac9e5')

1 Jianshu is relatively friendly to crawlers, just add an agent (but don't crawl maliciously to maintain network security). 2 A second error occurred in the middle, just add two try. I have considered whether there will be an error before. If the page turning of the short book exceeds the last page, it will automatically jump to the second page (try it manually), So I adjusted a very large threshold and I didn't want to make mistakes. 3 There is an error and I don’t want to crawl duplicate data and a user can publish many articles, so I added a judgment to get_article_url, which probably means: if the crawled url is in the user table, I will not access, store, or crawl Fans are waiting for the operation.

Run entry

import sys
sys.path.append("..")
from multiprocessing import Pool
from channel_extract import topic_urls
from page_spider import get_article_url

db_topic_urls = [item['topicurl'] for item in topic_urls.find()]
shouye_url = ['http://www.jianshu.com/c/bDHhpK']
x = set(db_topic_urls)
y = set(shouye_url)
rest_urls = x-y

def get_all_links_from(channel):
    for num in range(1,5000):
        get_article_url(channel,num)

if __name__ =='__main__':

    pool = Pool(processes=4)
    pool.map(get_all_links_from,rest_urls)

1 I am still crawling the homepage today (because I fetched 17000 before num (too many homepage articles)), I thought that most of the articles on the homepage were pushed from other topics, so I won’t crawl it. Subtract a collection, remove the link on the homepage, and then crawl. 2 Why is it said that it is a false breakpoint crawling? Because the next time you report an error, you still have to start again (unless the program is changed), but here is an idea for everyone to crawl the rest of the information by subtracting the collection.

Reference: https://cloud.tencent.com/developer/article/1394188 Jianshu Unofficial Big Data (2)-Cloud + Community-Tencent Cloud