Crawler | A hundred lines of code crawling 14.5W Douban book information

Crawler | A hundred lines of code crawling 14.5W Douban book information

The results of the first wave of crawling:

Some screenshots in the database

Actual combat

Introduce class library

import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException
from urllib.parse import urlencode
import pymongo
import numpy as np
import time
from faker import Faker

Analyze page requests

Analyze the target page

Open developer mode, view the link

Click any label, analyze the page request to request a different label page, analyze the request link, you can find the following rules:

tag_url ='' + the content enclosed by the a tag in the tag page

From this, we can build the following code to get all the tag links on the tag page:

# Parse the total tab page and stitch to get all tab page links
def splice_tags_indexhtml(html):
    url =''
    book_tags = []
    tags_url = []
    soup = BeautifulSoup(html,'lxml')
    tagurl_lists ='#content> div> div.article> div> div> table> tbody> tr> td> a')
    for tag_url in tagurl_lists:
        # Get the a tag content of all tags and stitch them together
        book_tags += [tag_url.attrs["href"]]
    for book_tag in book_tags:
        tags_url.append([url + book_tag])
    return tags_url

We enter a single label page, analyze the book list page, and analyze the fields we need to store. We analyze the fields we need through bs4, such as: publication time, author/translator, Douban score, price, number of reviews, etc.

# Parse the information of a single page under a single tag page
def parse_tag_page(html):
        soup = BeautifulSoup(html,"lxml")
        tag_name ='title')[0].get_text().strip()
        list_soup = soup.find('ul', {'class':'subject-list'})
        if list_soup == None:
            print('Failed to get the information list')
            for book_info in list_soup.findAll('div', {'class':'info'}):
                # Title
                title = book_info.find('a').get('title').strip()
                # Number of reviews
                people_num = book_info.find('span', {'class':'pl'}).get_text().strip()
                # Publication information, author
                pub = book_info.find('div', {'class':'pub'}).get_text().strip()
                pub_list = pub.split('/')
                    author_info ='Author/Translator: '+'/'.join(pub_list[0:-3])
                    author_info ='Author/Translator: None'
                    pub_info ='Publishing information: '+'/'.join(pub_list[-3:-1])
                    pub_info ='Publishing Information: Not Available'
                    price_info ='Price:' +'/'.join(pub_list[-1:])
                    price_info ='Price: None'
                    rating_num = book_info.find('span', {'class':'rating_nums'}).get_text().strip()
                    rating_num = '0.0'
                book_data = {
                    'title': title,
                    'people_num': people_num,
                    'author_info': author_info,
                    'pub_info': pub_info,
                    'price_info': price_info,
                    'rating_num': rating_num
                # return book_data
                if book_data:
        print('parse error')
        return None

At this point, we can already get the book information of a single page under a single tag. At this time, we only need to add the page turning function to crawl the information of all books under a single tag.

Click on the next page and analyze the page request. You can see that the page has two more parameters, start and type. At the same time, the start parameter starts from 0 and increases with an offset of 20. We can build a generator according to this rule to generate start parameter. From the first picture of the article, it can be seen that different tag pages have different numbers of books, and the number of pages is also different. How to build a generator at this time? At this time, we found that all tags could not request information after page 50, so we only need to construct the page link of the first 50 pages. The page 51 is displayed as follows:

Results displayed on page 51

# Request each page under the tag
def get_tag_page(tag_url,page):
        formdata = {
            'start': page,
        url = tag_url[0]+'?'+ urlencode(formdata)
            response = requests.get(url, headers=headers)
            if response.status_code == 200:
                return response.text
            return None
        except RequestException:
            print('Request list page error')
            return None

Anti-anti-anti-climbing Douban's anti-climbing is simple and rude, directly blocking the IP. For the robustness of the crawler, you can use a proxy or random header + random delay. The random delay can be set between 30 and 40, but this greatly affects the crawling. Rate, if you need to crawl quickly, you can use proxy + multithreading + random header + random delay so that you can avoid anti-climbing and can crawl quickly.

#Use the Faker library to randomly generate fake headers
from faker import Faker
fake = Faker()
headers ={'User-Agent':fake.user_agent()}

Pay attention

  • The post I wrote to myself was written after writing the code and reviewing it again. After reviewing, I found that there are many areas that need to be optimized. For example, the exception handling part often leads to abnormal interruptions during crawling and has to be re-examined. error. There is also the function of the reptile's breakpoint resumable uploading. You should learn about it.
Reference: Crawler | A hundred lines of code crawling 14.5W Douban book information-Cloud + Community-Tencent Cloud