Python crawler novice tutorial: Zhihu article image crawler

Python crawler novice tutorial: Zhihu article image crawler

Yesterday, I wrote a part of the code of the Zhihu article image crawler, and carried out data capture for the answer json of the Zhihu question. Some hard-coded content appeared in the blog. Today, I adjusted that part of the information and downloaded the picture. Go to the code.

First of all, you need to get any question you know. You only need to enter the ID of the question to get the relevant page information, such as the most important total number of people answering the question.

The problem ID is the following red number

Write the code, the following code is used to check whether the user input is the correct ID, and by splicing the URL to get the total number of answers to the question.

import requests
import re
import pymongo
import time
DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME ='sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba", "dba")
collection = db.zhihuone # ready to insert data

BASE_URL = "https://www.zhihu.com/question/{}"
def get_totle_answers(article_id):
    headers = {
        "user-agent": "Need to complete Mozilla/5.0 (Windows NT 10.0; WOW64)"
    }

    with requests.Session() as s:
        with s.get(BASE_URL.format(article_id),headers=headers,timeout=3) as rep:
            html = rep.text
            pattern =re.compile('<meta itemProp="answerCount" content="(\d*?)"/>')
            s = pattern.search(html)
            print("Find {} data".format(s.groups()[0]))
            return s.groups()[0]

if __name__ =='__main__':

    # Use an infinite loop to determine whether the user input is a number
    article_id = ""
    while not article_id.isdigit():
        article_id = input("Please enter the article ID:")

    totle = get_totle_answers(article_id)
    if int(totle)>0:
        zhi = ZhihuOne(article_id,totle)
        zhi.run()
    else:
        print("No data!")

Improve the picture download part. The picture download address was found in the review process, and it was found in the content of the json field. We used a simple regular expression to match it. The details are shown in the figure below

Write the code. Please read carefully the following code comments. There is a small BUG in the middle, and you need to manually modify pic3 to pic2. The reason is not clear. It may be due to my local network. Also, please create one in the project root directory. imgs folder, used to store pictures

def download_img(self,data):
        ## Download image
        for item in data["data"]:
            content = item["content"]
            pattern = re.compile('<noscript>(.*?)</noscript>')
            imgs = pattern.findall(content)
            if len(imgs)> 0:
                for img in imgs:
                    match = re.search('<img src="(.*?)"', img)
                    download = match.groups()[0]
                    download = download.replace("pic3", "pic2") # Small bug, pic3 cannot be downloaded

                    print("Downloading{}".format(download), end="")
                    try:
                        with requests.Session() as s:
                            with s.get(download) as img_down:
                                # Get file name
                                file = download[download.rindex("/") + 1:]

                                content = img_down.content
                                with open("imgs/{}".format(file), "wb+") as f: # This place is hard-coded
                                    f.write(content)

                                print("Picture download complete", end="\n")

                    except Exception as e:
                        print(e.args)

            else:
                pass

The result of the operation is

------------------- End -------------------

Reference: https://cloud.tencent.com/developer/article/1673523 Python crawler novice tutorial: Zhihu Article Image Crawler-Cloud + Community-Tencent Cloud