Yesterday, I wrote a part of the code of the Zhihu article image crawler, and carried out data capture for the answer json of the Zhihu question. Some hard-coded content appeared in the blog. Today, I adjusted that part of the information and downloaded the picture. Go to the code.
First of all, you need to get any question you know. You only need to enter the ID of the question to get the relevant page information, such as the most important total number of people answering the question.
The problem ID is the following red number
Write the code, the following code is used to check whether the user input is the correct ID, and by splicing the URL to get the total number of answers to the question.
import requests import re import pymongo import time DATABASE_IP = '127.0.0.1' DATABASE_PORT = 27017 DATABASE_NAME ='sun' client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT) db = client.sun db.authenticate("dba", "dba") collection = db.zhihuone # ready to insert data BASE_URL = "https://www.zhihu.com/question/{}" def get_totle_answers(article_id): headers = { "user-agent": "Need to complete Mozilla/5.0 (Windows NT 10.0; WOW64)" } with requests.Session() as s: with s.get(BASE_URL.format(article_id),headers=headers,timeout=3) as rep: html = rep.text pattern =re.compile('<meta itemProp="answerCount" content="(\d*?)"/>') s = pattern.search(html) print("Find {} data".format(s.groups()[0])) return s.groups()[0] if __name__ =='__main__': # Use an infinite loop to determine whether the user input is a number article_id = "" while not article_id.isdigit(): article_id = input("Please enter the article ID:") totle = get_totle_answers(article_id) if int(totle)>0: zhi = ZhihuOne(article_id,totle) zhi.run() else: print("No data!")
Improve the picture download part. The picture download address was found in the review process, and it was found in the content of the json field. We used a simple regular expression to match it. The details are shown in the figure below
Write the code. Please read carefully the following code comments. There is a small BUG in the middle, and you need to manually modify pic3 to pic2. The reason is not clear. It may be due to my local network. Also, please create one in the project root directory. imgs folder, used to store pictures
def download_img(self,data): ## Download image for item in data["data"]: content = item["content"] pattern = re.compile('<noscript>(.*?)</noscript>') imgs = pattern.findall(content) if len(imgs)> 0: for img in imgs: match = re.search('<img src="(.*?)"', img) download = match.groups()[0] download = download.replace("pic3", "pic2") # Small bug, pic3 cannot be downloaded print("Downloading{}".format(download), end="") try: with requests.Session() as s: with s.get(download) as img_down: # Get file name file = download[download.rindex("/") + 1:] content = img_down.content with open("imgs/{}".format(file), "wb+") as f: # This place is hard-coded f.write(content) print("Picture download complete", end="\n") except Exception as e: print(e.args) else: pass
The result of the operation is
------------------- End -------------------