Today, I will take everyone to write a simple and complete crawler. Let's grab the pictures of the entire site and save them on the computer!
Tools: Python3.6, pycharm
Library: requests, re, time, random, os
Target website: sister map (please go to the code to see the specific url...)
Before we start to write the code, we must first analyze the website, focusing on these points:
1. First judge whether the webpage is a static webpage, this is related to the crawling method we use!
Simply put, the content of the webpage can be found in the source code of the webpage, then it can be concluded that the website is static; if it is not found, you need to find it in the developer tools to see if it is a packet capture or Analyze js structure or other ways.
2. Look at the structure of the webpage, it is roughly clear to capture the target data, it takes several layers of cycles, the way of each cycle, and whether it is guaranteed that there are no omissions!
3. Determine the matching method to be adopted according to the source code of the web page
Generally speaking, regular expressions are the fastest way to process strings, but it is not very efficient in crawlers, because it needs to traverse the entire html to match related content. If the web page source code is relatively regular, bs4 is recommended Or xpath, etc., the way to parse the web page structure is better!
Of course, today we are basic crawlers, so we use regular expressions. After all, regular expressions must be mastered!
So, how do you write crawler code specifically~? Let me give you a simple example:
If it is manual operation, it is probably this process
Open the homepage==>Select a category==>Select an atlas==>Select pictures in turn==>Right click to save==>Repeat the above to save other pictures
Then put this process into the code, and its structure is roughly like this:
Visit the homepage url==>Find and cycle all categories==>Create category folder==>Access category url==>Find page number to build cycle category all pages==>Cycle page all atlas==>Create atlas folder ==>Find all the picture urls in the gallery==>Save to the corresponding folder
Okay, the idea is also there, so let's not talk too much nonsense, let's write the code~!
The pause function of the time module is added to the request. If you do not join, you may be denied access to the web page!
When you finally request the image address, you need to add UA to tell the server that you are a browser and not a script. This is the most commonly used anti-climbing method.
#author:云飞 #QQ群542110741 import requests import time import random import re import os def new_title(title): rstr = r"[\/\\\:\*\?\"\<\>\|]" #'//: *? "<> |' new_title = re.sub(rstr, "_", title) # Replace with underscore return new_title url ='http://www.meizitu.com/' html = requests.get(url) html.encoding ='gb2312' infos = re.findall(r'a href="(http://www.meizitu.com/.*?html)" target="_blank" title="(.*?)"',html.text) i = 1 for sor_url,sor in infos: sor = new_title(sor) path ='E://python/mn/meizitu/%s/'%sor#path if os.path.exists(path):#Determine whether the path and folder exist, create if they don’t exist pass else: os.mkdir(path) time.sleep(random.random()) sor_html = requests.get(sor_url) sor_html.encoding ='gb2312' atlas = set(re.findall(r"<li><a href='(.*?html)'>\d+</a></li>",sor_html.text)) atlas_lis = [] atlas_lis.append(sor_url) atlas_lis += [url+'a/'+x for x in list(atlas)] for atla in atlas_lis: atla_html = requests.get(atla).text at_url_lis = re.findall(r'h3 class="tit"><a href="(http://www.meizitu.com/.*?html)" targe',atla_html) for at_url in at_url_lis: at_html = requests.get(at_url) at_html.encoding = "gb2312" atlas_title =''.join(re.findall(r'<title>(.*?)</title>',at_html.text)) atlas_title = new_title(atlas_title) img_path ='E://python/mn/meizitu/%s/%s/'%(sor,atlas_title) if os.path.exists(img_path):#Judging whether the path and folder exist, create if they don’t exist pass else: os.mkdir(img_path) img_urls = re.findall(r'src="(http://mm.chinasareview.com/.*?jpg)"/><br/>',at_html.text) k = 1 for img_url in img_urls: header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'} data = requests.get(img_url,headers=header).content#Get the binary format of the picture with open('%s%s'%(img_path,img_url.split('/')[-1]),'wb') as f: f.write(data) print("[Downloading] The %d picture of {%s}, a total of %d pictures have been downloaded"%(atlas_title,k,i)) i += 1 k += 1
Effect after downloading for a period of time
------------------- End -------------------
Welcome everyone to like , leave a message, forward, reprint, thank you for your company and support
Thousands of waters and thousands of mountains are always in love, can you click [ Looking ]
*Disclaimer: This article is organized on the Internet, and the copyright belongs to the original author. If the source information is wrong or infringes on rights, please contact us for deletion or authorization.
/Today's message topic/
Just say a word or two~~