In software engineering, there are such a few words " high cohesion and low coupling ", which means: large modules are divided into small modules to achieve high independence between each module. Modification of one module will affect other modules. The module or the entire project has less impact.
We take a crawler that downloads pictures as an example to make it clearer to everyone.
Error example
import re import requests def Visit(url,regularity,regularity_1): #parameter 1, web page address, parameter 2, regular expression rule parameter 3 secondary web page regular rule parameter 4 page number parameter 5 total page number r = requests.get(url) r.encoding = r.apparent_encoding web_source=r.text regular_ls=re.findall(regularity,web_source) for i in range(len(regular_ls)): yema=regular_ls[i] print(yema) url_1="https://www.236www.com/"+regular_ls[i] #Extracted secondary webpage address print(url_1) html = requests.get(url_1) html.encoding = html.apparent_encoding web_source_html = html.text regular_ls_1 = re.findall(regularity_1, web_source_html) for n in range(len(regular_ls_1)): try: picture_url=regular_ls_1[n] picture_html=requests.get(picture_url) address = "D:\\picture\\"+str(yema[17:])+"--"+str(i)+"--"+str(n)+".jpg" #Picture download local path with open(address, "wb") as f: f.write(picture_html.content) print(address,'download complete') except: print(str(i)+str(n),"Print failed") def web_batch(The_number_of): regularity ='<a href="(.*?)" target="_blank' # First-level webpage regularity rules regularity_1 ='<img src="(.*?)"/>' # Second-level webpage regularity rules number=1 for i in range(The_number_of): url = "https://www.0606rr.com/Html/63/index-2" + str(number) + ".html" # Visit web address Visit(url, regularity, regularity_1) number =number + 1 web_batch(5)
The above code is a crawler that downloads pictures. I don't know what it feels like at first glance.
Anyway, when I read it at the time, I felt that the readability was so low. Not only was there no key comment, but the modules were all stuffed together. Reading it line by line, I really didn’t understand what this py file was doing .
The above code editor thinks that there are several shortcomings:
1. Either there is no blank line or only one line between the function modules. It is recommended in the Python coding standards that there are two blank lines between each module.
2. The code is not robust and the request has no exception handling. If a request for an image fails, the program will crash.
3. Each function does too many things, which is too different from "high cohesion and low coupling", and later maintenance is inconvenient.
After modification
Modify the above code. I revised it based on the idea of "high cohesion and low coupling" and object-oriented. The amount of code has increased a bit, but the crawler is not only more robust, the function of each module is clear at a glance, and the key comments are also improved.
If crawlers want to be robust and easy to maintain, they are generally written in this structure, generally divided into 5 modules, and large crawler projects are structured in this way. For example, the Scrapy framework is also based on this structure, as follows:
spiderMan:
The main logic module, the business logic is implemented here.
from picture. serial crawling. urlManager import urlManager from picture. serial crawling. htmlDownload import htmlDownload from picture. serial crawling. parseHtml import parseHtml from picture. serial crawling. dataOutput import dataOutput import json import time class spipderMan(): """ Main logic """ def __init__(self): """ Initialize each module """ self.manager = urlManager() self.download = htmlDownload() self.parse = parseHtml() self.output = dataOutput() def get_url(self): """ Get the url of each page :return: """ page_urls = self.manager.create_url() self.request_url(page_urls) def request_url(self,page_urls): """ Request the url of each page :return: """ for page_url in page_urls: response = self.download.request_url(page_url) # Determine whether the request is successful if response == None: print('Failed to access the webpage, it may be crawled back or there is a problem with the network~') break # Determine whether it is the last page html = json.loads(response.text) if html['end'] != False: print('There are no more pictures, the download is complete!') break else: self.get_img_url(html) def get_img_url(self,html): """ Parse to get the URL of all pictures on each page :return: """ img_urls = self.parse.get_this_page_img_urls(html) self.get_img(img_urls) def get_img(self,img_urls): """ Download image :return: """ self.output.download_img(img_urls) if __name__ =='__main__': # Run the main interface start_time = time.time() spider = spipderMan() spider.get_url() end_time = time.time() print(end_time-start_time)
urlManager:
The module that controls url scheduling.
class urlManager(): """ Management url """ def __init__(self): """ Initialize the url to be spliced """ self.base_url ='http://image.so.com/zj?ch=beauty&sn={}&listtype=new&temp=1' def create_url(self): """ Construct the url of each page :return: """ urls = [self.base_url.format(str(i)) for i in range(0,1020,30)] return urls
htmlDownload:
The module requested by the web page to download.
import requests import random class htmlDownload(): """ Web download """ def __init__(self): """ Initialize the request header """ USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.204 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/66.0', ] self.headers = {'User-Agent': random.choice(USER_AGENTS)} def request_url(self,url): """ request :param url: :return: """ response = requests.get(url,self.headers) if response.status_code == 200: return response else: return None
parseHtml:
Analyze the module that gets the data
class parseHtml(): """ Analyze web pages and extract data """ def __init__(self): self.img_urls = [] # Store the list of image titles and urls def get_this_page_img_urls(self,html): """ Get the url of the picture on this page :param html: :return: Store the list of image urls """ # Print how many pictures are currently downloaded, first determine whether the access is successful, then print if successful if html['list']: img_count = html['lastid']-30 print('Currently downloaded {} sheets, there is an error, the same name will be filtered~'.format(img_count)) for item in html['list']: img_title = item['group_title'] img_url = item['qhimg_url'] self.img_urls.append((img_title,img_url)) return self.img_urls
dataOutput:
Data processing modules, such as storing in the database, cleaning and other operations.
import os from picture. serial crawling. htmlDownload import htmlDownload class dataOutput(): """ Data output processing """ def __init__(self): """ Create image save path """ self.root_path = r'C:\Users\13479\Desktop\python project\my crawler\Is there any multi-process, thread image, video download comparison\pictures\serial crawling\images\\' # Create if there is no file path if not os.path.exists(self.root_path): os.mkdir(self.root_path) self.download = htmlDownload() def download_img(self,img_urls): """ Function to download pictures :param img_urls: image name, url corresponding list :return: """ for img_url in img_urls: # Construct the complete download route of the picture download_path ='{}{}.jpg'.format(self.root_path,img_url[0]) if not os.path.exists(download_path): response = self.download.request_url(img_url[1]) try: with open(download_path,'wb') as f: f.write(response.content) except: pass else: pass
The above is a crawler that the editor downloads pictures from the 360 picture website. According to this structure, it will be much easier to use and maintain later.
"High cohesion and low coupling" is a kind of thinking, and there is no fixed coding structure, but if you write the code in this way, it is not only convenient for your own later maintenance, but also very readable for others.
File acquisition
Follow the public account "Kinxia Learning Python" and reply to "360 Image Crawler" to get the source code.
end
The editor wrote a serial crawler this time. It took 6 minutes to download 1000 like pictures. Next time, I will modify it to multi-process and multi-thread to share with you.