What is the structure of a crawler?

What is the structure of a crawler?

In software engineering, there are such a few words " high cohesion and low coupling ", which means: large modules are divided into small modules to achieve high independence between each module. Modification of one module will affect other modules. The module or the entire project has less impact.

We take a crawler that downloads pictures as an example to make it clearer to everyone.

Error example

import re
import requests
def Visit(url,regularity,regularity_1): #parameter 1, web page address, parameter 2, regular expression rule parameter 3 secondary web page regular rule parameter 4 page number parameter 5 total page number
    r = requests.get(url)
    r.encoding = r.apparent_encoding
    web_source=r.text
    regular_ls=re.findall(regularity,web_source)
    for i in range(len(regular_ls)):
        yema=regular_ls[i]
        print(yema)
        url_1="https://www.236www.com/"+regular_ls[i] #Extracted secondary webpage address
        print(url_1)
        html = requests.get(url_1)
        html.encoding = html.apparent_encoding
        web_source_html = html.text
        regular_ls_1 = re.findall(regularity_1, web_source_html)
        for n in range(len(regular_ls_1)):
            try:
                picture_url=regular_ls_1[n]
                picture_html=requests.get(picture_url)
                address = "D:\\picture\\"+str(yema[17:])+"--"+str(i)+"--"+str(n)+".jpg" #Picture download local path
                with open(address, "wb") as f:
                    f.write(picture_html.content)
                    print(address,'download complete')
            except:
                print(str(i)+str(n),"Print failed")

def web_batch(The_number_of):
    regularity ='<a href="(.*?)" target="_blank' # First-level webpage regularity rules
    regularity_1 ='<img src="(.*?)"/>' # Second-level webpage regularity rules
    number=1
    for i in range(The_number_of):
        url = "https://www.0606rr.com/Html/63/index-2" + str(number) + ".html" # Visit web address
        Visit(url, regularity, regularity_1)
        number =number + 1
web_batch(5)

The above code is a crawler that downloads pictures. I don't know what it feels like at first glance.

Anyway, when I read it at the time, I felt that the readability was so low. Not only was there no key comment, but the modules were all stuffed together. Reading it line by line, I really didn’t understand what this py file was doing .

The above code editor thinks that there are several shortcomings:

1. Either there is no blank line or only one line between the function modules. It is recommended in the Python coding standards that there are two blank lines between each module.

2. The code is not robust and the request has no exception handling. If a request for an image fails, the program will crash.

3. Each function does too many things, which is too different from "high cohesion and low coupling", and later maintenance is inconvenient.

After modification

Modify the above code. I revised it based on the idea of ​​"high cohesion and low coupling" and object-oriented. The amount of code has increased a bit, but the crawler is not only more robust, the function of each module is clear at a glance, and the key comments are also improved.

If crawlers want to be robust and easy to maintain, they are generally written in this structure, generally divided into 5 modules, and large crawler projects are structured in this way. For example, the Scrapy framework is also based on this structure, as follows:

spiderMan:

The main logic module, the business logic is implemented here.

from picture. serial crawling. urlManager import urlManager
from picture. serial crawling. htmlDownload import htmlDownload
from picture. serial crawling. parseHtml import parseHtml
from picture. serial crawling. dataOutput import dataOutput
import json
import time


class spipderMan():
    """
    Main logic
    """
    def __init__(self):
        """
        Initialize each module
        """
        self.manager = urlManager()
        self.download = htmlDownload()
        self.parse = parseHtml()
        self.output = dataOutput()

    def get_url(self):
        """
        Get the url of each page
        :return:
        """
        page_urls = self.manager.create_url()
        self.request_url(page_urls)

    def request_url(self,page_urls):
        """
        Request the url of each page
        :return:
        """
        for page_url in page_urls:
            response = self.download.request_url(page_url)
            # Determine whether the request is successful
            if response == None:
                print('Failed to access the webpage, it may be crawled back or there is a problem with the network~')
                break
            # Determine whether it is the last page
            html = json.loads(response.text)
            if html['end'] != False:
                print('There are no more pictures, the download is complete!')
                break
            else:
                self.get_img_url(html)

    def get_img_url(self,html):
        """
        Parse to get the URL of all pictures on each page
        :return:
        """
        img_urls = self.parse.get_this_page_img_urls(html)
        self.get_img(img_urls)

    def get_img(self,img_urls):
        """
        Download image
        :return:
        """
        self.output.download_img(img_urls)


if __name__ =='__main__':
    # Run the main interface
    start_time = time.time()
    spider = spipderMan()
    spider.get_url()
    end_time = time.time()
    print(end_time-start_time)

urlManager:

The module that controls url scheduling.

class urlManager():
    """
    Management url
    """
    def __init__(self):
        """
        Initialize the url to be spliced
        """
        self.base_url ='http://image.so.com/zj?ch=beauty&sn={}&listtype=new&temp=1'

    def create_url(self):
        """
        Construct the url of each page
        :return:
        """
        urls = [self.base_url.format(str(i)) for i in range(0,1020,30)]
        return urls

htmlDownload:

The module requested by the web page to download.

import requests
import random


class htmlDownload():
    """
    Web download
    """
    def __init__(self):
        """
        Initialize the request header
        """
        USER_AGENTS = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.204 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/66.0',       
        ]
        self.headers = {'User-Agent': random.choice(USER_AGENTS)}


    def request_url(self,url):
        """
        request
        :param url:
        :return:
        """
        response = requests.get(url,self.headers)
        if response.status_code == 200:
            return response
        else:
            return None

parseHtml:

Analyze the module that gets the data

class parseHtml():
    """
    Analyze web pages and extract data
    """
    def __init__(self):
        self.img_urls = [] # Store the list of image titles and urls

    def get_this_page_img_urls(self,html):
        """
        Get the url of the picture on this page
        :param html:
        :return: Store the list of image urls
        """
        # Print how many pictures are currently downloaded, first determine whether the access is successful, then print if successful
        if html['list']:
            img_count = html['lastid']-30
            print('Currently downloaded {} sheets, there is an error, the same name will be filtered~'.format(img_count))

        for item in html['list']:
            img_title = item['group_title']
            img_url = item['qhimg_url']
            self.img_urls.append((img_title,img_url))
        return self.img_urls

dataOutput:

Data processing modules, such as storing in the database, cleaning and other operations.

import os
from picture. serial crawling. htmlDownload import htmlDownload


class dataOutput():
    """
    Data output processing
    """
    def __init__(self):
        """
        Create image save path
        """
        self.root_path = r'C:\Users\13479\Desktop\python project\my crawler\Is there any multi-process, thread image, video download comparison\pictures\serial crawling\images\\'
        # Create if there is no file path
        if not os.path.exists(self.root_path):
            os.mkdir(self.root_path)

        self.download = htmlDownload()

    def download_img(self,img_urls):
        """
        Function to download pictures
        :param img_urls: image name, url corresponding list
        :return:
        """
        for img_url in img_urls:
            # Construct the complete download route of the picture
            download_path ='{}{}.jpg'.format(self.root_path,img_url[0])
            if not os.path.exists(download_path):
                response = self.download.request_url(img_url[1])
                try:
                    with open(download_path,'wb') as f:
                        f.write(response.content)
                except:
                    pass
            else:
                pass

The above is a crawler that the editor downloads pictures from the 360 ​​picture website. According to this structure, it will be much easier to use and maintain later.

"High cohesion and low coupling" is a kind of thinking, and there is no fixed coding structure, but if you write the code in this way, it is not only convenient for your own later maintenance, but also very readable for others.

File acquisition

Follow the public account "Kinxia Learning Python" and reply to "360 Image Crawler" to get the source code.

end

The editor wrote a serial crawler this time. It took 6 minutes to download 1000 like pictures. Next time, I will modify it to multi-process and multi-thread to share with you.

Reference: https://cloud.tencent.com/developer/article/1555324 What is the structure of a crawler? -Cloud + Community-Tencent Cloud