Manually build distributed crawlers

Manually build distributed crawlers


Qiye , Python Chinese community columnist, information security researcher, good at network security, reverse engineering, Python crawler development, Python Web development. Author of "Python crawler development and project combat" .

The article shared this time is the content of my new book "Python Crawler Development and Project Practical Combat" Basics-Chapter 7, about how to build a simple distributed crawler by hand (if you are interested in this book, you can take a look at it. Sample chapter:, the following is the specific content of the article.

This chapter is still about actual combat projects. The actual combat content is to create distributed crawlers. This is not a small challenge for beginners, and it is also a meaningful attempt. The distributed crawler created this time adopts a relatively simple master-slave model, which is completely manual and does not use a mature framework. It basically covers the main knowledge points of the first six chapters. Among them, the distributed knowledge points are distributed processes and inter-processes. The content of the communication is a summary of the basics of Python crawlers.

Now the large-scale crawler system adopts the distributed crawling structure. Through this actual combat project, everyone has a clearer understanding of distributed crawlers and lays the foundation for the subsequent system explanation of distributed crawlers. In fact, it is not so difficult. . Practical goal: Crawl 2,000 Baidu Encyclopedia web crawler entries and related entry titles, abstracts, and links, and use a distributed structure to rewrite the basic crawler in Chapter 6 to make the function more powerful. The crawled page is shown below.

7.1 Simple distributed crawler structure This distributed crawler adopts a master-slave model. The master-slave mode refers to a host as the control node responsible for the management of all hosts running the web crawler. The crawler only needs to receive tasks from the control node and submit the newly generated tasks to the control node. There is no need in this process. Communicating with other crawlers, this method is simple to implement and is conducive to management. The control node needs to communicate with all crawlers. Therefore, it can be seen that the master-slave mode is flawed. The control node will become the bottleneck of the entire system, which will easily cause the performance of the entire distributed web crawler system to decline. This time, three hosts are used for distributed crawling, one host is used as a control node, and the other two hosts are used as crawler nodes. The crawler structure is shown in Figure 7.1:

Figure 7.1 Master-slave crawler structure

7.2 Control Node ControlNode control node is mainly divided into URL manager, data storage and control scheduler. The control scheduler coordinates the work of the URL manager and the data storage through three processes. One is the URL management process, which is responsible for URL management and passing the URL to the crawler node, and the other is the data extraction process, which is responsible for reading the data returned by the crawler node. , The URL in the returned data is handed over to the URL management process, and the title and summary data are handed over to the data storage process. The last one is the data storage process, which is responsible for storing the data submitted in the data extraction process locally. The execution process is shown in Figure 7.2:

Figure 7.2 Control node execution process

7.2.1 URL manager URL manager checked the code in Chapter 6 and made some optimization changes. Since we use the set memory deduplication method, if you directly store a large number of URL links, especially when URL links are very long, it is easy to cause memory overflow, so we use MD5 processing of crawled URLs, because the string passes The length of the information digest after MD5 processing can be 128bit. After storing the generated MD5 digest in the set, the memory consumption can be reduced several times. The MD5 algorithm in Python generates a 32-bit string, because we crawl fewer URLs , Md5 conflict is not big, you can take the middle 16-bit string, that is, 16-bit MD5 encryption. At the same time, the save_progress and load_progress methods are added to perform serialization operations, serializing the un-crawled URL collection and the crawled URL collection to the local, saving the current progress, so that the state can be restored next time. The URL manager code is as follows:

7.2.2 Data memory The content of the data memory is basically the same as in Chapter 6, but the generated files are named according to the current time to avoid duplication, and the files are cached and written. code show as below:

7.2.3 Control scheduler The control scheduler mainly generates and starts the URL management process, the data extraction process and the data storage process, and maintains 4 queues to maintain the communication between processes, namely url_queue, result_queue, conn_q, store_q. The four queues are described as follows: url_q queue is the channel through which the URL management process passes the URL to the crawler node. result_q queue is the channel through which the crawler node returns data to the data extraction process. The store_q queue of the channel is the channel through which the data extraction process delivers the acquired data to the data storage process. Because it needs to communicate with the worker nodes, a distributed process is essential. Refer to the code in the service process in the distributed process (Linux version) in section 1.4.4 to create a distributed manager and define it as the start_manager method. The method code is as follows:

The data extraction process reads the returned data from the result_queue queue, and adds the URL in the data to the conn_q queue to the URL management process, and adds the article title and abstract in the data to the store_q queue to the data storage process. code show as below:

Finally, the distributed manager, URL management process, data extraction process and data storage process are started, and 4 queues are initialized. code show as below:

7.3 SpiderNode The spider node is relatively simple, mainly including HTML downloader, HTML parser and crawler scheduler. The execution process is as follows: The crawler scheduler reads the URL from the url_q queue in the control node. The crawler scheduler calls the HTML downloader and HTML parser to obtain the new URL and title summary in the web page. Finally, the crawler scheduler passes in the new URL and title summary. The result_q queue is handed over to the control node

7.3.1 HTML downloader The code of HTML downloader is the same as that in Chapter 6, as long as you pay attention to the webpage coding. code show as below:

7.3.2 HTML parser The code of the HTML parser is the same as that in Chapter 6, and the detailed web page analysis process can be reviewed in Chapter 6. code show as below:

7.3.3 Crawler Scheduler The crawler scheduler needs to use the code of the working process in the distributed process. For the specific content, please refer to the distributed process chapter in Chapter 1. The crawler scheduler needs to connect to the control node first, and then completes obtaining URLs from the url_q queue, downloading and parsing the web page, and handing the obtained data to the result_q queue, returning to the control node and other tasks. The code is as follows:

A local IP is set in the crawler scheduler: You can test the correctness of the code on a machine. Of course, you can also use three VPS servers, two to run the crawler node program, change the IP to the public network IP of the control node host, and one to run the control node program for distributed crawling, which is closer to the real crawling environment. The following figure 7.3 is the final crawled data, figure 7.4 is the content of new_urls.txt, and figure 7.5 is the content of old_urls.txt, you can compare and test, this simple distributed crawler still has a lot of room to play, I hope you can play your own His ingenuity has been further improved.

Figure 7.4 new_urls.txt

Figure 7.5 old_urls.txt

Reference: Manually build distributed crawlers-Cloud + Community-Tencent Cloud