Python | How to get twice the result with half the effort

Python | How to get twice the result with half the effort

When writing crawlers, you must have encountered the problem of unsatisfactory crawling speed when the amount of data is relatively large. This time I made a comparison with the last "360 Image Crawler" , which was rewritten into multi-process, multi-thread, and multi-thread + multi-process crawlers. The link to the previous article is as follows:

" What is the structure of a crawler? „Äč


The above is the time required for a multi-threaded serial image crawler to download 1000 images without adding multiple processes.


Let's first take a look at how multi-process is created~

1. Create a process based on the parent process:

A Process class is provided to describe process objects. To create a child process, you only need to pass in an execution function and function parameters
, You can complete the creation of a Process instance, start the process with the start() method, and implement the process with the join() method
import os
from multiprocessing import Process

#The code to be executed by the child process
def run_proc(name):
    print('Child process %s (%s) Running ...'% (name,os.getpid())) # os.getpid() Get process id

if __name__ =='__main__':
    print('Parent process %s.'% os.getpid())
    for i in range(5):
        p = Process(target=run_proc,args=(str(i),)) # The first parameter is the task function, and the second is the task parameter passed in
        print('Process will start.')
    print('Process end.')

2. Create a process pool. The default is to create a few CPUs if the computer has several CPUs:

Pool can provide a specified number of processes for users to call, the default size is the number of CPU cores. But there are new requests submitted to
When the pool is not full, a new process will be created to execute the request; if the maximum number has been reached, it will wait
Know that a process is over, will create a new process to deal with it
from multiprocessing import Pool
import os,time,random

def run_task(name):
    print('Task %s (pid = %s) is running ...'% (name,os.getpid()))
    time.sleep(random.random() * 3)
    print('Task %s end.'% name)

if __name__ =='__main__':
    print('Current process %s.'% os.getpid())
    p = Pool(processes=3)
    for i in range(5):
        p.apply_async(run_task,args=(i,)) #Add process task, i is the parameter of the process task passed in
    print('Waiting for all subprocesses done...')
    p.close() #No more adding new processes
    p.join() #Wait for all child processes to be executed, close() must be called before calling, for the Pool object
    print('All subprocesses done.')

The above is the creation of multi-processes. The running time of the crawler modified on this basis is generally as follows:


Let's first take a look at how multithreading is created~

method one:

Pass in a function and create an instance of Tread
import random
import time,threading

#Code executed by the new thread
def thread_run(urls):
    print('Current %s is running...'% threading.current_thread().name)
    for url in urls:
        print('%s --->>> %s'% (threading.current_thread().name,url))
    print('%s ended.'% threading.current_thread().name)

print('%s is runnung...'% threading.current_thread().name)
t1 = threading.Thread(target=thread_run,name='Thread_1',args=(['url_1',
t2 = threading.Thread(target=thread_run,name='Thread_2',args=(['url_4',
print('%s ended.'% threading.current_thread().name)

Method Two:

Create thread class inherited from treading.Thread
import random
import threading
import time

class myThreading(threading.Thread):
    def __init__(self,name,urls):
        threading.Thread.__init__(self,name=name) #Initialize thread
        self.urls = urls

    def run(self):
        print('Current %s is running...'% threading.current_thread().name)
        for url in self.urls:
            print('%s --->>> %s'% (threading.current_thread().name,url))
        print('%s ended.'% threading.current_thread().name,url)

print('%s is running...'% threading.current_thread().name)
t1 = myThreading(name='Tread_1',urls=['url_1','url_2','url_3'])
t2 = myThreading(name='Tread_2',urls=['url_4','url_5','url_6'])
print('%s ended.'% threading.current_thread().name)

The above are the two creation methods of multi-threading. After the editor's crawler is modified on this basis, there is a doubt that the speed has not improved much, and it is slower than serial:

Multi-process + multi-thread

The editor changed the serial crawler to multi-process + multi-thread, and found that it was 2 seconds faster than using only multi-process, and the speed increased a bit.

Here, multi-process is used to request web pages, and multi-thread is used to download pictures.



The speed of multi-process will be significantly improved. Multi-process is generally used for computationally intensive (for loops). In crawlers, it is generally used for the requested URL list.


Using multithreading alone for crawling speed does not change much. Multithreading is generally used for I/O intensive (file reading and writing). In crawlers, there are generally reading and writing files, downloading pictures, video and music.

Multi-process + multi-thread:

Combining the two and using them in a suitable place will change the speed. It is not necessary to use any procedure. It is necessary to prescribe the right medicine. Xiaobian this crawler combines the two, although the speed has only increased by 2 seconds, it has also changed.

Reference: Python | How to get twice the result with half the effort-Cloud + Community-Tencent Cloud