When writing crawlers, you must have encountered the problem of unsatisfactory crawling speed when the amount of data is relatively large. This time I made a comparison with the last "360 Image Crawler" , which was rewritten into multi-process, multi-thread, and multi-thread + multi-process crawlers. The link to the previous article is as follows:
" What is the structure of a crawler? 》
The above is the time required for a multi-threaded serial image crawler to download 1000 images without adding multiple processes.
Let's first take a look at how multi-process is created~
1. Create a process based on the parent process:
""" Cross-platform A Process class is provided to describe process objects. To create a child process, you only need to pass in an execution function and function parameters , You can complete the creation of a Process instance, start the process with the start() method, and implement the process with the join() method Inter-synchronization """ import os from multiprocessing import Process #The code to be executed by the child process def run_proc(name): print('Child process %s (%s) Running ...'% (name,os.getpid())) # os.getpid() Get process id if __name__ =='__main__': print('Parent process %s.'% os.getpid()) for i in range(5): p = Process(target=run_proc,args=(str(i),)) # The first parameter is the task function, and the second is the task parameter passed in print('Process will start.') p.start() p.join() print('Process end.')
2. Create a process pool. The default is to create a few CPUs if the computer has several CPUs:
""" Pool can provide a specified number of processes for users to call, the default size is the number of CPU cores. But there are new requests submitted to When the pool is not full, a new process will be created to execute the request; if the maximum number has been reached, it will wait Know that a process is over, will create a new process to deal with it """ from multiprocessing import Pool import os,time,random def run_task(name): print('Task %s (pid = %s) is running ...'% (name,os.getpid())) time.sleep(random.random() * 3) print('Task %s end.'% name) if __name__ =='__main__': print('Current process %s.'% os.getpid()) p = Pool(processes=3) for i in range(5): p.apply_async(run_task,args=(i,)) #Add process task, i is the parameter of the process task passed in pass print('Waiting for all subprocesses done...') p.close() #No more adding new processes p.join() #Wait for all child processes to be executed, close() must be called before calling, for the Pool object print('All subprocesses done.')
The above is the creation of multi-processes. The running time of the crawler modified on this basis is generally as follows:
Let's first take a look at how multithreading is created~
""" Pass in a function and create an instance of Tread """ import random import time,threading #Code executed by the new thread def thread_run(urls): print('Current %s is running...'% threading.current_thread().name) for url in urls: print('%s --->>> %s'% (threading.current_thread().name,url)) time.sleep(random.random()) print('%s ended.'% threading.current_thread().name) print('%s is runnung...'% threading.current_thread().name) t1 = threading.Thread(target=thread_run,name='Thread_1',args=(['url_1', 'url_2','url_3'],)) t2 = threading.Thread(target=thread_run,name='Thread_2',args=(['url_4', 'url_5','url_6'],)) t1.start() t2.start() t1.join() t2.join() print('%s ended.'% threading.current_thread().name)
""" Create thread class inherited from treading.Thread """ import random import threading import time class myThreading(threading.Thread): def __init__(self,name,urls): threading.Thread.__init__(self,name=name) #Initialize thread self.urls = urls def run(self): print('Current %s is running...'% threading.current_thread().name) for url in self.urls: print('%s --->>> %s'% (threading.current_thread().name,url)) time.sleep(random.random()) print('%s ended.'% threading.current_thread().name,url) print('%s is running...'% threading.current_thread().name) t1 = myThreading(name='Tread_1',urls=['url_1','url_2','url_3']) t2 = myThreading(name='Tread_2',urls=['url_4','url_5','url_6']) t1.start() t2.start() t1.join() t2.join() print('%s ended.'% threading.current_thread().name)
The above are the two creation methods of multi-threading. After the editor's crawler is modified on this basis, there is a doubt that the speed has not improved much, and it is slower than serial:
Multi-process + multi-thread
The editor changed the serial crawler to multi-process + multi-thread, and found that it was 2 seconds faster than using only multi-process, and the speed increased a bit.
Here, multi-process is used to request web pages, and multi-thread is used to download pictures.
The speed of multi-process will be significantly improved. Multi-process is generally used for computationally intensive (for loops). In crawlers, it is generally used for the requested URL list.
Using multithreading alone for crawling speed does not change much. Multithreading is generally used for I/O intensive (file reading and writing). In crawlers, there are generally reading and writing files, downloading pictures, video and music.
Multi-process + multi-thread:
Combining the two and using them in a suitable place will change the speed. It is not necessary to use any procedure. It is necessary to prescribe the right medicine. Xiaobian this crawler combines the two, although the speed has only increased by 2 seconds, it has also changed.