How to build a crawler proxy service
Anyone who has ever crawled should know that there are too many websites and data. If the crawler crawls too fast, it will inevitably trigger the anti-crawl mechanism of the website. Almost the same trick is to block IP. There are 2 solutions:
The first solution sacrifices time and speed in exchange for data, but under normal circumstances, our time is very precious. Ideally, we will use the shortest time to obtain the most data. So the second solution is recommended, so where can I find so many proxy IPs?
Looking for an agent
When the programmers don’t understand, they should look for it, Google, Du Niang, and enter the keyword: free proxy IP. The first few pages are almost all websites that provide proxy IP. After opening them one by one, we observe that they are almost all a list page. Displays ranging from tens to hundreds of IPs.
But if you observe carefully, you will find that the free IP provided by each website is limited, and if you use a few, you will find that some have already expired. Of course, they are more inclined to buy other people's agents, and they rely on this to make money.
As a cunning programmer, of course, you can’t kneel because of this difficulty. Think about it carefully. Since search engines can find so many websites that provide agents, each website provides dozens or hundreds, if there are 10 websites. , That adds up to hundreds to thousands of IPs.
Well, all you have to do is to record these websites and use the program to grab the IP. Isn't it easy to think about it?
By the way just now, you should get hundreds or thousands of proxy IPs.
Wait, with so many IPs, do others really give it to you for free? Of course not. As mentioned earlier, a large part of these agents are already ineffective. So what to do? How to know which agents are valid and which are not available?
It's very simple. Hang up these proxies, visit a certain stable website, and then see if it can be accessed normally. The ones that can be accessed normally are available, and the ones that can't be accessed are invalid.
The fastest, you can use the curl command to test whether an agent is available:
# Use proxy 22.214.171.124:3128 to visit NetEase homepage curl -x "126.96.36.199:3128" "http://www.163.com"
Of course, this method is just for the convenience of demonstration. The best method is actually: use a multi-threaded method, use a proxy to visit a certain website, and then output the available proxy. This is the fastest way to find available agents.
Now it is possible to find out the available proxies through the above method. If it is applied to the program, I shouldn't need to say more, most of them should be used. For example, just enter the available agents into a file, and each line is an agent, then you can use it like this:
In this way, if there are hundreds of agents, they can basically keep crawling the data of a certain website for a period of time, and it is not a problem to grab tens of thousands of data.
However, if I want to continuously obtain data from a certain website, or crawl millions or even hundreds of millions of webpage data, then this definitely won't work.
Continuous supply of agents
The method just now is to grab a few proxy websites at once, and then test whether each proxy is available through the program, and get a list of available proxies. But this is only a one-off, and the amount of proxy is often small, and it will certainly not be able to meet the needs in continuous crawling. So how can you continue to find available agents?
According to the above method, you can write a program that automatically collects the agent, and then the crawler can periodically go to the file/database to obtain it and use it. But there is a small problem, how do you know the quality of each agent? In other words, what is the speed of the agent?
The previous points are just the basics. These 3 points can further optimize your agent program and output a priority agent list. The crawler uses the agent according to the weight and the maximum number of uses. The benefits of this: to ensure the use of high-quality agents, while preventing frequent use of a certain agent to prevent being blocked.
After a series of improvements and optimizations above, a usable proxy service has been built, which is just based on the file system or database.
If the crawler wants to use these proxies, it can only read files or read the database, and then select the proxy to use according to certain rules. This is more cumbersome. Can you make it easier for the crawler to use the proxy? Then you need to make proxy access into a service.
There is a well-known server software squid that uses its cache_peer neighbor proxy mechanism to do this perfectly.
Write the proxy of the proxy list in the configuration file in a certain format according to Squid's cache_peer mechanism.
Squid is a proxy server software, which is generally used like this. If the crawler is on machine A and squid is installed on machine B, the website server to be crawled is machine C, and the proxy IP is machine D/E/F...
The advantage of this is that the crawler does not need to consider how to load and select available agents. It gives Squid a list of agents. According to the rules of the configuration file, it can help you manage and schedule the selected agents. The most important thing is that the crawler side only needs to access the Squid service port to use the proxy!
Now that the servitization has been set up, the only step is to integrate:
A complete proxy service can be built in this way, and high-quality proxies are output regularly. The crawler does not need to care about the collection and testing of the agent, just use Squid's unified service portal to crawl data.
About the columnist
Kaito, engaged in the development of Internet Python Web and crawler fields, has 2 years of development experience in the crawler field, and has developed a distributed vertical crawler platform. Able to conduct secondary development based on the open source framework. Blog: http://kaito-kidd.com
Python Chinese Community