How to build a crawler proxy service?

How to build a crawler proxy service?

How to build a crawler proxy service

Columnist: Kaito

cause

Anyone who has ever crawled should know that there are too many websites and data. If the crawler crawls too fast, it will inevitably trigger the anti-crawl mechanism of the website. Almost the same trick is to block IP. There are 2 solutions:

  • 1. The same IP, slow down (slow crawling speed)
  • 2. Use proxy IP access (recommended)

The first solution sacrifices time and speed in exchange for data, but under normal circumstances, our time is very precious. Ideally, we will use the shortest time to obtain the most data. So the second solution is recommended, so where can I find so many proxy IPs?

Looking for an agent

When the programmers don’t understand, they should look for it, Google, Du Niang, and enter the keyword: free proxy IP. The first few pages are almost all websites that provide proxy IP. After opening them one by one, we observe that they are almost all a list page. Displays ranging from tens to hundreds of IPs.

But if you observe carefully, you will find that the free IP provided by each website is limited, and if you use a few, you will find that some have already expired. Of course, they are more inclined to buy other people's agents, and they rely on this to make money.

As a cunning programmer, of course, you can’t kneel because of this difficulty. Think about it carefully. Since search engines can find so many websites that provide agents, each website provides dozens or hundreds, if there are 10 websites. , That adds up to hundreds to thousands of IPs.

Well, all you have to do is to record these websites and use the program to grab the IP. Isn't it easy to think about it?

Test agent

By the way just now, you should get hundreds or thousands of proxy IPs.

Wait, with so many IPs, do others really give it to you for free? Of course not. As mentioned earlier, a large part of these agents are already ineffective. So what to do? How to know which agents are valid and which are not available?

It's very simple. Hang up these proxies, visit a certain stable website, and then see if it can be accessed normally. The ones that can be accessed normally are available, and the ones that can't be accessed are invalid.

The fastest, you can use the curl command to test whether an agent is available:

# Use proxy 48.139.133.93:3128 to visit NetEase homepage curl -x "48.139.133.93:3128" "http://www.163.com"

Of course, this method is just for the convenience of demonstration. The best method is actually: use a multi-threaded method, use a proxy to visit a certain website, and then output the available proxy. This is the fastest way to find available agents.

Use proxy

Now it is possible to find out the available proxies through the above method. If it is applied to the program, I shouldn't need to say more, most of them should be used. For example, just enter the available agents into a file, and each line is an agent, then you can use it like this:

  • 1. Read the proxy file
  • 2. Randomly select the proxy IP and initiate an HTTP request

In this way, if there are hundreds of agents, they can basically keep crawling the data of a certain website for a period of time, and it is not a problem to grab tens of thousands of data.

However, if I want to continuously obtain data from a certain website, or crawl millions or even hundreds of millions of webpage data, then this definitely won't work.

Continuous supply of agents

The method just now is to grab a few proxy websites at once, and then test whether each proxy is available through the program, and get a list of available proxies. But this is only a one-off, and the amount of proxy is often small, and it will certainly not be able to meet the needs in continuous crawling. So how can you continue to find available agents?

  • 1. Find more agency websites (data basis)
  • 2. Regularly monitor these agent websites to obtain agents
  • 3. After getting the proxy IP, the program will automatically detect and output available proxies (file or database)
  • 4. The program loads the file or database and randomly selects the proxy IP to initiate an HTTP request

According to the above method, you can write a program that automatically collects the agent, and then the crawler can periodically go to the file/database to obtain it and use it. But there is a small problem, how do you know the quality of each agent? In other words, what is the speed of the agent?

  • 1. When detecting the agent, record the request response time
  • 2. The response time is from short to long, weighted and heavy, and the use rate of short response is higher
  • 3. Limit the maximum number of uses within a certain period of time

The previous points are just the basics. These 3 points can further optimize your agent program and output a priority agent list. The crawler uses the agent according to the weight and the maximum number of uses. The benefits of this: to ensure the use of high-quality agents, while preventing frequent use of a certain agent to prevent being blocked.

Servicing

After a series of improvements and optimizations above, a usable proxy service has been built, which is just based on the file system or database.

If the crawler wants to use these proxies, it can only read files or read the database, and then select the proxy to use according to certain rules. This is more cumbersome. Can you make it easier for the crawler to use the proxy? Then you need to make proxy access into a service.

There is a well-known server software squid that uses its cache_peer neighbor proxy mechanism to do this perfectly.

Write the proxy of the proxy list in the configuration file in a certain format according to Squid's cache_peer mechanism.

Squid is a proxy server software, which is generally used like this. If the crawler is on machine A and squid is installed on machine B, the website server to be crawled is machine C, and the proxy IP is machine D/E/F...

  • 1. No proxy: request from crawler machine A —> website machine C
  • 2. Use proxy: crawler machine A —> proxy IP machine D/E/F/... —> website machine C
  • 3. Use squid: crawler machine A —> squid (machine B, cache_peer mechanism management scheduling agent D/E/F) —> website machine C

The advantage of this is that the crawler does not need to consider how to load and select available agents. It gives Squid a list of agents. According to the rules of the configuration file, it can help you manage and schedule the selected agents. The most important thing is that the crawler side only needs to access the Squid service port to use the proxy!

Further integration

Now that the servitization has been set up, the only step is to integrate:

  • 1. Regularly monitor the source website of the agent (30 minutes/1 hour are available), analyze all the agent IPs, and enter the database
  • 2. Take out all the agents from the database, visit a certain fixed website, find out the successfully accessed agents, update the available flags and response time of the database
  • 3. Load all available agents from the database, and calculate the use weight and maximum use times based on the response time through a certain algorithm
  • 4. Write the configuration file in accordance with Squid's cache_peer format
  • 5. Reload the squid configuration file and refresh the proxy list under squid
  • 6. The crawler specifies the service IP and port of Squid to perform pure crawling operations

A complete proxy service can be built in this way, and high-quality proxies are output regularly. The crawler does not need to care about the collection and testing of the agent, just use Squid's unified service portal to crawl data.

About the columnist

Kaito, engaged in the development of Internet Python Web and crawler fields, has 2 years of development experience in the crawler field, and has developed a distributed vertical crawler platform. Able to conduct secondary development based on the open source framework. Blog: http://kaito-kidd.com

Python Chinese Community

www.python-cn.com

Reference: https://cloud.tencent.com/developer/article/1033128 How to build a crawler proxy service? -Cloud + Community-Tencent Cloud