Three Python crawler versions, take you easy to get started with crawlers

3.Python crawler versions, take you easy to get started with crawlers

If we compare the Internet to a big spider web, the data is stored in each node of the spider web, and the crawler is just a small spider.

Crawling its own prey (data) along the web refers to a program that initiates a request to a website, obtains resources, and analyzes and extracts useful data;

From a technical point of view, it is to simulate the behavior of the browser requesting the site through the program, crawl the HTML code/JSON data/binary data (pictures, videos) returned by the site to the local, and then extract the data you need and store it for use;

Basic environment configuration

Version: Python3

System: Windows

IDE: Pycharm

Tools needed for crawlers:

Request library: requests, selenium (can drive the browser to parse and render CSS and JS, but there are performance disadvantages (useful and useless web pages will be loaded);)

Parsing library: regular, beautifulsoup, pyquery

Repository: files, MySQL, Mongodb, Redis

The basic process of Python crawler

Basic version:

Function package version

Concurrent version

(If you need to crawl 30 videos in total, open 30 threads to do it, and the time it takes is the slowest time-consuming time)

Understand the basic process of Python crawler, and then compare the code, do you think the crawler is particularly simple?

Reference: https://cloud.tencent.com/developer/article/1519136 3.Python crawler versions, take you easy to get started crawling-Cloud + Community-Tencent Cloud