If we compare the Internet to a big spider web, the data is stored in each node of the spider web, and the crawler is just a small spider.
Crawling its own prey (data) along the web refers to a program that initiates a request to a website, obtains resources, and analyzes and extracts useful data;
From a technical point of view, it is to simulate the behavior of the browser requesting the site through the program, crawl the HTML code/JSON data/binary data (pictures, videos) returned by the site to the local, and then extract the data you need and store it for use;
Request library: requests, selenium (can drive the browser to parse and render CSS and JS, but there are performance disadvantages (useful and useless web pages will be loaded);)
Parsing library: regular, beautifulsoup, pyquery
Repository: files, MySQL, Mongodb, Redis
(If you need to crawl 30 videos in total, open 30 threads to do it, and the time it takes is the slowest time-consuming time)
Understand the basic process of Python crawler, and then compare the code, do you think the crawler is particularly simple?