♚
Author : Half carriers War, Pythonistia && Otaku, an effort to change jobs in the surveying and mapping personnel desu
Homepage: zhihu.com/people/ban-zai-liu-shang
The crawler has three steps in total: initiating a request-parsing the data-storing the data, which is enough to write the most basic crawler. Frameworks such as Scrapy can be said to integrate everything about crawlers, but newcomers may not use them very smoothly. Watching tutorials may also step on various pits, and Scrapy itself is also a bit large. Therefore, I decided to hand-write a lightweight crawler framework-looser, which integrates the two core functions of debugging and crawler templates. Using looser, you can quickly write an efficient crawler. In addition, the function documentation of this project is quite complete. If you don’t understand, you can read the source code yourself.
installation
$ pip install looter
Only supports Python 3.6 and above.
Quick start
Let's start with a very simple image crawler: first, use the shell to get the website
$ looter shell konachan.com/post
Then use 2 lines of code to capture the image locally
>>> imgs = tree.cssselect('a.directlink') >>> save_imgs(imgs)
Or just use 1 line: d
>>> save_imgs(links(res, search='jpg'))
Workflow
If you want to launch a crawler quickly, then you can use the template provided by the looser to automatically generate one
$ looter genspider <name> <tmpl> [--async]
In this line of code, tmpl is a template, which is divided into data and image templates.
async is an alternate option, which makes the generated crawler core use asyncio instead of thread pool.
In the generated template, you can customize the two variables of domain and tasklist.
What is tasklist? In fact, it is all the links of the page you want to crawl.
Taking http://konachan.com as an example, you can use list comprehensions to create your own tasklist:
domain ='https://konachan.com' tasklist = [f'{domain}/post?page={i}' for i in range(1, 9777)]
Then you have to customize your crawl function, which is the core part of the crawler.
def crawl(url): tree = lt.fetch(url) items = tree.cssselect('ul li') for item in items: data = dict() # data[...] = item.cssselect(...) pprint(data)
In most cases, the content you want to grab is a list (that is, the ul or ol tags in HTML), and you can save them as items variable with the css selector.
Then, you just need to use a for loop to iterate them, extract the data you want, and store them in the dict.
However, before you finish writing this crawler, it is best to use the shell provided by the looser to debug whether your cssselect code is correct.
>>> items = tree.cssselect('ul li') >>> item = items[0] >>> item.cssselect(anything you want to crawl) # Pay attention to whether the output of the code is correct!
After debugging, your crawler is naturally complete. How about it, is it very simple :)
Of course, I have also compiled several crawler examples for reference.
function
Looter provides users with many useful functions.
view
Before crawling the page, you'd better confirm whether the rendering of the page is what you want
>>> view(url)
save_imgs
When you get a bunch of image links, you can use it to directly save them locally
>>> img_urls = [...] >>> save_imgs(img_urls)
alexa_rank
You can get the reach and popularity index (popularity) of the website, this function returns a tuple (url, reachrank, popularityrank)
>>> alexa_rank(url)
links
Get all the links of the webpage
>>> links(res) # Get all links >>> links(res, absolute=True) # Get absolute links >>> links(res, search='text') # Find the specified link
Similarly, you can also use regular expressions to get matching links
>>> re_links(res, r'regex_pattern')
saveasjson
Save the result as a json file, support sorting by key value
>>> total = [...] >>> save_as_json(total, name='text', sort_by='key')
parse_robots
Used to crawl all the links on the website robots.txt. This is quite effective when doing site-wide crawlers or recursive URL crawlers
>>> parse_robots(url)
login
Some websites must be logged in before they can be crawled, so there is a login function. The essence is to establish a session to send a POST request with data to the server. However, the login rules of each website are different, and it takes a lot of work to find the right postdata, and what is more, you need to construct param or header parameters. Fortunately, someone has sorted out the fake login method of major websites on github-fuck-login, which I admire. In short, it will test your ability to capture packets. The following is a simulated login to NetEase 126 mailbox (required parameters: postdata and param)
>>> params = {'df':'mail126_letter','from':'web','funcid':'loginone','iframe': '1','language':'-1','passtype' : '1','product':'mail126', 'verifycookie':'-1','net':'failed','style':'-1','race':'-2_-2_-2_db','uid':'webscraping123@126.com' ,'hid': '10010102'} >>> postdata = {'username': your username,'savelogin': '1','url2':'http://mail.126.com/errorpage/error126.htm','password': you 'S password} >>> url = "https://mail.126.com/entry/cgi/ntesdoor?" >>> res, ses = login(url, postdata, params=params) # res is the page after the post request, ses is the request session >>> index_url = re.findall(r'href = "(.*?)"', res.text)[0] # Get the redirect homepage link in res >>> index = ses.get(index_url) # Use the ses session to access the redirect link. If you want to confirm the success, just print it
The Python web crawler learning course has 9 sessions, providing courseware and source code for all the courses. The course is lectured by Luo Pan, the author of "Learning Python Web Crawlers from Scratch", well-known blogger of Jianshu, and Python web crawler expert.
Lecture 1: Introduction to Python zero-based syntax
Lecture 2: Regular Expression Crawler
Lecture 3: Lxml library and xpath syntax
Lecture 4: API crawler
Lecture 5: Asynchronous Loading
Lecture 6: Form interaction and simulated login
Lecture 7: Selenium Simulation Browser
Lecture 8: Getting Started with Scrapy
Lecture 9: Scrapy Diligence