Ultra-lightweight crawler framework: looter

Ultra-lightweight crawler framework: looter

Author : Half carriers War, Pythonistia && Otaku, an effort to change jobs in the surveying and mapping personnel desu

Homepage: zhihu.com/people/ban-zai-liu-shang

The crawler has three steps in total: initiating a request-parsing the data-storing the data, which is enough to write the most basic crawler. Frameworks such as Scrapy can be said to integrate everything about crawlers, but newcomers may not use them very smoothly. Watching tutorials may also step on various pits, and Scrapy itself is also a bit large. Therefore, I decided to hand-write a lightweight crawler framework-looser, which integrates the two core functions of debugging and crawler templates. Using looser, you can quickly write an efficient crawler. In addition, the function documentation of this project is quite complete. If you don’t understand, you can read the source code yourself.

installation

$ pip install looter

Only supports Python 3.6 and above.

Quick start

Let's start with a very simple image crawler: first, use the shell to get the website

$ looter shell konachan.com/post

Then use 2 lines of code to capture the image locally

>>> imgs = tree.cssselect('a.directlink')
>>> save_imgs(imgs)

Or just use 1 line: d

>>> save_imgs(links(res, search='jpg'))

Workflow

If you want to launch a crawler quickly, then you can use the template provided by the looser to automatically generate one

$ looter genspider <name> <tmpl> [--async]

In this line of code, tmpl is a template, which is divided into data and image templates.

async is an alternate option, which makes the generated crawler core use asyncio instead of thread pool.

In the generated template, you can customize the two variables of domain and tasklist.

What is tasklist? In fact, it is all the links of the page you want to crawl.

Taking http://konachan.com as an example, you can use list comprehensions to create your own tasklist:

domain ='https://konachan.com'
tasklist = [f'{domain}/post?page={i}' for i in range(1, 9777)]

Then you have to customize your crawl function, which is the core part of the crawler.

def crawl(url):
    tree = lt.fetch(url)
    items = tree.cssselect('ul li')
    for item in items:
        data = dict()
        # data[...] = item.cssselect(...)
        pprint(data)

In most cases, the content you want to grab is a list (that is, the ul or ol tags in HTML), and you can save them as items variable with the css selector.

Then, you just need to use a for loop to iterate them, extract the data you want, and store them in the dict.

However, before you finish writing this crawler, it is best to use the shell provided by the looser to debug whether your cssselect code is correct.

>>> items = tree.cssselect('ul li')
>>> item = items[0]
>>> item.cssselect(anything you want to crawl)
# Pay attention to whether the output of the code is correct!

After debugging, your crawler is naturally complete. How about it, is it very simple :)

Of course, I have also compiled several crawler examples for reference.

function

Looter provides users with many useful functions.

view

Before crawling the page, you'd better confirm whether the rendering of the page is what you want

>>> view(url)

save_imgs

When you get a bunch of image links, you can use it to directly save them locally

>>> img_urls = [...]
>>> save_imgs(img_urls)

alexa_rank

You can get the reach and popularity index (popularity) of the website, this function returns a tuple (url, reachrank, popularityrank)

>>> alexa_rank(url)

links

Get all the links of the webpage

>>> links(res) # Get all links
>>> links(res, absolute=True) # Get absolute links
>>> links(res, search='text') # Find the specified link

Similarly, you can also use regular expressions to get matching links

>>> re_links(res, r'regex_pattern')

saveasjson

Save the result as a json file, support sorting by key value

>>> total = [...]
>>> save_as_json(total, name='text', sort_by='key')

parse_robots

Used to crawl all the links on the website robots.txt. This is quite effective when doing site-wide crawlers or recursive URL crawlers

>>> parse_robots(url)

login

Some websites must be logged in before they can be crawled, so there is a login function. The essence is to establish a session to send a POST request with data to the server. However, the login rules of each website are different, and it takes a lot of work to find the right postdata, and what is more, you need to construct param or header parameters. Fortunately, someone has sorted out the fake login method of major websites on github-fuck-login, which I admire. In short, it will test your ability to capture packets. The following is a simulated login to NetEase 126 mailbox (required parameters: postdata and param)

>>> params = {'df':'mail126_letter','from':'web','funcid':'loginone','iframe': '1','language':'-1','passtype' : '1','product':'mail126',
 'verifycookie':'-1','net':'failed','style':'-1','race':'-2_-2_-2_db','uid':'webscraping123@126.com' ,'hid': '10010102'}
>>> postdata = {'username': your username,'savelogin': '1','url2':'http://mail.126.com/errorpage/error126.htm','password': you 'S password}
>>> url = "https://mail.126.com/entry/cgi/ntesdoor?"
>>> res, ses = login(url, postdata, params=params) # res is the page after the post request, ses is the request session
>>> index_url = re.findall(r'href = "(.*?)"', res.text)[0] # Get the redirect homepage link in res
>>> index = ses.get(index_url) # Use the ses session to access the redirect link. If you want to confirm the success, just print it

The Python web crawler learning course has 9 sessions, providing courseware and source code for all the courses. The course is lectured by Luo Pan, the author of "Learning Python Web Crawlers from Scratch", well-known blogger of Jianshu, and Python web crawler expert.

Lecture 1: Introduction to Python zero-based syntax

  1. Environmental installation
  2. Variables and strings
  3. Process control
  4. data structure
  5. File operations

Lecture 2: Regular Expression Crawler

  1. Internet connection
  2. Crawler principle
  3. Chrome browser installation and use
  4. Request library usage
  5. Regular expression
  6. csv file storage

Lecture 3: Lxml library and xpath syntax

  1. Excel storage
  2. lxml library
  3. Xpath syntax

Lecture 4: API crawler

  1. API concept
  2. Baidu Map API call
  3. JSON data analysis
  4. Picture crawler

Lecture 5: Asynchronous Loading

  1. MySQL database installation
  2. MySQL database is simple to use
  3. Python operating database
  4. Asynchronous loading
  5. Reverse Engineering
  6. Comprehensive case

Lecture 6: Form interaction and simulated login

  1. post request
  2. Reverse Engineering
  3. Submit cookie
  4. Comprehensive case

Lecture 7: Selenium Simulation Browser

  1. Selenium
  2. PhantomJS
  3. Asynchronous loading processing
  4. Web page operation processing
  5. Comprehensive case

Lecture 8: Getting Started with Scrapy

  1. Scrapy installation
  2. Create project
  3. Introduction of each component
  4. Comprehensive case

Lecture 9: Scrapy Diligence

  1. Cross-page crawler
  2. Store database
Reference: https://cloud.tencent.com/developer/article/1180230 Ultra-lightweight crawler framework: looter-Cloud + Community-Tencent Cloud