Request for detailed explanation of web crawler framework Scrapy

Request for detailed explanation of web crawler framework Scrapy

Author: zarten, Internet front-line workers.

Address: zhihu.com/people/zarten

Introduction

The Request class is an http request class, which is a very important class for crawlers. Usually such a request is created in Spider, and such a request is executed in Downloader. At the same time, there is also a subclass FormRequest inherited from it, used for post requests.

Common usage in Spider:

yield scrapy.Request(url ='zarten.com')

The class attributes and methods are:

url

method

headers

body

meta

copy()

replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])

Request

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])

Parameter Description:

  • url requested url
  • callback callback function, used to receive the return information after the request, if not specified, it will default to the parse() function
  • method HTTP request method, the default is GET request, generally do not need to be specified. If you need a POST request, just use FormRequest
  • headers Request header information, generally can be set in settings, but also can be set in middlewares
  • body str type, which is the request body, generally does not need to be set (get and post can actually pass parameters through the body, but generally not)
  • Cookies dict or list type, request cookie dict mode (key-value pair of name and value):
cookies = {'name1':'value1','name2':'value2'}

list method:

cookies = [
{'name':'Zarten','value':'my name is Zarten','domain':'example.com','path':'/currency'}
]
  • encoding The requested encoding method, the default is'utf-8'
  • priority int type, specify the priority of the request, the larger the number, the higher the priority, can be a negative number, the default is 0
  • dont_filter is False by default. If set to True, the request will not be filtered this time (it will not be added to the deduplication queue), and the same request can be executed multiple times
  • errback The callback function that throws errors, errors include 404, timeout, DNS error, etc., the first parameter is an instance of Twisted Failure
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ToScrapeCSSSpider(scrapy.Spider):
    name = "toscrape-css"
    # start_urls = [
    #'http://quotes.toscrape.com/',
    #]

    start_urls = [
        "http://www.httpbin.org/", # HTTP 200 expected
        "http://www.httpbin.org/status/404", # Not found error
        "http://www.httpbin.org/status/500", # server issue
        "http://www.httpbin.org:12345/", # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/", # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                 errback=self.errback_httpbin,
                                 dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.info(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.info('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.info('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.info('TimeoutError on %s', request.url)
  • Flags list type, generally not used, flags for sending requests, generally used for log records
  • meta can be user-defined to pass parameters from Request to Response, this parameter can also be generally handled in middlewares
yield scrapy.Request(url ='zarten.com', meta = {'name':'Zarten'})

In Response:

my_name = response.meta['name']

However, there are also special keys built into scrapy, which are also very useful. They are as follows:

  • proxy set proxy, generally set in middlewares

You can set http or https proxy

request.meta['proxy'] ='https://' +'ip:port'
  • downloadtimeout Set the request timeout waiting time (seconds), usually set DOWNLOADTIMEOUT in settings, the default is 180 seconds (3 minutes)
  • maxretrytimes The maximum number of retries (excluding the first download), the default is 2 times, usually set in RETRY_TIMES in settings
  • After dont_redirect is set to True, Request will not be redirected
  • After dont_retry is set to True, requests with http link errors or timeouts will not be retried
  • The handlehttpstatuslist http return code of 200-300 is a successful return. Anything out of this range is a failure return. By default, scrapy filters these returns and will not receive these incorrect returns for processing. But you can customize which errors are returned:
yield scrapy.Request(url='https://httpbin.org/get/zarten', meta= {'handle_httpstatus_list': [404]})

You can see the 404 error handling in the parse function:

def parse(self, response):
        print('The return information is:', response.text)
  • After handlehttpstatusall is set to True, Response will receive the return information for processing any status code
  • dontmergecookies scrapy will automatically save the returned cookies for its next request. When we specify custom cookies, if we do not need to merge the returned cookies and use our own specified cookies, we can set it to True
  • cookiejar can track multiple cookies in a single spider, it is not sticky and needs to be brought with each request
def start_requests(self):
        urls = ['http://quotes.toscrape.com/page/1',
                'http://quotes.toscrape.com/page/3',
                'http://quotes.toscrape.com/page/5',
                ]
        for i ,url in enumerate(urls):
            yield scrapy.Request(url= url, meta= {'cookiejar': i})


    def parse(self, response):
        next_page_url = response.css("li.next> a::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url), meta= {'cookiejar': response.meta['cookiejar']}, callback= self.parse_next)

    def parse_next(self, response):
        print('cookiejar:', response.meta['cookiejar'])
  • After dont_cache is set to True, it will not be cached
  • redirect_urls is not yet clear about the specific role, friends who know are welcome to leave a comment in the comments
  • bindaddress bind output IP
  • dontobeyrobotstxt is set to True, does not comply with the robots protocol, usually set in settings
  • downloadmaxsize sets the maximum download size (bytes) of the downloader, usually DOWNLOADMAXSIZE is set in settings, the default is 1073741824 (1024MB=1G), if the maximum download limit is not set, set to 0
  • download_latency read-only attribute, get the response time of the request (seconds)
def start_requests(self):

        headers = {
            'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
        }
        yield scrapy.Request(url ='https://www.amazon.com', headers = headers)

    def parse(self, response):

        print('The response time is:', response.meta['download_latency'])
  • downloadfailon_dataloss is rarely used, see here for details
  • referrer_policy set Referrer Policy

FormRequest

The FormRequest class is a subclass of Request, used for POST requests

This class adds a new parameter formdata, other parameters are the same as Request, please refer to the above description for details

The general usage is:

yield scrapy.FormRequest(url="http://www.example.com/post/action",
                    formdata={'name':'Zarten','age': '27'},
                    callback=self.after_post)
Reference: https://cloud.tencent.com/developer/article/1180232 Web crawler framework Scrapy Detailed Explanation Request-Cloud + Community-Tencent Cloud