Author: zarten, Internet front-line workers.
Address: zhihu.com/people/zarten
Introduction
The Request class is an http request class, which is a very important class for crawlers. Usually such a request is created in Spider, and such a request is executed in Downloader. At the same time, there is also a subclass FormRequest inherited from it, used for post requests.
Common usage in Spider:
yield scrapy.Request(url ='zarten.com')
The class attributes and methods are:
url method headers body meta copy() replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])
Request
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
Parameter Description:
cookies = {'name1':'value1','name2':'value2'}
list method:
cookies = [ {'name':'Zarten','value':'my name is Zarten','domain':'example.com','path':'/currency'} ]
from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError, TCPTimedOutError class ToScrapeCSSSpider(scrapy.Spider): name = "toscrape-css" # start_urls = [ #'http://quotes.toscrape.com/', #] start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error "http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.info('Got successful response from {}'.format(response.url)) # do something useful here... def errback_httpbin(self, failure): # log all failures self.logger.info(repr(failure)) # in case you want to do something special for some errors, # you may need the failure's type: if failure.check(HttpError): # these exceptions come from HttpError spider middleware # you can get the non-200 response response = failure.value.response self.logger.info('HttpError on %s', response.url) elif failure.check(DNSLookupError): # this is the original request request = failure.request self.logger.info('DNSLookupError on %s', request.url) elif failure.check(TimeoutError, TCPTimedOutError): request = failure.request self.logger.info('TimeoutError on %s', request.url)
yield scrapy.Request(url ='zarten.com', meta = {'name':'Zarten'})
In Response:
my_name = response.meta['name']
However, there are also special keys built into scrapy, which are also very useful. They are as follows:
You can set http or https proxy
request.meta['proxy'] ='https://' +'ip:port'
yield scrapy.Request(url='https://httpbin.org/get/zarten', meta= {'handle_httpstatus_list': [404]})
You can see the 404 error handling in the parse function:
def parse(self, response): print('The return information is:', response.text)
def start_requests(self): urls = ['http://quotes.toscrape.com/page/1', 'http://quotes.toscrape.com/page/3', 'http://quotes.toscrape.com/page/5', ] for i ,url in enumerate(urls): yield scrapy.Request(url= url, meta= {'cookiejar': i}) def parse(self, response): next_page_url = response.css("li.next> a::attr(href)").extract_first() if next_page_url is not None: yield scrapy.Request(response.urljoin(next_page_url), meta= {'cookiejar': response.meta['cookiejar']}, callback= self.parse_next) def parse_next(self, response): print('cookiejar:', response.meta['cookiejar'])
def start_requests(self): headers = { 'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' } yield scrapy.Request(url ='https://www.amazon.com', headers = headers) def parse(self, response): print('The response time is:', response.meta['download_latency'])
FormRequest
The FormRequest class is a subclass of Request, used for POST requests
This class adds a new parameter formdata, other parameters are the same as Request, please refer to the above description for details
The general usage is:
yield scrapy.FormRequest(url="http://www.example.com/post/action", formdata={'name':'Zarten','age': '27'}, callback=self.after_post)