In this article, the editor compiled the content of examples about Python crawlers crawling embarrassment encyclopedia for everyone. Friends who need it can refer to it.
This time, I will bring you an example of Python crawling the snippets of Embarrassing Encyclopedia.
First of all, everyone has heard of Embarrassment Encyclopedia, right? A lot of funny jokes posted by embarrassing friends, this time we try to grab them with crawlers.
The goal of this article
1. Grab popular stories on the Encyclopedia of Embarrassment; 2. Filter stories with pictures; 3. Achieve each time you press Enter to display the release time, publisher, content of the story, and the number of likes.
Embarrassment Encyclopedia does not need to log in, so there is no need to use cookies. In addition, some paragraphs of Embarrassment Encyclopedia are attached with pictures. We grab the pictures and the pictures are not easy to display, so let's try to filter out the paragraphs with pictures.
Okay, now let’s try to grab some popular paragraphs on the Encyclopedia of Embarrassment. Each time we press Enter, we will display a paragraph.
1. Determine the URL and grab the page code
1. we make sure that the URL of the page is http://www.qiushibaike.com/hot/page/1, where the last number 1 represents the number of pages, we can pass in different values to get the paragraph content of a page.
We initially constructed the following code to print the content of the page code and have a try. 1. construct the most basic page crawling method to see if it will succeed.
# -*- coding:utf-8 -*- import urllib import urllib2 page = 1 url ='http://www.qiushibaike.com/hot/page/' + str(page) try: request = urllib2.Request(url) response = urllib2.urlopen(request) print response.read() except urllib2.URLError, e: if hasattr(e,"code"): print e.code if hasattr(e,"reason"): print e.reason
Run the program, oh no, it reported an error, really bad luck, ill-fated.
line 373, in _read_status raise BadStatusLine(line) httplib.BadStatusLine:''
Well, it should be the problem of header verification. Let's add a header verification and try it out. Modify the code as follows:
# -*- coding:utf-8 -*- import urllib import urllib2 page = 1 url ='http://www.qiushibaike.com/hot/page/' + str(page) user_agent ='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent': user_agent} try: request = urllib2.Request(url,headers = headers) response = urllib2.urlopen(request) print response.read() except urllib2.URLError, e: if hasattr(e,"code"): print e.code if hasattr(e,"reason"): print e.reason
Hey, the operation is finally normal this time. The HTML code of the first page is printed out. You can run the code to try it out. If the result of running here is too long, it will not be posted.
2. Extract all paragraphs of a page
Okay, after getting the HTML code, we start to analyze how to get all the paragraphs of a page.
1. let’s review the elements and press F12 in the browser. The screenshot is as follows:
We can see that each paragraph is the content of the <div class=”article block untagged mb15″ id=”…”>…</div>.
Now we want to get the publisher, the release date, the content of the paragraph, and the number of likes. But another thing to note is that some of the paragraphs have pictures. If we want to display pictures on the console, it is unrealistic, so we directly remove the paragraphs with pictures, and only save the paragraphs that only contain text.
So we add the following regular expression to match, the method used is re.findall is to find all matched content. The usage details of the method can be seen in the introduction of regular expressions mentioned earlier.
Well, our regular expression matching statement is written as follows, and the following code is added to the original basis:
content = response.read().decode('utf-8') pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+ 'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number"> (.*?)</i>',re.S) items = re.findall(pattern,content) for item in items: print item[0],item[1],item[2],item[3],item[4]
Now regular expressions are explained here a little bit
1) .*? is a fixed collocation,. And * can match any unlimited number of characters, plus? Means to use non-greedy mode for matching, that is, we will do the matching as short as possible, and we will use a lot of .*? collocations in the future.
2) (.*?) represents a group. In this regular expression, we match five groups. In the following traversal item, item[0] represents the content referred to by the first (.*?) , Item[1] represents the content referred to by the second (.*?), and so on.
3) The re.S mark represents any matching mode of dot during matching, and dot. can also represent a newline character.
In this way, we get the publisher, release time, release content, additional pictures and the number of likes.
Note here that if the content we want to get is with pictures, it is more cumbersome to output directly, so here we only get the paragraphs without pictures.
So, here we need to filter the paragraphs with pictures.
We can find that paragraphs with pictures will have codes similar to the following, and those without pictures will not. Therefore, the item[3] of our regular expression is to get the following content. If there is no picture, item [3] The content obtained is empty.
<div class="thumb"> <a href="/article/112061287?list=hot&s=4794990" rel="external nofollow" target="_blank"> <img src="http://pic.qiushibaike.com/system/pictures/11206/112061287/medium/app112061287.jpg" alt="but they are still optimistic"> </a> </div>
So we only need to determine whether item[3] contains the img tag.
Ok, let's change the for loop in the above code to the following
for item in items: haveImg = re.search("img",item[3]) if not haveImg: print item[0],item[1],item[2],item[4]
Now, the overall code is as follows:
# -*- coding:utf-8 -*- import urllib import urllib2 import re page = 1 url ='http://www.qiushibaike.com/hot/page/' + str(page) user_agent ='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent': user_agent} try: request = urllib2.Request(url,headers = headers) response = urllib2.urlopen(request) content = response.read().decode('utf-8') pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+ 'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number"> (.*?)</i>',re.S) items = re.findall(pattern,content) for item in items: haveImg = re.search("img",item[3]) if not haveImg: print item[0],item[1],item[2],item[4] except urllib2.URLError, e: if hasattr(e,"code"): print e.code if hasattr(e,"reason"): print e.reason
Run it to see the effect:
Well, the jokes with pictures have been removed. Isn't it very open?
3. Improve interaction and design an object-oriented model
Okay, now we have completed the most important part, and the rest is to fix the corners and corners. What we want to achieve is:
Press Enter to read a paragraph and display the publisher, date, content and number of likes of the paragraph.
In addition, we need to design an object-oriented model, introduce classes and methods, and optimize and encapsulate the code. Finally, our code is as follows:
__author__ ='CQC' # -*- coding:utf-8 -*- import urllib import urllib2 import re import thread import time #糗事EncyclopediaReptiles class QSBK: #Initialization method, define some variables def __init__(self): self.pageIndex = 1 self.user_agent ='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' #Initialize headers self.headers = {'User-Agent': self.user_agent} #Store the variable of the paragraph, each element is the paragraph of each page self.stories = [] #Store the variable whether the program continues to run self.enable = False #Pass in the index of a page to get the page code def getPage(self,pageIndex): try: url ='http://www.qiushibaike.com/hot/page/' + str(pageIndex) #Build the request of the request request = urllib2.Request(url,headers = self.headers) #Using urlopen to get the page code response = urllib2.urlopen(request) #Convert the page to UTF-8 encoding pageCode = response.read().decode('utf-8') return pageCode except urllib2.URLError, e: if hasattr(e,"reason"): print u "Failed to connect to the Wikipedia, the reason for the error", e.reason return None #Pass in a page code, return to the page list of paragraphs without pictures def getPageItems(self,pageIndex): pageCode = self.getPage(pageIndex) if not pageCode: print "Page failed to load...." return None pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+ 'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number"> (.*?)</i>',re.S) items = re.findall(pattern,pageCode) #Used to store the paragraphs of each page pageStories = [] #Traverse information matched by regular expressions for item in items: #Does it contain pictures? haveImg = re.search("img",item[3]) #If there is no picture, add it to the list if not haveImg: replaceBR = re.compile('<br/>') text = re.sub(replaceBR,"\n",item[1]) #item[0] is the publisher of a paragraph, item[1] is the content, item[2] is the release time, and item[4] is the number of likes pageStories.append([item[0].strip(),text.strip(),item[2].strip(),item[4].strip()]) return pageStories #Load and extract the content of the page and add it to the list def loadPage(self): #If the number of pages not currently viewed is less than 2 pages, load a new page if self.enable == True: if len(self.stories) <2: #Get a new page pageStories = self.getPageItems(self.pageIndex) #Store the paragraphs of this page in the global list if pageStories: self.stories.append(pageStories) #After getting the page number index plus one, it means reading the next page next time self.pageIndex += 1 #Call this method, each time you hit enter to print out a paragraph def getOneStory(self,pageStories,page): #Traverse a page of paragraphs for story in pageStories: #Waiting for user input input = raw_input() #Whenever you enter a carriage return, determine whether to load a new page self.loadPage() #If you enter Q, the program ends if input == "Q": self.enable = False return print u"Page %d\tPublisher:%s\tPublished time:%s\tLike:%s\n%s" %(page,story[0],story[2],story[3] ,story[1]) #Start method def start(self): print u"Reading the Encyclopedia of Embarrassment, press Enter to view the new paragraph, Q to exit" #Make the variable True, the program can run normally self.enable = True #First load a page of content self.loadPage() #Local variables, control the number of pages currently read nowPage = 0 while self.enable: if len(self.stories)>0: #Get a page of paragraphs from the global list pageStories = self.stories[0] #The number of pages currently read plus one nowPage += 1 #Delete the first element in the global list because it has been taken out del self.stories[0] #Output the paragraph of this page self.getOneStory(pageStories,nowPage) spider = QSBK() spider.start()
Okay, let’s test it out. Click Enter and it will output a paragraph, including the publisher, release time, paragraph content, and number of likes. Does it feel cool?
This is the introduction of our first actual crawler project. Welcome everyone to continue to pay attention.