Python crawler crawling embarrassment encyclopedia paragraph example sharing

Python crawler crawling embarrassment encyclopedia paragraph example sharing

In this article, the editor compiled the content of examples about Python crawlers crawling embarrassment encyclopedia for everyone. Friends who need it can refer to it.

This time, I will bring you an example of Python crawling the snippets of Embarrassing Encyclopedia.

First of all, everyone has heard of Embarrassment Encyclopedia, right? A lot of funny jokes posted by embarrassing friends, this time we try to grab them with crawlers.

The goal of this article

1. Grab popular stories on the Encyclopedia of Embarrassment; 2. Filter stories with pictures; 3. Achieve each time you press Enter to display the release time, publisher, content of the story, and the number of likes.

Embarrassment Encyclopedia does not need to log in, so there is no need to use cookies. In addition, some paragraphs of Embarrassment Encyclopedia are attached with pictures. We grab the pictures and the pictures are not easy to display, so let's try to filter out the paragraphs with pictures.

Okay, now let’s try to grab some popular paragraphs on the Encyclopedia of Embarrassment. Each time we press Enter, we will display a paragraph.

1. Determine the URL and grab the page code

1. we make sure that the URL of the page is http://www.qiushibaike.com/hot/page/1, where the last number 1 represents the number of pages, we can pass in different values ​​to get the paragraph content of a page.

We initially constructed the following code to print the content of the page code and have a try. 1. construct the most basic page crawling method to see if it will succeed.

# -*- coding:utf-8 -*-

import urllib

import urllib2

page = 1

url ='http://www.qiushibaike.com/hot/page/' + str(page)

try:

    request = urllib2.Request(url)

    response = urllib2.urlopen(request)

    print response.read()

except urllib2.URLError, e:

    if hasattr(e,"code"):

        print e.code

    if hasattr(e,"reason"):

        print e.reason

Run the program, oh no, it reported an error, really bad luck, ill-fated.

line 373, in _read_status

 raise BadStatusLine(line)

httplib.BadStatusLine:''

Well, it should be the problem of header verification. Let's add a header verification and try it out. Modify the code as follows:

# -*- coding:utf-8 -*-

import urllib

import urllib2

page = 1

url ='http://www.qiushibaike.com/hot/page/' + str(page)

user_agent ='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

headers = {'User-Agent': user_agent}

try:

    request = urllib2.Request(url,headers = headers)

    response = urllib2.urlopen(request)

    print response.read()

except urllib2.URLError, e:

    if hasattr(e,"code"):

        print e.code

    if hasattr(e,"reason"):

        print e.reason

Hey, the operation is finally normal this time. The HTML code of the first page is printed out. You can run the code to try it out. If the result of running here is too long, it will not be posted.

2. Extract all paragraphs of a page

Okay, after getting the HTML code, we start to analyze how to get all the paragraphs of a page.

1. let’s review the elements and press F12 in the browser. The screenshot is as follows:

We can see that each paragraph is the content of the <div class=”article block untagged mb15″ id=”…”>…</div>.

Now we want to get the publisher, the release date, the content of the paragraph, and the number of likes. But another thing to note is that some of the paragraphs have pictures. If we want to display pictures on the console, it is unrealistic, so we directly remove the paragraphs with pictures, and only save the paragraphs that only contain text.

So we add the following regular expression to match, the method used is re.findall is to find all matched content. The usage details of the method can be seen in the introduction of regular expressions mentioned earlier.

Well, our regular expression matching statement is written as follows, and the following code is added to the original basis:

content = response.read().decode('utf-8')

pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+

                         'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">

                         (.*?)</i>',re.S)

items = re.findall(pattern,content)

for item in items:

    print item[0],item[1],item[2],item[3],item[4]

Now regular expressions are explained here a little bit

1) .*? is a fixed collocation,. And * can match any unlimited number of characters, plus? Means to use non-greedy mode for matching, that is, we will do the matching as short as possible, and we will use a lot of .*? collocations in the future.

2) (.*?) represents a group. In this regular expression, we match five groups. In the following traversal item, item[0] represents the content referred to by the first (.*?) , Item[1] represents the content referred to by the second (.*?), and so on.

3) The re.S mark represents any matching mode of dot during matching, and dot. can also represent a newline character.

In this way, we get the publisher, release time, release content, additional pictures and the number of likes.

Note here that if the content we want to get is with pictures, it is more cumbersome to output directly, so here we only get the paragraphs without pictures.

So, here we need to filter the paragraphs with pictures.

We can find that paragraphs with pictures will have codes similar to the following, and those without pictures will not. Therefore, the item[3] of our regular expression is to get the following content. If there is no picture, item [3] The content obtained is empty.

<div class="thumb">

<a href="/article/112061287?list=hot&s=4794990" rel="external nofollow" target="_blank">

<img src="http://pic.qiushibaike.com/system/pictures/11206/112061287/medium/app112061287.jpg" alt="but they are still optimistic">

</a>

</div>

So we only need to determine whether item[3] contains the img tag.

Ok, let's change the for loop in the above code to the following

for item in items:

        haveImg = re.search("img",item[3])

        if not haveImg:

            print item[0],item[1],item[2],item[4]

Now, the overall code is as follows:

# -*- coding:utf-8 -*-

import urllib

import urllib2

import re

page = 1

url ='http://www.qiushibaike.com/hot/page/' + str(page)

user_agent ='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

headers = {'User-Agent': user_agent}

try:

    request = urllib2.Request(url,headers = headers)

    response = urllib2.urlopen(request)

    content = response.read().decode('utf-8')

    pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+

                         'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">

                         (.*?)</i>',re.S)

    items = re.findall(pattern,content)

    for item in items:

        haveImg = re.search("img",item[3])

        if not haveImg:

            print item[0],item[1],item[2],item[4]

except urllib2.URLError, e:

    if hasattr(e,"code"):

        print e.code

    if hasattr(e,"reason"):

        print e.reason

Run it to see the effect:

Well, the jokes with pictures have been removed. Isn't it very open?

3. Improve interaction and design an object-oriented model

Okay, now we have completed the most important part, and the rest is to fix the corners and corners. What we want to achieve is:

Press Enter to read a paragraph and display the publisher, date, content and number of likes of the paragraph.

In addition, we need to design an object-oriented model, introduce classes and methods, and optimize and encapsulate the code. Finally, our code is as follows:

__author__ ='CQC'

# -*- coding:utf-8 -*-

import urllib

import urllib2

import re

import thread

import time

#糗事EncyclopediaReptiles

class QSBK:

    #Initialization method, define some variables

    def __init__(self):

        self.pageIndex = 1

        self.user_agent ='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

        #Initialize headers

        self.headers = {'User-Agent': self.user_agent}

        #Store the variable of the paragraph, each element is the paragraph of each page

        self.stories = []

        #Store the variable whether the program continues to run

        self.enable = False

    #Pass in the index of a page to get the page code

    def getPage(self,pageIndex):

        try:

            url ='http://www.qiushibaike.com/hot/page/' + str(pageIndex)

            #Build the request of the request

            request = urllib2.Request(url,headers = self.headers)

            #Using urlopen to get the page code

            response = urllib2.urlopen(request)

            #Convert the page to UTF-8 encoding

            pageCode = response.read().decode('utf-8')

            return pageCode

        except urllib2.URLError, e:

            if hasattr(e,"reason"):

                print u "Failed to connect to the Wikipedia, the reason for the error", e.reason

                return None

    #Pass in a page code, return to the page list of paragraphs without pictures

    def getPageItems(self,pageIndex):

        pageCode = self.getPage(pageIndex)

        if not pageCode:

            print "Page failed to load...."

            return None

        pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</a>.*?<div.*?'+

                         'content">(.*?)<!--(.*?)-->.*?</div>(.*?)<div class="stats.*?class="number">

                         (.*?)</i>',re.S)

        items = re.findall(pattern,pageCode)

        #Used to store the paragraphs of each page

        pageStories = []

        #Traverse information matched by regular expressions

        for item in items:

            #Does it contain pictures?

            haveImg = re.search("img",item[3])

            #If there is no picture, add it to the list

            if not haveImg:

                replaceBR = re.compile('<br/>')

                text = re.sub(replaceBR,"\n",item[1])

                #item[0] is the publisher of a paragraph, item[1] is the content, item[2] is the release time, and item[4] is the number of likes

                pageStories.append([item[0].strip(),text.strip(),item[2].strip(),item[4].strip()])

        return pageStories

    #Load and extract the content of the page and add it to the list

    def loadPage(self):

        #If the number of pages not currently viewed is less than 2 pages, load a new page

        if self.enable == True:

            if len(self.stories) <2:

                #Get a new page

                pageStories = self.getPageItems(self.pageIndex)

                #Store the paragraphs of this page in the global list

                if pageStories:

                    self.stories.append(pageStories)

                    #After getting the page number index plus one, it means reading the next page next time

                    self.pageIndex += 1

    

    #Call this method, each time you hit enter to print out a paragraph

    def getOneStory(self,pageStories,page):

        #Traverse a page of paragraphs

        for story in pageStories:

            #Waiting for user input

            input = raw_input()

            #Whenever you enter a carriage return, determine whether to load a new page

            self.loadPage()

            #If you enter Q, the program ends

            if input == "Q":

                self.enable = False

                return

            print u"Page %d\tPublisher:%s\tPublished time:%s\tLike:%s\n%s" %(page,story[0],story[2],story[3] ,story[1])

    

    #Start method

    def start(self):

        print u"Reading the Encyclopedia of Embarrassment, press Enter to view the new paragraph, Q to exit"

        #Make the variable True, the program can run normally

        self.enable = True

        #First load a page of content

        self.loadPage()

        #Local variables, control the number of pages currently read

        nowPage = 0

        while self.enable:

            if len(self.stories)>0:

                #Get a page of paragraphs from the global list

                pageStories = self.stories[0]

                #The number of pages currently read plus one

                nowPage += 1

                #Delete the first element in the global list because it has been taken out

                del self.stories[0]

                #Output the paragraph of this page

                self.getOneStory(pageStories,nowPage)

spider = QSBK()

spider.start()

Okay, let’s test it out. Click Enter and it will output a paragraph, including the publisher, release time, paragraph content, and number of likes. Does it feel cool?

This is the introduction of our first actual crawler project. Welcome everyone to continue to pay attention.

Reference: https://cloud.tencent.com/developer/article/1689793 Python crawler crawling embarrassment encyclopedia paragraph example sharing-Cloud + Community-Tencent Cloud