Python crawls the movie heaven website

Python crawls the movie heaven website

Okay, let's enter the topic below to explain the realization of the program.

1. we must analyze the home page structure of the Movie Paradise website.

From the menu bar above, we can see the overall classification of the entire website resources. It just so happens that we can take advantage of its classification and use each classification address as the starting point of the crawler.

①Analyze the homepage address to extract the classification information

#Analysis Home
def CrawIndexPage(starturl):
    print "Crawling the homepage"
    page = __getpage(starturl)
    if page=="error":
        return
    page = page.decode('gbk','ignore')
    tree = etree.HTML(page)
    Nodes = tree.xpath("//div[@id='menu']//a")
    print "Home page resolves the address",len(Nodes),"bar"
    for node in Nodes:
        CrawledURLs = []
        CrawledURLs.append(starturl)
        url=node.xpath("@href")[0]
        if re.match(r'/html/[A-Za-z0-9_/]+/index.html', url):
            if __isexit(host + url,CrawledURLs):
                pass
            else:
                try:
                    catalog = node.xpath("text()")[0].encode("utf-8")
                    newdir = "E:/Movie Resources/" + catalog
                    os.makedirs(newdir.decode("utf-8"))
                    print "Create category directory successfully------"+newdir
                    thread = myThread(host + url, newdir,CrawledURLs)
                    thread.start()
                except:
                    pass

In this function, first download the source code of the web page, and parse the menu classification information through XPath. And create the corresponding file directory. One area that needs attention is the encoding problem, but it has been entangled by this encoding for a long time. By checking the source code of the web page, we can find that the encoding of the web page is GB2312. Here, the text information is needed to construct the Tree object through XPath. Decoding operation, change gb2312 into Unicode encoding, so that the DOM tree structure is correct, otherwise there will be problems in the later parsing.

②Analyze the homepage of each category

# Parse classification file
def CrawListPage(indexurl, filedir, CrawledURLs):
    print "Resolving classified homepage resources"
    print indexurl
    page = __getpage(indexurl)
    if page=="error":
        return
    CrawledURLs.append(indexurl)
    page = page.decode('gbk','ignore')
    tree = etree.HTML(page)
    Nodes = tree.xpath("//div[@class='co_content8']//a")
    for node in Nodes:
        url=node.xpath("@href")[0]
        if re.match(r'/', url):
            # The non-paged address can parse the video resource address from it
            if __isexit(host + url,CrawledURLs):
                pass
            else:
                #File naming is not allowed to appear the following special symbols
                filename=node.xpath("text()")[0].encode("utf-8").replace("/"," ")\
                                                                .replace("\\"," ")\
                                                                .replace(":"," ")\
                                                                .replace("*"," ")\
                                                                .replace("?"," ")\
                                                                .replace("\""," ")\
                                                                .replace("<", "")/
                                                                .replace(">", "")\
                                                                .replace("|", "")
                CrawlSourcePage(host + url,filedir,filename,CrawledURLs)
            pass
        else:
            # The paging address is parsed again from the nest
            print "The paging address is parsed again from the nest", url
            index = indexurl.rfind("/")
            baseurl = indexurl[0:index + 1]
            pageurl = baseurl + url
            if __isexit(pageurl,CrawledURLs):
                pass
            else:
                print "The paging address is parsed again from the nest", pageurl
                CrawListPage(pageurl, filedir, CrawledURLs)
            pass
    pass

Open the homepage of each category and you will find that there is a same structure (click to open the example). 1. the node containing the resource URL is parsed, and then the name and URL are extracted. There are two things to pay attention to in this part. One is because you want to save the resource in a txt file, but some special symbols cannot appear when naming, so you need to deal with it. The second is to deal with the paging. The data in the website is displayed in the form of paging, so how to identify and capture the paging is also very important. Through observation, it is found that there is no "/" in front of the paging address, so it is only necessary to find out the paging address link through regular expressions, and then nested calls can solve the paging problem.

③Analyze the resource address and save it in a file

#Processing resource page crawling resource address
def CrawlSourcePage(url,filedir,filename,CrawledURLs):
    print url
    page = __getpage(url)
    if page=="error":
        return
    CrawledURLs.append(url)
    page = page.decode('gbk','ignore')
    tree = etree.HTML(page)
    Nodes = tree.xpath("//div[@align='left']//table//a")
    try:
        source = filedir + "/" + filename + ".txt"
        f = open(source.decode("utf-8"),'w')
        for node in Nodes:
            sourceurl = node.xpath("text()")[0]
            f.write(sourceurl.encode("utf-8")+"\n")
        f.close()
    except:
        print "!!!!!!!!!!!!!!!!!"

This paragraph is relatively simple, just write the extracted content into a file

In order to improve the efficiency of the program, multi-threaded crawling is used. Here I have opened a thread for each classified homepage, which greatly speeds up the efficiency of crawlers. At the beginning, I just used a single thread to run, but I waited all afternoon and finally ran away all afternoon because of an exception that was not handled. ! ! ! Tired

class myThread (threading.Thread): #Inherit the parent class threading.Thread
    def __init__(self, url, newdir,CrawledURLs):
        threading.Thread.__init__(self)
        self.url = url
        self.newdir = newdir
        self.CrawledURLs=CrawledURLs
    def run(self): #Write the code to be executed into the run function. The thread will run the run function directly after creation
        CrawListPage(self.url, self.newdir,self.CrawledURLs)

The final crawling result is as follows.

Reference: https://cloud.tencent.com/developer/article/1534285 Python crawls the movie heaven website-Cloud + Community-Tencent Cloud