Okay, let's enter the topic below to explain the realization of the program.
1. we must analyze the home page structure of the Movie Paradise website.
From the menu bar above, we can see the overall classification of the entire website resources. It just so happens that we can take advantage of its classification and use each classification address as the starting point of the crawler.
①Analyze the homepage address to extract the classification information
#Analysis Home def CrawIndexPage(starturl): print "Crawling the homepage" page = __getpage(starturl) if page=="error": return page = page.decode('gbk','ignore') tree = etree.HTML(page) Nodes = tree.xpath("//div[@id='menu']//a") print "Home page resolves the address",len(Nodes),"bar" for node in Nodes: CrawledURLs = [] CrawledURLs.append(starturl) url=node.xpath("@href")[0] if re.match(r'/html/[A-Za-z0-9_/]+/index.html', url): if __isexit(host + url,CrawledURLs): pass else: try: catalog = node.xpath("text()")[0].encode("utf-8") newdir = "E:/Movie Resources/" + catalog os.makedirs(newdir.decode("utf-8")) print "Create category directory successfully------"+newdir thread = myThread(host + url, newdir,CrawledURLs) thread.start() except: pass
In this function, first download the source code of the web page, and parse the menu classification information through XPath. And create the corresponding file directory. One area that needs attention is the encoding problem, but it has been entangled by this encoding for a long time. By checking the source code of the web page, we can find that the encoding of the web page is GB2312. Here, the text information is needed to construct the Tree object through XPath. Decoding operation, change gb2312 into Unicode encoding, so that the DOM tree structure is correct, otherwise there will be problems in the later parsing.
②Analyze the homepage of each category
# Parse classification file def CrawListPage(indexurl, filedir, CrawledURLs): print "Resolving classified homepage resources" print indexurl page = __getpage(indexurl) if page=="error": return CrawledURLs.append(indexurl) page = page.decode('gbk','ignore') tree = etree.HTML(page) Nodes = tree.xpath("//div[@class='co_content8']//a") for node in Nodes: url=node.xpath("@href")[0] if re.match(r'/', url): # The non-paged address can parse the video resource address from it if __isexit(host + url,CrawledURLs): pass else: #File naming is not allowed to appear the following special symbols filename=node.xpath("text()")[0].encode("utf-8").replace("/"," ")\ .replace("\\"," ")\ .replace(":"," ")\ .replace("*"," ")\ .replace("?"," ")\ .replace("\""," ")\ .replace("<", "")/ .replace(">", "")\ .replace("|", "") CrawlSourcePage(host + url,filedir,filename,CrawledURLs) pass else: # The paging address is parsed again from the nest print "The paging address is parsed again from the nest", url index = indexurl.rfind("/") baseurl = indexurl[0:index + 1] pageurl = baseurl + url if __isexit(pageurl,CrawledURLs): pass else: print "The paging address is parsed again from the nest", pageurl CrawListPage(pageurl, filedir, CrawledURLs) pass pass
Open the homepage of each category and you will find that there is a same structure (click to open the example). 1. the node containing the resource URL is parsed, and then the name and URL are extracted. There are two things to pay attention to in this part. One is because you want to save the resource in a txt file, but some special symbols cannot appear when naming, so you need to deal with it. The second is to deal with the paging. The data in the website is displayed in the form of paging, so how to identify and capture the paging is also very important. Through observation, it is found that there is no "/" in front of the paging address, so it is only necessary to find out the paging address link through regular expressions, and then nested calls can solve the paging problem.
③Analyze the resource address and save it in a file
#Processing resource page crawling resource address def CrawlSourcePage(url,filedir,filename,CrawledURLs): print url page = __getpage(url) if page=="error": return CrawledURLs.append(url) page = page.decode('gbk','ignore') tree = etree.HTML(page) Nodes = tree.xpath("//div[@align='left']//table//a") try: source = filedir + "/" + filename + ".txt" f = open(source.decode("utf-8"),'w') for node in Nodes: sourceurl = node.xpath("text()")[0] f.write(sourceurl.encode("utf-8")+"\n") f.close() except: print "!!!!!!!!!!!!!!!!!"
This paragraph is relatively simple, just write the extracted content into a file
In order to improve the efficiency of the program, multi-threaded crawling is used. Here I have opened a thread for each classified homepage, which greatly speeds up the efficiency of crawlers. At the beginning, I just used a single thread to run, but I waited all afternoon and finally ran away all afternoon because of an exception that was not handled. ! ! ! Tired
class myThread (threading.Thread): #Inherit the parent class threading.Thread def __init__(self, url, newdir,CrawledURLs): threading.Thread.__init__(self) self.url = url self.newdir = newdir self.CrawledURLs=CrawledURLs def run(self): #Write the code to be executed into the run function. The thread will run the run function directly after creation CrawListPage(self.url, self.newdir,self.CrawledURLs)
The final crawling result is as follows.