Preface
Python is very popular now, with simple syntax and powerful functions. Many students want to learn Python! So the little ones have prepared high-value Python learning video tutorials and related electronic books for everyone. They are all placed at the end of the article. Welcome to collect them!
I have always been in the habit of watching American TV shows. On the one hand, I exercise my English listening skills and on the other hand I pass the time. Previously, it was possible to watch online on video websites, but since the SARFT’s restriction order, imported American and British dramas seem to not be updated simultaneously as before. However, as a house diao, how can I not be willing to follow the drama, so I just checked the Internet and found a US drama download site [Tiantian Meiju] that can be downloaded with Thunder. High-definition documentary, nature is too beautiful.
Although you can download the resource website, you have to open the browser every time, enter the URL, find the American drama, and click the link to download it. After a long time, I feel that the process is very cumbersome, and sometimes the website link will not open, which will be a bit troublesome. I happen to have been learning Python crawlers, so I wrote a crawler on a whim today. I grabbed all the links to American dramas on the website and saved them in a text file. Just open the link and copy the link to Thunder to download any drama you want.
In fact, at the beginning, I planned to write the kind of discovery of a url, use requests to open the crawl download link, and start crawling from the homepage to complete the site. However, there are a lot of repeated links, and the url of its website is not as regular as I thought. After writing it for a long time, I haven't written the kind of divergent crawler I want. Maybe it's because I haven't been hot enough. Keep working hard. . .
Later, I found out that the links to the TV series were all in the article, and there was a number after the article url, like this http://cn163.net/archives/24016/, so I used the crawler I wrote before. Experience, the solution is to automatically generate url, can’t the numbers behind it be changed, and each drama is unique, so I tried about how many articles there are, and then use the range function to directly generate numbers continuously to construct the url .
But many URLs don’t exist, so they will hang up directly. Don’t worry, we use requests. The built-in status_code is used to determine the status of the request. So as long as the returned status code is 404, we will all Skip it, and all the others go in to crawl the link, which solves the url problem.
The following is the implementation code of the above steps.
def get_urls(self): try: for i in range(2015,25000): base_url='http://cn163.net/archives/' url=base_url+str(i)+'/' if requests.get(url).status_code == 404: continue else: self.save_links(url) except Exception,e: pass
The rest went smoothly. I found similar crawlers written by predecessors on the Internet, but only crawled an article, so I borrowed its regular expressions. I have used BeautifulSoup but the regular effect is not good, so I abandoned it decisively, and there is no end to learning. However, the effect is not so ideal. About half of the links cannot be crawled correctly, and further optimization is needed.
# -*- coding:utf-8 -*- import requests import re import sys import threading import time reload(sys) sys.setdefaultencoding('utf-8') class Archives(object): def save_links(self,url): try: data=requests.get(url,timeout=3) content=data.text link_pat='"(ed2k://\|file\|[^"]+?\.(S\d+)(E\d+)[^"]+?1024X\d{3}[^"]+? )"' name_pat=re.compile(r'<h2 class="entry_title">(.*?)</h2>',re.S) links = set(re.findall(link_pat,content)) name=re.findall(name_pat,content) links_dict = {} count=len(links) except Exception,e: pass for i in links: links_dict[int(i[1][1:3]) * 100 + int(i[2][1:3])] = i#Extract the number of episodes by s and e try: with open(name[0].replace('/','')+'.txt','w') as f: print name[0] for i in sorted(list(links_dict.keys())):#Write in sorted order by season number + episode number f.write(links_dict[i][0] +'\n') print "Get links ... ", name[0], count except Exception,e: pass def get_urls(self): try: for i in range(2015,25000): base_url='http://cn163.net/archives/' url=base_url+str(i)+'/' if requests.get(url).status_code == 404: continue else: self.save_links(url) except Exception,e: pass def main(self): thread1=threading.Thread(target=self.get_urls()) thread1.start() thread1.join() if __name__ =='__main__': start=time.time() a=Archives() a.main() end=time.time() print end-start
The full version of the code, which also uses multi-threading, but it feels useless, because of the Python GIL, it seems that there are more than 20,000 dramas, I thought it would take a long time to crawl to complete, but remove the url error and For those that are not matched, the total crawling time is less than 20 minutes. It made me want to use Redis to crawl on the two Linux, but after a lot of tossing it feels unnecessary, so let's go ahead and get it later when I need more data.
In addition, I encountered a very torturing problem in the process of saving the file name. I must complain about it here. The file name in txt format can have spaces, but not slashes, backslashes, brackets, etc. That’s the problem. I spent all the time in the morning on this. At first I thought it was an error in crawling data. After checking for a long time, I found out that the title of the crawled play had a slash. This made me miserable Up.