The old Python driver will take you to write crawlers hand in hand, download the girl pictures from the whole site, and it’s cool enough!

The old Python driver will take you to write crawlers hand in hand, download the girl pictures from the whole site, and it’s cool enough!

Today, I will take everyone to write a simple and complete crawler. Let's grab the pictures of the entire site and save them on the computer!

Ready to work

Tools: Python3.6, pycharm

Library: requests, re, time, random, os

Target website: sister map (please go to the code to see the specific url...)

Before writing code

Before we start to write the code, we must first analyze the website, focusing on these points:

1. First judge whether the webpage is a static webpage, this is related to the crawling method we use!

Simply put, the content of the webpage can be found in the source code of the webpage, then it can be concluded that the website is static; if it is not found, you need to find it in the developer tools to see if it is a packet capture or Analyze js structure or other ways.

2. Look at the structure of the webpage, it is roughly clear to capture the target data, it takes several layers of cycles, the way of each cycle, and whether it is guaranteed that there are no omissions!

3. Determine the matching method to be adopted according to the source code of the web page

Generally speaking, regular expressions are the fastest way to process strings, but it is not very efficient in crawlers, because it needs to traverse the entire html to match related content. If the web page source code is relatively regular, bs4 is recommended Or xpath, etc., the way to parse the web page structure is better!

Of course, today we are basic crawlers, so we use regular expressions. After all, regular expressions must be mastered!

So, how do you write crawler code specifically~? Let me give you a simple example:

If it is manual operation, it is probably this process

Open the homepage==>Select a category==>Select an atlas==>Select pictures in turn==>Right click to save==>Repeat the above to save other pictures

Then put this process into the code, and its structure is roughly like this:

Visit the homepage url==>Find and cycle all categories==>Create category folder==>Access category url==>Find page number to build cycle category all pages==>Cycle page all atlas==>Create atlas folder ==>Find all the picture urls in the gallery==>Save to the corresponding folder

Okay, the idea is also there, so let's not talk too much nonsense, let's write the code~!

Complete code and running effect

The pause function of the time module is added to the request. If you do not join, you may be denied access to the web page!

When you finally request the image address, you need to add UA to tell the server that you are a browser and not a script. This is the most commonly used anti-climbing method.

#author:云飞
#QQ群542110741
import requests
import time
import random
import re
import os

def new_title(title):
    rstr = r"[\/\\\:\*\?\"\<\>\|]" #'//: *? "<> |'
    new_title = re.sub(rstr, "_", title) # Replace with underscore
    return new_title

url ='http://www.meizitu.com/'
html = requests.get(url)
html.encoding ='gb2312'
infos = re.findall(r'a href="(http://www.meizitu.com/.*?html)" target="_blank" title="(.*?)"',html.text)
i = 1
for sor_url,sor in infos:
    sor = new_title(sor)
    path ='E://python/mn/meizitu/%s/'%sor#path
    if os.path.exists(path):#Determine whether the path and folder exist, create if they don’t exist
        pass
    else:
        os.mkdir(path)
    time.sleep(random.random())
    sor_html = requests.get(sor_url)
    sor_html.encoding ='gb2312'
    atlas = set(re.findall(r"<li><a href='(.*?html)'>\d+</a></li>",sor_html.text))
    atlas_lis = []
    atlas_lis.append(sor_url)
    atlas_lis += [url+'a/'+x for x in list(atlas)]
    for atla in atlas_lis:
        atla_html = requests.get(atla).text
        at_url_lis = re.findall(r'h3 class="tit"><a href="(http://www.meizitu.com/.*?html)" targe',atla_html)
        for at_url in at_url_lis:
            at_html = requests.get(at_url)
            at_html.encoding = "gb2312"
            atlas_title =''.join(re.findall(r'<title>(.*?)</title>',at_html.text))
            atlas_title = new_title(atlas_title)
            img_path ='E://python/mn/meizitu/%s/%s/'%(sor,atlas_title)
            if os.path.exists(img_path):#Judging whether the path and folder exist, create if they don’t exist
                pass
            else:
                os.mkdir(img_path)
            img_urls = re.findall(r'src="(http://mm.chinasareview.com/.*?jpg)"/><br/>',at_html.text)
            k = 1
            for img_url in img_urls:
                header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}
                data = requests.get(img_url,headers=header).content#Get the binary format of the picture
                with open('%s%s'%(img_path,img_url.split('/')[-1]),'wb') as f:
                    f.write(data)
                print("[Downloading] The %d picture of {%s}, a total of %d pictures have been downloaded"%(atlas_title,k,i))
                i += 1
                k += 1

Effect after downloading for a period of time

------------------- End -------------------

Welcome everyone to like , leave a message, forward, reprint, thank you for your company and support

Thousands of waters and thousands of mountains are always in love, can you click [ Looking ]

*Disclaimer: This article is organized on the Internet, and the copyright belongs to the original author. If the source information is wrong or infringes on rights, please contact us for deletion or authorization.

/Today's message topic/

Just say a word or two~~

Reference: https://cloud.tencent.com/developer/article/1683987 Python old driver will take you to write crawlers hand in hand, download girls pictures from the whole site, and you will be cool at a time! -Cloud + Community-Tencent Cloud