Python crawler tutorial: crawl Zhihu.com

Python crawler tutorial: crawl Zhihu.com

Preface

Python is very popular now, with simple syntax and powerful functions. Many students want to learn Python! So the little ones have prepared high-value Python learning video tutorials and related electronic books for everyone. Welcome to receive them!

Zhihu has become a training ground for crawlers. This article uses the requests library in Python to simulate login to Zhihu, obtain the cookie, and save it locally. Then this cookie is used as a login credential to log in to the main page of Zhihu and crawl the home page of Zhihu. A summary of the questions on the surface and the answers to the corresponding questions.

Regarding the problem of Zhihu verification code login, I used PIL, an important image processing library on Python. If it doesn't work, save the image locally and enter it manually.

The key part of crawling Zhihu: simulated landing

By capturing the packet of Zhihu login, you can find that login Zhihu requires three parameters for post, one is the account number, one is the password, and the other is xrsf. This xrsf is hidden in the form. Every time you log in, the server should randomly generate a string. All, when you want to simulate login, you must get xrsf.

The result of using chrome (or Firefox httpfox to capture packets):

Therefore, the value of xsrf must be obtained. Note that this is a dynamically changing parameter, which is different every time.

Note the difference between findall and find_all functions.

After getting xsrf, you can simulate login below. Using the session object of the requests library, the advantage of establishing a session is that different requests of the same user can be linked, and cookies will be automatically processed until the end of the session.

Note: cookies is a file in the current directory. This file saves the cookies you know. If you are the first to log in, then of course there is no such file and you cannot log in through the cookie file. A password must be entered.

def login(secret, account):
    # Determine whether it is a mobile phone number by the entered user name
    if re.match(r"^1\d{10}$", account):
        print("Mobile phone number login\n")
        post_url ='https://www.zhihu.com/login/phone_num'
        postdata = {
            '_xsrf': get_xsrf(),
            'password': secret,
            'remember_me':'true',
            'phone_num': account,
        }
    else:
        if "@" in account:
            print("Mailbox login\n")
        else:
            print("There is a problem with your account input, please log in again")
            return 0
        post_url ='https://www.zhihu.com/login/email'
        postdata = {
            '_xsrf': get_xsrf(),
            'password': secret,
            'remember_me':'true',
            'email': account,
        }
    try:
        # No verification code is required to log in successfully
        login_page = session.post(post_url, data=postdata, headers=headers)
        login_code = login_page.text
        print(login_page.status_code)
        print(login_code)
    except:
        # You need to enter the verification code to log in successfully
        postdata["captcha"] = get_captcha()
        login_page = session.post(post_url, data=postdata, headers=headers)
        login_code = eval(login_page.text)
        print(login_code['msg'])
    session.cookies.save()
try:
    input = raw_input
except:
    pass

This is the login function. Use the login function to log in, post your account, password and xrsf to the login authentication page of Zhihu, then get the cookie, and save the cookie to the file in the current directory. When you log in next time, read this cookie file directly.

#LWP-Cookies-2.0
Set-Cookie3: cap_id="\"YWJkNTkxYzhiMGYwNDU2OGI4NDUxN2FlNzBmY2NlMTY=|1487052577|4aacd7a27b11a852e637262bb251d79c6cf4c8dc\""";2017-03 path="/"; domain=".zhi"spec.com expired:09"; ; version=0
Set-Cookie3: l_cap_id="\"OGFmYTk3ZDA3YmJmNDQ4YThiNjFlZjU3NzQ5NjZjMTA=|1487052577|0f66a8f8d485bc85e500a121587780c7c8766faf\""; path="/"; 09: domain=". ; version=0
Set-Cookie3: login="\"NmYxMmU0NWJmN2JlNDY2NGFhYzZiYWIxMzE5ZTZiMzU=|1487052597|a57652ef6e0bbbc9c4df0a8a0a59b559d4e20456\""; path="/" 06; expires=".Zzhi" 06; ; version=0
Set-Cookie3: q_c1="ee29042649aa4f87969ed193acb6cb83|1487052577000|1487052577000"; path="/"; domain=".zhihu.com"; path_spec; expires="2020-02-14 06:09:37Z"; version=0
Set-Cookie3: z_c0="\"QUFCQTFCOGdBQUFYQUFBQVlRSlZUVFVzeWxoZzlNbTYtNkt0Qk1NV0JLUHZBV0N6NlNNQmZ3PT0=|1487052597|expired ; httponly=None; version=0

This is the content of the cookie file

The following is the source code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
try:
    import cookielib
except:
    import http.cookiejar as cookielib
import re
import time
import os.path
try:
    from PIL import Image
except:
    pass

from bs4 import BeautifulSoup


# Construct Request headers
agent ='Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0'
headers = {
    "Host": "www.zhihu.com",
    "Referer": "https://www.zhihu.com/",
    'User-Agent': agent
}

# Use login cookie information
session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename='cookies')
try:
    session.cookies.load(ignore_discard=True)
except:
    print("Cookie failed to load")



def get_xsrf():
    '''_xsrf is a dynamically changing parameter'''
    index_url ='https://www.zhihu.com'
    # Get the _xsrf that needs to be used when logging in
    index_page = session.get(index_url, headers=headers)
    html = index_page.text
    pattern = r'name="_xsrf" value="(.*?)"'
    # Here _xsrf returns a list
    _xsrf = re.findall(pattern, html)
    return _xsrf[0]





# get verification code
def get_captcha():
    t = str(int(time.time() * 1000))
    captcha_url ='https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
    r = session.get(captcha_url, headers=headers)
    with open('captcha.jpg','wb') as f:
        f.write(r.content)
        f.close()
    # Use pillow's Image to display the verification code
    # If Pillow is not installed, go to the directory where the source code is located to find the verification code and enter it manually
    try:
        im = Image.open('captcha.jpg')
        im.show()
        im.close()
    except:
        print(u'Please find captcha.jpg in the %s directory and manually enter'% os.path.abspath('captcha.jpg'))
    captcha = input("please input the captcha\n>")
    return captcha





def isLogin():
    # Judge whether you have logged in by checking the user's personal information
    url = "https://www.zhihu.com/settings/profile"
    login_code = session.get(url, headers=headers, allow_redirects=False).status_code
    if login_code == 200:
        return True
    else:
        return False


def login(secret, account):
    # Determine whether it is a mobile phone number by the entered user name
    if re.match(r"^1\d{10}$", account):
        print("Mobile phone number login\n")
        post_url ='https://www.zhihu.com/login/phone_num'
        postdata = {
            '_xsrf': get_xsrf(),
            'password': secret,
            'remember_me':'true',
            'phone_num': account,
        }
    else:
        if "@" in account:
            print("Mailbox login\n")
        else:
            print("There is a problem with your account input, please log in again")
            return 0
        post_url ='https://www.zhihu.com/login/email'
        postdata = {
            '_xsrf': get_xsrf(),
            'password': secret,
            'remember_me':'true',
            'email': account,
        }
    try:
        # No verification code is required to log in successfully
        login_page = session.post(post_url, data=postdata, headers=headers)
        login_code = login_page.text
        print(login_page.status_code)
        print(login_code)
    except:
        # You need to enter the verification code to log in successfully
        postdata["captcha"] = get_captcha()
        login_page = session.post(post_url, data=postdata, headers=headers)
        login_code = eval(login_page.text)
        print(login_code['msg'])
    session.cookies.save()
try:
    input = raw_input
except:
    pass



## Output the main question list on the shell
def getPageQuestion(url2):  
  mainpage = session.get(url2, headers=headers)
  soup=BeautifulSoup(mainpage.text,'html.parser')
  tags=soup.find_all("a",class_="question_link")
  #print tags

  for tag in tags:
    print tag.string

# Output the summary of the answer to the question on the main page on the shell
def getPageAnswerAbstract(url2):
    mainpage=session.get(url2,headers=headers)
    soup=BeautifulSoup(mainpage.text,'html.parser')
    tags=soup.find_all('div',class_='zh-summary summary clearfix')

    for tag in tags:
       # print tag
        print tag.get_text()
        print'Detailed link:',tag.find('a').get('href')


def getPageALL(url2):
    #mainpage=session.get(url2,headers=headers)
    #soup=BeautifulSoup(mainpage.text,'html.parser')
    #tags=soup.find_all('div',class_='feed-item-inner')
    #print "def getpageall "
    mainpage=session.get(url2,headers=headers)
    soup=BeautifulSoup(mainpage.text,'html.parser')
    tags=soup.find_all('div',class_='feed-content')
    for tag in tags:
        #print tag
        print tag.find('a',class_='question_link').get_text()
        # There is a problem here. Bs is still not too skilled
        #print tag.find('a',class_='zh-summary summary clearfix').get_text()
        #print tag.find('div',class_='zh-summary summary clearfix').get_text()


if __name__ =='__main__':
    if isLogin():
        print('You are already logged in')
        url2='https://www.zhihu.com'
        # getPageQuestion(url2)
        #getPageAnswerAbstract(url2)
        getPageALL(url2)
    else:
        account = input('Please enter your username\n>')
        secret = input("Please enter your password\n> ")
        login(secret, account)

operation result:

ps: Friends who want to learn python recommend here the python zero-based system I built to learn communication buckle qun: 322795889, there are free video tutorials, development tools, and e-book sharing in the group. Professional teachers answer questions! Learn python web, python crawler, data analysis, artificial intelligence and other technologies if you don’t understand, you can join in to exchange and learn together, and make progress together!

Alright! The article is shared with the readers here

Finally, if you find it helpful, remember to follow, forward, and favorite

·END·

Reference: https://cloud.tencent.com/developer/article/1507658 Python crawler tutorial: Crawling Zhihu.com-Cloud + Community-Tencent Cloud