Preface
Python is very popular now, with simple syntax and powerful functions. Many students want to learn Python! So the little ones have prepared high-value Python learning video tutorials and related electronic books for everyone. Welcome to receive them!
What I want to introduce to you today is the crawling and recognition of verification codes, but only the simplest graphic verification codes are involved, which is also a relatively common type now.
Operating platform: Windows
Python version: Python3.6
IDE: Sublime Text
Other: Chrome browser
Briefly describe the process:
Step 1: Briefly introduce the verification code
Step 2: Crawl a few pictures of the verification code
Step 3: Introduce Baidu text recognition OCR
Step 4: Identify the verification code that is crawled
Step 5: Simple image processing
At present, many websites will take various measures to prevent crawlers. Verification code is one of them. For example, when it detects that the visit frequency is too high, a verification code will pop up for you to enter to confirm that the visitor is not a robot. But with the development of crawler technology, there are more and more patterns of verification codes, from the initial simple graphic verification code composed of a few numbers or letters (that is, what we are going to deal with today) to the need to click on the inverted text letters , Click-through verification codes for pictures that match the text, extreme sliding verification codes that need to be slid to the appropriate position, and verification codes for calculation questions, etc., in short, there are so many tricks that make people bald. You can look at this website for other related knowledge of the verification code:captcha.org
Let’s briefly talk about the graphic verification code, like this one:
It is composed of letters and numbers, plus some noise, but in order to prevent recognition, simple graphic verification codes are now becoming more complicated, some with interference lines, some with noise, and some with background, the font is distorted, Adhesion, hollowing, mixed use, etc., even sometimes it is difficult for the human eye to recognize, you can only click silently "I can't see clearly, come again."
The increase in the difficulty of the verification code brings about the need to increase the cost of recognition. In the next recognition process, I will directly use Baidu text recognition OCR to test the recognition accuracy, and then confirm whether to choose grayscale, Image operations such as binarization and interference removal optimize the recognition rate.
Next, we will crawl a small amount of verification code pictures and save them into a file.
1. open the Chrome browser and visit the website just introduced. There is a link to a captcha image sample https://captcha.com/captcha-examples.html?cst=corg
in the webpage : There are 60 different types of graphic verification codes on the webpage, which are enough for us to identify the test.
Let's look at the code directly:
import requests import os import time from lxml import etree def get_Page(url,headers): response = requests.get(url,headers=headers) if response.status_code == 200: # print(response.text) return response.text return None def parse_Page(html,headers): html_lxml = etree.HTML(html) datas = html_lxml.xpath('.//div[@class="captcha_images_left"]|.//div[@class="captcha_images_right"]') item= {} # Create a folder to save the verification code file ='D:/******' if os.path.exists(file): os.chdir(file) else: os.mkdir(file) os.chdir(file) for data in datas: # Verification code name name = data.xpath('.//h3') # print(len(name)) # Verification code link src = data.xpath('.//div/img/@src') # print(len(src)) count = 0 for i in range(len(name)): # Verification code image file name filename = name[i].text +'.jpg' img_url ='https://captcha.com/' + src[i] response = requests.get(img_url,headers=headers) if response.status_code == 200: image = response.content with open(filename,'wb') as f: f.write(image) count += 1 print('Save the verification code of {} successfully'.format(count)) time.sleep(1) def main(): url ='https://captcha.com/captcha-examples.html?cst=corg' headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'} html = get_Page(url,headers) parse_Page(html,headers) if __name__ =='__main__': main()
Still using Xpath crawling, when you right-click to check the picture, you can find that the web page is divided into two columns, as shown in the red box in the figure below, divided into two columns according to the class, and the verification code is in the two columns.
datas = html_lxml.xpath('.//div[@class="captcha_images_left"]|.//div[@class="captcha_images_right"]')
Here I used the path selection in Xpath. In the path expression, it “|”
means to select several paths. For example, it means selecting a block class
as "captcha_images_left"
or "captcha_images_right"
. Let's take a look at the results:
Since it is forced to wait for 1 second every time a verification code image is crawled, the final running time is really desperate. It seems that multi-threading is needed to speed up the speed. We will talk about multi-process and multi-thread next time. Here we come first. Look at the image of the verification code that was crawled.
Now that the pictures are in hand, the next step is to call the OCR of Baidu text recognition to recognize these pictures. Before recognizing, I will briefly introduce the use of Baidu OCR, because many tutorials for recognizing verification codes use the tesserocr library, so at the beginning I also tried it. I encountered a lot of pitfalls during the installation process. Later, I didn't continue to use it, but chose Baidu OCR to identify. The Baidu OCR interface provides functions such as image text detection, positioning, and recognition in natural scenes. The result of text recognition can be used in scenarios such as translation, search, verification code, etc. instead of user input. In addition, there are other recognition functions in visual and speech technology. You can directly read the document to understand: Baidu OCR-API documenthttps://ai.baidu.com/docs#/OCR-API/top
If you use Baidu OCR, first register the user, then download and install the interface module, and pip install baidu-aip
you can directly enter the terminal . Then create a text recognition application, get the relevant information Appid
, API Key
and Secret Key
what needs to be understood is that Baidu AI provides 50,000 free calls to the general text recognition interface every day, which is enough for us to splurge.
Then you can call the code directly.
from aip import AipOcr # Your APPID AK SK APP_ID ='Your APP_ID' API_KEY ='Your API_KEY' SECRET_KEY ='Your SECRET_KEY' client = AipOcr(APP_ID, API_KEY, SECRET_KEY) # Read picture def get_file_content(filePath): with open(filePath,'rb') as fp: return fp.read() image = get_file_content('test.jpg') # Call general text recognition, the picture parameter is a local picture result = client.basicGeneral(image) # Define parameter variables options = { # Define the image direction 'detect_direction':'true', # Identify the language type, the default is'CHN_ENG' Chinese and English mixed 'language_type':'CHN_ENG', } # Call the universal text recognition interface result = client.basicGeneral(image,options) print(result) for word in result['words_result']: print(word['words'])
Here we identify this picture
You can look at the recognition results
The above is the result of direct output after recognition, and the following is the text part extracted separately. As you can see, all the text parts are output correctly except for the dash is not output. The picture we use here is in jpg format, and the incoming image for text recognition supports jpg/png/bmp format, but it is mentioned in the technical documentation that uploading pictures in jpg format will improve the accuracy rate, which is also our crawl verification The reason why the jpg format is used to save the code.
In the output result, each field represents:
Next, what we have to do is to identify the verification code we crawled earlier with the OCR just introduced, and see if we can get the correct result.
from aip import AipOcr import os i = 0 j = 0 APP_ID ='Your APP_ID' API_KEY ='Your API_KEY' SECRET_KEY ='Your SECRET_KEY' client = AipOcr(APP_ID, API_KEY, SECRET_KEY) # Read picture file_path ='D:\******\Verification code picture' filenames = os.listdir(file_path) # print(filenames) for filename in filenames: # Combining the path and the file name is the full path of each file info = os.path.join(file_path,filename) with open(info,'rb') as fp: # Get the path of the folder image = fp.read() # Call general text recognition, the picture parameter is a local picture result = client.basicGeneral(image) # Define parameter variables options = { 'detect_direction':'true', 'language_type':'CHN_ENG', } # Call the universal text recognition interface result = client.basicGeneral(image,options) # print(result) if result['words_result_num'] == 0: print(filename +':' +'----') i += 1 else: for word in result['words_result']: print(filename + ':' +word['words']) j += 1 print('Total identification verification code {} Zhang'.format(i+j)) print('Unrecognized text {} sheets'.format(i)) print('The text has been recognized {} sheets'.format(j))
Just like recognizing pictures, here we read all the pictures in the folder verification code picture, let OCR recognize them in turn, and “word_result_num”
judge whether the text is successfully recognized according to the fields , and print the result if the text is recognized, and “----”
replace the unrecognized ones. And combine the file name to correspond to the recognition result. Finally, count the number of recognition results, and then look at the recognition results.
Seeing the result, I can only say Amazing! Sixty pictures actually recognized 65 pictures, and there are 27 pictures with unrecognized text. This is not the result I want~ Let’s take a brief look at the problem and see that “Vertigo Captcha Image.jpg"
the picture name appears twice. , It is suspected that due to interference during the recognition process, the recognition is output as two lines of text, which is a good explanation for why there are 5 more verification code pictures. but! Why are there so many unrecognized texts, and the verification code composed of English numbers is recognized as Chinese. It seems that the idea of only relying on OCR to identify the verification code image does not work. Then we will use the image processing method to re-identify the verification code.
This is the picture used when introducing the verification code
This picture was also not recognized, making people bald. Next, process this picture to see if it can be correctly recognized by OCR
from PIL import Image filepath ='D:\******\Verification code image\AncientMosaic Captcha Image.jpg' image = Image.open(filepath) # Pass in'L' to convert the picture into a grayscale image image = image.convert('L') # Pass in '1' to binarize the picture image = image.convert('1') image.show()
Let's take a look at what the picture looks like after transforming like this?
There are indeed some differences, so quickly try to see if it can be recognized, but it still fails~~Continue to modify
from PIL import Image filepath ='D:\******\Verification code image\AncientMosaic Captcha Image.bmp' image = Image.open(filepath) # Pass in'L' to convert the picture into a grayscale image image = image.convert('L') # Pass in'l' to binarize the picture, and the default binarization threshold is 127 # Specify threshold for conversion count = 170 table = [] for i in range(256): if i <count: table.append(0) else: table.append(1) image = image.point(table,'1') image.show()
Here I save the picture in bmp mode, and then specify the threshold for binarization. If it is not specified, the default is 127. We need to convert the original image to grayscale first, and cannot directly convert it on the original image. Then add the required pixels that constitute the verification code to a table, and then use the point method to construct a new verification code image.
Now recognized text, although I do not know why identification has become "Jane", after analysis found that z is because I set the parameters set “language_type”
for the “CHN_ENG”
Chinese and English mixed mode, so I modified to "ENG" English type, found It can be recognized as a character, but the recognition is still not successful. After trying other methods I know, I am speechless, and I decided to continue to try other methods in the PIL library.
# Find the edge image = image.filter(ImageFilter.FIND_EDGES) # image.show() # Edge enhancement image = image.filter(ImageFilter.EDGE_ENHANCE) image.show()
Still can’t recognize it correctly, I decided to try another verification code. . . . . .
I found this one with shadow
from PIL import Image,ImageFilter filepath ='D:\******\Verification code image\CrossShadow2 Captcha Image.jpg' image = Image.open(filepath) # Pass in'L' to convert the picture into a grayscale image image = image.convert('L') # Pass in'l' to binarize the picture, and the default binarization threshold is 127 # Specify threshold for conversion count = 230 table = [] for i in range(256): if i <count: table.append(1) else: table.append(0) image = image.point(table,'1') image.show()
After simple processing, we get this picture:
The recognition result is:
The recognition is successful, and the old tears are in tears! ! ! It seems that Baidu OCR can still recognize the verification code, but the recognition rate is still a bit low, and certain images need to be processed to increase the accuracy of recognition. However, Baidu OCR's recognition of standardized text is still very accurate.
So compared with other verification codes, what makes this verification code easier to read by OCR?
Such a verification code is relatively easy to recognize. In addition, the black text on a white background when recognizing a picture belongs to a very standard specification text, so the recognition accuracy is high. As for more complex graphic verification codes, deeper image processing technology or trained OCR is required. If you just simply recognize a verification code, it is better to check the picture input manually. If it is a little bit more, you can also hand it over to the code. Platform to identify.
ps: Friends who want to learn python recommend here the python zero-based system I built to learn communication buckle qun: 322795889, there are free video tutorials, development tools, e-book sharing in the group. Professional teachers answer questions! Learn python web, python crawler, data analysis, artificial intelligence and other technologies if you don’t understand, you can join in to exchange and learn together, and make progress together!