Do you always encounter UnicodeDecodeError/UnicodeEncodeError when writing Python code? Or print a string of characters, which is indeed garbled, which makes people mentally awkward.
Don't panic, this article will explain the past and present of coding, so that you have a deep understanding of coding so that you can analyze and solve Python coding problems later.
Everyone knows that the computer itself does not recognize characters, only 0 and 1. Therefore, computer characters need to be converted to 0 and 1 to be recognized by the computer. In fact, the process of converting characters to 01 numbers recognized by the computer is the character encoding.
Of course, the character encoding needs to define a set of standards, otherwise the computer will not be able to accurately recognize the characters. Everyone knows about standards. Most of them are first set by foreign countries and then set at home, and it is difficult to achieve uniformity.
Based on the English ASCII code, it is represented by one byte, and one byte is 8bit. As the name implies, it can represent up to 256 characters (2 to the 8th power). 256 characters are more than enough for English, but with so many Chinese characters, obviously ASCII code is not applicable. The Chinese have customized GBK, but each country customizes it by itself, so it is difficult to unify.
At this time, the Universal Code Unicode encoding appeared, and two bytes were used for encoding. Now the problem also arises, that is, English letters can be solved with one byte, and now it takes two bytes. Isn't it a waste of memory? The emergence of variable-length encoding UTF-8 solves this problem, using one byte for letters and two bytes for complex characters.
In summary, Unicode encoding takes up space, but it runs fast. UTF-8 is the opposite. Therefore, Unicode encoding is used in memory and UTF-8 is used for storage. Please remember this carefully.
First of all, Python3 default encoding is utf-8.
import sys print(sys.getdefaultencoding()) # utf-8
Then, Python is divided into two data types, str and bytes. The text character is str, str can represent all characters in the Unicode character set, and bytes represent binary data.
a ='a' b ='Luo Pan' print(type(a),type(b)) c = b'\xe7\xbd\x97\xe6\x94\x80' print(c,type(c)) #<class'str'> <class'str'> #b'\xe7\xbd\x97\xe6\x94\x80' <class'bytes'>
The previous errors are divided into UnicodeDecodeError and UnicodeEncodeError, which are actually encoding and decoding errors.
Simply put, the conversion from a character to 01 that the computer can recognize is the encoding, and the conversion of 01 to a character is the decoding. The encoding and decoding here must be consistent, otherwise an error will be reported.
The conversion between str and bytes is to use encode and decode methods.
a ='Luo Pan' print(a.encode('utf-8')) print(a.encode('utf-8').decode('utf-8')) print(a.encode('gb2312').decode('utf-8')) # b'\xe7\xbd\x97\xe6\x94\x80' #罗攀 #Traceback (most recent call last): # File "/Users/luopan/Python practice/coding problem.py", line 25, in #<module> # print(a.encode('gb2312').decode('utf-8')) #UnicodeDecodeError:'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
For example, I create a new txt locally and encode it as utf-16.
If we read the file directly, an error will be reported, because python default encoding is utf-8.
So we need to specify the encoding.
See you next time~