When working with strings you may have encountered the error "UnicodeDecodeError: 'utf-8' codec can't decode byte X in position X: invalid continuation byte". It occurs because when we specify an incorrect encoding when decoding bytes data. In order to fix the issue, we have to specify correct encoding.
UnicodeDecodeError: 'utf-8' codec can't decode byte X in position X: invalid continuation byte #
Managing the encoding of character strings can sometimes cause problems in Python. For example, the error "UnicodeDecodeError: 'utf-8' codec can't decode byte xxx in position x: invalid continuation byte" occurs in a script in Python when trying to decode a string to UTF-8 but that it is not encoded in this way. If you are handling a file and you don't know the encoding, there are solutions to work around this error.
To understand what the errors means, we have to analyze the error message. Strings have encoding that describes character set and collation the string bytes objects will accept. When we encode the string in UTF-8, we specify data in UTF-8 standard format. If we were to decode the string, we were to decode the string with latin-1 for instance, that will cause the error "UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte".
Encoding is a process whereby we turn sequence of characters, which includes alphabet, numbers, punctuation and all of the other symbols, into bytes for efficiency in transmission and storing. Decoding is the opposite of encoding. It's a process of turning bytes into the sequence of characters.
Here's a demonstration of the problem:
str_bytes = 'ééééééé'.encode('latin-1')
# ⛔️ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
my_str = str_bytes.decode('utf-8')
The code runs into an error when we try to decode the string with an encoding that wasn't used to decode the string. To fix this we have to decode the string using the same encoding we used to encode it. Here's the fix to the previous error:
str_bytes = 'ééééééé'.encode('latin-1')
my_str = str_bytes.decode('latin-1')
Now the program runs correctly and no errors occur.
The Solution #
In general, you ought to decode your strings in the same encoding you used to decode but at times this issue is hard to avoid because it can be difficult to keep track of many encoding. That's why I present two solution. The first one is to decode your strings correctly and the second one is to read the strings from a file and have Python automatically find the string encoding.
Solution one
When doing a decoding of bytes using decode() method, you have to make sure that the encoding is the one was used to encode the string into bytes. For example, in this example I have set the encoding of first_string to UTF-8 which is an encoding that accepts all known language on earth, and then to decode it I have set the encoding to UTF-8 again.
first_string = 'ééééééé'.encode('utf-8')
decoded_str = first_string.decode('utf-8')
print(decoded_str)
Solution two
File management functions have a "Binary" mode which treats characters as "bytes". With this mode, no decoding is performed and the characters are thus preserved, whatever their encoding. To open a file in binary mode, you must specify the "rb" mode.
with open(filePath, 'rb') as file:
content = file.read()
To write to a file in binary mode, you must use the "wb" or "ab" modes.
If, despite the encoding problem, you want to open the file and read the content in utf-8, it is possible to add an additional parameter to the "open()" function telling it to ignore errors. Characters that cannot be read will be ignored and not displayed.
with open(filePath, encoding="utf8", errors='ignore') as file:
The "byte 0xff in position 0" error that appears when you try to decode a file in UTF-8 may simply indicate that the file is encoded in UTF-16. You can try changing the opening encoding of the file.
with open(filePath, encoding='utf-16') as file:
This solution only works with Python 3, which includes UTF-16 encoding support in the "open()" function. If you are using Python 2, you will have to perform a conversion after opening the file in binary mode.
with open(filePath,'rb') as file:
content = file.read()
content = content.rstrip("\n").decode("utf-16")
The Conclusion
Thank you for sticking with this tutorial all the way to the end. When attempting to decode a string using decode() method and passing wrong encoding as a parameter the error message "UnicodeDecodeError: 'utf-8' codec can't decode byte X in position X: invalid continuation byte". To fix this issue, you have to either find out what encoding was used for the string in encode() method and use it to decode the string using decode() method. If you are reading encoding from a file and not a string, you can open the file in binary mode with open() method and pass the correct encoding to it's encoding parameter.
If you found this article helpful. Please comment and share. If you find any issue you can always tell, I'll get back to you as soon as possible. Take care!