Skip to content Skip to sidebar Skip to footer

Unicodedecodeerror utf8 Codec Cant Decode Byte 0xed in Position 48 Invalid Continuation Byte

If you are getting trouble with the error "Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte", take it easy and follow our article to overcome the problem. Read on it now.

Reason for "Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte " error

This problem is common when reading a file under CSV format in pandas. It happens because the read_csv() function in pandas uses utf-8 Standard Encodings, which is defaulted in Python, but the file contains some special characters.

Now, we will read a CSV file about the biomedical domain by pandas and how the error happens.

You can download the CVS file here.

Code:

            import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv") data.head()          

Result:

          UnicodeDecodeError                        Traceback (most recent call last) <ipython-input-76-0c9089169b2f> in <module>       1 import pandas as pd ----> 2 a = pd.read_csv('/content/drive/MyDrive/LearnShareIT/alldata_1_for_kaggle.csv')   /usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()   UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 3: invalid start byte                  

Note: You may get the same error with format like that: UnicodeDecodeError: 'utf-8' codec can't decode byte <<memory address>>  in position <<position>> : invalid start byte error .

Solutions to solve this problem

Solution for reading csv file:

Some common encodings can bypass the codecs lookup machinery to improve performance such as latin1, iso-8859-1, ascii, us-ascii, etc.

You can pass a parameter named "encoding" with a string value which defines the type of encoding to perform the data.

In our example, we use "latin1" to encode the data.

Code:

            import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv", encoding = 'latin1') # pass encoding parameter data.head()          

Result:

          Unnamed:    0               0                                                  a 0           0  Thyroid_Cancer  Thyroid surgery in  children in a single insti... 1           1  Thyroid_Cancer  " The adopted strategy was the same as that us... 2           2  Thyroid_Cancer  coronary arterybypass grafting thrombosis ï¬b... 3           3  Thyroid_Cancer   Solitary plasmacytoma SP of the skull is an u... 4           4  Thyroid_Cancer   This study aimed to investigate serum matrix ...        

Solution for reading text and json file:

The initial content of json and txt file:

            {"student":[     { "firstName":"™œœ''™™œ""××""™"ˆ'γ°°'ˆ'"œ™"ε""íö", "lastName":"Doe" },     { "firstName":"Anna", "lastName":"Smith" },     { "firstName":"Peter", "lastName":"Jones" }   ] }          
            œMedical Informatics and œHealth Care Sciences          

Open file and read with binary mode

syntax: file_reader = open("path/to/file", "rb") with rb is binary reading mode

Read json file:

            import json   file = open('a.json', 'rb') content = json.load(file)  print(content)          

Result:

          {'student': [{'firstName': "™œ\x9dœ\x9d''™™œ\x9d""××""™"ˆ'γ°°'ˆ'"œ\x9d™"ε""Ã\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}        

Read text file:

            file = open('a.txt', 'rb')  print(file.read())          

Result:

          b'\xc5\x93Medical Informatics\xc2\x9d and \xc5\x93Health Care Sciences'        

Ignoring errors when reading file

Syntax: file = open("path/to/file", "r", errors="ignore" to ignore encoding errors can lead to data loss.

Read json file:

            import json   file = open('a.json', 'r', errors = 'ignore') content = json.load(file) print(content)          

Reuslt:

          {'student': [{'firstName': "â„¢Å"ÂÅ"Â''™™Å"Ââ€â€œÃƒâ€"Ãâ€"â€â€â„¢â€œË†â€™ÃŽÂ³Â°Â°â€™Ë†â€™â€œÅ"™“ε““ÃÂ\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}        

Read txt file:

            file = open('a.txt', 'r',  errors='ignore') print(file.read())          

Result:

          Å"Medical Informatics and Å"Health Care Sciences        

Summary

Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte is a common error when reading files. Through our article, hope you understand the root of the problem and the solution to the problem.

Maybe you are interested:

  • UnicodeDecodeError: 'ascii' codec can't decode byte
  • UnicodeEncodeError: 'ascii' codec can't encode character in position
  • AttributeError: 'dict' object has no attribute 'iteritems'

rosstecame.blogspot.com

Source: https://learnshareit.com/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-start-byte/

Post a Comment for "Unicodedecodeerror utf8 Codec Cant Decode Byte 0xed in Position 48 Invalid Continuation Byte"