How to decode incorrect character encoding in e-mail using Python
Sometimes character encoding sent via e-mail are incorrect. The most often scenario is as follow.
- Actual encoding is TIS-620.
- Header indicates ISO-8859-1.
- The e-mail reader displays contents in UTF-8.
As a result, it is very difficult to recover the failure. However, it is possible to solve this problem using Python.
There are 3 steps.
- Decode from UTF-8.
- Encode to ISO-8859-1.
- Decode from TIS-620.
Now we will obtain the content in UTF-8.
>>> s = '''\xc3\x8a\xc3\x98\xc2\xa1\xc3\x83\xc3\x95''' >>> print s.decode('utf-8').encode('iso-8859-1').decode('tis-620')
- sugree's blog
- 2633 reads
Post new comment