How to decode incorrect character encoding in e-mail using Python

Sometimes character encoding sent via e-mail are incorrect. The most often scenario is as follow.

  • Actual encoding is TIS-620.
  • Header indicates ISO-8859-1.
  • The e-mail reader displays contents in UTF-8.

As a result, it is very difficult to recover the failure. However, it is possible to solve this problem using Python.

There are 3 steps.

  1. Decode from UTF-8.
  2. Encode to ISO-8859-1.
  3. Decode from TIS-620.

Now we will obtain the content in UTF-8.

>>> s = '''\xc3\x8a\xc3\x98\xc2\xa1\xc3\x83\xc3\x95'''
>>> print s.decode('utf-8').encode('iso-8859-1').decode('tis-620')

Tags: ,

Reply