How to read non-English character in mailman archives using Python
I was asked to help reading e-mail with TIS-620 encoding, Thai, archived by Mailman in HTML format. Since Thai characters are above ascii range, Mailman changed all Thai characters into ascii character by running escape entity function. For example, A could be encoded to A. That means all Thai characters were converted to € - ÿ, roughly. Both IE and Firefox will display these entities in latin1 encoding no matter what I forced, e.g., TIS-620.
Fortunately, Python helps me in just a few line of code. See below.
import re s = "ÊØ¡ÃÕ" print re.sub(r’(\d+);’,lambda m: chr(int(m.group(1))),s)
Above code, I make use of re module and lambda function. In other case, you might be interesting to know how to reverse the above function. Below is a short example.
print re.sub(r’(.)’,lambda m: ’%d;’ % ord(m.group(1)),"abc")
Thanks Python! Note that "ÊØ¡ÃÕ" is my name in Thai. :-)
Technorati Tags: English, Programming, Python, Tips and Tricks, Character Encoding
- sugree's blog
- 638 reads
Post new comment