How to read non-English character in mailman archives using Python

Posted March 14th, 2006 by sugree

I was asked to help reading e-mail with TIS-620 encoding, Thai, archived by Mailman in HTML format. Since Thai characters are above ascii range, Mailman changed all Thai characters into ascii character by running escape entity function. For example, A could be encoded to A. That means all Thai characters were converted to € - ÿ, roughly. Both IE and Firefox will display these entities in latin1 encoding no matter what I forced, e.g., TIS-620.

Fortunately, Python helps me in just a few line of code. See below.

import re
s = "ÊØ¡ÃÕ"
print re.sub(r’&#(\d+);’,lambda m: chr(int(m.group(1))),s)

Above code, I make use of re module and lambda function. In other case, you might be interesting to know how to reverse the above function. Below is a short example.

print re.sub(r’(.)’,lambda m: ’&#%d;’ % ord(m.group(1)),"abc")

Thanks Python! Note that "ÊØ¡ÃÕ" is my name in Thai. :-)

Technorati Tags: English, Programming, Python, Tips and Tricks, Character Encoding

sugree's blog
638 reads

howforge.com

How to read non-English character in mailman archives using Python

Post new comment

User login

Recent comments

Active forum topics

Navigation

Who's online

Syndicate

tags in Topic

Recent posts

Related Contents