Fix Incorrect Filename Encoding

My friend asked me some help to fix incorrect filename encoding. By default, most operating systems automatically create filename in utf-8 encoding. That's the correct and good one. However, in some case, applications may create filename in other encoding, e.g., TIS-620 for Thai. The result is unreadable filenames. Actually, this situation often occurs when you download files and create the same name on local computer. The encoding maybe ISO-8859-1. It is possible to prevent this problem by adjusting configuration of those applications. By the way, you are about to fix this problem since it is already happened.

Fortunately, Python has all functions required for fixing this problem easily. I wrote a simple and short script as follows.

#!/usr/bin/env python
 
import sys
import os
from optparse import OptionParser
 
class Application:
    def __init__(self):
        self.parse_args()
 
    def parse_args(self):
        parser = OptionParser(usage='usage: %prog [options] path ...')
        parser.add_option('-v','--verbose',dest='verbose',default=False,
                          action="store_true",help='verbose')
        parser.add_option('-i','--in',dest='in_encoding',default='utf-8',
                          help='input encoding (default=utf-8)')
        parser.add_option('-o','--out',dest='out_encoding',default='tis-620',
                          help='input encoding (default=tis-620)')
        parser.add_option('-r','--recursive',dest='recursive',default=False,
                          action="store_true",help='recursive')
        parser.add_option('--dryrun',dest='dryrun',default=False,
                          action="store_true",help='dry run')
        self.options,self.args = parser.parse_args()
 
    def verbose(self,message):
        if self.options.verbose:
            print message
 
    def error(self,message):
        print >> sys.stderr,'ERROR: %s' % message
 
    def run(self):
        for path in self.args:
            self.process(path)
 
    def process(self,path):
        if self.options.recursive:
            for root,dirs,files in os.walk(path,topdown=False):
                for name in files:
                    self.rename(os.path.join(root,name))
                for name in dirs:
                    self.rename(os.path.join(root,name))
        self.rename(path)
 
    def rename(self,path):
        path = os.path.realpath(path)
        dir = os.path.dirname(path)
        src = os.path.basename(path)
        dest = src.decode(self.options.in_encoding,'replace').encode(self.options.out_encoding,'replace')
        src = os.path.join(dir,src)
        dest = os.path.join(dir,dest)
        self.verbose("rename '%s' to '%s'" % (src,dest))
        try:
            if not self.options.dryrun:
                os.rename(src,dest)
        except OSError,why:
            self.error(str(why))
 
if __name__ == '__main__':
    app = Application()
    app.run()

To make it useful as much as possible, I wrote it with a few options.

usage: fixencoding.py [options] path ...

options:
  -h, --help            show this help message and exit
  -v, --verbose         verbose
  -i IN_ENCODING, --in=IN_ENCODING
                        input encoding (default=utf-8)
  -o OUT_ENCODING, --out=OUT_ENCODING
                        output encoding (default=tis-620)
  -r, --recursive       recursive
  --dryrun              dry run

For example, you can recursively convert utf-8 to tis-620 in C:\Downloads as follow.

python fixencoding.py -i utf-8 -o tis-620 -r C:\Downloads

If you don't sure the input encoding or output encoding, you may just try it using --dryrun and --verbose.

python fixencoding.py -i utf-8 -o tis-620 -r -v --dryrun C:\Downloads

Tags: , , , ,

Thanks!

this is a big help to me... some comments: double underscores ('__') have dissappeared and I pop() the first argument (scriptname) anyway, just wanted to thank you for sharing! regards

changed geshifilter

Thank for the comment. I have changed geshifilter to newer version for so long while this entry was written in old format. That's why the code is not correctly formatted. Anyway, I fixed it.

Post new comment