Monday, May 21, 2007

UTF-8 transcoder

I needed a quick program to convert from one character set to another, specifically from shift-JIS (a Japanese encoding) to UTF-8. I didn't really need the result but I did want to know if it could be *legally* converted. So here what I came up with:

#!/usr/bin/python

import codecs
import sys

print "opening file %s"% sys.argv[1]

fi = codecs.open( sys.argv[1],'r', 'shift-jis')
data = fi.read()
fi.close()

print "Writing results.txt"
fo = codecs.open('result.txt','w','utf-8'
fo.write(data)
fo.close()


I later added an option to specify the source codec.. and then after trying several japanese codecs, one after another, came up with the bulk transcoder to cycle through all japanese codecs until finding a suitable one (ie one that didn't throw an exception).

No comments: