Understanding UnicodeOver the years I've studied up on Unicode and read the important pieces on the subject with enthusiasm as they came up. For example:
- Unicode pages at Wikipedia
- Tim Bray's: On the Goodness of Unicode and Characters vs. Bytes
- Joel's: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode ...
- Dive into Python's Unicode Chapter
Handling Unicode Errors in Python 2.xBest to look at the language specific docs in that case, right? As a budding pythonista, I headed here:
So far so good, although true understanding of what should be done was still muddy. While there were plenty of nuts and bolts to handle issues as they arose, there wasn't yet a mental framework available to latch onto about how it all works, why I am trapping these exceptions, nor how to avoid them in the first place. Perhaps I'm a dufus (although surely not the only one), but it wasn't until a few years later when answers started permeating my thick skull, (through osmosis I'd gather).
Coincidentally, about the time these ideas were solidifying through rote repetition, I read this fantastic yet short presentation, whereby I had the proverbial "aha" moment. The problem and solution are described more elegantly than I could below.
- Unicode In Python, Completely Demystified,
courtesy Kumar McMillan
The Missing Link
- Decode early
- Unicode everywhere
- Encode late
To expand on this a bit, Unicode in Python is not text, but rather a text-like object in memory. Codecs like "UTF-8" and "latin-1" allow a Unicode object to be encoded for transport, for example to disk files or across the network. To work with them correctly, we simply decode them to Unicode on input, manipulate them (in their "natural" state), and encode them on output! Simple to understand and code.
Below is a minimal example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# read in data from files, network, user, etc. infile = file('input.txt', 'r') data = infile.read() infile.close() # decode immediately to Unicode unicode_text = data.decode('utf-8') # manipulate in memory, add smiley unicode_text = unicode_text + u'Hello World! \u263B\n' # re-encode before transport or storage data = unicode_text.encode('utf-8') # save it outfile = file('ouput.txt', 'w') outfile.write(data) outfile.close()
With this newfound understanding, I found the very good "nuts and bolts" page below more helpful than it was before: