Handling Unicode in a Nutshell

As a "global citizen in a worldwide economy," (groan) it can only make sense to properly handle multiple languages in the software you develop.  That is, unless you'd prefer to ignore what might be the majority of potential users and/or profits.  Accordingly, for many years I've done my best to accommodate that goal in the programs I write or contribute to.

Understanding Unicode

Over the years I've studied up on Unicode and read the important pieces on the subject with enthusiasm as they came up.  For example:
Ok great, sold.  So now we know a lot about Unicode.  All good reading of course, but what now?  How do I correctly write a Unicode-aware program?  No clue ... as usual references bury you in mind-numbing detail, and articles and essays never seem to get that far, assuming you'd continue on in the language of your choice.  Metaphorically, we have been provided steps one through ten, but seven through nine have been (in)conveniently left out and left to the reader to figure out.

Handling Unicode Errors in Python 2.x

Best to look at the language specific docs in that case, right?  As a budding pythonista, I headed here:
Ok, it's getting clearer.  Around this time I start putting Chinese or Russian UTF-8 into test cases and am able to work around and patch the ensuing exceptions.  I guess we'll call these steps seven and eight.  ;)

So far so good, although true understanding of what should be done was still muddy.  While there were plenty of nuts and bolts to handle issues as they arose, there wasn't yet a mental framework available to latch onto about how it all works, why I am trapping these exceptions, nor how to avoid them in the first place.  Perhaps I'm a dufus (although surely not the only one), but it wasn't until a few years later when answers started permeating my thick skull, (through osmosis I'd gather).

Coincidentally, about the time these ideas were solidifying through rote repetition, I read this fantastic yet short presentation, whereby I had the proverbial "aha" moment.  The problem and solution are described more elegantly than I could below.
If you haven't had that moment yet, I'd like to condense the piece above further into the following nutshell for my own benefit as well as any others in need.

The Missing Link

How do I write a Unicode program properly in Python 2.x?  The answer:
  1. Decode early
  2. Unicode everywhere
  3. Encode late
By early, he means "at first encounter," late meaning, "just before output."

To expand on this a bit, Unicode in Python is not text, but rather a text-like object in memory.  Codecs like "UTF-8" and "latin-1" allow a Unicode object to be encoded for transport, for example to disk files or across the network.  To work with them correctly, we simply decode them to Unicode on input, manipulate them (in their "natural" state), and encode them on output!  Simple to understand and code.

Below is a minimal example:
# read in data from files, network, user, etc.
infile = file('input.txt', 'r')
data = infile.read()

# decode immediately to Unicode
unicode_text = data.decode('utf-8')

# manipulate in memory, add smiley
unicode_text = unicode_text + u'Hello World! \u263B\n'

# re-encode before transport or storage
data = unicode_text.encode('utf-8')

# save it
outfile = file('ouput.txt', 'w')
Of course Python 3 will make this easier more of the time as it uses Unicode strings and sets the system encoding to UTF-8 by default.  I'm still using Python 2.x for work projects however, and the nutshell above is still valid under non-default cases.

With this newfound understanding, I found the very good "nuts and bolts" page below more helpful than it was before:
What tips do you have about how to handle Unicode that haven't gotten enough attention?

No comments:

Post a Comment