The Hassle of Unicode (and Getting On With It in Python)
December 10, 2008
Let’s face facts.
- Unicode is a hassle [1]
- Not using unicode is a hassle, especially if you have one of thoese “weirdo” languages, and *gasp*, you want to read text *in your own language*.
I was faced with a simple task. Take some text, process it, and print out some results (in JSON). This should be trivial, and in a world where programming
was invented by a multinational consortium, and designed from the first day to be compatible with all text, maybe it would be. Instead we have a world with a rich history of mutual incompatibility.
Text vs. Bytestreams
We english speakers are used to thinking about text as a series of bytes, that maps one-to-one onto a set of glyphs (a-zA-Z0-9 and various control characters… i.e., the stuff on a US keyboard). One byte = one character = one glyph. Simple, right? But limiting…. we only get 256 choices! A major driving idea of unicode is to reframe thinking of thing in terms of code points.
If you want more details, try these sources:
- A.M. Kuchling’s Unicode Howto
- Joel’s “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”
- ReportLab’s Python Unicode Tutorial
- Wikipedia’s page on UTF-8
So, how do I deal with it? My solution is to try to get all bytestreams I encounter into unicode, and then use python do deal with it. Then when it comes time to deal with io again, output the unicode back into a bytestreams. Python has tons of support for unicode, but it can be confusing to use (especially for me).
Memory aid:
- DECODE: bytes -> Unicode Object
- ENCODE: Unicode Object -> bytes
# our bytestream, which should contain an "small latin n with tilde" glyph
b='Ensexf1anza Txe9cnica'
assert unicode(b,'latin1') == b.decode("latin1")
for o in b: print (o,ord(o))
## the original problem...
import sys
if sys.version_info < (2,6):
import simplejson as json
else:
import json
json.dumps(b)
'''
Traceback (most recent call last):
File "", line 1, in ?
File "/usr/lib64/python2.4/site-packages/simplejson/__init__.py", line 225, in dumps
return _default_encoder.encode(obj)
File "/usr/lib64/python2.4/site-packages/simplejson/encoder.py", line 188, in encode
return encode_basestring_ascii(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7: invalid data
'''
# understanding the error
try:
json.dumps(b)
except Exception, E1:
pass
try:
unicode(b,'utf8')
except Exception, E2:
pass
# the errors are the same enough
assert E1.__dict__ == E2.__dict__
So, what’s the problem here? We see the character “xf1″ has an ordinal value of “241″, meaning it can’t be understood as ascii. Implicitly, simplejson is trying to encode the string as UTF8 unicode before dumping. Is “xf1″ valid in UTF8? To learn more about this letter, we can do some experiments.
# our tilde n
n = 'xf1'
assert type(n) = type('') # it's a string
assert ord(n) == 241
assert n.decode('latin1').encode('utf8') == 'xc3xb1'
assert 'xc3xb1'.decode('utf8') == u'xf1'
So, it’s “xf1″ sometimes, and “xc3xb1″ sometimes? What the heck is going on? If we look at the “latin small letter n with tilde” page, we see that the “code point” associated with this *glyph* is represented different ways in different encodings, as listed in the “encodings” section. In Latin-1, it’s “xf1″ and in UTF-8, it’s “xc3xb1″. Simplejson thought it was seeing a UTF-8 bytestream, when it was really seeing a bytestream indended to be viewed through the lens of Latin-1.
So what do we actually have? Encoding the unicode object as UTF16 will give us a good view of what’s actually inside the unicode object(essentially, displaying the code points).
def _twobite(uni):
''' take a unicode object and make it into nice two-byte sequences
to show better what's going on inside the unicode object '''
if not type(uni) == type(u''):
raise ValueError, "must be unicode type"
enc = uni.encode('utf16')
return [ enc[ii:ii+2] for ii in xrange(len(enc)) if not ii % 2 ]
print _twobite(n.decode('latin1'))
# ['xffxfe', 'xf1x00']
# if the endian order on your machine is different, your tuples may be reversed
# use the unichr function to verify this
print _twobite(unichr(241)) # 241 -> "f1"
Putting it all together
Let’s see if we can simply code our way out it, using simplejson, and ignore the issue altogeher:
json.dumps(b,encoding='latin1') # works fine, but assumes we know what # encoding we already have for the bytestream. json.dumps(b,ensure_ascii=False) # ignores the problem, passing through # the string untouched
Both of those work fine, for what they do. Ideally though, I don’t want to have to think about what the strange text coming in is. I just want a guarantee that I have valid unicode (eventually encodable into UTF8), to use as I see fit [2].
def _to_unicode(str, verbose=False):
'''attempt to fix non uft-8 string into utf-8, using a limited set of encodings'''
# fuller list of encodings at http://docs.python.org/library/codecs.html#standard-encodings
if not str: return u''
u = None
# we could add more encodings here, as warranted.
encodings = ('ascii', 'utf8', 'latin1')
for enc in encodings:
if u: break
try:
u = unicode(str,enc)
except UnicodeDecodeError:
if verbose: print "error for %s into encoding %s" % (str, enc)
pass
if not u:
u = unicode(str, errors='replace')
if verbose: print "using replacement character for %s" % str
return u
assert json.dumps(_to_unicode(b)) == '"Ense\u00f1anza T\u00e9cnica"'
type(json.loads(json.dumps(_to_unicode(b)))) == type(u'')
print json.loads(json.dumps(_to_unicode(b)))
Our world of Babel [3] makes things complicated enough already. So, go forth in to this land, waving the banner of UTF-8. Maybe some world builder will finally tell us the secrets of the universe, and we’ll be able to read them, instead of seeing a row of empty boxes.
Notes:
- At least until Python 3 gets to be the usual state of affairs.
- For example, I might want to take the text, and replace all latin-1 with “nearest ascii equivalents”. cf: The Unicode Hammer, ignorning the main text,
and focusing on comment 3 and its use of the “unicodedata” module. - Genesis 11:1-9, cf: http://en.wikipedia.org/wiki/Tower_of_Babel

December 10, 2008 at 7:57 pm
Your post betrays confusion about the meanings of the words “encode” and “decode”.
A Unicode string is an abstraction consisting of a sequence of code points. An *encoding* (such as UTF-8) is a way of representing a Unicode string as an actual sequence of bytes. You do not ever want to be dealing with encoded byte sequences in your application; you want only to be dealing with Unicode strings, which you construct by *decoding* some bytes that came off the wire or out of a file (or by referring to constants like u’Frobnitz’).
Here’s how the memory aid ought to go:
Encode: Unicode string -> bytes
Decode: bytes -> Unicode string
The fact that Python treats encoded byte sequences as “strings” was a design flaw, which has now happily been corrected in version 3.
December 10, 2008 at 8:27 pm
Jonathan,
Your explanation is quite cogent, and indeed, all of this confuses the heck out of me, and out of many people. Calling a python object “a series of code points” is a very simple and clear way of looking at it.
That said, I’m not a huge fan of the phrase “unicode string”, and thinking with that phrase is why I get so confused about this stuff!
Do you agree with this code below?
# bytes going to unicode code points, mapped through the latin-1 lookup table
“ma\xf1ana”.decode(“latin1″)
December 11, 2008 at 1:43 am
Decoding is not generally a matter of using a “lookup table”, so your comment isn’t quite right. However, your code sample does illustrate “constructing a Unicode string by interpreting a sequence of bytes as latin-1-encoded text”. I understand your confusion over the overloading of the word “string”, but the “unicode” python type is much closer to the platonic “string” than the “str” python type, which is really a byte sequence.
IDLE 1.2.2
>>> s = u’Moxy Früvous’
>>> type(s)
>>> utf8 = s.encode(‘utf8′)
>>> utf8
‘Moxy Fr\xc3\xbcvous’
>>> type(utf8)
>>> latin1 = s.encode(‘latin-1′)
>>> latin1
‘Moxy Fr\xfcvous’
December 11, 2008 at 1:44 am
Your blog has swallowed some angle brackets.
>>> type(s)
<type ‘unicode’>
>>> type(utf8)
<type ’str’>
December 11, 2008 at 6:07 am
I think you’re very confused about the relationship between unicode and byte strings.
> We english speakers are used to thinking about text as a series of characters, that maps one to one onto a set of glyphs (a-zA-Z0-9 and various control characters… i.e., the stuff on a US keyboard). The whole idea of unicode is to move away from this expectation, and to start thinking of thing in terms of bytestreams. Until I know an encoding for it, it’s just a stream.
This is almost the exact opposite of what Unicode actually is. A unicode string is a sequence of Unicode code points; equivalent to ASCII characters.
> So, how do I deal with it? My solution is to try to get everything I can into UTF-8, and then use python do deal with it. Python has tons of support for UTF-8, but it can be confusing to use.
The correct solution is to use Unicode strings. In Python, this is achieved by prefixing strings with u and using \u escapes.
For example, u’Ense\u00f1anza T\u00e9cnica’
You’re also getting confused between Latin-1 and UTF-8. These are two different Unicode encodings — \xF1 is valid Latin-1, but not valid UTF-8.
Unicode U+00F1 = UTF-8 “\xC3\xB1″
December 11, 2008 at 5:36 pm
Thanks for the good comments, John.
Indeed I was confused, as you point out. I know that I’m not the only one, and that all of my co-workers also bang out heads against that wall every time a bug ends up being unicode related. You mention:
“In Python, this is achieved by prefixing strings with u and using \u escapes.
For example, u’Ense\u00f1anza T\u00e9cnica’”
I know that it would be awfully nice if every bytestream I interact with came in with a properly labelled encoding, but many don’t, which is the point. In this case, simplejson seems to assume that any non-ascii stream is UTF8, and it bit me here.
I hope the revisions to this article are more consistent with your understanding of things.