The Hassle of Unicode (and Getting On With It in Python)

Let’s face facts.

  1. Unicode is a hassle [1]
  2. Not using unicode is a hassle, especially if you have one of thoese “weirdo” languages, and *gasp*, you want to read text *in your own language*.

I was faced with a simple task.  Take some text, process it, and print out some results (in JSON).  This should be trivial, and in a world where programming
was invented by a multinational consortium, and designed from the first day to be compatible with all text, maybe it would be.  Instead we have a world with a rich history of mutual incompatibility.

Text vs. Bytestreams

We english speakers are used to thinking about text as a series of bytes, that maps one-to-one onto a set of glyphs (a-zA-Z0-9 and various control characters… i.e., the stuff on a US keyboard).  One byte = one character = one glyph. Simple, right? But limiting…. we only get 256 choices! A major driving idea of unicode is to reframe thinking of thing in terms of code points. 

If you want more details, try these sources:

So, how do I deal with it?  My solution is to try to get all bytestreams I encounter into unicode, and then use python do deal with it.  Then when it comes time to deal with io again, output the unicode back into a bytestreams. Python has tons of support for unicode, but it can be confusing to use (especially for me).

Memory aid:

  • DECODE:  bytes -> Unicode Object
  • ENCODE:  Unicode Object -> bytes
# our bytestream, which should contain an "small latin n with tilde" glyph
b='Ensexf1anza Txe9cnica'
assert unicode(b,'latin1') == b.decode("latin1")
for o in b: print (o,ord(o))

## the original problem...
import sys
if sys.version_info < (2,6):
    import simplejson as json
else:
    import json

json.dumps(b)

'''
Traceback (most recent call last):
  File "", line 1, in ?
  File "/usr/lib64/python2.4/site-packages/simplejson/__init__.py", line 225, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python2.4/site-packages/simplejson/encoder.py", line 188, in encode
    return encode_basestring_ascii(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7: invalid data
'''

# understanding the error
try:
    json.dumps(b)
except Exception, E1:
    pass

try:
    unicode(b,'utf8')
except Exception, E2:
    pass

# the errors are the same enough
assert E1.__dict__ == E2.__dict__


So, what’s the problem here?  We see the character “xf1″ has an ordinal value of “241”, meaning it can’t be understood as ascii.  Implicitly, simplejson is trying to encode the string as UTF8 unicode before dumping.  Is “xf1″ valid in UTF8?  To learn more about this letter, we can do some experiments.

# our tilde n
n = 'xf1'
assert type(n) = type('')  # it's a string
assert ord(n) == 241

assert n.decode('latin1').encode('utf8') == 'xc3xb1'
assert 'xc3xb1'.decode('utf8') == u'xf1'

So, it’s “xf1″ sometimes, and “xc3xb1″ sometimes?  What the heck is going on?  If we look at the “latin small letter n with tilde” page, we see that the “code point” associated with this *glyph* is represented different ways in different encodings, as listed in the “encodings”  section. In Latin-1, it’s “xf1″ and in UTF-8, it’s “xc3xb1″. Simplejson thought it was seeing a UTF-8 bytestream, when it was really seeing a bytestream indended to be viewed through the lens of Latin-1.

So what do we actually have?  Encoding the unicode object as UTF16 will give us a good view of what’s actually inside the unicode object(essentially, displaying the code points).


def _twobite(uni):
    ''' take a unicode object and make it into nice two-byte sequences
    to show better what's going on inside the unicode object '''
    if not type(uni) == type(u''):
        raise ValueError, "must be unicode type"
    enc = uni.encode('utf16')
    return [ enc[ii:ii+2] for ii in xrange(len(enc)) if not ii % 2 ]

print _twobite(n.decode('latin1'))
# ['xffxfe', 'xf1x00']
# if the endian order on your machine is different, your tuples may be reversed

# use the unichr function to verify this
print _twobite(unichr(241))  # 241 -> "f1"

Putting it all together

Let’s see if we can simply code our way out it, using simplejson, and ignore the issue altogeher:


json.dumps(b,encoding='latin1')  # works fine, but assumes we know what
                                 # encoding we already have for the bytestream.

json.dumps(b,ensure_ascii=False)  # ignores the problem, passing through
                                  # the string untouched

Both of those work fine, for what they do.  Ideally though, I don’t want to have to think about what the strange text coming in is.  I just want a guarantee that I have valid unicode (eventually encodable into UTF8), to use as I see fit [2].


def _to_unicode(str, verbose=False):
    '''attempt to fix non uft-8 string into utf-8, using a limited set of encodings'''
    # fuller list of encodings at http://docs.python.org/library/codecs.html#standard-encodings
    if not str:  return u''
    u = None
    # we could add more encodings here, as warranted.  
    encodings = ('ascii', 'utf8', 'latin1')
    for enc in encodings:
        if u:  break
        try:
            u = unicode(str,enc)
        except UnicodeDecodeError:
            if verbose: print "error for %s into encoding %s" % (str, enc)
            pass
    if not u:
        u = unicode(str, errors='replace')
        if verbose:  print "using replacement character for %s" % str
    return u

assert json.dumps(_to_unicode(b)) == '"Ense\u00f1anza T\u00e9cnica"'

type(json.loads(json.dumps(_to_unicode(b)))) == type(u'')
print json.loads(json.dumps(_to_unicode(b)))

Our world of Babel [3] makes things complicated enough already. So, go forth in to this land, waving the banner of UTF-8. Maybe some world builder will finally tell us the secrets of the universe, and we’ll be able to read them, instead of seeing a row of empty boxes.

Notes:

  1. At least until Python 3 gets to be the usual state of affairs.
  2. For example, I might want to take the text, and replace all latin-1 with “nearest ascii equivalents”. cf: The Unicode Hammer, ignorning the main text,
    and focusing on comment 3 and its use of the “unicodedata” module.
  3. Genesis 11:1-9, cf:  http://en.wikipedia.org/wiki/Tower_of_Babel
About these ads

8 Comments on “The Hassle of Unicode (and Getting On With It in Python)”

  1. Your post betrays confusion about the meanings of the words “encode” and “decode”.

    A Unicode string is an abstraction consisting of a sequence of code points. An *encoding* (such as UTF-8) is a way of representing a Unicode string as an actual sequence of bytes. You do not ever want to be dealing with encoded byte sequences in your application; you want only to be dealing with Unicode strings, which you construct by *decoding* some bytes that came off the wire or out of a file (or by referring to constants like u’Frobnitz’).

    Here’s how the memory aid ought to go:

    Encode: Unicode string -> bytes
    Decode: bytes -> Unicode string

    The fact that Python treats encoded byte sequences as “strings” was a design flaw, which has now happily been corrected in version 3.

  2. writeonly says:

    Jonathan,

    Your explanation is quite cogent, and indeed, all of this confuses the heck out of me, and out of many people. Calling a python object “a series of code points” is a very simple and clear way of looking at it.

    That said, I’m not a huge fan of the phrase “unicode string”, and thinking with that phrase is why I get so confused about this stuff!

    Do you agree with this code below?

    # bytes going to unicode code points, mapped through the latin-1 lookup table
    “ma\xf1ana”.decode(“latin1″)

  3. Decoding is not generally a matter of using a “lookup table”, so your comment isn’t quite right. However, your code sample does illustrate “constructing a Unicode string by interpreting a sequence of bytes as latin-1-encoded text”. I understand your confusion over the overloading of the word “string”, but the “unicode” python type is much closer to the platonic “string” than the “str” python type, which is really a byte sequence.

    IDLE 1.2.2
    >>> s = u’Moxy Früvous’
    >>> type(s)

    >>> utf8 = s.encode(‘utf8′)
    >>> utf8
    ‘Moxy Fr\xc3\xbcvous’
    >>> type(utf8)

    >>> latin1 = s.encode(‘latin-1′)
    >>> latin1
    ‘Moxy Fr\xfcvous’

  4. Your blog has swallowed some angle brackets.

    >>> type(s)
    <type ‘unicode’>
    >>> type(utf8)
    <type ‘str’>

  5. John Millikin says:

    I think you’re very confused about the relationship between unicode and byte strings.

    > We english speakers are used to thinking about text as a series of characters, that maps one to one onto a set of glyphs (a-zA-Z0-9 and various control characters… i.e., the stuff on a US keyboard). The whole idea of unicode is to move away from this expectation, and to start thinking of thing in terms of bytestreams. Until I know an encoding for it, it’s just a stream.

    This is almost the exact opposite of what Unicode actually is. A unicode string is a sequence of Unicode code points; equivalent to ASCII characters.

    > So, how do I deal with it? My solution is to try to get everything I can into UTF-8, and then use python do deal with it. Python has tons of support for UTF-8, but it can be confusing to use.

    The correct solution is to use Unicode strings. In Python, this is achieved by prefixing strings with u and using \u escapes.

    For example, u’Ense\u00f1anza T\u00e9cnica’

    You’re also getting confused between Latin-1 and UTF-8. These are two different Unicode encodings — \xF1 is valid Latin-1, but not valid UTF-8.

    Unicode U+00F1 = UTF-8 “\xC3\xB1″

  6. writeonly says:

    Thanks for the good comments, John.

    Indeed I was confused, as you point out. I know that I’m not the only one, and that all of my co-workers also bang out heads against that wall every time a bug ends up being unicode related. You mention:

    “In Python, this is achieved by prefixing strings with u and using \u escapes.

    For example, u’Ense\u00f1anza T\u00e9cnica’”

    I know that it would be awfully nice if every bytestream I interact with came in with a properly labelled encoding, but many don’t, which is the point. In this case, simplejson seems to assume that any non-ascii stream is UTF8, and it bit me here.

    I hope the revisions to this article are more consistent with your understanding of things.

  7. [...] your gonna work with unicode on your appengine app than your in for some other troubles. This article (and its comments) might help a [...]

  8. A quick tip: how to do a STRING search (instead of a substring search using indexof or find) and getting rid of Unicode encoding/decoding issues:

    For example: you need to find the word ‘cat’ in the following phrase: ‘Learning string concatenation in Python’. How do you accomplish this?

    import re

    term = ‘cat’
    phrase = ‘Learning string concatenation in Python’

    found = re.search(r’\b’ + term + r’\b’, phrase)

    if found:
    print ‘Found!’
    else:
    print ‘Not found.’

    Explanation: the ‘import re’ imports the regular expression package. The search method takes a pattern and a string as arguments. The pattern was assembled by using the expression ‘\b’, which means word boundary. Also, the prefix r (before word boundary expression) implicitly gets rid of the encoding issue.

    Note: I am not familiarized with most Python concepts, and there could be some errors in my explanation. Nevertheless, it works for the intended purpose, and that’s the only intent, to provide a hands-on to-the-point example of a simple task in python. I will research the internals of this solution, but in the meantime, hope this can help.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.