Notes on python unicode

From Tronche's wiki
Jump to: navigation, search


Contents

Why is it so complicated (yet so simple in many other computer languages) ?

No idea. Progress may be.

What is usually poorly understood is:

  • Why do I have errors when trying to output something like 'ê' (like in french fête = party). ê is only one character, so what's all that unicode / utf headache ?
  • Interaction with the terminal, interaction with files and why they differ

A note on Unicode (just in case)

If you're from Western Europe, it's most confusing since the alphabet, including éàñ and others can be encoded on 8 bits, like byte strings. However, Python insists on the fact that what you put in the str types are by default ASCII, that is 8-bits bytes with code ranging from 0 to 127. If you're using "é", may be it fits in iso-8859-1, but when your russian friend will open the file, she most likely will see a щ (cyrillic char with the same code -233- in iso-8859-5 than é in iso-8859-1). So this is byte 233, that is é if you interpret it as iso-8859-1, and щ if you say the code is representing iso-8859-5. This is why unicode was created, so you can have both é and щ in the same character set (not counting arabic, chinese-japanese-korean and many others), and actually enough space to have several billions of symbols. But since you have so many symbols, they can't fit in an 8-bit byte anymore, and this is where you have to devise a new coding system. UTF-8 is one of them, it uses the same code for the lower 128 part as ASCII, but "beyond" that, you need multiple bytes to represent one character.

How does it works ?

1. str are array of bytes.
2. unicode (the Python type) are array of characters (for example the greek symbol for capital PI is a character).

Examples:

>>> u'\u03C0' # 03C0 is unicode for greek capital PI 
u'\u03c0'
>>> print u'\u03C0'
Π

NB: this only works if your terminal is set to some encoding capable of outputing Unicode (such as UTF-8).

When starting in interactive mode, Python (by default) tries to match the terminal output encoding, we can see this by typing:


>>> import sys
>>> sys.stdout.encoding
'UTF-8'

In my case, the terminal is set to interpret output sent to stdout as utf-8, and Python matched that setting when starting.More on this later.

The point is that a str is a byte string, a succession of 8-bit numbers, some of them can be interpreted as accents under various circumstances, with no bidirectional mapping, that is a code can represent various characters. For example 233 is é in iso-8859-1 and щ in iso-8859-15), conversely the character é is represented by 233 (hex e9) if encoded in iso-8859-1 and hex c3 hex a9 if encoded in utf-8. An unicode string in contrast represents the character, independently of any representation, but to output it to a file or terminal, you'll have to choose an encoding. Thus:

3. encode convert from unicode to byte string (str in python2 talk)
4. decode convert from byte string to unicode

A bit more to understand the difference:

>>> u'\u03C0'.encode('utf-8')
'\xcf\x80'

Thus, the representation of Π in utf-8 is hex CF followed by hex 80 (and not hex 03 followed by hex C0 like unicode number may make you think).

>>> len(u'\u03C0')
1

The (unicode) string u'\u03C0' has length of one character (greek capital PI). Of course you need 2 bytes in memory (assuming it's internally coded either on 16 bits or utf-8).

>>> len(u'\u03C0'.encode('utf-8'))
2

When using UTF-8, it's 2 bytes.

Analyzing errors

The "have no influence in my mind but triggers an error"

Many errors come from Python trying to implicitly encode. Here's an example (my terminal is set for UTF-8, LANG environment variable is en_GB.UTF-8, and thus python getdefaultencoding() is 'UTF-8'). input is ISO-8859-1, so I can type the é.

>>> 'é'
'\xe9' # So far so good
>>> 'é'.encode('iso-8859-15')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/iso8859_15.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

Note that this is a decode error while we're trying to encode.

"But that's unfair", you say. "Python has put 'é' in the byte array (that is byte with code hex e9 = dec 233), and that can be encoded in iso-8859-15. What's wrong with that ?"

You're not listening. encode convert from unicode to byte stream. And you've passed a byte stream to encode, and frankly, I'm of the opinion that encode shouldn't be defined on a string. But it is (not any more in python 3, which removes part of the confusion), so what's Python trying to do ? First to decode the string into some unicode stuff before encoding it. And the default for that is ascii, thus what Python is really trying to do is

'é'.decode('ascii').encode('stuff')

And then BAM ! 'é' is not in pure ASCII (which is range(128), what Python is trying to tell us), thus decoding can't occur.

Contrast this with

>>> u'é'.encode('iso-8859-15')
'\xe9'

And remember rule #3.

Of course, you shouldn't feed encode with a string (only unicode), but that may happen if you haven't thought unicode when starting your program, or using some legacy code that ends up mixing str and unicode.

Work well on terminal, crash when redirecting to a file ? (Unix)

Python thinks that when you're writing to a terminal, the program is most likely trying to communicate with some people that uses various symbols, so special symbols should output well on the terminal, while when writing to a file, you should be more specific, because the file could be read on another machine in another environment (and thus files are just collections of bytes).

On Unix, Python takes the character set in the locale subsytem, for example the LANG environment.

$ echo $LANG
en_GB.UTF-8
$ python -c 'import sys
> print sys.stdout.encoding'
UTF-8

$ export LANG=en_GB.ISO-8859-1
$ python -c 'import sys
print sys.stdout.encoding'
ISO-8859-1

LANG is supposed to reflect the settings of your terminal. This is very likely to be true if you open a fresh terminal and don't mess with the settings.

Now what happen if you redirect to a file ?

$ python -c 'print u"é"'
é
$ python -c 'print u"é"' > f
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

Damn. What happened ? Of course the unicode strings u"é" contains a special character, and you've to encode to output to a stream or a file, in my case (with an UTF-8 terminal, taken by Python from the LANG environment variable), it's encoded to UTF-8, so python really performs sys.stdout.write(u"é".encode("UTF-8")), which outputs two bytes to the terminal (hex c3 and hex a9, which is the UTF-8 for the e with an acute accent), the terminal gets those two bytes, and happily display é.

Now, what encoding is used if stdout is redirected to a file ? As I said, it could be read in another environment, and Python makes no asumption on it, so:

$ python -c '
import sys
print >>sys.stderr, sys.stdout.encoding'
UTF-8
 python -c '
import sys
print >>sys.stderr, sys.stdout.encoding' > /dev/null # note the redirection here
None

Python has no encoding for stdout in that case, and thus uses python default encoding (most likely ascii), triggering the error.

Comments


Add your comment
Tronche's wiki welcomes all comments. If you do not want to be anonymous, register or log in. It is free.

Personal tools