Python Rocks! and other rants

2005-03-01 06:48:16

Python and Unicode

Python has extensive support for Unicode data. Two issues that are not well documented elsewhere are the handling of non-Ascii characters in the Python interpreter, and use of the default system encoding. I cover those here.

For more general information about Python and Unicode see the references.

Unicode in the Python interpreter

Understanding the console encoding

Characters that you type at the interpreter prompt, or in response to a call to input() or raw_input(), are entered in the encoding of sys.stdin. Printed characters are interpreted using the encoding of sys.stdout. You can find out what these encodings are by looking at sys.stdin.encoding and sys.stdout.encoding.

In IDLE on Windows, the input and output encodings are both Cp1252, a superset of Latin-1:

>>> import sys
>>> sys.stdin.encoding
'cp1252'
>>> sys.stdout.encoding
'cp1252'

Under Windows, the DOS console normally uses Cp437:

>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> sys.stdin.encoding
'cp437'

The use of Cp437 can cause problems. For example:

>>> 'ä'
'\x84'

This is correct; the code point for ä in Cp437 is \x84. But consider:

>>> u'ä'
u'\x84'

The Unicode string is also set to a value of \x84! The correct Unicode code point is \xe4!

Setting the console encoding

You can change the encoding used by the DOS console using the chcp command. For example:

D:>chcp 1252
Active code page: 1252

D:>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'ä'
'\xe4'
>>> u'ä'
u'\xe4'

Since Unicode is a superset of Latin-1, the Unicode string now has the correct value.

The default system encoding

Converting between Unicode and 8-bit strings

The string method decode() converts from an 8-bit string to a Unicode string. For example in a DOS console we can convert a string from Cp437 to Unicode like this:

>>> 'ä'.decode('cp437')
u'\xe4'

The inverse operation is the Unicode string method encode():

>>> u'\xe4'.encode('cp437')
'\x84'
>>> print u'\xe4'.encode('cp437')
ä

How do you remember which is which? I think of Unicode as the 'master' - it is pure, unencoded character data :-). To convert to any other representation, it must be encoded. Conversely, any encoded characters must be decoded into Unicode. Of course Unicode is really just another encoding, but I find this a useful mnemonic.

If no encoding is specified, a default encoding is used. The default encoding is normally ascii:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

The default encoding is used when an 8-bit string is implicitly converted to a Unicode string, and when a Unicode string is implicitly converted to an 8-bit string. The ascii codec only allows values in the range 0-127. This can cause some surprises:

>>> u=u'ä'
>>> u
u'\xe4'
>>> print u
ä
>>> str(u)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
>>> a='ä'
>>> u+a
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

You can change the default encoding by creating a file called sitecustomize.py in the directory Python/Lib/site-packages. The file should contain

import sys
sys.setdefaultencoding('latin-1')

(or whatever default encoding you want to use.) With the default encoding set to latin-1 and using the IDLE shell, I can do this:

>>> u=u'ä'
>>> u
u'\xe4'
>>> a='ä'
>>> a
'\xe4'
>>> u+a
u'\xe4\xe4'
>>> print u
ä
>>> print a
ä
>>> unicode(a)
u'\xe4'

In a DOS shell with the default encoding set to Cp437, you can see clearly that should avoid non-Ascii data in Unicode strings:

>>> a='ä'
>>> u=u'ä'
>>> u+a
u'\x84\xe4'

u+a converts a to the correct unicode value; the result has two different codes! So if your console is set to an encoding other than Latin-1, you are better off using hexadecimal values or explicit unicode values for your unicode strings, e.g.:

>>> u'\N{LATIN SMALL LETTER A WITH DIAERESIS}'
u'\xe4'

Other references

There is a lot more to Python's Unicode support than I have covered here; I am just trying to fill in some of the cracks. For more information, see one of these references:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)

Unicode for Programmers (includes Unicode in Python) (Jason Orendorff)

Python Unicode Objects (Fredrik Lundh)

Python Unicode Tutorial (ReportLab)

End to End Unicode Web Applications in Python (Martin Doudoroff)

Unicode in Python (Thijs van der Vossen)

PEP 100 - Python Unicode Integration - This PEP defines the initial Python implementation of Unicode

PEP 263 - Defining Python Source Code Encodings - This PEP extends Python to allow source files to define their encoding