The string method decode() converts from an 8-bit string to a Unicode string. For example in a DOS console we can convert a string from Cp437 to Unicode like this:
>>> 'ä'.decode('cp437')
u'\xe4'
The inverse operation is the Unicode string method encode():
>>> u'\xe4'.encode('cp437')
'\x84'
>>> print u'\xe4'.encode('cp437')
ä
How do you remember which is which? I think of Unicode as the 'master' - it is pure, unencoded character data :-). To convert to any other representation, it must be encoded. Conversely, any encoded characters must be decoded into Unicode. Of course Unicode is really just another encoding, but I find this a useful mnemonic.
If no encoding is specified, a default encoding is used. The default encoding is normally ascii:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
The default encoding is used when an 8-bit string is implicitly converted to a Unicode string, and when a Unicode string is implicitly converted to an 8-bit string. The ascii codec only allows values in the range 0-127. This can cause some surprises:
>>> u=u'ä'
>>> u
u'\xe4'
>>> print u
ä
>>> str(u)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
>>> a='ä'
>>> u+a
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
You can change the default encoding by creating a file called sitecustomize.py in the directory Python/Lib/site-packages. The file should contain
import sys
sys.setdefaultencoding('latin-1')
(or whatever default encoding you want to use.)
With the default encoding set to latin-1 and using the IDLE shell, I can do this:
>>> u=u'ä'
>>> u
u'\xe4'
>>> a='ä'
>>> a
'\xe4'
>>> u+a
u'\xe4\xe4'
>>> print u
ä
>>> print a
ä
>>> unicode(a)
u'\xe4'
In a DOS shell with the default encoding set to Cp437, you can see clearly that should avoid non-Ascii data in Unicode strings:
>>> a='ä'
>>> u=u'ä'
>>> u+a
u'\x84\xe4'
u+a converts a to the correct unicode value; the result has two different codes! So if your console is set to an encoding other than Latin-1, you are better off using hexadecimal values or explicit unicode values for your unicode strings, e.g.:
>>> u'\N{LATIN SMALL LETTER A WITH DIAERESIS}'
u'\xe4'