Regular Python strings are not unicode, they are just plain bytes. To create a unicode string, use the 'u' prefix on the string literal:
>>> ustring = u'A unicode \u018e string \xf1'>>> ustringu'A unicode \u018e string \xf1'
A unicode string is a different type of object from regular "str" string, but the unicode string is compatible (they share the common superclass basestring), and the various libraries such as regular expressions work correctly if passed a unicode string instead of a regular string.
To convert a unicode string to bytes with an encoding such as 'utf-8', call the ustring.encode('utf-8') method on the unicode string. Going the other direction, the unicode(s, encoding) function converts encoded plain bytes to a unicode string:
## (ustring from above contains a unicode string)>>> s = ustring.encode('utf-8')>>> s'A unicode \xc6\x8e string \xc3\xb1' ## bytes of utf-8 encoding>>> t = unicode(s, 'utf-8') ## Convert bytes back to a unicode string>>> t == ustring ## It's the same as the original, yay!True
The built-in print does not work fully with unicode strings. You can encode() first to print in utf-8 or whatever. In the file-reading section, there's an example that shows how to open a text file with some encoding and read out unicode strings. Note that unicode handling is one area where Python 3.0 is significantly cleaned up vs. Python 2.x behavior described here.
The codecs module provides support for reading a unicode file.
import codecsf = codecs.open('foo.txt', 'rU', 'utf-8')for line in f:# here line is a *unicode* string
For writing, use f.write() since print does not fully support unicode.