With Strings Attached

Python 2.x has unfortunate interface constraints: __str__ and __repr__ methods must return byte strings, but it is not clear what encoding should these byte strings use. While working on NLTK Python 3 port I tried to figure out how to deal with these methods (taking __unicode__, default encodings, Python 2.x and Python 3.x in account).

Background

The quote from Python 2.7 docs:

object.__repr__(self)

Called by the repr built-in function and by string conversions (reverse quotes) to compute the "official" string representation of an object. If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description...> should be returned. The return value must be a string object. If a class defines __repr__ but not __str__, then __repr__ is also used when an "informal" string representation of instances of that class is required.

This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.

object.__str__(self)

Called by the str built-in function and by the print statement to compute the "informal" string representation of an object. This differs from __repr__ in that it does not have to be a valid Python expression: a more convenient or concise representation may be used instead. The return value must be a string object.

object.__unicode__(self)

Called to implement unicode built-in; should return a Unicode object. When this method is not defined, string conversion is attempted, and the result of string conversion is converted to Unicode using the system default encoding.

This is a nice overview, but let's try to understand what are these magic methods used in practice for.

__unicode__ is not used in Python 3.x; in Python 2.x it is used for:

# casting to unicode
unicode(foo)

# unicode string formatting
u"%s" % foo
u"{0}".format(foo)
logging.debug(u"Foo value: %s", foo)

__str__ is used in the following cases under Python 2.x:

# casting to str
str(foo)

# printing
print foo

# string formatting
"%s" % foo
"{0}".format(foo)
logging.debug("Foo value: %s", foo)

It is also used for unicode string formatting and for casting to unicode if __unicode__ method is not available.

__repr__ is used in the following cases:

# by repr builtin
repr(foo)

# for string formatting using %r format specifier
"%r" % foo
logging.debug("Foo value: %r", foo)

# __str__ of Python container types uses __repr__ of items,
# see http://www.python.org/dev/peps/pep-3140/
str([foo, bar])

It is also used for checking objects in REPL:

>>> foo
"value, returned by foo.__repr__()"

__repr__ is a first fallback for __str__ and a second fallback for __unicode__ so it is used:

  • for printing, if __str__ is not available;
  • for casting to str, if __str__ is not available;
  • for casting to unicode, if __str__ and __unicode__ are not available;
  • for string formatting, if __str__ is not available;
  • for unicode string formatting, if __str__ and __unicode__ are not available;

When __repr__ for object is missing, Python emulates it by using strings like <module.Class at 0x108162710> instead of object.__repr__() results.

Highlights

There are some non-obvious things with these magic methods.

print always uses byte strings under Python 2.x, with a single exception: values of type "unicode string" are special-cased for print in PyFile_WriteObject. This mean that in case of print you can't create your own class with __unicode__ method and make this class enjoy automatic encoding like the built-in unicode type (well, maybe inheriting every single class from unicode may help somehow, but this is weird).

REPL always uses a byte string returned by __repr__ under Python 2.x; it doesn't use either __str__ or __unicode__;

__unicode__ is never implicitly used for byte ("native") string formatting and for I/O (including print); also, it is never implicitly used by REPL.

There is a small gotcha with __str__, __unicode__ and string formatting:

>>> class Foo(object):
...     def __str__(self): return '__str__'
...     def __unicode__(self): return u'__unicode__'

>>> foo = Foo()
>>> "%s, %s" % (foo, 'byte string')
'__str__, byte string'

>>> "%s %s" % (foo, u'unicode string')
u'__unicode__, unicode string'

>>> u"%s %s" % (foo, 'byte string')
u'__unicode__, byte string'

And finally, the things may be complicated by the fact that string formatting and type casting may be used in __str__, __repr__ or __unicode__ implementations.

Default encodings

There is no single default encoding in Python; different encodings are used in different contexts.

Source code encoding

# -*- coding: utf-8 -*- at the top of the file denotes encoding of the python source code. It affects the loading of non-ascii constants. See PEP-0263 for more info.

Default string encoding

This encoding is returned by sys.getdefaultencoding(). It is used for implicit conversions between unicode strings and byte strings that are not related to I/O.

Example 1 (don't do this!):

# under python 2.x
unicode('привет')

In this example 'привет' is a byte string encoded using the source code encoding; unicode('привет') tries to decode byte string 'привет' using the sys.getdefaultencoding() encoding. If sys.getdefaultencoding() and source code encoding don't match, either exception would be raised or the decoded text would be unreadable.

Example 2 (don't do this!):

u"hello %s" % (u'привет'.encode('utf8'))

Under Python 2.x this would fail if sys.getdefaultencoding() is not UTF-8 because Python would try to decode utf8-encoded byte string u'привет'.encode('utf8') using incorrect encoding.

sys.getdefaultencoding() is usually "ascii".

It is possible to set the default string encoding using a hack:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Using sys.setdefaultencoding() in hope that it will fix some UnicodeDecodeError is certainly evil.

I don't have real arguments against (knowingly) using sys.setdefaultencoding(), but it just feels wrong. I think it is better not to rely on sys.setdefaultencoding(), sys.getdefaultencoding() and implicit encoding/decoding in any way.

sys.stdout.encoding

(There is also sys.stdin.encoding and sys.stderr.encoding).

sys.stdout.encoding is used for encoding unicode strings during stdout I/O - notably for printing unicode strings (but not for printing objects with __unicode__ method!).

In terminal context (tty) this encoding is set by Python interpreter at the best guess (using locale information and environment variables?); in non-tty context (e.g. unix pipes) under Python 2.x it may be None (which means 7-bit ascii and exceptions if unicode strings are printed).

sys.stdout.encoding may be overridden using an environment variable PYTHONIOENCODING, but it is not always possible to change the encoding that terminal uses; incompatible PYTHONIOENCODING would lead to broken output in this case.

Note

In Python 2.6 sys.stdout.write(unicode_string) uses sys.getdefaultencoding() instead of sys.stdout.encoding; see http://bugs.python.org/issue4947. Among the others, this affects the default logging.StreamHandler: when logging objects with custom __unicode__ method StreamHandler incorrectly ends up using UTF-8 instead of sys.stdout.encoding (as a result of several fallbacks).

See also: http://wiki.python.org/moin/PrintFails

locale.getpreferredencoding()

I didn't figure out what is it for :) It seems that this is an encoding "suggested" by the OS (via ANSI C locale implementation); that it is not used by Python itself, and that it may be used by a developer as a hint in some cases where the desired encoding is not clear, but I may be totally wrong.

What's the problem?

__str__ and __repr__ must return byte strings under Python 2.x, but it is not clear what encoding should they use. It is possible to have all the default encodings different on the same machine at the same time.

For example, on Windows XP with Russian locale using Python shell we would get the following:

>>> import sys
>>> import locale
>>> sys.getdefaultencoding()
'ascii'
>>> locale.getpreferredencoding()
'cp1251'
>>> sys.stdout.encoding
'cp866'

To make things even more fun, if the output is redirected to a file then sys.stdout.encoding may become None (which is interpreted as 'ascii').

So the question is: what encoding should __str__ and __repr__ use?

Some considerations

  1. If we want checking objects in REPL to work we must have __repr__ encoding compatible with the console encoding. This mean encoding __repr__ result to sys.stdout.encoding or to 7bit ASCII (which is compatible with a broad set of encodings).

  2. If we want print to work in REPL we must have __str__ encoding compatible with the console encoding (so __str__ should also use either sys.stdout.encoding or to 7bit ASCII).

  3. Relying on runtime global parameters (like sys.stdout.encoding) for str(obj) and repr(obj) is bad because it makes results non-interoperable and may break things (e.g. a shared log may end up containing text in mixed encodings).

  4. %r format specifier is tricky because __repr__ doesn't have unicode counterpart under Python 2.x. For example, consider this:

    u'привет, %r' % user
    

    user.__repr__() would be called, and the result (which is a byte string) would be decoded using sys.getdefaultencoding() before the interpolation; this would fail if user.__repr__() uses an encoding incompatible with sys.getdefaultencoding().

    Another innocent example:

    logging.debug("Foo: %r, bar: %s", foo, bar)
    

    If bar is an unicode string then the message will become unicode during the interpolation; this would trigger implicit decoding of foo.__repr__() which will fail if the result of foo.__repr__() is encoded to something incompatible with sys.getdefaultencoding(). Having foo.__unicode__ defined won't help because %r doesn't use it.

    This means that in order to have robust %r, obj.__repr__() should return string either in sys.getdefaultencoding() encoding or in 7bit ASCII.

  5. It may look like %s format specifier is immune to %r issues when both __str__ and __unicode__ are defined, but in practice it is not. For example:

    greeting = "%s says hi to " % user
    print (greeting + user2.full_name)
    

    Is it easy to spot an error? In this example greeting would be a byte string returned by user.__str__() because __str__ would be called during string formatting, not __unicode__, even if __unicode__ is present. Then the concatenation with an unicode variable user.full_name would fail if user.__str__() encoding is not compatible with sys.getdefaultencoding() because Python would try to decode greeting with sys.getdefaultencoding().

    There are workarounds for %r and %s examples above; the point is that it is not hard to make a non-obvious mistake which will strike back only at an another machine or in an another environment, or maybe after some time.

What to do

There are several options for __str__ and __repr__ under Python 2.x.

Juggling with encodings

It is tempting to encode the values of __str__ and __repr__ to sys.stdout.encoding, and maybe set system default encoding to sys.stdout.encoding. This way REPL and print would work as expected and the result would be readable when it is possible.

I think this approach is too magical and has serious drawbacks:

  • values depend on global configuration and thus non-interoperable (this may break e.g. shared logging);
  • it is easy to shoot yourself in the foot when dealing with unicode (because the encoding of strings may vary between runs);
  • non-tty behavior may be surprising (e.g. str(obj) in web context).

7bit ASCII

The second option is to make __str__ and __repr__ return 7bit ASCII, using escaping and/or transliteration.

Escaping is often used to make an arbitrary string 7bit. Python 2.x does it itself:

  • repr(unicode_string) returns escaped 7bit-safe ASCII string;
  • str([unicode_string1, unicode_string2]) returns 7bit ASCII because __repr__ of elements is used for building string representation of standard Python container types (see http://www.python.org/dev/peps/pep-3140/ ).

String escaping is an "escaping" from the problems; non-English Python users are all used to unhelpful Python 2.x output for non-ascii data.

I think that limiting __str__ and __repr__ to 7bit in user code is not popular because (while it is the most robust way to deal with the issue under Python 2.x) it often makes the output unreadable.

In order to improve readability transliteration may be considered.

For example, we may decide to have obj.__str__() returning a transliterated 7bit ASCII version of obj.__unicode__(), and __repr__ returning the "full" representation encoded to 7bit ASCII with escaping:

>>> obj = MyCls(unicode_data=u'ciào привет')
>>> print obj
ciao privet
>>> print unicode(obj)
ciào привет
>>> obj
<MyCls(u'ci\xe0o \u043f\u0440\u0438\u0432\u0435\u0442')>

It may look like the transliteration of obj.__str__() is not needed (because print unicode(obj) output is even nicer). The advantages of transliterating obj.__str__() are:

  • consistent interface - print obj becomes useful for inspection;
  • __str__ is used in string formatting so transliterated obj.__str__() could make representation of other objects more readable.

The main drawback of the transliteration is that user might think that obj has "ciao privet" data, not "ciào привет" (based on "print obj"). I think this is a serious issue, but convinced myself that it may be OK for __str__ to behave like this because (according to Python docs) __str__ is only an "informal" representation of an object. In my opinion, the following conditions are met with transliterated __str__ and escaped __repr__:

  • __repr__ is "information-rich and unambiguous";
  • __str__ provides "convenient or concise representation";

According to Python docs, we shouldn't transliterate __repr__ because the transliteration is a lossy process.


Proper transliteration is complex; transliteration rules depend on language used, and they are not limited to 1-to-1 mappings between characters. There is a registry of machine-readable rules at http://site.icu-project.org/ , and I've even seen a Python package for transliteration that uses these rules (can't remember the name). But in order to use these "proper" transliteration methods language of the text should be known in advance. A general solution using ICU rules would be quite complex: the language of the text should be guessed before transliterating; if text has parts written in different languages then it should be somehow split into monolingual parts. Nice project for studying statistics and machine learning by the way :)

The popular option for transliteration is Unidecode. Unidecode supports many languages; it is small and fast because it uses a simple mapping between unicode codepoints and ASCII representation.

Unidecode works quite well in practice, but it is plagued with licensing issues. It used to be dual-licensed under (quite obscure) Perl Artistic License and GPL; the license was later changed to GPL; GPL may be a show-stopper in many cases.

Unidecode is a port of Perl Text::Unidecode library; I've made an another (very basic, 10 lines of Perl + 20 lines of Python) port named text-unidecode which is licensed under Perl Artistic License (thanks to Steven Bird for the idea of re-porting).

For Western languages removing diacritic marks if often enough to make text 7bit ASCII. In this case external libraries are not needed (thanks to Álvaro Justen for the suggestion).

Example implementation of the (non-GPL) transliteration method selection:

try:
    # Older versions of unidecode are licensed under Artistic License;
    # assume an older version is installed.
    from unidecode import unidecode

    def transliterate(txt):
        return unidecode(txt).encode('ascii')

except ImportError:
    try:
        # text-unidecode implementation is worse than unidecode's
        # so unidecode is preferred.
        from text_unidecode import unidecode

        def transliterate(txt):
            return unidecode(txt).encode('ascii')

    except ImportError:
        # I'm not sure about this part. The version below only
        # handles accents; this may be OK for many European languages
        # but will produce empty strings e.g. for Cyrillic.
        # Maybe try a yet another method if this returns an empty string?
        import unicodedata

        def transliterate(text):
            normalized_text = unicodedata.normalize('NFKD', text)
            return normalized_text.encode('ascii', 'ignore')

If GPL is OK, just use Unidecode.

The transliteration speed may be a concern, but transliteration is quite fast with Unidecode (or text-unidecode), and, anyway, it is human reading speed that should be optimized in case of __str__ (and __repr__ ?).

UTF-8 everywhere

The third option is to always encode __str__ and __repr__ results to UTF-8.

The advantages of this approach are that it is consistent, easy to understand and easy to and implement, and that for UTF-8 consoles print and REPL work great. It also does "the right thing": all strings should be encoded to UTF-8.

However, in real word it is not possible to always use only UTF-8 strings:

  • with this approach REPL and print may fail (produce unreadable output, raise exceptions) when sys.stdout.encoding != 'UTF-8'. This case is very common on Windows or when redirecting output to a file or pipe. There may be hacks to overcome this (wrapping sys.stdout using codecs anyone?), but hacks are hacks.
  • %r format specifier is broken and should be avoided (sys.setdefaultencoding() hack may fix this).
  • It is quite easy to misuse %s and get an exception or bad output.

This approach makes things more convenient for people at the cost of breaking things for other people. I personally don't like this approach (it feels ignorant), but it has many advantages and it may be viable when

  1. the library is not wildly used in shell context and
  2. developers are aware of %r issues.

This approach is currently (at the moment of writing) used by django (see https://code.djangoproject.com/ticket/18063).

Python 2.x - 3.x compatible code

All the tricks above are only necessary for Python 2.x because Python 3.x fixed the main issue: __repr__ and __str__ must return unicode in Python 3.x. Developers shouldn't worry about sys.stdout.encoding and many other things under Python 3.x because when the expected encoding is known, unicode would be encoded using this encoding.

Error handling become stricter; issues like 2.x %r issue are also not possible under Python 3.x.

Note

There are still issues though (e.g. http://bugs.python.org/issue1602).

I think that the best way is to write code with Python 3.x semantics (__str__ and __repr__ should return unicode) and have a decorator for fixing them for Python 2.x; this idea is from a thread at django-developers. Decorator then may be removed in a (far?) future alongside with dropping Python 2.x support.

The decorator may look like this (a version with transliteration):

def python_2_unicode_compatible(klass):
    """
    This decorator defines __unicode__ method and fixes
    __repr__ and __str__ methods under Python 2.

    To support Python 2 and 3 with a single code base,
    define __str__ and __repr__ methods returning unicode
    text and apply this decorator to the class.

    Original __repr__ and __str__ would be available
    as _unicode_repr and __unicode__ (under both
    Python 2 and Python 3).
    """
    klass.__unicode__ = klass.__str__
    klass._unicode_repr = klass.__repr__

    if not compat.PY3:
        klass.__str__ = lambda self: transliterate(self.__unicode__())
        klass.__repr__ = lambda self: to_7bit(self._unicode_repr())

    return klass

def to_7bit(text):
    return text.encode('ascii', errors='backslashreplace')

This @python_2_unicode_compatible decorator could make things worse for some people in some cases: if they previously had __repr__ or __str__ returning bytes in the same encoding as their terminal uses, then instead of nice readable text people would get a transliterated version (hopefully readable) of __str__ and an unreadable escaped __repr__.

There are (simple?) workarounds: users may check the unicode versions under Python 2.x:

print unicode(obj)
print obj._unicode_repr()

..or start moving to Python 3.

Final notes

I'm going to use python_2_unicode_compatible decorator (a version with transliteration) in NLTK port. "UTF-8 everywhere" doesn't look compelling because tty support is very important for NLTK.

The article turned out long; let's hope there are not too many mistakes in it. If you found a factual error, know better ways to deal with __str__ and __repr__ or have something to say otherwise, I'd be grateful for a note in comments.

blog comments powered by Disqus