Python: how is sys.stdout.encoding chosen?

2007-05-14

Or what to do if your Python programs complain they can’t output a string because of encoding problems.

Like this:

>>> print u'\\N{left-pointing double angle quotation mark}'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\\xab' in position 0: ordinal not in range(128)

It can’t perform the output because sys.stdout.encoding is ‘ascii’ and the ascii encoding can’t encoding that weird unicode character. I think I approve of this strict approach. It encourages explicitness and The Right Thing.

Okay. But in this instance I know what I’m doing and I want stdout to be treated as if it uses the UTF-8 encoding.

It turns out that Python picks the value for sys.stdout.encoding based on the value of the environment variable LC_CTYPE, so the following works:

$ LC_CTYPE=en_GB.utf-8 python
>>> print u'\\N{left-pointing double angle quotation mark}'
«

It even prints out the right character.

BUT WHERE THE HELL IS THIS DOCUMENTED?

About these ads

24 Responses to “Python: how is sys.stdout.encoding chosen?”

  1. glorkspangle Says:

    Yes, documentation for sys.stdout et al in the library reference is woeful. Google finds a mailing list posting telling me that sys.stdout.encoding comes from nl_langinfo(CODESET) (see nl_langinfo(3)).
    I believe you can also set this programmatically, but not in the way you would want. See the StreamRewriter code on this page:

    http://www.reportlab.com/i18n/python_unicode_tutorial.html

  2. Gareth Rees Says:

    I think the decision to have ‘strict’ error reporting on stdout is a mistake. It violates the law of least surprise, which is that when you execute print foo you hope going to see at least some attempt at a representation of foo, however mangled it might be by your terminal. At least then you know where you stand and can figure out what to do next. If you’re new to the language and not yet familiar with encodings, it can be pretty dispiriting struggling with endless UnicodeEncodeErrors. The default error handling ought to be ‘replace’ or something similar.

    (From the Python mailing list, a similar rant and a typical user difficulty.)

  3. Gareth Rees Says:

    Indeed, Googling for UnicodeEncodeError turns up a lot of Python users having trouble.

  4. drj11 Says:

    I think you make a good point. I would certainly be very annoyed if I was having some unrelated problem and inserted a debugging print only to find that I had to recursively debug a unicode encoding issue.

    As it happens ‘replace’ in my case would’ve meant a bug that I probably wouldn’t have spotted.

    Now I investigate a bit, Python’s behaviour is stubborn and arbitrary. When stdout is not a terminal then LC_CTYPE does not determine sys.stdout.encoding. A consequence is LC_CTYPE=en_GB.utf-8 python -c ‘print u”\N{left-pointing double angle quotation mark}”‘ works, but piping it through tee gives a UnicodeEncodeError.

    It’s perfectly reasonable for Python to have some idea of what the encoding for sys.stdout should be, but it’s equally reasonable for a Python program to want to change it to (for example, I might have a Python program that converts from one specified encoding on stdin to another on stdout). I should just be able to change sys.stdout.encoding without having to go through hoops.


  5. The reason for this behavior is that changing sys.stdout doesn’t make sense. If your terminal is ASCII, setting the stdout to UTF-8 does not magically make your terminal capable of accepting UTF-8.

    The problem here is sys.stdout not getting its encoding set correctly for the terminal it is in, and that’s at least partially because the terminals themselves are still pretty fuzzy on the whole encoding issue. As that improves, this should go away.

    It’s an inconvenient behavior, yes, but the inconvenience is fundamental; any other behavior would be actively wrong, even if you’d like it better for not being so twitchy.

  6. drj11 Says:

    Who says sys.stdout is attached to my terminal? Why does LC_CTYPE affect sys.stdout.encoding only when Python reckons that it is a terminal? The Single Unix Specification doesn’t say that LC_CTYPE should only apply to the terminal’s encoding. Why doesn’t Python use it when its output is redirected to a file?

    What if I change the encoding my terminal uses whilst I’m in the middle of an interactive Python session? This is perfectly easy to do using Terminal on OS X.

    What if I unplug my ASCII terminal from the serial port and plug in a EBCDIC terminal? What if I download a new font to the soft graphics area that changes the graphics from looking like ISO-8859-1 to ISO-8859-15?

    What if my Python application is writing out on stdout a file in some particular format and that format has some sections in UTF-8 and some in ISO-8859-1? Tar files, Zip files, EXIF files with comments in different encodings could all easily be that sort of file.

    But really, the killer for me seems to be, WHAT IF PYTHON GETS IT WRONG AND I KNOW BETTER!?

  7. arkanes Says:

    If you know better, then you explicitly encode your output into the correct format. Personally, I don’t even think the ascii default should be attempted and any IO with a unicode object should be an error.

    I don’t think you really understand exactly how unicode support in Python (or terminals) works. Python has 2 types of strings, string and unicode objects. The string object is simply a sequence of bytes. The unicode object is an idealized sequence of unicode characters, in an internal format determined at Python build time.

    IO *always* is done with string objects (that is, sequences of bytes). You can’t write a unicode object to a file or print it to a terminal. When you attempt to do so, Python converts (encodes) it to a string object, by default using the ascii encoding. This is a reasonable lowest common denominator default – it’s almost guaranteed to work on any reasonable IO device (7 bit protocols being a notable exception).

    You can *always* explicitly encode your unicode objects into whatever encoding you want. That’s exactly what you’d do with mixed encoding files, and it’s what you should do in general.

    Also, anyone who knows enough about unicode issues to worry about changing the default output encoding should certainly know better than to think that the font used by the terminal has any bearing on the output encoding.


  8. “Who says sys.stdout is attached to my terminal?” – My mistake. Read “stream” for terminal.

    And the answer to all your questions really boils down to “Nothing is really designed to handle Unicode and multiple encodings properly.” To take one example: What if you change your encoding in your terminal on the fly? Well, how is Python supposed to know that, anyhow? Right now, there (probably) isn’t a way, even in theory, because there’s just too many layers between the terminal itself and Python (the terminal, the shell, the library being used to manipulate the terminal, then Python itself), almost none of which really can deal with encodings properly by communicating to the other layers about the encoding in real time, because all of these layers basically date from 1970.

    Again, Python’s really the only part of that entire stack doing it right. All the others are fuzzy at best, simply lacking the information at worst.

  9. drj11 Says:

    @arkanes: I think I do understand how unicode support in Python works. Enough to know that when I write a unicode string to a file stream, f, the unicode string gets converted to a string using the encoding of the file stream (that’s «f.encoding»), not the default encoding (which is returned by the function sys.getdefaultencoding).

    When the endpoint of a stream is a video display terminal then the meaning of the characters that are output is exactly what shapes get displayed. If I have a terminal (a real one, for the sake of concreteness) that displays the octets 160-255 using the shapes of graphics from ISO 8859-1 then the terminal is effectively using the iso-8859-1 encoding. If I redefine the graphics for octet 0xA3 (POUND SIGN in ISO 8859-1) so that it is drawn as a EURO SIGN then I have effectively changed the encoding of the terminal to be ISO 8859-15. It’s all about what shapes the terminal displays.

    Then there’s cut-and-paste.

    I agree with you about bizarre files with multiple-encodings. Explicit is best.

    In my application I wanted the output in UTF-8, and I wanted it on stdout (not necessarily my terminal). Changing sys.sdout.encoding seems like a reasonable approach to me. Certainly easier than looking up a codec, creating a wrapped file stream that uses the codec, then replacing sys.stdout. There’s no other way to change the encoding for sys.stdout is there?

    Explicit encoding when my entire stream is in the same encoding is tedious. It’s also error prone because I have to make sure that I use the same encoding every time I explicitly encode a unicode string. In that case, which is surely very common, setting the encoding on the file stream and writing out unicode strings seems sensible.

  10. drj11 Says:

    @Jeremy: Well, a suggestion for what happens when I change my Terminal encoding on OS X during an interactive Python session is that I should be able to change sys.stdout.encoding by assigning to it. I just don’t think that Python is holding up its end of the stack very well, though I appreciate your comments about the rest of the stack upon which Python sits. If my application does have some way of knowing that the encoding on stdout has changed (for example, asking me, the user) then Python doesn’t make it easy for the application to make use of this knowledge.

    It’s not unreasonable to have a program that wants to encode its output in a particular encoding. The example I gave earlier still seems reasonable to be, a program that takes input in one encoding and recodes to a different encoding on its output, with both the input and output encoding specified on the command line. Clearly such a program should be able to use stdin and stdout so that it can form part of a pipe. So how in Python do I change sys.stdout to use a particular encoding? It’s a right pain.

    You invent too many layers. The shell doesn’t get in the way of Python and Python’s stdout. And there are essentially no libraries being used to manipulate the terminal (which remember, may or may not be stdout), there’s ioctl(2) and tty(4).

    Basically it’s Python and a (mostly) transparent 8-bit data stream. Yes, on Unix there’s no reasonable way to determine the encoding that should be used. But that doesn’t mean that Python knows best, it means that the programmer should know best and Python should let the programmer specify it easily. The more I think about it, the more C’s approach seems better (inconceivable).

    What it comes down to is this:

    – Python chooses sys.stdout.encoding based on the environment variable LC_CTYPE when stdout is a tty. THIS IS WRONG. Thought experiment: stdout and stderr go to different ttys each with a different encoding (this is trivial to set up using OS X). But Python will set the same encoding (based on LC_CTYPE) for both sys.stdout and sys.stderr. How can that be right?
    – Python makes sys.stdout.encoding read-only so when Python has picked the wrong value for it, it makes it hard to change.

    Python should either use LC_CTYPE as a default encoding for all streams regardless of whether they’re a tty or not; or, it should ignore LC_CTYPE.

    Python should allow sys.stdout.encoding to be written. Programmer knows best.

    There should be an ioctl(2) for getting and setting the terminal encoding. Then the drop-down in OS X Terminal that lets you select the encoding could actually influence correctly written programs.

  11. Tom Smith Says:

    Thank you thank you… to be honest, I don’t really get encodings.. I just want something that works… I know I spent days trying to figure this out before I could have known what it might be… my usual solution being to wrap [] around any print statements…

  12. drj11 Says:

    You’re welcome.

  13. Annoyed Says:

    drj11 your so right.

    I just want to write a small script that dumps a few utf-8 strings to stdout regardless of what encoding the terminal supports (the output is meant to go mainly to files).

    It’s incredibly stupid that python tries to do some stupid guessing based on the LC_CTYPE (or whatever) of the encoding for stdout and doesn’t let me change them in a simple way.

    There is no problem if the terminal gets garbled up (this is the way it should be). A much bigger problem is some irrelevant UnicodeEncodeError that kills the script.

    I seriously hope that this behaviour will be changed in the future. Python should not try to guess (and depend) as little as possible on the surronding enviroment. It makes it less flexible.

  14. afrobeard Says:

    Code Monk Thanks for the LC_CTYPE tip, it helped me on my Linux box.

    However the LC_CTYPE variable didnt affect the deployment test server at my workplace [sigh! A windows machine, but not for long]. Further reading revealed something to do with code pages since Windows XP console does not support Unicode by default.

    For further information read:-

    http://mail.python.org/pipermail/python-list/2003-April/200079.html

    and http://www.perlmonks.org/?node_id=329433

    Hope this helps someone :).

  15. Rafael Xavier Says:

    I’m glad to see most of you share the same point of view as I do. It’s unbelievable python treats it that way!

  16. Rafael Xavier Says:

    When an unicode is printed out (or wrote to a file), it is converted to a “string”. If encode() is not specified, it is converted to default encoding (.encoding).

    One option:
    >>> print u’\\N{left-pointing double angle quotation mark}’.encode(‘utf-8′)

  17. Eljay Says:

    Thank you for these Q&A! Python Unicode Strings, demystified.

    I now understand what is happening, and my print (to stdout) are no longer getting the dreaded …

    UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe9′ in position 0: ordinal not in range(128)

    …since I now encode my Unicode strings as string (bytes) for output, and my terminal handles UTF-8 just fine. And, most importantly, I understand WHY I have do that, and WHERE I have to do that.

  18. Eric Says:

    Rafael Xavier, thanks for the encode() tip – exactly what I was looking for!

  19. Rasmus Kaj Says:

    The fact that python does respect LC_CTYPE automatically when output is a terminal but not when writing to something else is a BUG IN PYTHON, plain and simple.

    How else can you describe the fact that:

    ./myprog.py

    works just fine, but

    ./myprog.py | grep foo

    terminates with an exception about encoding to ASCII?

    Yes, I understand that unicode strings has to be converted to the output encoding, but I see no reason for the output encoding to default to ascii even though LC_CTYPE is set, as it aparently can be automagically set to utf-8 in other cases.

    And of course, it should be simple to set the output encoding explicitly, both having to write an encoder class and having to explicitly encode each string printed is ridiculous!

  20. Merwok Says:

    I like your point of view. Did you expose it on the mailing-list or file a bug?

    Best, Merwok

  21. drj11 Says:

    @Merwok: As far as I can tell all the Python people know about this behaviour and they think it is the right thing.

  22. toka Says:

    This issue is continiously stealing my time …

    one workaround I found to swich to utf8 output is

    import sys
    import codecs
    sys.stdout = codecs.getwriter('utf8')(sys.stdout)
    
  23. dingo Says:

    what about in jython?

  24. drj11 Says:

    @dingo: Jython? People really use that?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: