|Title:||WSGI Unicode Handling|
|Author:||Armin Ronacher <email@example.com>|
This proposal is rejected mainly because of those reasons:
From Ian Bicking:
I’ll add some commentary here, since I was the primary critic (of the limited audience before Armin withdrew this specification). Leaving this proposal here hopefully will be useful to later people considering this problem.
Changing the response app_iter is pretty heavy, and isn’t really an extension to WSGI, it’s a change to the core specification. Current WSGI implementors really expect
strgoes away in Python 3000, they will have to expect
bytesresponses too, but that’s a relatively straight-forward (though not trivial) change. Dealing with backward compatibility is quite difficult.
The use cases I personally see in this is avoiding the confusion and overhead of encoding and decoding responses when there are intermediaries which handle the response in its unicode form. This is not uncommon – for instance, XML processing happens on unicode data, and ideally all text responses should be handled as unicode. Deciding the encoding, and then doing the proper decoding, is not completely trivial (though not terribly hard). It is hard enough that people will and have avoided it, potentially working with
strdata when that was not correct. Similarly, it is important to send either properly-encoded data, or to change the encoding in the headers. Since encoding information can show up in multiple places (unfortunately) this can also be error-prone.
Despite these problems, sending unencoded data opens up a whole bunch of other problems, and realistically we get the union of all problems because we definitely cannot remove the sending of encoding text data. So everyone has to deal with both cases now, instead of just one case.
Anyway, that’s my take on this. – Ian
This specification proposes a possible implementation of unicode
support in WSGI. Current all WSGI application have to output
Python ships two types of strings subclassing the abstract base class
unicode. In Python 3
str and a new class
bytes will be introduced
(PEP 3100#atomic-types, PEP 3137). Also today many developers
unicode objects because support a wider range of characters
and functions like
len() still return the correct output, even
when using multibyte encodings like
But at the moment all WSGI applications have to yield
which require that uses encoder their data to a special encoding by
hand. WSGI middlewares don’t know about the charset the application is
A possible solution would be a new key in the environ called
wsgi.charset. The WSGI gateway would set this to
per default which means that yielding of
unicode objects results
in an exception. But if the charset is correctly defined all returned
unicode objects get encoded in the defined encoding by the WSGI
Middlewares could use this value too convert incomming form data to unicode automatically so that the application developer doesn’t have to take care about this issue.
If this environment key is updated by the application middlewares
would still see
None as charset because it’s updated on first
iteration only. So an application developer would need to wrap the
whole application including middlewares afterwards again with a new
middleware that updates this key. Another possibility would be that
the WSGI gateway provides a configuration value for the charset.
If encoding the output of the wsgi application the gateway must also
wsgi.charset key each time a unicode object is
found. Caching won’t work because the application must be able to
change the charset before each iteration:
def app(environ, start_response): start_response('200 OK', [('Content-Type', 'text/plain')]) environ['wsgi.charset'] = 'utf-8' yield u'Hällo Wörld' environ['wsgi.charset'] = 'iso-8895-15' yield u'Hällo Wörld'
Here a very simple CGI gateway that implements this functionality:
import os import sys def run_with_cgi(app, charset=None): environ = dict(os.environ.items()) environ['wsgi.charset'] = charset environ['wsgi.input'] = sys.stdin environ['wsgi.errors'] = sys.stderr environ['wsgi.version'] = (1,0) environ['wsgi.multithread'] = False environ['wsgi.multiprocess'] = True environ['wsgi.run_once'] = True if environ.get('HTTPS','off').lower() in ('on','1'): environ['wsgi.url_scheme'] = 'https' else: environ['wsgi.url_scheme'] = 'http' headers_set =  headers_sent =  def write(data): if not headers_set: raise AssertionError('write() before start_response()') elif not headers_sent: status, response_headers = headers_sent[:] = headers_set sys.stdout.write('Status: %s\r\n' % status) for header in response_headers: sys.stdout.write('%s: %s\r\n' % header) sys.stdout.write('\r\n') if isinstance(data, unicode): charset = environ['wsgi.charset'] if charset is None: raise AssertionError('application returned unicode without ' 'defined charset') data = data.encode(charset) sys.stdout.write(data) sys.stdout.flush() def start_response(status,response_headers,exc_info=None): if exc_info: try: if headers_sent: raise exc_info, exc_info, exc_info finally: exc_info = None elif headers_set: raise AssertionError('Headers already set!') headers_set[:] = [status,response_headers] return write result = app(environ, start_response) try: for data in result: if data: write(data) if not headers_sent: write('') finally: if hasattr(result,'close'): result.close()