350 lines
15 KiB
Plaintext
350 lines
15 KiB
Plaintext
============
|
|
Unicode data
|
|
============
|
|
|
|
Django supports Unicode data everywhere.
|
|
|
|
This document tells you what you need to know if you're writing applications
|
|
that use data or templates that are encoded in something other than ASCII.
|
|
|
|
Creating the database
|
|
=====================
|
|
|
|
Make sure your database is configured to be able to store arbitrary string
|
|
data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use
|
|
a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be
|
|
able to store certain characters in the database, and information will be lost.
|
|
|
|
* MySQL users, refer to the `MySQL manual`_ for details on how to set or alter
|
|
the database character set encoding.
|
|
|
|
* PostgreSQL users, refer to the `PostgreSQL manual`_ for details on creating
|
|
databases with the correct encoding.
|
|
|
|
* Oracle users, refer to the `Oracle manual`_ for details on how to set
|
|
(`section 2`_) or alter (`section 11`_) the database character set encoding.
|
|
|
|
* SQLite users, there is nothing you need to do. SQLite always uses UTF-8
|
|
for internal encoding.
|
|
|
|
.. _MySQL manual: https://dev.mysql.com/doc/refman/en/charset-database.html
|
|
.. _PostgreSQL manual: https://www.postgresql.org/docs/current/multibyte.html#id-1.6.10.5.6
|
|
.. _Oracle manual: https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/index.html
|
|
.. _section 2: https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/choosing-character-set.html
|
|
.. _section 11: https://docs.oracle.com/en/database/oracle/oracle-database/18/nlspg/character-set-migration.html
|
|
|
|
All of Django's database backends automatically convert strings into
|
|
the appropriate encoding for talking to the database. They also automatically
|
|
convert strings retrieved from the database into strings. You don't even need
|
|
to tell Django what encoding your database uses: that is handled transparently.
|
|
|
|
For more, see the section "The database API" below.
|
|
|
|
General string handling
|
|
=======================
|
|
|
|
Whenever you use strings with Django -- e.g., in database lookups, template
|
|
rendering or anywhere else -- you have two choices for encoding those strings.
|
|
You can use normal strings or bytestrings (starting with a 'b').
|
|
|
|
.. warning::
|
|
|
|
A bytestring does not carry any information with it about its encoding.
|
|
For that reason, we have to make an assumption, and Django assumes that all
|
|
bytestrings are in UTF-8.
|
|
|
|
If you pass a string to Django that has been encoded in some other format,
|
|
things will go wrong in interesting ways. Usually, Django will raise a
|
|
``UnicodeDecodeError`` at some point.
|
|
|
|
If your code only uses ASCII data, it's safe to use your normal strings,
|
|
passing them around at will, because ASCII is a subset of UTF-8.
|
|
|
|
Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set
|
|
to something other than ``'utf-8'`` you can use that other encoding in your
|
|
bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as
|
|
the result of template rendering (and email). Django will always assume UTF-8
|
|
encoding for internal bytestrings. The reason for this is that the
|
|
:setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the
|
|
application developer). It's under the control of the person installing and
|
|
using your application -- and if that person chooses a different setting, your
|
|
code must still continue to work. Ergo, it cannot rely on that setting.
|
|
|
|
In most cases when Django is dealing with strings, it will convert them to
|
|
strings before doing anything else. So, as a general rule, if you pass
|
|
in a bytestring, be prepared to receive a string back in the result.
|
|
|
|
Translated strings
|
|
------------------
|
|
|
|
Aside from strings and bytestrings, there's a third type of string-like
|
|
object you may encounter when using Django. The framework's
|
|
internationalization features introduce the concept of a "lazy translation" --
|
|
a string that has been marked as translated but whose actual translation result
|
|
isn't determined until the object is used in a string. This feature is useful
|
|
in cases where the translation locale is unknown until the string is used, even
|
|
though the string might have originally been created when the code was first
|
|
imported.
|
|
|
|
Normally, you won't have to worry about lazy translations. Just be aware that
|
|
if you examine an object and it claims to be a
|
|
``django.utils.functional.__proxy__`` object, it is a lazy translation.
|
|
Calling ``str()`` with the lazy translation as the argument will generate a
|
|
string in the current locale.
|
|
|
|
For more details about lazy translation objects, refer to the
|
|
:doc:`internationalization </topics/i18n/index>` documentation.
|
|
|
|
Useful utility functions
|
|
------------------------
|
|
|
|
Because some string operations come up again and again, Django ships with a few
|
|
useful functions that should make working with string and bytestring objects
|
|
a bit easier.
|
|
|
|
Conversion functions
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The ``django.utils.encoding`` module contains a few functions that are handy
|
|
for converting back and forth between strings and bytestrings.
|
|
|
|
* ``smart_str(s, encoding='utf-8', strings_only=False, errors='strict')``
|
|
converts its input to a string. The ``encoding`` parameter
|
|
specifies the input encoding. (For example, Django uses this internally
|
|
when processing form input data, which might not be UTF-8 encoded.) The
|
|
``strings_only`` parameter, if set to True, will result in Python
|
|
numbers, booleans and ``None`` not being converted to a string (they keep
|
|
their original types). The ``errors`` parameter takes any of the values
|
|
that are accepted by Python's ``str()`` function for its error
|
|
handling.
|
|
|
|
* ``force_str(s, encoding='utf-8', strings_only=False, errors='strict')`` is
|
|
identical to ``smart_str()`` in almost all cases. The difference is when the
|
|
first argument is a :ref:`lazy translation <lazy-translations>` instance.
|
|
While ``smart_str()`` preserves lazy translations, ``force_str()`` forces
|
|
those objects to a string (causing the translation to occur). Normally,
|
|
you'll want to use ``smart_str()``. However, ``force_str()`` is useful in
|
|
template tags and filters that absolutely *must* have a string to work with,
|
|
not just something that can be converted to a string.
|
|
|
|
* ``smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')``
|
|
is essentially the opposite of ``smart_str()``. It forces the first
|
|
argument to a bytestring. The ``strings_only`` parameter has the same
|
|
behavior as for ``smart_str()`` and ``force_str()``. This is
|
|
slightly different semantics from Python's builtin ``str()`` function,
|
|
but the difference is needed in a few places within Django's internals.
|
|
|
|
Normally, you'll only need to use ``force_str()``. Call it as early as
|
|
possible on any input data that might be either a string or a bytestring, and
|
|
from then on, you can treat the result as always being a string.
|
|
|
|
.. _uri-and-iri-handling:
|
|
|
|
URI and IRI handling
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Web frameworks have to deal with URLs (which are a type of IRI). One
|
|
requirement of URLs is that they are encoded using only ASCII characters.
|
|
However, in an international environment, you might need to construct a
|
|
URL from an :rfc:`IRI <3987>` -- very loosely speaking, a :rfc:`URI <2396>`
|
|
that can contain Unicode characters. Use these functions for quoting and
|
|
converting an IRI to a URI:
|
|
|
|
* The :func:`django.utils.encoding.iri_to_uri()` function, which implements the
|
|
conversion from IRI to URI as required by :rfc:`3987#section-3.1`.
|
|
|
|
* The :func:`urllib.parse.quote` and :func:`urllib.parse.quote_plus`
|
|
functions from Python's standard library.
|
|
|
|
These two groups of functions have slightly different purposes, and it's
|
|
important to keep them straight. Normally, you would use ``quote()`` on the
|
|
individual portions of the IRI or URI path so that any reserved characters
|
|
such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to
|
|
the full IRI and it converts any non-ASCII characters to the correct encoded
|
|
values.
|
|
|
|
.. note::
|
|
Technically, it isn't correct to say that ``iri_to_uri()`` implements the
|
|
full algorithm in the IRI specification. It doesn't (yet) perform the
|
|
international domain name encoding portion of the algorithm.
|
|
|
|
The ``iri_to_uri()`` function will not change ASCII characters that are
|
|
otherwise permitted in a URL. So, for example, the character '%' is not
|
|
further encoded when passed to ``iri_to_uri()``. This means you can pass a
|
|
full URL to this function and it will not mess up the query string or anything
|
|
like that.
|
|
|
|
An example might clarify things here::
|
|
|
|
>>> from urllib.parse import quote
|
|
>>> from django.utils.encoding import iri_to_uri
|
|
>>> quote('Paris & Orléans')
|
|
'Paris%20%26%20Orl%C3%A9ans'
|
|
>>> iri_to_uri('/favorites/François/%s' % quote('Paris & Orléans'))
|
|
'/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans'
|
|
|
|
If you look carefully, you can see that the portion that was generated by
|
|
``quote()`` in the second example was not double-quoted when passed to
|
|
``iri_to_uri()``. This is a very important and useful feature. It means that
|
|
you can construct your IRI without worrying about whether it contains
|
|
non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the
|
|
result.
|
|
|
|
Similarly, Django provides :func:`django.utils.encoding.uri_to_iri()` which
|
|
implements the conversion from URI to IRI as per :rfc:`3987#section-3.2`.
|
|
|
|
An example to demonstrate::
|
|
|
|
>>> from django.utils.encoding import uri_to_iri
|
|
>>> uri_to_iri('/%E2%99%A5%E2%99%A5/?utf8=%E2%9C%93')
|
|
'/♥♥/?utf8=✓'
|
|
>>> uri_to_iri('%A9hello%3Fworld')
|
|
'%A9hello%3Fworld'
|
|
|
|
In the first example, the UTF-8 characters are unquoted. In the second, the
|
|
percent-encodings remain unchanged because they lie outside the valid UTF-8
|
|
range or represent a reserved character.
|
|
|
|
Both ``iri_to_uri()`` and ``uri_to_iri()`` functions are idempotent, which means the
|
|
following is always true::
|
|
|
|
iri_to_uri(iri_to_uri(some_string)) == iri_to_uri(some_string)
|
|
uri_to_iri(uri_to_iri(some_string)) == uri_to_iri(some_string)
|
|
|
|
So you can safely call it multiple times on the same URI/IRI without risking
|
|
double-quoting problems.
|
|
|
|
Models
|
|
======
|
|
|
|
Because all strings are returned from the database as ``str`` objects, model
|
|
fields that are character based (CharField, TextField, URLField, etc.) will
|
|
contain Unicode values when Django retrieves data from the database. This
|
|
is *always* the case, even if the data could fit into an ASCII bytestring.
|
|
|
|
You can pass in bytestrings when creating a model or populating a field, and
|
|
Django will convert it to strings when it needs to.
|
|
|
|
Taking care in ``get_absolute_url()``
|
|
-------------------------------------
|
|
|
|
URLs can only contain ASCII characters. If you're constructing a URL from
|
|
pieces of data that might be non-ASCII, be careful to encode the results in a
|
|
way that is suitable for a URL. The :func:`~django.urls.reverse` function
|
|
handles this for you automatically.
|
|
|
|
If you're constructing a URL manually (i.e., *not* using the ``reverse()``
|
|
function), you'll need to take care of the encoding yourself. In this case,
|
|
use the ``iri_to_uri()`` and ``quote()`` functions that were documented
|
|
above_. For example::
|
|
|
|
from urllib.parse import quote
|
|
from django.utils.encoding import iri_to_uri
|
|
|
|
def get_absolute_url(self):
|
|
url = '/person/%s/?x=0&y=0' % quote(self.location)
|
|
return iri_to_uri(url)
|
|
|
|
This function returns a correctly encoded URL even if ``self.location`` is
|
|
something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()``
|
|
call isn't strictly necessary in the above example, because all the
|
|
non-ASCII characters would have been removed in quoting in the first line.)
|
|
|
|
.. _above: `URI and IRI handling`_
|
|
|
|
Templates
|
|
=========
|
|
|
|
Use strings when creating templates manually::
|
|
|
|
from django.template import Template
|
|
t2 = Template('This is a string template.')
|
|
|
|
But the common case is to read templates from the filesystem. If your template
|
|
files are not stored with a UTF-8 encoding, adjust the :setting:`TEMPLATES`
|
|
setting. The built-in :py:mod:`~django.template.backends.django` backend
|
|
provides the ``'file_charset'`` option to change the encoding used to read
|
|
files from disk.
|
|
|
|
The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates.
|
|
This is set to UTF-8 by default.
|
|
|
|
Template tags and filters
|
|
-------------------------
|
|
|
|
A couple of tips to remember when writing your own template tags and filters:
|
|
|
|
* Always return strings from a template tag's ``render()`` method
|
|
and from template filters.
|
|
|
|
* Use ``force_str()`` in preference to ``smart_str()`` in these
|
|
places. Tag rendering and filter calls occur as the template is being
|
|
rendered, so there is no advantage to postponing the conversion of lazy
|
|
translation objects into strings. It's easier to work solely with
|
|
strings at that point.
|
|
|
|
.. _unicode-files:
|
|
|
|
Files
|
|
=====
|
|
|
|
If you intend to allow users to upload files, you must ensure that the
|
|
environment used to run Django is configured to work with non-ASCII file names.
|
|
If your environment isn't configured correctly, you'll encounter
|
|
``UnicodeEncodeError`` exceptions when saving files with file names that
|
|
contain non-ASCII characters.
|
|
|
|
Filesystem support for UTF-8 file names varies and might depend on the
|
|
environment. Check your current configuration in an interactive Python shell by
|
|
running::
|
|
|
|
import sys
|
|
sys.getfilesystemencoding()
|
|
|
|
This should output "UTF-8".
|
|
|
|
The ``LANG`` environment variable is responsible for setting the expected
|
|
encoding on Unix platforms. Consult the documentation for your operating system
|
|
and application server for the appropriate syntax and location to set this
|
|
variable.
|
|
|
|
In your development environment, you might need to add a setting to your
|
|
``~.bashrc`` analogous to:::
|
|
|
|
export LANG="en_US.UTF-8"
|
|
|
|
Form submission
|
|
===============
|
|
|
|
HTML form submission is a tricky area. There's no guarantee that the
|
|
submission will include encoding information, which means the framework might
|
|
have to guess at the encoding of submitted data.
|
|
|
|
Django adopts a "lazy" approach to decoding form data. The data in an
|
|
``HttpRequest`` object is only decoded when you access it. In fact, most of
|
|
the data is not decoded at all. Only the ``HttpRequest.GET`` and
|
|
``HttpRequest.POST`` data structures have any decoding applied to them. Those
|
|
two fields will return their members as Unicode data. All other attributes and
|
|
methods of ``HttpRequest`` return data exactly as it was submitted by the
|
|
client.
|
|
|
|
By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding
|
|
for form data. If you need to change this for a particular form, you can set
|
|
the ``encoding`` attribute on an ``HttpRequest`` instance. For example::
|
|
|
|
def some_view(request):
|
|
# We know that the data must be encoded as KOI8-R (for some reason).
|
|
request.encoding = 'koi8-r'
|
|
...
|
|
|
|
You can even change the encoding after having accessed ``request.GET`` or
|
|
``request.POST``, and all subsequent accesses will use the new encoding.
|
|
|
|
Most developers won't need to worry about changing form encoding, but this is
|
|
a useful feature for applications that talk to legacy systems whose encoding
|
|
you cannot control.
|
|
|
|
Django does not decode the data of file uploads, because that data is normally
|
|
treated as collections of bytes, rather than strings. Any automatic decoding
|
|
there would alter the meaning of the stream of bytes.
|