symbian-qemu-0.9.1-12/python-2.6.1/Doc/library/codecs.rst
changeset 1 2fb8b9db1c86
equal deleted inserted replaced
0:ffa851df0825 1:2fb8b9db1c86
       
     1 
       
     2 :mod:`codecs` --- Codec registry and base classes
       
     3 =================================================
       
     4 
       
     5 .. module:: codecs
       
     6    :synopsis: Encode and decode data and streams.
       
     7 .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
       
     8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
       
     9 .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
       
    10 
       
    11 
       
    12 .. index::
       
    13    single: Unicode
       
    14    single: Codecs
       
    15    pair: Codecs; encode
       
    16    pair: Codecs; decode
       
    17    single: streams
       
    18    pair: stackable; streams
       
    19 
       
    20 This module defines base classes for standard Python codecs (encoders and
       
    21 decoders) and provides access to the internal Python codec registry which
       
    22 manages the codec and error handling lookup process.
       
    23 
       
    24 It defines the following functions:
       
    25 
       
    26 
       
    27 .. function:: register(search_function)
       
    28 
       
    29    Register a codec search function. Search functions are expected to take one
       
    30    argument, the encoding name in all lower case letters, and return a
       
    31    :class:`CodecInfo` object having the following attributes:
       
    32 
       
    33    * ``name`` The name of the encoding;
       
    34 
       
    35    * ``encode`` The stateless encoding function;
       
    36 
       
    37    * ``decode`` The stateless decoding function;
       
    38 
       
    39    * ``incrementalencoder`` An incremental encoder class or factory function;
       
    40 
       
    41    * ``incrementaldecoder`` An incremental decoder class or factory function;
       
    42 
       
    43    * ``streamwriter`` A stream writer class or factory function;
       
    44 
       
    45    * ``streamreader`` A stream reader class or factory function.
       
    46 
       
    47    The various functions or classes take the following arguments:
       
    48 
       
    49    *encode* and *decode*: These must be functions or methods which have the same
       
    50    interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see
       
    51    Codec Interface). The functions/methods are expected to work in a stateless
       
    52    mode.
       
    53 
       
    54    *incrementalencoder* and *incrementaldecoder*: These have to be factory
       
    55    functions providing the following interface:
       
    56 
       
    57    ``factory(errors='strict')``
       
    58 
       
    59    The factory functions must return objects providing the interfaces defined by
       
    60    the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`,
       
    61    respectively. Incremental codecs can maintain state.
       
    62 
       
    63    *streamreader* and *streamwriter*: These have to be factory functions providing
       
    64    the following interface:
       
    65 
       
    66    ``factory(stream, errors='strict')``
       
    67 
       
    68    The factory functions must return objects providing the interfaces defined by
       
    69    the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.
       
    70    Stream codecs can maintain state.
       
    71 
       
    72    Possible values for errors are ``'strict'`` (raise an exception in case of an
       
    73    encoding error), ``'replace'`` (replace malformed data with a suitable
       
    74    replacement marker, such as ``'?'``), ``'ignore'`` (ignore malformed data and
       
    75    continue without further notice), ``'xmlcharrefreplace'`` (replace with the
       
    76    appropriate XML character reference (for encoding only)) and
       
    77    ``'backslashreplace'`` (replace with backslashed escape sequences (for encoding
       
    78    only)) as well as any other error handling name defined via
       
    79    :func:`register_error`.
       
    80 
       
    81    In case a search function cannot find a given encoding, it should return
       
    82    ``None``.
       
    83 
       
    84 
       
    85 .. function:: lookup(encoding)
       
    86 
       
    87    Looks up the codec info in the Python codec registry and returns a
       
    88    :class:`CodecInfo` object as defined above.
       
    89 
       
    90    Encodings are first looked up in the registry's cache. If not found, the list of
       
    91    registered search functions is scanned. If no :class:`CodecInfo` object is
       
    92    found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
       
    93    is stored in the cache and returned to the caller.
       
    94 
       
    95 To simplify access to the various codecs, the module provides these additional
       
    96 functions which use :func:`lookup` for the codec lookup:
       
    97 
       
    98 
       
    99 .. function:: getencoder(encoding)
       
   100 
       
   101    Look up the codec for the given encoding and return its encoder function.
       
   102 
       
   103    Raises a :exc:`LookupError` in case the encoding cannot be found.
       
   104 
       
   105 
       
   106 .. function:: getdecoder(encoding)
       
   107 
       
   108    Look up the codec for the given encoding and return its decoder function.
       
   109 
       
   110    Raises a :exc:`LookupError` in case the encoding cannot be found.
       
   111 
       
   112 
       
   113 .. function:: getincrementalencoder(encoding)
       
   114 
       
   115    Look up the codec for the given encoding and return its incremental encoder
       
   116    class or factory function.
       
   117 
       
   118    Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
       
   119    doesn't support an incremental encoder.
       
   120 
       
   121    .. versionadded:: 2.5
       
   122 
       
   123 
       
   124 .. function:: getincrementaldecoder(encoding)
       
   125 
       
   126    Look up the codec for the given encoding and return its incremental decoder
       
   127    class or factory function.
       
   128 
       
   129    Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
       
   130    doesn't support an incremental decoder.
       
   131 
       
   132    .. versionadded:: 2.5
       
   133 
       
   134 
       
   135 .. function:: getreader(encoding)
       
   136 
       
   137    Look up the codec for the given encoding and return its StreamReader class or
       
   138    factory function.
       
   139 
       
   140    Raises a :exc:`LookupError` in case the encoding cannot be found.
       
   141 
       
   142 
       
   143 .. function:: getwriter(encoding)
       
   144 
       
   145    Look up the codec for the given encoding and return its StreamWriter class or
       
   146    factory function.
       
   147 
       
   148    Raises a :exc:`LookupError` in case the encoding cannot be found.
       
   149 
       
   150 
       
   151 .. function:: register_error(name, error_handler)
       
   152 
       
   153    Register the error handling function *error_handler* under the name *name*.
       
   154    *error_handler* will be called during encoding and decoding in case of an error,
       
   155    when *name* is specified as the errors parameter.
       
   156 
       
   157    For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`
       
   158    instance, which contains information about the location of the error. The error
       
   159    handler must either raise this or a different exception or return a tuple with a
       
   160    replacement for the unencodable part of the input and a position where encoding
       
   161    should continue. The encoder will encode the replacement and continue encoding
       
   162    the original input at the specified position. Negative position values will be
       
   163    treated as being relative to the end of the input string. If the resulting
       
   164    position is out of bound an :exc:`IndexError` will be raised.
       
   165 
       
   166    Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
       
   167    :exc:`UnicodeTranslateError` will be passed to the handler and that the
       
   168    replacement from the error handler will be put into the output directly.
       
   169 
       
   170 
       
   171 .. function:: lookup_error(name)
       
   172 
       
   173    Return the error handler previously registered under the name *name*.
       
   174 
       
   175    Raises a :exc:`LookupError` in case the handler cannot be found.
       
   176 
       
   177 
       
   178 .. function:: strict_errors(exception)
       
   179 
       
   180    Implements the ``strict`` error handling.
       
   181 
       
   182 
       
   183 .. function:: replace_errors(exception)
       
   184 
       
   185    Implements the ``replace`` error handling.
       
   186 
       
   187 
       
   188 .. function:: ignore_errors(exception)
       
   189 
       
   190    Implements the ``ignore`` error handling.
       
   191 
       
   192 
       
   193 .. function:: xmlcharrefreplace_errors(exception)
       
   194 
       
   195    Implements the ``xmlcharrefreplace`` error handling.
       
   196 
       
   197 
       
   198 .. function:: backslashreplace_errors(exception)
       
   199 
       
   200    Implements the ``backslashreplace`` error handling.
       
   201 
       
   202 To simplify working with encoded files or stream, the module also defines these
       
   203 utility functions:
       
   204 
       
   205 
       
   206 .. function:: open(filename, mode[, encoding[, errors[, buffering]]])
       
   207 
       
   208    Open an encoded file using the given *mode* and return a wrapped version
       
   209    providing transparent encoding/decoding.  The default file mode is ``'r'``
       
   210    meaning to open the file in read mode.
       
   211 
       
   212    .. note::
       
   213 
       
   214       The wrapped version will only accept the object format defined by the codecs,
       
   215       i.e. Unicode objects for most built-in codecs.  Output is also codec-dependent
       
   216       and will usually be Unicode as well.
       
   217 
       
   218    .. note::
       
   219 
       
   220       Files are always opened in binary mode, even if no binary mode was
       
   221       specified.  This is done to avoid data loss due to encodings using 8-bit
       
   222       values.  This means that no automatic conversion of ``'\n'`` is done
       
   223       on reading and writing.
       
   224 
       
   225    *encoding* specifies the encoding which is to be used for the file.
       
   226 
       
   227    *errors* may be given to define the error handling. It defaults to ``'strict'``
       
   228    which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
       
   229 
       
   230    *buffering* has the same meaning as for the built-in :func:`open` function.  It
       
   231    defaults to line buffered.
       
   232 
       
   233 
       
   234 .. function:: EncodedFile(file, input[, output[, errors]])
       
   235 
       
   236    Return a wrapped version of file which provides transparent encoding
       
   237    translation.
       
   238 
       
   239    Strings written to the wrapped file are interpreted according to the given
       
   240    *input* encoding and then written to the original file as strings using the
       
   241    *output* encoding. The intermediate encoding will usually be Unicode but depends
       
   242    on the specified codecs.
       
   243 
       
   244    If *output* is not given, it defaults to *input*.
       
   245 
       
   246    *errors* may be given to define the error handling. It defaults to ``'strict'``,
       
   247    which causes :exc:`ValueError` to be raised in case an encoding error occurs.
       
   248 
       
   249 
       
   250 .. function:: iterencode(iterable, encoding[, errors])
       
   251 
       
   252    Uses an incremental encoder to iteratively encode the input provided by
       
   253    *iterable*. This function is a :term:`generator`.  *errors* (as well as any
       
   254    other keyword argument) is passed through to the incremental encoder.
       
   255 
       
   256    .. versionadded:: 2.5
       
   257 
       
   258 
       
   259 .. function:: iterdecode(iterable, encoding[, errors])
       
   260 
       
   261    Uses an incremental decoder to iteratively decode the input provided by
       
   262    *iterable*. This function is a :term:`generator`.  *errors* (as well as any
       
   263    other keyword argument) is passed through to the incremental decoder.
       
   264 
       
   265    .. versionadded:: 2.5
       
   266 
       
   267 The module also provides the following constants which are useful for reading
       
   268 and writing to platform dependent files:
       
   269 
       
   270 
       
   271 .. data:: BOM
       
   272           BOM_BE
       
   273           BOM_LE
       
   274           BOM_UTF8
       
   275           BOM_UTF16
       
   276           BOM_UTF16_BE
       
   277           BOM_UTF16_LE
       
   278           BOM_UTF32
       
   279           BOM_UTF32_BE
       
   280           BOM_UTF32_LE
       
   281 
       
   282    These constants define various encodings of the Unicode byte order mark (BOM)
       
   283    used in UTF-16 and UTF-32 data streams to indicate the byte order used in the
       
   284    stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
       
   285    :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
       
   286    native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
       
   287    :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
       
   288    :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
       
   289    encodings.
       
   290 
       
   291 
       
   292 .. _codec-base-classes:
       
   293 
       
   294 Codec Base Classes
       
   295 ------------------
       
   296 
       
   297 The :mod:`codecs` module defines a set of base classes which define the
       
   298 interface and can also be used to easily write your own codecs for use in
       
   299 Python.
       
   300 
       
   301 Each codec has to define four interfaces to make it usable as codec in Python:
       
   302 stateless encoder, stateless decoder, stream reader and stream writer. The
       
   303 stream reader and writers typically reuse the stateless encoder/decoder to
       
   304 implement the file protocols.
       
   305 
       
   306 The :class:`Codec` class defines the interface for stateless encoders/decoders.
       
   307 
       
   308 To simplify and standardize error handling, the :meth:`encode` and
       
   309 :meth:`decode` methods may implement different error handling schemes by
       
   310 providing the *errors* string argument.  The following string values are defined
       
   311 and implemented by all standard Python codecs:
       
   312 
       
   313 +-------------------------+-----------------------------------------------+
       
   314 | Value                   | Meaning                                       |
       
   315 +=========================+===============================================+
       
   316 | ``'strict'``            | Raise :exc:`UnicodeError` (or a subclass);    |
       
   317 |                         | this is the default.                          |
       
   318 +-------------------------+-----------------------------------------------+
       
   319 | ``'ignore'``            | Ignore the character and continue with the    |
       
   320 |                         | next.                                         |
       
   321 +-------------------------+-----------------------------------------------+
       
   322 | ``'replace'``           | Replace with a suitable replacement           |
       
   323 |                         | character; Python will use the official       |
       
   324 |                         | U+FFFD REPLACEMENT CHARACTER for the built-in |
       
   325 |                         | Unicode codecs on decoding and '?' on         |
       
   326 |                         | encoding.                                     |
       
   327 +-------------------------+-----------------------------------------------+
       
   328 | ``'xmlcharrefreplace'`` | Replace with the appropriate XML character    |
       
   329 |                         | reference (only for encoding).                |
       
   330 +-------------------------+-----------------------------------------------+
       
   331 | ``'backslashreplace'``  | Replace with backslashed escape sequences     |
       
   332 |                         | (only for encoding).                          |
       
   333 +-------------------------+-----------------------------------------------+
       
   334 
       
   335 The set of allowed values can be extended via :meth:`register_error`.
       
   336 
       
   337 
       
   338 .. _codec-objects:
       
   339 
       
   340 Codec Objects
       
   341 ^^^^^^^^^^^^^
       
   342 
       
   343 The :class:`Codec` class defines these methods which also define the function
       
   344 interfaces of the stateless encoder and decoder:
       
   345 
       
   346 
       
   347 .. method:: Codec.encode(input[, errors])
       
   348 
       
   349    Encodes the object *input* and returns a tuple (output object, length consumed).
       
   350    While codecs are not restricted to use with Unicode, in a Unicode context,
       
   351    encoding converts a Unicode object to a plain string using a particular
       
   352    character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
       
   353 
       
   354    *errors* defines the error handling to apply. It defaults to ``'strict'``
       
   355    handling.
       
   356 
       
   357    The method may not store state in the :class:`Codec` instance. Use
       
   358    :class:`StreamCodec` for codecs which have to keep state in order to make
       
   359    encoding/decoding efficient.
       
   360 
       
   361    The encoder must be able to handle zero length input and return an empty object
       
   362    of the output object type in this situation.
       
   363 
       
   364 
       
   365 .. method:: Codec.decode(input[, errors])
       
   366 
       
   367    Decodes the object *input* and returns a tuple (output object, length consumed).
       
   368    In a Unicode context, decoding converts a plain string encoded using a
       
   369    particular character set encoding to a Unicode object.
       
   370 
       
   371    *input* must be an object which provides the ``bf_getreadbuf`` buffer slot.
       
   372    Python strings, buffer objects and memory mapped files are examples of objects
       
   373    providing this slot.
       
   374 
       
   375    *errors* defines the error handling to apply. It defaults to ``'strict'``
       
   376    handling.
       
   377 
       
   378    The method may not store state in the :class:`Codec` instance. Use
       
   379    :class:`StreamCodec` for codecs which have to keep state in order to make
       
   380    encoding/decoding efficient.
       
   381 
       
   382    The decoder must be able to handle zero length input and return an empty object
       
   383    of the output object type in this situation.
       
   384 
       
   385 The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
       
   386 the basic interface for incremental encoding and decoding. Encoding/decoding the
       
   387 input isn't done with one call to the stateless encoder/decoder function, but
       
   388 with multiple calls to the :meth:`encode`/:meth:`decode` method of the
       
   389 incremental encoder/decoder. The incremental encoder/decoder keeps track of the
       
   390 encoding/decoding process during method calls.
       
   391 
       
   392 The joined output of calls to the :meth:`encode`/:meth:`decode` method is the
       
   393 same as if all the single inputs were joined into one, and this input was
       
   394 encoded/decoded with the stateless encoder/decoder.
       
   395 
       
   396 
       
   397 .. _incremental-encoder-objects:
       
   398 
       
   399 IncrementalEncoder Objects
       
   400 ^^^^^^^^^^^^^^^^^^^^^^^^^^
       
   401 
       
   402 .. versionadded:: 2.5
       
   403 
       
   404 The :class:`IncrementalEncoder` class is used for encoding an input in multiple
       
   405 steps. It defines the following methods which every incremental encoder must
       
   406 define in order to be compatible with the Python codec registry.
       
   407 
       
   408 
       
   409 .. class:: IncrementalEncoder([errors])
       
   410 
       
   411    Constructor for an :class:`IncrementalEncoder` instance.
       
   412 
       
   413    All incremental encoders must provide this constructor interface. They are free
       
   414    to add additional keyword arguments, but only the ones defined here are used by
       
   415    the Python codec registry.
       
   416 
       
   417    The :class:`IncrementalEncoder` may implement different error handling schemes
       
   418    by providing the *errors* keyword argument. These parameters are predefined:
       
   419 
       
   420    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
       
   421 
       
   422    * ``'ignore'`` Ignore the character and continue with the next.
       
   423 
       
   424    * ``'replace'`` Replace with a suitable replacement character
       
   425 
       
   426    * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
       
   427 
       
   428    * ``'backslashreplace'`` Replace with backslashed escape sequences.
       
   429 
       
   430    The *errors* argument will be assigned to an attribute of the same name.
       
   431    Assigning to this attribute makes it possible to switch between different error
       
   432    handling strategies during the lifetime of the :class:`IncrementalEncoder`
       
   433    object.
       
   434 
       
   435    The set of allowed values for the *errors* argument can be extended with
       
   436    :func:`register_error`.
       
   437 
       
   438 
       
   439    .. method:: encode(object[, final])
       
   440 
       
   441       Encodes *object* (taking the current state of the encoder into account)
       
   442       and returns the resulting encoded object. If this is the last call to
       
   443       :meth:`encode` *final* must be true (the default is false).
       
   444 
       
   445 
       
   446    .. method:: reset()
       
   447 
       
   448       Reset the encoder to the initial state.
       
   449 
       
   450 
       
   451 .. _incremental-decoder-objects:
       
   452 
       
   453 IncrementalDecoder Objects
       
   454 ^^^^^^^^^^^^^^^^^^^^^^^^^^
       
   455 
       
   456 The :class:`IncrementalDecoder` class is used for decoding an input in multiple
       
   457 steps. It defines the following methods which every incremental decoder must
       
   458 define in order to be compatible with the Python codec registry.
       
   459 
       
   460 
       
   461 .. class:: IncrementalDecoder([errors])
       
   462 
       
   463    Constructor for an :class:`IncrementalDecoder` instance.
       
   464 
       
   465    All incremental decoders must provide this constructor interface. They are free
       
   466    to add additional keyword arguments, but only the ones defined here are used by
       
   467    the Python codec registry.
       
   468 
       
   469    The :class:`IncrementalDecoder` may implement different error handling schemes
       
   470    by providing the *errors* keyword argument. These parameters are predefined:
       
   471 
       
   472    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
       
   473 
       
   474    * ``'ignore'`` Ignore the character and continue with the next.
       
   475 
       
   476    * ``'replace'`` Replace with a suitable replacement character.
       
   477 
       
   478    The *errors* argument will be assigned to an attribute of the same name.
       
   479    Assigning to this attribute makes it possible to switch between different error
       
   480    handling strategies during the lifetime of the :class:`IncrementalDecoder`
       
   481    object.
       
   482 
       
   483    The set of allowed values for the *errors* argument can be extended with
       
   484    :func:`register_error`.
       
   485 
       
   486 
       
   487    .. method:: decode(object[, final])
       
   488 
       
   489       Decodes *object* (taking the current state of the decoder into account)
       
   490       and returns the resulting decoded object. If this is the last call to
       
   491       :meth:`decode` *final* must be true (the default is false). If *final* is
       
   492       true the decoder must decode the input completely and must flush all
       
   493       buffers. If this isn't possible (e.g. because of incomplete byte sequences
       
   494       at the end of the input) it must initiate error handling just like in the
       
   495       stateless case (which might raise an exception).
       
   496 
       
   497 
       
   498    .. method:: reset()
       
   499 
       
   500       Reset the decoder to the initial state.
       
   501 
       
   502 
       
   503 The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
       
   504 working interfaces which can be used to implement new encoding submodules very
       
   505 easily. See :mod:`encodings.utf_8` for an example of how this is done.
       
   506 
       
   507 
       
   508 .. _stream-writer-objects:
       
   509 
       
   510 StreamWriter Objects
       
   511 ^^^^^^^^^^^^^^^^^^^^
       
   512 
       
   513 The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
       
   514 following methods which every stream writer must define in order to be
       
   515 compatible with the Python codec registry.
       
   516 
       
   517 
       
   518 .. class:: StreamWriter(stream[, errors])
       
   519 
       
   520    Constructor for a :class:`StreamWriter` instance.
       
   521 
       
   522    All stream writers must provide this constructor interface. They are free to add
       
   523    additional keyword arguments, but only the ones defined here are used by the
       
   524    Python codec registry.
       
   525 
       
   526    *stream* must be a file-like object open for writing binary data.
       
   527 
       
   528    The :class:`StreamWriter` may implement different error handling schemes by
       
   529    providing the *errors* keyword argument. These parameters are predefined:
       
   530 
       
   531    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
       
   532 
       
   533    * ``'ignore'`` Ignore the character and continue with the next.
       
   534 
       
   535    * ``'replace'`` Replace with a suitable replacement character
       
   536 
       
   537    * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
       
   538 
       
   539    * ``'backslashreplace'`` Replace with backslashed escape sequences.
       
   540 
       
   541    The *errors* argument will be assigned to an attribute of the same name.
       
   542    Assigning to this attribute makes it possible to switch between different error
       
   543    handling strategies during the lifetime of the :class:`StreamWriter` object.
       
   544 
       
   545    The set of allowed values for the *errors* argument can be extended with
       
   546    :func:`register_error`.
       
   547 
       
   548 
       
   549    .. method:: write(object)
       
   550 
       
   551       Writes the object's contents encoded to the stream.
       
   552 
       
   553 
       
   554    .. method:: writelines(list)
       
   555 
       
   556       Writes the concatenated list of strings to the stream (possibly by reusing
       
   557       the :meth:`write` method).
       
   558 
       
   559 
       
   560    .. method:: reset()
       
   561 
       
   562       Flushes and resets the codec buffers used for keeping state.
       
   563 
       
   564       Calling this method should ensure that the data on the output is put into
       
   565       a clean state that allows appending of new fresh data without having to
       
   566       rescan the whole stream to recover state.
       
   567 
       
   568 
       
   569 In addition to the above methods, the :class:`StreamWriter` must also inherit
       
   570 all other methods and attributes from the underlying stream.
       
   571 
       
   572 
       
   573 .. _stream-reader-objects:
       
   574 
       
   575 StreamReader Objects
       
   576 ^^^^^^^^^^^^^^^^^^^^
       
   577 
       
   578 The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
       
   579 following methods which every stream reader must define in order to be
       
   580 compatible with the Python codec registry.
       
   581 
       
   582 
       
   583 .. class:: StreamReader(stream[, errors])
       
   584 
       
   585    Constructor for a :class:`StreamReader` instance.
       
   586 
       
   587    All stream readers must provide this constructor interface. They are free to add
       
   588    additional keyword arguments, but only the ones defined here are used by the
       
   589    Python codec registry.
       
   590 
       
   591    *stream* must be a file-like object open for reading (binary) data.
       
   592 
       
   593    The :class:`StreamReader` may implement different error handling schemes by
       
   594    providing the *errors* keyword argument. These parameters are defined:
       
   595 
       
   596    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
       
   597 
       
   598    * ``'ignore'`` Ignore the character and continue with the next.
       
   599 
       
   600    * ``'replace'`` Replace with a suitable replacement character.
       
   601 
       
   602    The *errors* argument will be assigned to an attribute of the same name.
       
   603    Assigning to this attribute makes it possible to switch between different error
       
   604    handling strategies during the lifetime of the :class:`StreamReader` object.
       
   605 
       
   606    The set of allowed values for the *errors* argument can be extended with
       
   607    :func:`register_error`.
       
   608 
       
   609 
       
   610    .. method:: read([size[, chars, [firstline]]])
       
   611 
       
   612       Decodes data from the stream and returns the resulting object.
       
   613 
       
   614       *chars* indicates the number of characters to read from the
       
   615       stream. :func:`read` will never return more than *chars* characters, but
       
   616       it might return less, if there are not enough characters available.
       
   617 
       
   618       *size* indicates the approximate maximum number of bytes to read from the
       
   619       stream for decoding purposes. The decoder can modify this setting as
       
   620       appropriate. The default value -1 indicates to read and decode as much as
       
   621       possible.  *size* is intended to prevent having to decode huge files in
       
   622       one step.
       
   623 
       
   624       *firstline* indicates that it would be sufficient to only return the first
       
   625       line, if there are decoding errors on later lines.
       
   626 
       
   627       The method should use a greedy read strategy meaning that it should read
       
   628       as much data as is allowed within the definition of the encoding and the
       
   629       given size, e.g.  if optional encoding endings or state markers are
       
   630       available on the stream, these should be read too.
       
   631 
       
   632       .. versionchanged:: 2.4
       
   633          *chars* argument added.
       
   634 
       
   635       .. versionchanged:: 2.4.2
       
   636          *firstline* argument added.
       
   637 
       
   638 
       
   639    .. method:: readline([size[, keepends]])
       
   640 
       
   641       Read one line from the input stream and return the decoded data.
       
   642 
       
   643       *size*, if given, is passed as size argument to the stream's
       
   644       :meth:`readline` method.
       
   645 
       
   646       If *keepends* is false line-endings will be stripped from the lines
       
   647       returned.
       
   648 
       
   649       .. versionchanged:: 2.4
       
   650          *keepends* argument added.
       
   651 
       
   652 
       
   653    .. method:: readlines([sizehint[, keepends]])
       
   654 
       
   655       Read all lines available on the input stream and return them as a list of
       
   656       lines.
       
   657 
       
   658       Line-endings are implemented using the codec's decoder method and are
       
   659       included in the list entries if *keepends* is true.
       
   660 
       
   661       *sizehint*, if given, is passed as the *size* argument to the stream's
       
   662       :meth:`read` method.
       
   663 
       
   664 
       
   665    .. method:: reset()
       
   666 
       
   667       Resets the codec buffers used for keeping state.
       
   668 
       
   669       Note that no stream repositioning should take place.  This method is
       
   670       primarily intended to be able to recover from decoding errors.
       
   671 
       
   672 
       
   673 In addition to the above methods, the :class:`StreamReader` must also inherit
       
   674 all other methods and attributes from the underlying stream.
       
   675 
       
   676 The next two base classes are included for convenience. They are not needed by
       
   677 the codec registry, but may provide useful in practice.
       
   678 
       
   679 
       
   680 .. _stream-reader-writer:
       
   681 
       
   682 StreamReaderWriter Objects
       
   683 ^^^^^^^^^^^^^^^^^^^^^^^^^^
       
   684 
       
   685 The :class:`StreamReaderWriter` allows wrapping streams which work in both read
       
   686 and write modes.
       
   687 
       
   688 The design is such that one can use the factory functions returned by the
       
   689 :func:`lookup` function to construct the instance.
       
   690 
       
   691 
       
   692 .. class:: StreamReaderWriter(stream, Reader, Writer, errors)
       
   693 
       
   694    Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
       
   695    object. *Reader* and *Writer* must be factory functions or classes providing the
       
   696    :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
       
   697    is done in the same way as defined for the stream readers and writers.
       
   698 
       
   699 :class:`StreamReaderWriter` instances define the combined interfaces of
       
   700 :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
       
   701 methods and attributes from the underlying stream.
       
   702 
       
   703 
       
   704 .. _stream-recoder-objects:
       
   705 
       
   706 StreamRecoder Objects
       
   707 ^^^^^^^^^^^^^^^^^^^^^
       
   708 
       
   709 The :class:`StreamRecoder` provide a frontend - backend view of encoding data
       
   710 which is sometimes useful when dealing with different encoding environments.
       
   711 
       
   712 The design is such that one can use the factory functions returned by the
       
   713 :func:`lookup` function to construct the instance.
       
   714 
       
   715 
       
   716 .. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
       
   717 
       
   718    Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
       
   719    *encode* and *decode* work on the frontend (the input to :meth:`read` and output
       
   720    of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and
       
   721    writing to the stream).
       
   722 
       
   723    You can use these objects to do transparent direct recodings from e.g. Latin-1
       
   724    to UTF-8 and back.
       
   725 
       
   726    *stream* must be a file-like object.
       
   727 
       
   728    *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,
       
   729    *Writer* must be factory functions or classes providing objects of the
       
   730    :class:`StreamReader` and :class:`StreamWriter` interface respectively.
       
   731 
       
   732    *encode* and *decode* are needed for the frontend translation, *Reader* and
       
   733    *Writer* for the backend translation.  The intermediate format used is
       
   734    determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode
       
   735    as the intermediate encoding.
       
   736 
       
   737    Error handling is done in the same way as defined for the stream readers and
       
   738    writers.
       
   739 
       
   740 
       
   741 :class:`StreamRecoder` instances define the combined interfaces of
       
   742 :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
       
   743 methods and attributes from the underlying stream.
       
   744 
       
   745 
       
   746 .. _encodings-overview:
       
   747 
       
   748 Encodings and Unicode
       
   749 ---------------------
       
   750 
       
   751 Unicode strings are stored internally as sequences of codepoints (to be precise
       
   752 as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
       
   753 via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the
       
   754 former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data
       
   755 type. Once a Unicode object is used outside of CPU and memory, CPU endianness
       
   756 and how these arrays are stored as bytes become an issue.  Transforming a
       
   757 unicode object into a sequence of bytes is called encoding and recreating the
       
   758 unicode object from the sequence of bytes is known as decoding.  There are many
       
   759 different methods for how this transformation can be done (these methods are
       
   760 also called encodings). The simplest method is to map the codepoints 0-255 to
       
   761 the bytes ``0x0``-``0xff``. This means that a unicode object that contains
       
   762 codepoints above ``U+00FF`` can't be encoded with this method (which is called
       
   763 ``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a
       
   764 :exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'
       
   765 codec can't encode character u'\u1234' in position 3: ordinal not in
       
   766 range(256)``.
       
   767 
       
   768 There's another group of encodings (the so called charmap encodings) that choose
       
   769 a different subset of all unicode code points and how these codepoints are
       
   770 mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
       
   771 e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
       
   772 Windows). There's a string constant with 256 characters that shows you which
       
   773 character is mapped to which byte value.
       
   774 
       
   775 All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
       
   776 defined in unicode. A simple and straightforward way that can store each Unicode
       
   777 code point, is to store each codepoint as two consecutive bytes. There are two
       
   778 possibilities: Store the bytes in big endian or in little endian order. These
       
   779 two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
       
   780 disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
       
   781 will always have to swap bytes on encoding and decoding. UTF-16 avoids this
       
   782 problem: Bytes will always be in natural endianness. When these bytes are read
       
   783 by a CPU with a different endianness, then bytes have to be swapped though. To
       
   784 be able to detect the endianness of a UTF-16 byte sequence, there's the so
       
   785 called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
       
   786 This character will be prepended to every UTF-16 byte sequence. The byte swapped
       
   787 version of this character (``0xFFFE``) is an illegal character that may not
       
   788 appear in a Unicode text. So when the first character in an UTF-16 byte sequence
       
   789 appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
       
   790 Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
       
   791 a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
       
   792 a word to be split. It can e.g. be used to give hints to a ligature algorithm.
       
   793 With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
       
   794 deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
       
   795 Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
       
   796 it's a device to determine the storage layout of the encoded bytes, and vanishes
       
   797 once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
       
   798 NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
       
   799 
       
   800 There's another encoding that is able to encoding the full range of Unicode
       
   801 characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
       
   802 with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
       
   803 parts: Marker bits (the most significant bits) and payload bits. The marker bits
       
   804 are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are
       
   805 encoded like this (with x being payload bits, which when concatenated give the
       
   806 Unicode character):
       
   807 
       
   808 +-----------------------------------+----------------------------------------------+
       
   809 | Range                             | Encoding                                     |
       
   810 +===================================+==============================================+
       
   811 | ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx                                     |
       
   812 +-----------------------------------+----------------------------------------------+
       
   813 | ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx                            |
       
   814 +-----------------------------------+----------------------------------------------+
       
   815 | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   |
       
   816 +-----------------------------------+----------------------------------------------+
       
   817 | ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
       
   818 +-----------------------------------+----------------------------------------------+
       
   819 | ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
       
   820 +-----------------------------------+----------------------------------------------+
       
   821 | ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
       
   822 |                                   | 10xxxxxx                                     |
       
   823 +-----------------------------------+----------------------------------------------+
       
   824 
       
   825 The least significant bit of the Unicode character is the rightmost x bit.
       
   826 
       
   827 As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
       
   828 the decoded Unicode string (even if it's the first character) is treated as a
       
   829 ``ZERO WIDTH NO-BREAK SPACE``.
       
   830 
       
   831 Without external information it's impossible to reliably determine which
       
   832 encoding was used for encoding a Unicode string. Each charmap encoding can
       
   833 decode any random byte sequence. However that's not possible with UTF-8, as
       
   834 UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
       
   835 sequences. To increase the reliability with which a UTF-8 encoding can be
       
   836 detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
       
   837 ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
       
   838 is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
       
   839 sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
       
   840 that any charmap encoded file starts with these byte values (which would e.g.
       
   841 map to
       
   842 
       
   843    | LATIN SMALL LETTER I WITH DIAERESIS
       
   844    | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
       
   845    | INVERTED QUESTION MARK
       
   846 
       
   847 in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
       
   848 correctly guessed from the byte sequence. So here the BOM is not used to be able
       
   849 to determine the byte order used for generating the byte sequence, but as a
       
   850 signature that helps in guessing the encoding. On encoding the utf-8-sig codec
       
   851 will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
       
   852 decoding utf-8-sig will skip those three bytes if they appear as the first three
       
   853 bytes in the file.
       
   854 
       
   855 
       
   856 .. _standard-encodings:
       
   857 
       
   858 Standard Encodings
       
   859 ------------------
       
   860 
       
   861 Python comes with a number of codecs built-in, either implemented as C functions
       
   862 or with dictionaries as mapping tables. The following table lists the codecs by
       
   863 name, together with a few common aliases, and the languages for which the
       
   864 encoding is likely used. Neither the list of aliases nor the list of languages
       
   865 is meant to be exhaustive. Notice that spelling alternatives that only differ in
       
   866 case or use a hyphen instead of an underscore are also valid aliases.
       
   867 
       
   868 Many of the character sets support the same languages. They vary in individual
       
   869 characters (e.g. whether the EURO SIGN is supported or not), and in the
       
   870 assignment of characters to code positions. For the European languages in
       
   871 particular, the following variants typically exist:
       
   872 
       
   873 * an ISO 8859 codeset
       
   874 
       
   875 * a Microsoft Windows code page, which is typically derived from a 8859 codeset,
       
   876   but replaces control characters with additional graphic characters
       
   877 
       
   878 * an IBM EBCDIC code page
       
   879 
       
   880 * an IBM PC code page, which is ASCII compatible
       
   881 
       
   882 +-----------------+--------------------------------+--------------------------------+
       
   883 | Codec           | Aliases                        | Languages                      |
       
   884 +=================+================================+================================+
       
   885 | ascii           | 646, us-ascii                  | English                        |
       
   886 +-----------------+--------------------------------+--------------------------------+
       
   887 | big5            | big5-tw, csbig5                | Traditional Chinese            |
       
   888 +-----------------+--------------------------------+--------------------------------+
       
   889 | big5hkscs       | big5-hkscs, hkscs              | Traditional Chinese            |
       
   890 +-----------------+--------------------------------+--------------------------------+
       
   891 | cp037           | IBM037, IBM039                 | English                        |
       
   892 +-----------------+--------------------------------+--------------------------------+
       
   893 | cp424           | EBCDIC-CP-HE, IBM424           | Hebrew                         |
       
   894 +-----------------+--------------------------------+--------------------------------+
       
   895 | cp437           | 437, IBM437                    | English                        |
       
   896 +-----------------+--------------------------------+--------------------------------+
       
   897 | cp500           | EBCDIC-CP-BE, EBCDIC-CP-CH,    | Western Europe                 |
       
   898 |                 | IBM500                         |                                |
       
   899 +-----------------+--------------------------------+--------------------------------+
       
   900 | cp737           |                                | Greek                          |
       
   901 +-----------------+--------------------------------+--------------------------------+
       
   902 | cp775           | IBM775                         | Baltic languages               |
       
   903 +-----------------+--------------------------------+--------------------------------+
       
   904 | cp850           | 850, IBM850                    | Western Europe                 |
       
   905 +-----------------+--------------------------------+--------------------------------+
       
   906 | cp852           | 852, IBM852                    | Central and Eastern Europe     |
       
   907 +-----------------+--------------------------------+--------------------------------+
       
   908 | cp855           | 855, IBM855                    | Bulgarian, Byelorussian,       |
       
   909 |                 |                                | Macedonian, Russian, Serbian   |
       
   910 +-----------------+--------------------------------+--------------------------------+
       
   911 | cp856           |                                | Hebrew                         |
       
   912 +-----------------+--------------------------------+--------------------------------+
       
   913 | cp857           | 857, IBM857                    | Turkish                        |
       
   914 +-----------------+--------------------------------+--------------------------------+
       
   915 | cp860           | 860, IBM860                    | Portuguese                     |
       
   916 +-----------------+--------------------------------+--------------------------------+
       
   917 | cp861           | 861, CP-IS, IBM861             | Icelandic                      |
       
   918 +-----------------+--------------------------------+--------------------------------+
       
   919 | cp862           | 862, IBM862                    | Hebrew                         |
       
   920 +-----------------+--------------------------------+--------------------------------+
       
   921 | cp863           | 863, IBM863                    | Canadian                       |
       
   922 +-----------------+--------------------------------+--------------------------------+
       
   923 | cp864           | IBM864                         | Arabic                         |
       
   924 +-----------------+--------------------------------+--------------------------------+
       
   925 | cp865           | 865, IBM865                    | Danish, Norwegian              |
       
   926 +-----------------+--------------------------------+--------------------------------+
       
   927 | cp866           | 866, IBM866                    | Russian                        |
       
   928 +-----------------+--------------------------------+--------------------------------+
       
   929 | cp869           | 869, CP-GR, IBM869             | Greek                          |
       
   930 +-----------------+--------------------------------+--------------------------------+
       
   931 | cp874           |                                | Thai                           |
       
   932 +-----------------+--------------------------------+--------------------------------+
       
   933 | cp875           |                                | Greek                          |
       
   934 +-----------------+--------------------------------+--------------------------------+
       
   935 | cp932           | 932, ms932, mskanji, ms-kanji  | Japanese                       |
       
   936 +-----------------+--------------------------------+--------------------------------+
       
   937 | cp949           | 949, ms949, uhc                | Korean                         |
       
   938 +-----------------+--------------------------------+--------------------------------+
       
   939 | cp950           | 950, ms950                     | Traditional Chinese            |
       
   940 +-----------------+--------------------------------+--------------------------------+
       
   941 | cp1006          |                                | Urdu                           |
       
   942 +-----------------+--------------------------------+--------------------------------+
       
   943 | cp1026          | ibm1026                        | Turkish                        |
       
   944 +-----------------+--------------------------------+--------------------------------+
       
   945 | cp1140          | ibm1140                        | Western Europe                 |
       
   946 +-----------------+--------------------------------+--------------------------------+
       
   947 | cp1250          | windows-1250                   | Central and Eastern Europe     |
       
   948 +-----------------+--------------------------------+--------------------------------+
       
   949 | cp1251          | windows-1251                   | Bulgarian, Byelorussian,       |
       
   950 |                 |                                | Macedonian, Russian, Serbian   |
       
   951 +-----------------+--------------------------------+--------------------------------+
       
   952 | cp1252          | windows-1252                   | Western Europe                 |
       
   953 +-----------------+--------------------------------+--------------------------------+
       
   954 | cp1253          | windows-1253                   | Greek                          |
       
   955 +-----------------+--------------------------------+--------------------------------+
       
   956 | cp1254          | windows-1254                   | Turkish                        |
       
   957 +-----------------+--------------------------------+--------------------------------+
       
   958 | cp1255          | windows-1255                   | Hebrew                         |
       
   959 +-----------------+--------------------------------+--------------------------------+
       
   960 | cp1256          | windows1256                    | Arabic                         |
       
   961 +-----------------+--------------------------------+--------------------------------+
       
   962 | cp1257          | windows-1257                   | Baltic languages               |
       
   963 +-----------------+--------------------------------+--------------------------------+
       
   964 | cp1258          | windows-1258                   | Vietnamese                     |
       
   965 +-----------------+--------------------------------+--------------------------------+
       
   966 | euc_jp          | eucjp, ujis, u-jis             | Japanese                       |
       
   967 +-----------------+--------------------------------+--------------------------------+
       
   968 | euc_jis_2004    | jisx0213, eucjis2004           | Japanese                       |
       
   969 +-----------------+--------------------------------+--------------------------------+
       
   970 | euc_jisx0213    | eucjisx0213                    | Japanese                       |
       
   971 +-----------------+--------------------------------+--------------------------------+
       
   972 | euc_kr          | euckr, korean, ksc5601,        | Korean                         |
       
   973 |                 | ks_c-5601, ks_c-5601-1987,     |                                |
       
   974 |                 | ksx1001, ks_x-1001             |                                |
       
   975 +-----------------+--------------------------------+--------------------------------+
       
   976 | gb2312          | chinese, csiso58gb231280, euc- | Simplified Chinese             |
       
   977 |                 | cn, euccn, eucgb2312-cn,       |                                |
       
   978 |                 | gb2312-1980, gb2312-80, iso-   |                                |
       
   979 |                 | ir-58                          |                                |
       
   980 +-----------------+--------------------------------+--------------------------------+
       
   981 | gbk             | 936, cp936, ms936              | Unified Chinese                |
       
   982 +-----------------+--------------------------------+--------------------------------+
       
   983 | gb18030         | gb18030-2000                   | Unified Chinese                |
       
   984 +-----------------+--------------------------------+--------------------------------+
       
   985 | hz              | hzgb, hz-gb, hz-gb-2312        | Simplified Chinese             |
       
   986 +-----------------+--------------------------------+--------------------------------+
       
   987 | iso2022_jp      | csiso2022jp, iso2022jp,        | Japanese                       |
       
   988 |                 | iso-2022-jp                    |                                |
       
   989 +-----------------+--------------------------------+--------------------------------+
       
   990 | iso2022_jp_1    | iso2022jp-1, iso-2022-jp-1     | Japanese                       |
       
   991 +-----------------+--------------------------------+--------------------------------+
       
   992 | iso2022_jp_2    | iso2022jp-2, iso-2022-jp-2     | Japanese, Korean, Simplified   |
       
   993 |                 |                                | Chinese, Western Europe, Greek |
       
   994 +-----------------+--------------------------------+--------------------------------+
       
   995 | iso2022_jp_2004 | iso2022jp-2004,                | Japanese                       |
       
   996 |                 | iso-2022-jp-2004               |                                |
       
   997 +-----------------+--------------------------------+--------------------------------+
       
   998 | iso2022_jp_3    | iso2022jp-3, iso-2022-jp-3     | Japanese                       |
       
   999 +-----------------+--------------------------------+--------------------------------+
       
  1000 | iso2022_jp_ext  | iso2022jp-ext, iso-2022-jp-ext | Japanese                       |
       
  1001 +-----------------+--------------------------------+--------------------------------+
       
  1002 | iso2022_kr      | csiso2022kr, iso2022kr,        | Korean                         |
       
  1003 |                 | iso-2022-kr                    |                                |
       
  1004 +-----------------+--------------------------------+--------------------------------+
       
  1005 | latin_1         | iso-8859-1, iso8859-1, 8859,   | West Europe                    |
       
  1006 |                 | cp819, latin, latin1, L1       |                                |
       
  1007 +-----------------+--------------------------------+--------------------------------+
       
  1008 | iso8859_2       | iso-8859-2, latin2, L2         | Central and Eastern Europe     |
       
  1009 +-----------------+--------------------------------+--------------------------------+
       
  1010 | iso8859_3       | iso-8859-3, latin3, L3         | Esperanto, Maltese             |
       
  1011 +-----------------+--------------------------------+--------------------------------+
       
  1012 | iso8859_4       | iso-8859-4, latin4, L4         | Baltic languages               |
       
  1013 +-----------------+--------------------------------+--------------------------------+
       
  1014 | iso8859_5       | iso-8859-5, cyrillic           | Bulgarian, Byelorussian,       |
       
  1015 |                 |                                | Macedonian, Russian, Serbian   |
       
  1016 +-----------------+--------------------------------+--------------------------------+
       
  1017 | iso8859_6       | iso-8859-6, arabic             | Arabic                         |
       
  1018 +-----------------+--------------------------------+--------------------------------+
       
  1019 | iso8859_7       | iso-8859-7, greek, greek8      | Greek                          |
       
  1020 +-----------------+--------------------------------+--------------------------------+
       
  1021 | iso8859_8       | iso-8859-8, hebrew             | Hebrew                         |
       
  1022 +-----------------+--------------------------------+--------------------------------+
       
  1023 | iso8859_9       | iso-8859-9, latin5, L5         | Turkish                        |
       
  1024 +-----------------+--------------------------------+--------------------------------+
       
  1025 | iso8859_10      | iso-8859-10, latin6, L6        | Nordic languages               |
       
  1026 +-----------------+--------------------------------+--------------------------------+
       
  1027 | iso8859_13      | iso-8859-13                    | Baltic languages               |
       
  1028 +-----------------+--------------------------------+--------------------------------+
       
  1029 | iso8859_14      | iso-8859-14, latin8, L8        | Celtic languages               |
       
  1030 +-----------------+--------------------------------+--------------------------------+
       
  1031 | iso8859_15      | iso-8859-15                    | Western Europe                 |
       
  1032 +-----------------+--------------------------------+--------------------------------+
       
  1033 | johab           | cp1361, ms1361                 | Korean                         |
       
  1034 +-----------------+--------------------------------+--------------------------------+
       
  1035 | koi8_r          |                                | Russian                        |
       
  1036 +-----------------+--------------------------------+--------------------------------+
       
  1037 | koi8_u          |                                | Ukrainian                      |
       
  1038 +-----------------+--------------------------------+--------------------------------+
       
  1039 | mac_cyrillic    | maccyrillic                    | Bulgarian, Byelorussian,       |
       
  1040 |                 |                                | Macedonian, Russian, Serbian   |
       
  1041 +-----------------+--------------------------------+--------------------------------+
       
  1042 | mac_greek       | macgreek                       | Greek                          |
       
  1043 +-----------------+--------------------------------+--------------------------------+
       
  1044 | mac_iceland     | maciceland                     | Icelandic                      |
       
  1045 +-----------------+--------------------------------+--------------------------------+
       
  1046 | mac_latin2      | maclatin2, maccentraleurope    | Central and Eastern Europe     |
       
  1047 +-----------------+--------------------------------+--------------------------------+
       
  1048 | mac_roman       | macroman                       | Western Europe                 |
       
  1049 +-----------------+--------------------------------+--------------------------------+
       
  1050 | mac_turkish     | macturkish                     | Turkish                        |
       
  1051 +-----------------+--------------------------------+--------------------------------+
       
  1052 | ptcp154         | csptcp154, pt154, cp154,       | Kazakh                         |
       
  1053 |                 | cyrillic-asian                 |                                |
       
  1054 +-----------------+--------------------------------+--------------------------------+
       
  1055 | shift_jis       | csshiftjis, shiftjis, sjis,    | Japanese                       |
       
  1056 |                 | s_jis                          |                                |
       
  1057 +-----------------+--------------------------------+--------------------------------+
       
  1058 | shift_jis_2004  | shiftjis2004, sjis_2004,       | Japanese                       |
       
  1059 |                 | sjis2004                       |                                |
       
  1060 +-----------------+--------------------------------+--------------------------------+
       
  1061 | shift_jisx0213  | shiftjisx0213, sjisx0213,      | Japanese                       |
       
  1062 |                 | s_jisx0213                     |                                |
       
  1063 +-----------------+--------------------------------+--------------------------------+
       
  1064 | utf_32          | U32, utf32                     | all languages                  |
       
  1065 +-----------------+--------------------------------+--------------------------------+
       
  1066 | utf_32_be       | UTF-32BE                       | all languages                  |
       
  1067 +-----------------+--------------------------------+--------------------------------+
       
  1068 | utf_32_le       | UTF-32LE                       | all languages                  |
       
  1069 +-----------------+--------------------------------+--------------------------------+
       
  1070 | utf_16          | U16, utf16                     | all languages                  |
       
  1071 +-----------------+--------------------------------+--------------------------------+
       
  1072 | utf_16_be       | UTF-16BE                       | all languages (BMP only)       |
       
  1073 +-----------------+--------------------------------+--------------------------------+
       
  1074 | utf_16_le       | UTF-16LE                       | all languages (BMP only)       |
       
  1075 +-----------------+--------------------------------+--------------------------------+
       
  1076 | utf_7           | U7, unicode-1-1-utf-7          | all languages                  |
       
  1077 +-----------------+--------------------------------+--------------------------------+
       
  1078 | utf_8           | U8, UTF, utf8                  | all languages                  |
       
  1079 +-----------------+--------------------------------+--------------------------------+
       
  1080 | utf_8_sig       |                                | all languages                  |
       
  1081 +-----------------+--------------------------------+--------------------------------+
       
  1082 
       
  1083 A number of codecs are specific to Python, so their codec names have no meaning
       
  1084 outside Python. Some of them don't convert from Unicode strings to byte strings,
       
  1085 but instead use the property of the Python codecs machinery that any bijective
       
  1086 function with one argument can be considered as an encoding.
       
  1087 
       
  1088 For the codecs listed below, the result in the "encoding" direction is always a
       
  1089 byte string. The result of the "decoding" direction is listed as operand type in
       
  1090 the table.
       
  1091 
       
  1092 +--------------------+---------------------------+----------------+---------------------------+
       
  1093 | Codec              | Aliases                   | Operand type   | Purpose                   |
       
  1094 +====================+===========================+================+===========================+
       
  1095 | base64_codec       | base64, base-64           | byte string    | Convert operand to MIME   |
       
  1096 |                    |                           |                | base64                    |
       
  1097 +--------------------+---------------------------+----------------+---------------------------+
       
  1098 | bz2_codec          | bz2                       | byte string    | Compress the operand      |
       
  1099 |                    |                           |                | using bz2                 |
       
  1100 +--------------------+---------------------------+----------------+---------------------------+
       
  1101 | hex_codec          | hex                       | byte string    | Convert operand to        |
       
  1102 |                    |                           |                | hexadecimal               |
       
  1103 |                    |                           |                | representation, with two  |
       
  1104 |                    |                           |                | digits per byte           |
       
  1105 +--------------------+---------------------------+----------------+---------------------------+
       
  1106 | idna               |                           | Unicode string | Implements :rfc:`3490`,   |
       
  1107 |                    |                           |                | see also                  |
       
  1108 |                    |                           |                | :mod:`encodings.idna`     |
       
  1109 +--------------------+---------------------------+----------------+---------------------------+
       
  1110 | mbcs               | dbcs                      | Unicode string | Windows only: Encode      |
       
  1111 |                    |                           |                | operand according to the  |
       
  1112 |                    |                           |                | ANSI codepage (CP_ACP)    |
       
  1113 +--------------------+---------------------------+----------------+---------------------------+
       
  1114 | palmos             |                           | Unicode string | Encoding of PalmOS 3.5    |
       
  1115 +--------------------+---------------------------+----------------+---------------------------+
       
  1116 | punycode           |                           | Unicode string | Implements :rfc:`3492`    |
       
  1117 +--------------------+---------------------------+----------------+---------------------------+
       
  1118 | quopri_codec       | quopri, quoted-printable, | byte string    | Convert operand to MIME   |
       
  1119 |                    | quotedprintable           |                | quoted printable          |
       
  1120 +--------------------+---------------------------+----------------+---------------------------+
       
  1121 | raw_unicode_escape |                           | Unicode string | Produce a string that is  |
       
  1122 |                    |                           |                | suitable as raw Unicode   |
       
  1123 |                    |                           |                | literal in Python source  |
       
  1124 |                    |                           |                | code                      |
       
  1125 +--------------------+---------------------------+----------------+---------------------------+
       
  1126 | rot_13             | rot13                     | Unicode string | Returns the Caesar-cypher |
       
  1127 |                    |                           |                | encryption of the operand |
       
  1128 +--------------------+---------------------------+----------------+---------------------------+
       
  1129 | string_escape      |                           | byte string    | Produce a string that is  |
       
  1130 |                    |                           |                | suitable as string        |
       
  1131 |                    |                           |                | literal in Python source  |
       
  1132 |                    |                           |                | code                      |
       
  1133 +--------------------+---------------------------+----------------+---------------------------+
       
  1134 | undefined          |                           | any            | Raise an exception for    |
       
  1135 |                    |                           |                | all conversions. Can be   |
       
  1136 |                    |                           |                | used as the system        |
       
  1137 |                    |                           |                | encoding if no automatic  |
       
  1138 |                    |                           |                | :term:`coercion` between  |
       
  1139 |                    |                           |                | byte and Unicode strings  |
       
  1140 |                    |                           |                | is desired.               |
       
  1141 +--------------------+---------------------------+----------------+---------------------------+
       
  1142 | unicode_escape     |                           | Unicode string | Produce a string that is  |
       
  1143 |                    |                           |                | suitable as Unicode       |
       
  1144 |                    |                           |                | literal in Python source  |
       
  1145 |                    |                           |                | code                      |
       
  1146 +--------------------+---------------------------+----------------+---------------------------+
       
  1147 | unicode_internal   |                           | Unicode string | Return the internal       |
       
  1148 |                    |                           |                | representation of the     |
       
  1149 |                    |                           |                | operand                   |
       
  1150 +--------------------+---------------------------+----------------+---------------------------+
       
  1151 | uu_codec           | uu                        | byte string    | Convert the operand using |
       
  1152 |                    |                           |                | uuencode                  |
       
  1153 +--------------------+---------------------------+----------------+---------------------------+
       
  1154 | zlib_codec         | zip, zlib                 | byte string    | Compress the operand      |
       
  1155 |                    |                           |                | using gzip                |
       
  1156 +--------------------+---------------------------+----------------+---------------------------+
       
  1157 
       
  1158 .. versionadded:: 2.3
       
  1159    The ``idna`` and ``punycode`` encodings.
       
  1160 
       
  1161 
       
  1162 :mod:`encodings.idna` --- Internationalized Domain Names in Applications
       
  1163 ------------------------------------------------------------------------
       
  1164 
       
  1165 .. module:: encodings.idna
       
  1166    :synopsis: Internationalized Domain Names implementation
       
  1167 .. moduleauthor:: Martin v. Löwis
       
  1168 
       
  1169 .. versionadded:: 2.3
       
  1170 
       
  1171 This module implements :rfc:`3490` (Internationalized Domain Names in
       
  1172 Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
       
  1173 Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
       
  1174 and :mod:`stringprep`.
       
  1175 
       
  1176 These RFCs together define a protocol to support non-ASCII characters in domain
       
  1177 names. A domain name containing non-ASCII characters (such as
       
  1178 ``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
       
  1179 (ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
       
  1180 name is then used in all places where arbitrary characters are not allowed by
       
  1181 the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
       
  1182 on. This conversion is carried out in the application; if possible invisible to
       
  1183 the user: The application should transparently convert Unicode domain labels to
       
  1184 IDNA on the wire, and convert back ACE labels to Unicode before presenting them
       
  1185 to the user.
       
  1186 
       
  1187 Python supports this conversion in several ways: The ``idna`` codec allows to
       
  1188 convert between Unicode and the ACE. Furthermore, the :mod:`socket` module
       
  1189 transparently converts Unicode host names to ACE, so that applications need not
       
  1190 be concerned about converting host names themselves when they pass them to the
       
  1191 socket module. On top of that, modules that have host names as function
       
  1192 parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names
       
  1193 (:mod:`httplib` then also transparently sends an IDNA hostname in the
       
  1194 :mailheader:`Host` field if it sends that field at all).
       
  1195 
       
  1196 When receiving host names from the wire (such as in reverse name lookup), no
       
  1197 automatic conversion to Unicode is performed: Applications wishing to present
       
  1198 such host names to the user should decode them to Unicode.
       
  1199 
       
  1200 The module :mod:`encodings.idna` also implements the nameprep procedure, which
       
  1201 performs certain normalizations on host names, to achieve case-insensitivity of
       
  1202 international domain names, and to unify similar characters. The nameprep
       
  1203 functions can be used directly if desired.
       
  1204 
       
  1205 
       
  1206 .. function:: nameprep(label)
       
  1207 
       
  1208    Return the nameprepped version of *label*. The implementation currently assumes
       
  1209    query strings, so ``AllowUnassigned`` is true.
       
  1210 
       
  1211 
       
  1212 .. function:: ToASCII(label)
       
  1213 
       
  1214    Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
       
  1215    assumed to be false.
       
  1216 
       
  1217 
       
  1218 .. function:: ToUnicode(label)
       
  1219 
       
  1220    Convert a label to Unicode, as specified in :rfc:`3490`.
       
  1221 
       
  1222 
       
  1223 :mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
       
  1224 -------------------------------------------------------------
       
  1225 
       
  1226 .. module:: encodings.utf_8_sig
       
  1227    :synopsis: UTF-8 codec with BOM signature
       
  1228 .. moduleauthor:: Walter Dörwald
       
  1229 
       
  1230 .. versionadded:: 2.5
       
  1231 
       
  1232 This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
       
  1233 BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
       
  1234 is only done once (on the first write to the byte stream).  For decoding an
       
  1235 optional UTF-8 encoded BOM at the start of the data will be skipped.
       
  1236