symbian-qemu-0.9.1-12/python-2.6.1/Doc/c-api/unicode.rst
changeset 1 2fb8b9db1c86
equal deleted inserted replaced
0:ffa851df0825 1:2fb8b9db1c86
       
     1 .. highlightlang:: c
       
     2 
       
     3 .. _unicodeobjects:
       
     4 
       
     5 Unicode Objects and Codecs
       
     6 --------------------------
       
     7 
       
     8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
       
     9 
       
    10 Unicode Objects
       
    11 ^^^^^^^^^^^^^^^
       
    12 
       
    13 
       
    14 These are the basic Unicode object types used for the Unicode implementation in
       
    15 Python:
       
    16 
       
    17 .. % --- Unicode Type -------------------------------------------------------
       
    18 
       
    19 
       
    20 .. ctype:: Py_UNICODE
       
    21 
       
    22    This type represents the storage type which is used by Python internally as
       
    23    basis for holding Unicode ordinals.  Python's default builds use a 16-bit type
       
    24    for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
       
    25    possible to build a UCS4 version of Python (most recent Linux distributions come
       
    26    with UCS4 builds of Python). These builds then use a 32-bit type for
       
    27    :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
       
    28    where :ctype:`wchar_t` is available and compatible with the chosen Python
       
    29    Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
       
    30    :ctype:`wchar_t` to enhance native platform compatibility. On all other
       
    31    platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
       
    32    short` (UCS2) or :ctype:`unsigned long` (UCS4).
       
    33 
       
    34 Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
       
    35 this in mind when writing extensions or interfaces.
       
    36 
       
    37 
       
    38 .. ctype:: PyUnicodeObject
       
    39 
       
    40    This subtype of :ctype:`PyObject` represents a Python Unicode object.
       
    41 
       
    42 
       
    43 .. cvar:: PyTypeObject PyUnicode_Type
       
    44 
       
    45    This instance of :ctype:`PyTypeObject` represents the Python Unicode type.  It
       
    46    is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
       
    47 
       
    48 The following APIs are really C macros and can be used to do fast checks and to
       
    49 access internal read-only data of Unicode objects:
       
    50 
       
    51 
       
    52 .. cfunction:: int PyUnicode_Check(PyObject *o)
       
    53 
       
    54    Return true if the object *o* is a Unicode object or an instance of a Unicode
       
    55    subtype.
       
    56 
       
    57    .. versionchanged:: 2.2
       
    58       Allowed subtypes to be accepted.
       
    59 
       
    60 
       
    61 .. cfunction:: int PyUnicode_CheckExact(PyObject *o)
       
    62 
       
    63    Return true if the object *o* is a Unicode object, but not an instance of a
       
    64    subtype.
       
    65 
       
    66    .. versionadded:: 2.2
       
    67 
       
    68 
       
    69 .. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
       
    70 
       
    71    Return the size of the object.  *o* has to be a :ctype:`PyUnicodeObject` (not
       
    72    checked).
       
    73 
       
    74 
       
    75 .. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
       
    76 
       
    77    Return the size of the object's internal buffer in bytes.  *o* has to be a
       
    78    :ctype:`PyUnicodeObject` (not checked).
       
    79 
       
    80 
       
    81 .. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
       
    82 
       
    83    Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object.  *o*
       
    84    has to be a :ctype:`PyUnicodeObject` (not checked).
       
    85 
       
    86 
       
    87 .. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
       
    88 
       
    89    Return a pointer to the internal buffer of the object. *o* has to be a
       
    90    :ctype:`PyUnicodeObject` (not checked).
       
    91 
       
    92 
       
    93 .. cfunction:: int PyUnicode_ClearFreeList(void)
       
    94 
       
    95    Clear the free list. Return the total number of freed items.
       
    96 
       
    97    .. versionadded:: 2.6
       
    98 
       
    99 Unicode provides many different character properties. The most often needed ones
       
   100 are available through these macros which are mapped to C functions depending on
       
   101 the Python configuration.
       
   102 
       
   103 .. % --- Unicode character properties ---------------------------------------
       
   104 
       
   105 
       
   106 .. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
       
   107 
       
   108    Return 1 or 0 depending on whether *ch* is a whitespace character.
       
   109 
       
   110 
       
   111 .. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
       
   112 
       
   113    Return 1 or 0 depending on whether *ch* is a lowercase character.
       
   114 
       
   115 
       
   116 .. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
       
   117 
       
   118    Return 1 or 0 depending on whether *ch* is an uppercase character.
       
   119 
       
   120 
       
   121 .. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
       
   122 
       
   123    Return 1 or 0 depending on whether *ch* is a titlecase character.
       
   124 
       
   125 
       
   126 .. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
       
   127 
       
   128    Return 1 or 0 depending on whether *ch* is a linebreak character.
       
   129 
       
   130 
       
   131 .. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
       
   132 
       
   133    Return 1 or 0 depending on whether *ch* is a decimal character.
       
   134 
       
   135 
       
   136 .. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
       
   137 
       
   138    Return 1 or 0 depending on whether *ch* is a digit character.
       
   139 
       
   140 
       
   141 .. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
       
   142 
       
   143    Return 1 or 0 depending on whether *ch* is a numeric character.
       
   144 
       
   145 
       
   146 .. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
       
   147 
       
   148    Return 1 or 0 depending on whether *ch* is an alphabetic character.
       
   149 
       
   150 
       
   151 .. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
       
   152 
       
   153    Return 1 or 0 depending on whether *ch* is an alphanumeric character.
       
   154 
       
   155 These APIs can be used for fast direct character conversions:
       
   156 
       
   157 
       
   158 .. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
       
   159 
       
   160    Return the character *ch* converted to lower case.
       
   161 
       
   162 
       
   163 .. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
       
   164 
       
   165    Return the character *ch* converted to upper case.
       
   166 
       
   167 
       
   168 .. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
       
   169 
       
   170    Return the character *ch* converted to title case.
       
   171 
       
   172 
       
   173 .. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
       
   174 
       
   175    Return the character *ch* converted to a decimal positive integer.  Return
       
   176    ``-1`` if this is not possible.  This macro does not raise exceptions.
       
   177 
       
   178 
       
   179 .. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
       
   180 
       
   181    Return the character *ch* converted to a single digit integer. Return ``-1`` if
       
   182    this is not possible.  This macro does not raise exceptions.
       
   183 
       
   184 
       
   185 .. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
       
   186 
       
   187    Return the character *ch* converted to a double. Return ``-1.0`` if this is not
       
   188    possible.  This macro does not raise exceptions.
       
   189 
       
   190 To create Unicode objects and access their basic sequence properties, use these
       
   191 APIs:
       
   192 
       
   193 .. % --- Plain Py_UNICODE ---------------------------------------------------
       
   194 
       
   195 
       
   196 .. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
       
   197 
       
   198    Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
       
   199    may be *NULL* which causes the contents to be undefined. It is the user's
       
   200    responsibility to fill in the needed data.  The buffer is copied into the new
       
   201    object. If the buffer is not *NULL*, the return value might be a shared object.
       
   202    Therefore, modification of the resulting Unicode object is only allowed when *u*
       
   203    is *NULL*.
       
   204 
       
   205 
       
   206 .. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
       
   207 
       
   208    Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
       
   209    buffer, *NULL* if *unicode* is not a Unicode object.
       
   210 
       
   211 
       
   212 .. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
       
   213 
       
   214    Return the length of the Unicode object.
       
   215 
       
   216 
       
   217 .. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
       
   218 
       
   219    Coerce an encoded object *obj* to an Unicode object and return a reference with
       
   220    incremented refcount.
       
   221 
       
   222    String and other char buffer compatible objects are decoded according to the
       
   223    given encoding and using the error handling defined by errors.  Both can be
       
   224    *NULL* to have the interface use the default values (see the next section for
       
   225    details).
       
   226 
       
   227    All other objects, including Unicode objects, cause a :exc:`TypeError` to be
       
   228    set.
       
   229 
       
   230    The API returns *NULL* if there was an error.  The caller is responsible for
       
   231    decref'ing the returned objects.
       
   232 
       
   233 
       
   234 .. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
       
   235 
       
   236    Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
       
   237    throughout the interpreter whenever coercion to Unicode is needed.
       
   238 
       
   239 If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
       
   240 Python can interface directly to this type using the following functions.
       
   241 Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
       
   242 the system's :ctype:`wchar_t`.
       
   243 
       
   244 .. % --- wchar_t support for platforms which support it ---------------------
       
   245 
       
   246 
       
   247 .. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
       
   248 
       
   249    Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
       
   250    Return *NULL* on failure.
       
   251 
       
   252 
       
   253 .. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
       
   254 
       
   255    Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*.  At most
       
   256    *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
       
   257    0-termination character).  Return the number of :ctype:`wchar_t` characters
       
   258    copied or -1 in case of an error.  Note that the resulting :ctype:`wchar_t`
       
   259    string may or may not be 0-terminated.  It is the responsibility of the caller
       
   260    to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
       
   261    required by the application.
       
   262 
       
   263 
       
   264 .. _builtincodecs:
       
   265 
       
   266 Built-in Codecs
       
   267 ^^^^^^^^^^^^^^^
       
   268 
       
   269 Python provides a set of builtin codecs which are written in C for speed. All of
       
   270 these codecs are directly usable via the following functions.
       
   271 
       
   272 Many of the following APIs take two arguments encoding and errors. These
       
   273 parameters encoding and errors have the same semantics as the ones of the
       
   274 builtin unicode() Unicode object constructor.
       
   275 
       
   276 Setting encoding to *NULL* causes the default encoding to be used which is
       
   277 ASCII.  The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
       
   278 as the encoding for file names. This variable should be treated as read-only: On
       
   279 some systems, it will be a pointer to a static string, on others, it will change
       
   280 at run-time (such as when the application invokes setlocale).
       
   281 
       
   282 Error handling is set by errors which may also be set to *NULL* meaning to use
       
   283 the default handling defined for the codec.  Default error handling for all
       
   284 builtin codecs is "strict" (:exc:`ValueError` is raised).
       
   285 
       
   286 The codecs all use a similar interface.  Only deviation from the following
       
   287 generic ones are documented for simplicity.
       
   288 
       
   289 These are the generic codec APIs:
       
   290 
       
   291 .. % --- Generic Codecs -----------------------------------------------------
       
   292 
       
   293 
       
   294 .. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
       
   295 
       
   296    Create a Unicode object by decoding *size* bytes of the encoded string *s*.
       
   297    *encoding* and *errors* have the same meaning as the parameters of the same name
       
   298    in the :func:`unicode` builtin function.  The codec to be used is looked up
       
   299    using the Python codec registry.  Return *NULL* if an exception was raised by
       
   300    the codec.
       
   301 
       
   302 
       
   303 .. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
       
   304 
       
   305    Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
       
   306    string object.  *encoding* and *errors* have the same meaning as the parameters
       
   307    of the same name in the Unicode :meth:`encode` method.  The codec to be used is
       
   308    looked up using the Python codec registry.  Return *NULL* if an exception was
       
   309    raised by the codec.
       
   310 
       
   311 
       
   312 .. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
       
   313 
       
   314    Encode a Unicode object and return the result as Python string object.
       
   315    *encoding* and *errors* have the same meaning as the parameters of the same name
       
   316    in the Unicode :meth:`encode` method. The codec to be used is looked up using
       
   317    the Python codec registry. Return *NULL* if an exception was raised by the
       
   318    codec.
       
   319 
       
   320 These are the UTF-8 codec APIs:
       
   321 
       
   322 .. % --- UTF-8 Codecs -------------------------------------------------------
       
   323 
       
   324 
       
   325 .. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
       
   326 
       
   327    Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
       
   328    *s*. Return *NULL* if an exception was raised by the codec.
       
   329 
       
   330 
       
   331 .. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
       
   332 
       
   333    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
       
   334    *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
       
   335    treated as an error. Those bytes will not be decoded and the number of bytes
       
   336    that have been decoded will be stored in *consumed*.
       
   337 
       
   338    .. versionadded:: 2.4
       
   339 
       
   340 
       
   341 .. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
       
   342 
       
   343    Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
       
   344    Python string object.  Return *NULL* if an exception was raised by the codec.
       
   345 
       
   346 
       
   347 .. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
       
   348 
       
   349    Encode a Unicode object using UTF-8 and return the result as Python string
       
   350    object.  Error handling is "strict".  Return *NULL* if an exception was raised
       
   351    by the codec.
       
   352 
       
   353 These are the UTF-32 codec APIs:
       
   354 
       
   355 .. % --- UTF-32 Codecs ------------------------------------------------------ */
       
   356 
       
   357 
       
   358 .. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
       
   359 
       
   360    Decode *length* bytes from a UTF-32 encoded buffer string and return the
       
   361    corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
       
   362    handling. It defaults to "strict".
       
   363 
       
   364    If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
       
   365    order::
       
   366 
       
   367       *byteorder == -1: little endian
       
   368       *byteorder == 0:  native order
       
   369       *byteorder == 1:  big endian
       
   370 
       
   371    and then switches if the first four bytes of the input data are a byte order mark
       
   372    (BOM) and the specified byte order is native order.  This BOM is not copied into
       
   373    the resulting Unicode string.  After completion, *\*byteorder* is set to the
       
   374    current byte order at the end of input data.
       
   375 
       
   376    In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
       
   377 
       
   378    If *byteorder* is *NULL*, the codec starts in native order mode.
       
   379 
       
   380    Return *NULL* if an exception was raised by the codec.
       
   381 
       
   382    .. versionadded:: 2.6
       
   383 
       
   384 
       
   385 .. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
       
   386 
       
   387    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
       
   388    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
       
   389    trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
       
   390    by four) as an error. Those bytes will not be decoded and the number of bytes
       
   391    that have been decoded will be stored in *consumed*.
       
   392 
       
   393    .. versionadded:: 2.6
       
   394 
       
   395 
       
   396 .. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
       
   397 
       
   398    Return a Python bytes object holding the UTF-32 encoded value of the Unicode
       
   399    data in *s*.  If *byteorder* is not ``0``, output is written according to the
       
   400    following byte order::
       
   401 
       
   402       byteorder == -1: little endian
       
   403       byteorder == 0:  native byte order (writes a BOM mark)
       
   404       byteorder == 1:  big endian
       
   405 
       
   406    If byteorder is ``0``, the output string will always start with the Unicode BOM
       
   407    mark (U+FEFF). In the other two modes, no BOM mark is prepended.
       
   408 
       
   409    If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
       
   410    as a single codepoint.
       
   411 
       
   412    Return *NULL* if an exception was raised by the codec.
       
   413 
       
   414    .. versionadded:: 2.6
       
   415 
       
   416 
       
   417 .. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
       
   418 
       
   419    Return a Python string using the UTF-32 encoding in native byte order. The
       
   420    string always starts with a BOM mark.  Error handling is "strict".  Return
       
   421    *NULL* if an exception was raised by the codec.
       
   422 
       
   423    .. versionadded:: 2.6
       
   424 
       
   425 
       
   426 These are the UTF-16 codec APIs:
       
   427 
       
   428 .. % --- UTF-16 Codecs ------------------------------------------------------ */
       
   429 
       
   430 
       
   431 .. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
       
   432 
       
   433    Decode *length* bytes from a UTF-16 encoded buffer string and return the
       
   434    corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
       
   435    handling. It defaults to "strict".
       
   436 
       
   437    If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
       
   438    order::
       
   439 
       
   440       *byteorder == -1: little endian
       
   441       *byteorder == 0:  native order
       
   442       *byteorder == 1:  big endian
       
   443 
       
   444    and then switches if the first two bytes of the input data are a byte order mark
       
   445    (BOM) and the specified byte order is native order.  This BOM is not copied into
       
   446    the resulting Unicode string.  After completion, *\*byteorder* is set to the
       
   447    current byte order at the.
       
   448 
       
   449    If *byteorder* is *NULL*, the codec starts in native order mode.
       
   450 
       
   451    Return *NULL* if an exception was raised by the codec.
       
   452 
       
   453 
       
   454 .. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
       
   455 
       
   456    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
       
   457    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
       
   458    trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
       
   459    split surrogate pair) as an error. Those bytes will not be decoded and the
       
   460    number of bytes that have been decoded will be stored in *consumed*.
       
   461 
       
   462    .. versionadded:: 2.4
       
   463 
       
   464 
       
   465 .. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
       
   466 
       
   467    Return a Python string object holding the UTF-16 encoded value of the Unicode
       
   468    data in *s*.  If *byteorder* is not ``0``, output is written according to the
       
   469    following byte order::
       
   470 
       
   471       byteorder == -1: little endian
       
   472       byteorder == 0:  native byte order (writes a BOM mark)
       
   473       byteorder == 1:  big endian
       
   474 
       
   475    If byteorder is ``0``, the output string will always start with the Unicode BOM
       
   476    mark (U+FEFF). In the other two modes, no BOM mark is prepended.
       
   477 
       
   478    If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
       
   479    represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
       
   480    values is interpreted as an UCS-2 character.
       
   481 
       
   482    Return *NULL* if an exception was raised by the codec.
       
   483 
       
   484 
       
   485 .. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
       
   486 
       
   487    Return a Python string using the UTF-16 encoding in native byte order. The
       
   488    string always starts with a BOM mark.  Error handling is "strict".  Return
       
   489    *NULL* if an exception was raised by the codec.
       
   490 
       
   491 These are the "Unicode Escape" codec APIs:
       
   492 
       
   493 .. % --- Unicode-Escape Codecs ----------------------------------------------
       
   494 
       
   495 
       
   496 .. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
       
   497 
       
   498    Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
       
   499    string *s*.  Return *NULL* if an exception was raised by the codec.
       
   500 
       
   501 
       
   502 .. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
       
   503 
       
   504    Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
       
   505    return a Python string object.  Return *NULL* if an exception was raised by the
       
   506    codec.
       
   507 
       
   508 
       
   509 .. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
       
   510 
       
   511    Encode a Unicode object using Unicode-Escape and return the result as Python
       
   512    string object.  Error handling is "strict". Return *NULL* if an exception was
       
   513    raised by the codec.
       
   514 
       
   515 These are the "Raw Unicode Escape" codec APIs:
       
   516 
       
   517 .. % --- Raw-Unicode-Escape Codecs ------------------------------------------
       
   518 
       
   519 
       
   520 .. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
       
   521 
       
   522    Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
       
   523    encoded string *s*.  Return *NULL* if an exception was raised by the codec.
       
   524 
       
   525 
       
   526 .. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
       
   527 
       
   528    Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
       
   529    and return a Python string object.  Return *NULL* if an exception was raised by
       
   530    the codec.
       
   531 
       
   532 
       
   533 .. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
       
   534 
       
   535    Encode a Unicode object using Raw-Unicode-Escape and return the result as
       
   536    Python string object. Error handling is "strict". Return *NULL* if an exception
       
   537    was raised by the codec.
       
   538 
       
   539 These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
       
   540 ordinals and only these are accepted by the codecs during encoding.
       
   541 
       
   542 .. % --- Latin-1 Codecs -----------------------------------------------------
       
   543 
       
   544 
       
   545 .. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
       
   546 
       
   547    Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
       
   548    *s*.  Return *NULL* if an exception was raised by the codec.
       
   549 
       
   550 
       
   551 .. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
       
   552 
       
   553    Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
       
   554    a Python string object.  Return *NULL* if an exception was raised by the codec.
       
   555 
       
   556 
       
   557 .. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
       
   558 
       
   559    Encode a Unicode object using Latin-1 and return the result as Python string
       
   560    object.  Error handling is "strict".  Return *NULL* if an exception was raised
       
   561    by the codec.
       
   562 
       
   563 These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
       
   564 codes generate errors.
       
   565 
       
   566 .. % --- ASCII Codecs -------------------------------------------------------
       
   567 
       
   568 
       
   569 .. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
       
   570 
       
   571    Create a Unicode object by decoding *size* bytes of the ASCII encoded string
       
   572    *s*.  Return *NULL* if an exception was raised by the codec.
       
   573 
       
   574 
       
   575 .. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
       
   576 
       
   577    Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
       
   578    Python string object.  Return *NULL* if an exception was raised by the codec.
       
   579 
       
   580 
       
   581 .. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
       
   582 
       
   583    Encode a Unicode object using ASCII and return the result as Python string
       
   584    object.  Error handling is "strict".  Return *NULL* if an exception was raised
       
   585    by the codec.
       
   586 
       
   587 These are the mapping codec APIs:
       
   588 
       
   589 .. % --- Character Map Codecs -----------------------------------------------
       
   590 
       
   591 This codec is special in that it can be used to implement many different codecs
       
   592 (and this is in fact what was done to obtain most of the standard codecs
       
   593 included in the :mod:`encodings` package). The codec uses mapping to encode and
       
   594 decode characters.
       
   595 
       
   596 Decoding mappings must map single string characters to single Unicode
       
   597 characters, integers (which are then interpreted as Unicode ordinals) or None
       
   598 (meaning "undefined mapping" and causing an error).
       
   599 
       
   600 Encoding mappings must map single Unicode characters to single string
       
   601 characters, integers (which are then interpreted as Latin-1 ordinals) or None
       
   602 (meaning "undefined mapping" and causing an error).
       
   603 
       
   604 The mapping objects provided must only support the __getitem__ mapping
       
   605 interface.
       
   606 
       
   607 If a character lookup fails with a LookupError, the character is copied as-is
       
   608 meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
       
   609 resp. Because of this, mappings only need to contain those mappings which map
       
   610 characters to different code points.
       
   611 
       
   612 
       
   613 .. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
       
   614 
       
   615    Create a Unicode object by decoding *size* bytes of the encoded string *s* using
       
   616    the given *mapping* object.  Return *NULL* if an exception was raised by the
       
   617    codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
       
   618    dictionary mapping byte or a unicode string, which is treated as a lookup table.
       
   619    Byte values greater that the length of the string and U+FFFE "characters" are
       
   620    treated as "undefined mapping".
       
   621 
       
   622    .. versionchanged:: 2.4
       
   623       Allowed unicode string as mapping argument.
       
   624 
       
   625 
       
   626 .. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
       
   627 
       
   628    Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
       
   629    *mapping* object and return a Python string object. Return *NULL* if an
       
   630    exception was raised by the codec.
       
   631 
       
   632 
       
   633 .. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
       
   634 
       
   635    Encode a Unicode object using the given *mapping* object and return the result
       
   636    as Python string object.  Error handling is "strict".  Return *NULL* if an
       
   637    exception was raised by the codec.
       
   638 
       
   639 The following codec API is special in that maps Unicode to Unicode.
       
   640 
       
   641 
       
   642 .. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
       
   643 
       
   644    Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
       
   645    character mapping *table* to it and return the resulting Unicode object.  Return
       
   646    *NULL* when an exception was raised by the codec.
       
   647 
       
   648    The *mapping* table must map Unicode ordinal integers to Unicode ordinal
       
   649    integers or None (causing deletion of the character).
       
   650 
       
   651    Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
       
   652    and sequences work well.  Unmapped character ordinals (ones which cause a
       
   653    :exc:`LookupError`) are left untouched and are copied as-is.
       
   654 
       
   655 These are the MBCS codec APIs. They are currently only available on Windows and
       
   656 use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
       
   657 DBCS) is a class of encodings, not just one.  The target encoding is defined by
       
   658 the user settings on the machine running the codec.
       
   659 
       
   660 .. % --- MBCS codecs for Windows --------------------------------------------
       
   661 
       
   662 
       
   663 .. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
       
   664 
       
   665    Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
       
   666    Return *NULL* if an exception was raised by the codec.
       
   667 
       
   668 
       
   669 .. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
       
   670 
       
   671    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
       
   672    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
       
   673    trailing lead byte and the number of bytes that have been decoded will be stored
       
   674    in *consumed*.
       
   675 
       
   676    .. versionadded:: 2.5
       
   677 
       
   678 
       
   679 .. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
       
   680 
       
   681    Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
       
   682    Python string object.  Return *NULL* if an exception was raised by the codec.
       
   683 
       
   684 
       
   685 .. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
       
   686 
       
   687    Encode a Unicode object using MBCS and return the result as Python string
       
   688    object.  Error handling is "strict".  Return *NULL* if an exception was raised
       
   689    by the codec.
       
   690 
       
   691 .. % --- Methods & Slots ----------------------------------------------------
       
   692 
       
   693 
       
   694 .. _unicodemethodsandslots:
       
   695 
       
   696 Methods and Slot Functions
       
   697 ^^^^^^^^^^^^^^^^^^^^^^^^^^
       
   698 
       
   699 The following APIs are capable of handling Unicode objects and strings on input
       
   700 (we refer to them as strings in the descriptions) and return Unicode objects or
       
   701 integers as appropriate.
       
   702 
       
   703 They all return *NULL* or ``-1`` if an exception occurs.
       
   704 
       
   705 
       
   706 .. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
       
   707 
       
   708    Concat two strings giving a new Unicode string.
       
   709 
       
   710 
       
   711 .. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
       
   712 
       
   713    Split a string giving a list of Unicode strings.  If sep is *NULL*, splitting
       
   714    will be done at all whitespace substrings.  Otherwise, splits occur at the given
       
   715    separator.  At most *maxsplit* splits will be done.  If negative, no limit is
       
   716    set.  Separators are not included in the resulting list.
       
   717 
       
   718 
       
   719 .. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
       
   720 
       
   721    Split a Unicode string at line breaks, returning a list of Unicode strings.
       
   722    CRLF is considered to be one line break.  If *keepend* is 0, the Line break
       
   723    characters are not included in the resulting strings.
       
   724 
       
   725 
       
   726 .. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
       
   727 
       
   728    Translate a string by applying a character mapping table to it and return the
       
   729    resulting Unicode object.
       
   730 
       
   731    The mapping table must map Unicode ordinal integers to Unicode ordinal integers
       
   732    or None (causing deletion of the character).
       
   733 
       
   734    Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
       
   735    and sequences work well.  Unmapped character ordinals (ones which cause a
       
   736    :exc:`LookupError`) are left untouched and are copied as-is.
       
   737 
       
   738    *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
       
   739    use the default error handling.
       
   740 
       
   741 
       
   742 .. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
       
   743 
       
   744    Join a sequence of strings using the given separator and return the resulting
       
   745    Unicode string.
       
   746 
       
   747 
       
   748 .. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
       
   749 
       
   750    Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
       
   751    (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
       
   752    0 otherwise. Return ``-1`` if an error occurred.
       
   753 
       
   754 
       
   755 .. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
       
   756 
       
   757    Return the first position of *substr* in *str*[*start*:*end*] using the given
       
   758    *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
       
   759    backward search).  The return value is the index of the first match; a value of
       
   760    ``-1`` indicates that no match was found, and ``-2`` indicates that an error
       
   761    occurred and an exception has been set.
       
   762 
       
   763 
       
   764 .. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
       
   765 
       
   766    Return the number of non-overlapping occurrences of *substr* in
       
   767    ``str[start:end]``.  Return ``-1`` if an error occurred.
       
   768 
       
   769 
       
   770 .. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
       
   771 
       
   772    Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
       
   773    return the resulting Unicode object. *maxcount* == -1 means replace all
       
   774    occurrences.
       
   775 
       
   776 
       
   777 .. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
       
   778 
       
   779    Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
       
   780    respectively.
       
   781 
       
   782 
       
   783 .. cfunction:: int PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
       
   784 
       
   785    Rich compare two unicode strings and return one of the following:
       
   786 
       
   787    * ``NULL`` in case an exception was raised
       
   788    * :const:`Py_True` or :const:`Py_False` for successful comparisons
       
   789    * :const:`Py_NotImplemented` in case the type combination is unknown
       
   790 
       
   791    Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
       
   792    :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
       
   793    with a :exc:`UnicodeDecodeError`.
       
   794 
       
   795    Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
       
   796    :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
       
   797 
       
   798 
       
   799 .. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
       
   800 
       
   801    Return a new string object from *format* and *args*; this is analogous to
       
   802    ``format % args``.  The *args* argument must be a tuple.
       
   803 
       
   804 
       
   805 .. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
       
   806 
       
   807    Check whether *element* is contained in *container* and return true or false
       
   808    accordingly.
       
   809 
       
   810    *element* has to coerce to a one element Unicode string. ``-1`` is returned if
       
   811    there was an error.