|
1 .. highlightlang:: c |
|
2 |
|
3 .. _unicodeobjects: |
|
4 |
|
5 Unicode Objects and Codecs |
|
6 -------------------------- |
|
7 |
|
8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
|
9 |
|
10 Unicode Objects |
|
11 ^^^^^^^^^^^^^^^ |
|
12 |
|
13 |
|
14 These are the basic Unicode object types used for the Unicode implementation in |
|
15 Python: |
|
16 |
|
17 .. % --- Unicode Type ------------------------------------------------------- |
|
18 |
|
19 |
|
20 .. ctype:: Py_UNICODE |
|
21 |
|
22 This type represents the storage type which is used by Python internally as |
|
23 basis for holding Unicode ordinals. Python's default builds use a 16-bit type |
|
24 for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also |
|
25 possible to build a UCS4 version of Python (most recent Linux distributions come |
|
26 with UCS4 builds of Python). These builds then use a 32-bit type for |
|
27 :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms |
|
28 where :ctype:`wchar_t` is available and compatible with the chosen Python |
|
29 Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for |
|
30 :ctype:`wchar_t` to enhance native platform compatibility. On all other |
|
31 platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned |
|
32 short` (UCS2) or :ctype:`unsigned long` (UCS4). |
|
33 |
|
34 Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep |
|
35 this in mind when writing extensions or interfaces. |
|
36 |
|
37 |
|
38 .. ctype:: PyUnicodeObject |
|
39 |
|
40 This subtype of :ctype:`PyObject` represents a Python Unicode object. |
|
41 |
|
42 |
|
43 .. cvar:: PyTypeObject PyUnicode_Type |
|
44 |
|
45 This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It |
|
46 is exposed to Python code as ``unicode`` and ``types.UnicodeType``. |
|
47 |
|
48 The following APIs are really C macros and can be used to do fast checks and to |
|
49 access internal read-only data of Unicode objects: |
|
50 |
|
51 |
|
52 .. cfunction:: int PyUnicode_Check(PyObject *o) |
|
53 |
|
54 Return true if the object *o* is a Unicode object or an instance of a Unicode |
|
55 subtype. |
|
56 |
|
57 .. versionchanged:: 2.2 |
|
58 Allowed subtypes to be accepted. |
|
59 |
|
60 |
|
61 .. cfunction:: int PyUnicode_CheckExact(PyObject *o) |
|
62 |
|
63 Return true if the object *o* is a Unicode object, but not an instance of a |
|
64 subtype. |
|
65 |
|
66 .. versionadded:: 2.2 |
|
67 |
|
68 |
|
69 .. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) |
|
70 |
|
71 Return the size of the object. *o* has to be a :ctype:`PyUnicodeObject` (not |
|
72 checked). |
|
73 |
|
74 |
|
75 .. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) |
|
76 |
|
77 Return the size of the object's internal buffer in bytes. *o* has to be a |
|
78 :ctype:`PyUnicodeObject` (not checked). |
|
79 |
|
80 |
|
81 .. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) |
|
82 |
|
83 Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. *o* |
|
84 has to be a :ctype:`PyUnicodeObject` (not checked). |
|
85 |
|
86 |
|
87 .. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o) |
|
88 |
|
89 Return a pointer to the internal buffer of the object. *o* has to be a |
|
90 :ctype:`PyUnicodeObject` (not checked). |
|
91 |
|
92 |
|
93 .. cfunction:: int PyUnicode_ClearFreeList(void) |
|
94 |
|
95 Clear the free list. Return the total number of freed items. |
|
96 |
|
97 .. versionadded:: 2.6 |
|
98 |
|
99 Unicode provides many different character properties. The most often needed ones |
|
100 are available through these macros which are mapped to C functions depending on |
|
101 the Python configuration. |
|
102 |
|
103 .. % --- Unicode character properties --------------------------------------- |
|
104 |
|
105 |
|
106 .. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) |
|
107 |
|
108 Return 1 or 0 depending on whether *ch* is a whitespace character. |
|
109 |
|
110 |
|
111 .. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch) |
|
112 |
|
113 Return 1 or 0 depending on whether *ch* is a lowercase character. |
|
114 |
|
115 |
|
116 .. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch) |
|
117 |
|
118 Return 1 or 0 depending on whether *ch* is an uppercase character. |
|
119 |
|
120 |
|
121 .. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch) |
|
122 |
|
123 Return 1 or 0 depending on whether *ch* is a titlecase character. |
|
124 |
|
125 |
|
126 .. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch) |
|
127 |
|
128 Return 1 or 0 depending on whether *ch* is a linebreak character. |
|
129 |
|
130 |
|
131 .. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch) |
|
132 |
|
133 Return 1 or 0 depending on whether *ch* is a decimal character. |
|
134 |
|
135 |
|
136 .. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch) |
|
137 |
|
138 Return 1 or 0 depending on whether *ch* is a digit character. |
|
139 |
|
140 |
|
141 .. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch) |
|
142 |
|
143 Return 1 or 0 depending on whether *ch* is a numeric character. |
|
144 |
|
145 |
|
146 .. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch) |
|
147 |
|
148 Return 1 or 0 depending on whether *ch* is an alphabetic character. |
|
149 |
|
150 |
|
151 .. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch) |
|
152 |
|
153 Return 1 or 0 depending on whether *ch* is an alphanumeric character. |
|
154 |
|
155 These APIs can be used for fast direct character conversions: |
|
156 |
|
157 |
|
158 .. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch) |
|
159 |
|
160 Return the character *ch* converted to lower case. |
|
161 |
|
162 |
|
163 .. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch) |
|
164 |
|
165 Return the character *ch* converted to upper case. |
|
166 |
|
167 |
|
168 .. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch) |
|
169 |
|
170 Return the character *ch* converted to title case. |
|
171 |
|
172 |
|
173 .. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch) |
|
174 |
|
175 Return the character *ch* converted to a decimal positive integer. Return |
|
176 ``-1`` if this is not possible. This macro does not raise exceptions. |
|
177 |
|
178 |
|
179 .. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch) |
|
180 |
|
181 Return the character *ch* converted to a single digit integer. Return ``-1`` if |
|
182 this is not possible. This macro does not raise exceptions. |
|
183 |
|
184 |
|
185 .. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch) |
|
186 |
|
187 Return the character *ch* converted to a double. Return ``-1.0`` if this is not |
|
188 possible. This macro does not raise exceptions. |
|
189 |
|
190 To create Unicode objects and access their basic sequence properties, use these |
|
191 APIs: |
|
192 |
|
193 .. % --- Plain Py_UNICODE --------------------------------------------------- |
|
194 |
|
195 |
|
196 .. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) |
|
197 |
|
198 Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u* |
|
199 may be *NULL* which causes the contents to be undefined. It is the user's |
|
200 responsibility to fill in the needed data. The buffer is copied into the new |
|
201 object. If the buffer is not *NULL*, the return value might be a shared object. |
|
202 Therefore, modification of the resulting Unicode object is only allowed when *u* |
|
203 is *NULL*. |
|
204 |
|
205 |
|
206 .. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) |
|
207 |
|
208 Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE` |
|
209 buffer, *NULL* if *unicode* is not a Unicode object. |
|
210 |
|
211 |
|
212 .. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) |
|
213 |
|
214 Return the length of the Unicode object. |
|
215 |
|
216 |
|
217 .. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors) |
|
218 |
|
219 Coerce an encoded object *obj* to an Unicode object and return a reference with |
|
220 incremented refcount. |
|
221 |
|
222 String and other char buffer compatible objects are decoded according to the |
|
223 given encoding and using the error handling defined by errors. Both can be |
|
224 *NULL* to have the interface use the default values (see the next section for |
|
225 details). |
|
226 |
|
227 All other objects, including Unicode objects, cause a :exc:`TypeError` to be |
|
228 set. |
|
229 |
|
230 The API returns *NULL* if there was an error. The caller is responsible for |
|
231 decref'ing the returned objects. |
|
232 |
|
233 |
|
234 .. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj) |
|
235 |
|
236 Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used |
|
237 throughout the interpreter whenever coercion to Unicode is needed. |
|
238 |
|
239 If the platform supports :ctype:`wchar_t` and provides a header file wchar.h, |
|
240 Python can interface directly to this type using the following functions. |
|
241 Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to |
|
242 the system's :ctype:`wchar_t`. |
|
243 |
|
244 .. % --- wchar_t support for platforms which support it --------------------- |
|
245 |
|
246 |
|
247 .. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) |
|
248 |
|
249 Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size. |
|
250 Return *NULL* on failure. |
|
251 |
|
252 |
|
253 .. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size) |
|
254 |
|
255 Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*. At most |
|
256 *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing |
|
257 0-termination character). Return the number of :ctype:`wchar_t` characters |
|
258 copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t` |
|
259 string may or may not be 0-terminated. It is the responsibility of the caller |
|
260 to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is |
|
261 required by the application. |
|
262 |
|
263 |
|
264 .. _builtincodecs: |
|
265 |
|
266 Built-in Codecs |
|
267 ^^^^^^^^^^^^^^^ |
|
268 |
|
269 Python provides a set of builtin codecs which are written in C for speed. All of |
|
270 these codecs are directly usable via the following functions. |
|
271 |
|
272 Many of the following APIs take two arguments encoding and errors. These |
|
273 parameters encoding and errors have the same semantics as the ones of the |
|
274 builtin unicode() Unicode object constructor. |
|
275 |
|
276 Setting encoding to *NULL* causes the default encoding to be used which is |
|
277 ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding` |
|
278 as the encoding for file names. This variable should be treated as read-only: On |
|
279 some systems, it will be a pointer to a static string, on others, it will change |
|
280 at run-time (such as when the application invokes setlocale). |
|
281 |
|
282 Error handling is set by errors which may also be set to *NULL* meaning to use |
|
283 the default handling defined for the codec. Default error handling for all |
|
284 builtin codecs is "strict" (:exc:`ValueError` is raised). |
|
285 |
|
286 The codecs all use a similar interface. Only deviation from the following |
|
287 generic ones are documented for simplicity. |
|
288 |
|
289 These are the generic codec APIs: |
|
290 |
|
291 .. % --- Generic Codecs ----------------------------------------------------- |
|
292 |
|
293 |
|
294 .. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors) |
|
295 |
|
296 Create a Unicode object by decoding *size* bytes of the encoded string *s*. |
|
297 *encoding* and *errors* have the same meaning as the parameters of the same name |
|
298 in the :func:`unicode` builtin function. The codec to be used is looked up |
|
299 using the Python codec registry. Return *NULL* if an exception was raised by |
|
300 the codec. |
|
301 |
|
302 |
|
303 .. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors) |
|
304 |
|
305 Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python |
|
306 string object. *encoding* and *errors* have the same meaning as the parameters |
|
307 of the same name in the Unicode :meth:`encode` method. The codec to be used is |
|
308 looked up using the Python codec registry. Return *NULL* if an exception was |
|
309 raised by the codec. |
|
310 |
|
311 |
|
312 .. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors) |
|
313 |
|
314 Encode a Unicode object and return the result as Python string object. |
|
315 *encoding* and *errors* have the same meaning as the parameters of the same name |
|
316 in the Unicode :meth:`encode` method. The codec to be used is looked up using |
|
317 the Python codec registry. Return *NULL* if an exception was raised by the |
|
318 codec. |
|
319 |
|
320 These are the UTF-8 codec APIs: |
|
321 |
|
322 .. % --- UTF-8 Codecs ------------------------------------------------------- |
|
323 |
|
324 |
|
325 .. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) |
|
326 |
|
327 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string |
|
328 *s*. Return *NULL* if an exception was raised by the codec. |
|
329 |
|
330 |
|
331 .. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed) |
|
332 |
|
333 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If |
|
334 *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be |
|
335 treated as an error. Those bytes will not be decoded and the number of bytes |
|
336 that have been decoded will be stored in *consumed*. |
|
337 |
|
338 .. versionadded:: 2.4 |
|
339 |
|
340 |
|
341 .. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) |
|
342 |
|
343 Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a |
|
344 Python string object. Return *NULL* if an exception was raised by the codec. |
|
345 |
|
346 |
|
347 .. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) |
|
348 |
|
349 Encode a Unicode object using UTF-8 and return the result as Python string |
|
350 object. Error handling is "strict". Return *NULL* if an exception was raised |
|
351 by the codec. |
|
352 |
|
353 These are the UTF-32 codec APIs: |
|
354 |
|
355 .. % --- UTF-32 Codecs ------------------------------------------------------ */ |
|
356 |
|
357 |
|
358 .. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder) |
|
359 |
|
360 Decode *length* bytes from a UTF-32 encoded buffer string and return the |
|
361 corresponding Unicode object. *errors* (if non-*NULL*) defines the error |
|
362 handling. It defaults to "strict". |
|
363 |
|
364 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte |
|
365 order:: |
|
366 |
|
367 *byteorder == -1: little endian |
|
368 *byteorder == 0: native order |
|
369 *byteorder == 1: big endian |
|
370 |
|
371 and then switches if the first four bytes of the input data are a byte order mark |
|
372 (BOM) and the specified byte order is native order. This BOM is not copied into |
|
373 the resulting Unicode string. After completion, *\*byteorder* is set to the |
|
374 current byte order at the end of input data. |
|
375 |
|
376 In a narrow build codepoints outside the BMP will be decoded as surrogate pairs. |
|
377 |
|
378 If *byteorder* is *NULL*, the codec starts in native order mode. |
|
379 |
|
380 Return *NULL* if an exception was raised by the codec. |
|
381 |
|
382 .. versionadded:: 2.6 |
|
383 |
|
384 |
|
385 .. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) |
|
386 |
|
387 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If |
|
388 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat |
|
389 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible |
|
390 by four) as an error. Those bytes will not be decoded and the number of bytes |
|
391 that have been decoded will be stored in *consumed*. |
|
392 |
|
393 .. versionadded:: 2.6 |
|
394 |
|
395 |
|
396 .. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) |
|
397 |
|
398 Return a Python bytes object holding the UTF-32 encoded value of the Unicode |
|
399 data in *s*. If *byteorder* is not ``0``, output is written according to the |
|
400 following byte order:: |
|
401 |
|
402 byteorder == -1: little endian |
|
403 byteorder == 0: native byte order (writes a BOM mark) |
|
404 byteorder == 1: big endian |
|
405 |
|
406 If byteorder is ``0``, the output string will always start with the Unicode BOM |
|
407 mark (U+FEFF). In the other two modes, no BOM mark is prepended. |
|
408 |
|
409 If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output |
|
410 as a single codepoint. |
|
411 |
|
412 Return *NULL* if an exception was raised by the codec. |
|
413 |
|
414 .. versionadded:: 2.6 |
|
415 |
|
416 |
|
417 .. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) |
|
418 |
|
419 Return a Python string using the UTF-32 encoding in native byte order. The |
|
420 string always starts with a BOM mark. Error handling is "strict". Return |
|
421 *NULL* if an exception was raised by the codec. |
|
422 |
|
423 .. versionadded:: 2.6 |
|
424 |
|
425 |
|
426 These are the UTF-16 codec APIs: |
|
427 |
|
428 .. % --- UTF-16 Codecs ------------------------------------------------------ */ |
|
429 |
|
430 |
|
431 .. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder) |
|
432 |
|
433 Decode *length* bytes from a UTF-16 encoded buffer string and return the |
|
434 corresponding Unicode object. *errors* (if non-*NULL*) defines the error |
|
435 handling. It defaults to "strict". |
|
436 |
|
437 If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte |
|
438 order:: |
|
439 |
|
440 *byteorder == -1: little endian |
|
441 *byteorder == 0: native order |
|
442 *byteorder == 1: big endian |
|
443 |
|
444 and then switches if the first two bytes of the input data are a byte order mark |
|
445 (BOM) and the specified byte order is native order. This BOM is not copied into |
|
446 the resulting Unicode string. After completion, *\*byteorder* is set to the |
|
447 current byte order at the. |
|
448 |
|
449 If *byteorder* is *NULL*, the codec starts in native order mode. |
|
450 |
|
451 Return *NULL* if an exception was raised by the codec. |
|
452 |
|
453 |
|
454 .. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) |
|
455 |
|
456 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If |
|
457 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat |
|
458 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a |
|
459 split surrogate pair) as an error. Those bytes will not be decoded and the |
|
460 number of bytes that have been decoded will be stored in *consumed*. |
|
461 |
|
462 .. versionadded:: 2.4 |
|
463 |
|
464 |
|
465 .. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) |
|
466 |
|
467 Return a Python string object holding the UTF-16 encoded value of the Unicode |
|
468 data in *s*. If *byteorder* is not ``0``, output is written according to the |
|
469 following byte order:: |
|
470 |
|
471 byteorder == -1: little endian |
|
472 byteorder == 0: native byte order (writes a BOM mark) |
|
473 byteorder == 1: big endian |
|
474 |
|
475 If byteorder is ``0``, the output string will always start with the Unicode BOM |
|
476 mark (U+FEFF). In the other two modes, no BOM mark is prepended. |
|
477 |
|
478 If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get |
|
479 represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE` |
|
480 values is interpreted as an UCS-2 character. |
|
481 |
|
482 Return *NULL* if an exception was raised by the codec. |
|
483 |
|
484 |
|
485 .. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) |
|
486 |
|
487 Return a Python string using the UTF-16 encoding in native byte order. The |
|
488 string always starts with a BOM mark. Error handling is "strict". Return |
|
489 *NULL* if an exception was raised by the codec. |
|
490 |
|
491 These are the "Unicode Escape" codec APIs: |
|
492 |
|
493 .. % --- Unicode-Escape Codecs ---------------------------------------------- |
|
494 |
|
495 |
|
496 .. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) |
|
497 |
|
498 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded |
|
499 string *s*. Return *NULL* if an exception was raised by the codec. |
|
500 |
|
501 |
|
502 .. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) |
|
503 |
|
504 Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and |
|
505 return a Python string object. Return *NULL* if an exception was raised by the |
|
506 codec. |
|
507 |
|
508 |
|
509 .. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) |
|
510 |
|
511 Encode a Unicode object using Unicode-Escape and return the result as Python |
|
512 string object. Error handling is "strict". Return *NULL* if an exception was |
|
513 raised by the codec. |
|
514 |
|
515 These are the "Raw Unicode Escape" codec APIs: |
|
516 |
|
517 .. % --- Raw-Unicode-Escape Codecs ------------------------------------------ |
|
518 |
|
519 |
|
520 .. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) |
|
521 |
|
522 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape |
|
523 encoded string *s*. Return *NULL* if an exception was raised by the codec. |
|
524 |
|
525 |
|
526 .. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors) |
|
527 |
|
528 Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape |
|
529 and return a Python string object. Return *NULL* if an exception was raised by |
|
530 the codec. |
|
531 |
|
532 |
|
533 .. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) |
|
534 |
|
535 Encode a Unicode object using Raw-Unicode-Escape and return the result as |
|
536 Python string object. Error handling is "strict". Return *NULL* if an exception |
|
537 was raised by the codec. |
|
538 |
|
539 These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode |
|
540 ordinals and only these are accepted by the codecs during encoding. |
|
541 |
|
542 .. % --- Latin-1 Codecs ----------------------------------------------------- |
|
543 |
|
544 |
|
545 .. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) |
|
546 |
|
547 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string |
|
548 *s*. Return *NULL* if an exception was raised by the codec. |
|
549 |
|
550 |
|
551 .. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) |
|
552 |
|
553 Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return |
|
554 a Python string object. Return *NULL* if an exception was raised by the codec. |
|
555 |
|
556 |
|
557 .. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) |
|
558 |
|
559 Encode a Unicode object using Latin-1 and return the result as Python string |
|
560 object. Error handling is "strict". Return *NULL* if an exception was raised |
|
561 by the codec. |
|
562 |
|
563 These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other |
|
564 codes generate errors. |
|
565 |
|
566 .. % --- ASCII Codecs ------------------------------------------------------- |
|
567 |
|
568 |
|
569 .. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) |
|
570 |
|
571 Create a Unicode object by decoding *size* bytes of the ASCII encoded string |
|
572 *s*. Return *NULL* if an exception was raised by the codec. |
|
573 |
|
574 |
|
575 .. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) |
|
576 |
|
577 Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a |
|
578 Python string object. Return *NULL* if an exception was raised by the codec. |
|
579 |
|
580 |
|
581 .. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) |
|
582 |
|
583 Encode a Unicode object using ASCII and return the result as Python string |
|
584 object. Error handling is "strict". Return *NULL* if an exception was raised |
|
585 by the codec. |
|
586 |
|
587 These are the mapping codec APIs: |
|
588 |
|
589 .. % --- Character Map Codecs ----------------------------------------------- |
|
590 |
|
591 This codec is special in that it can be used to implement many different codecs |
|
592 (and this is in fact what was done to obtain most of the standard codecs |
|
593 included in the :mod:`encodings` package). The codec uses mapping to encode and |
|
594 decode characters. |
|
595 |
|
596 Decoding mappings must map single string characters to single Unicode |
|
597 characters, integers (which are then interpreted as Unicode ordinals) or None |
|
598 (meaning "undefined mapping" and causing an error). |
|
599 |
|
600 Encoding mappings must map single Unicode characters to single string |
|
601 characters, integers (which are then interpreted as Latin-1 ordinals) or None |
|
602 (meaning "undefined mapping" and causing an error). |
|
603 |
|
604 The mapping objects provided must only support the __getitem__ mapping |
|
605 interface. |
|
606 |
|
607 If a character lookup fails with a LookupError, the character is copied as-is |
|
608 meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal |
|
609 resp. Because of this, mappings only need to contain those mappings which map |
|
610 characters to different code points. |
|
611 |
|
612 |
|
613 .. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors) |
|
614 |
|
615 Create a Unicode object by decoding *size* bytes of the encoded string *s* using |
|
616 the given *mapping* object. Return *NULL* if an exception was raised by the |
|
617 codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a |
|
618 dictionary mapping byte or a unicode string, which is treated as a lookup table. |
|
619 Byte values greater that the length of the string and U+FFFE "characters" are |
|
620 treated as "undefined mapping". |
|
621 |
|
622 .. versionchanged:: 2.4 |
|
623 Allowed unicode string as mapping argument. |
|
624 |
|
625 |
|
626 .. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors) |
|
627 |
|
628 Encode the :ctype:`Py_UNICODE` buffer of the given size using the given |
|
629 *mapping* object and return a Python string object. Return *NULL* if an |
|
630 exception was raised by the codec. |
|
631 |
|
632 |
|
633 .. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) |
|
634 |
|
635 Encode a Unicode object using the given *mapping* object and return the result |
|
636 as Python string object. Error handling is "strict". Return *NULL* if an |
|
637 exception was raised by the codec. |
|
638 |
|
639 The following codec API is special in that maps Unicode to Unicode. |
|
640 |
|
641 |
|
642 .. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors) |
|
643 |
|
644 Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a |
|
645 character mapping *table* to it and return the resulting Unicode object. Return |
|
646 *NULL* when an exception was raised by the codec. |
|
647 |
|
648 The *mapping* table must map Unicode ordinal integers to Unicode ordinal |
|
649 integers or None (causing deletion of the character). |
|
650 |
|
651 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries |
|
652 and sequences work well. Unmapped character ordinals (ones which cause a |
|
653 :exc:`LookupError`) are left untouched and are copied as-is. |
|
654 |
|
655 These are the MBCS codec APIs. They are currently only available on Windows and |
|
656 use the Win32 MBCS converters to implement the conversions. Note that MBCS (or |
|
657 DBCS) is a class of encodings, not just one. The target encoding is defined by |
|
658 the user settings on the machine running the codec. |
|
659 |
|
660 .. % --- MBCS codecs for Windows -------------------------------------------- |
|
661 |
|
662 |
|
663 .. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) |
|
664 |
|
665 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. |
|
666 Return *NULL* if an exception was raised by the codec. |
|
667 |
|
668 |
|
669 .. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed) |
|
670 |
|
671 If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If |
|
672 *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode |
|
673 trailing lead byte and the number of bytes that have been decoded will be stored |
|
674 in *consumed*. |
|
675 |
|
676 .. versionadded:: 2.5 |
|
677 |
|
678 |
|
679 .. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) |
|
680 |
|
681 Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a |
|
682 Python string object. Return *NULL* if an exception was raised by the codec. |
|
683 |
|
684 |
|
685 .. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) |
|
686 |
|
687 Encode a Unicode object using MBCS and return the result as Python string |
|
688 object. Error handling is "strict". Return *NULL* if an exception was raised |
|
689 by the codec. |
|
690 |
|
691 .. % --- Methods & Slots ---------------------------------------------------- |
|
692 |
|
693 |
|
694 .. _unicodemethodsandslots: |
|
695 |
|
696 Methods and Slot Functions |
|
697 ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
698 |
|
699 The following APIs are capable of handling Unicode objects and strings on input |
|
700 (we refer to them as strings in the descriptions) and return Unicode objects or |
|
701 integers as appropriate. |
|
702 |
|
703 They all return *NULL* or ``-1`` if an exception occurs. |
|
704 |
|
705 |
|
706 .. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) |
|
707 |
|
708 Concat two strings giving a new Unicode string. |
|
709 |
|
710 |
|
711 .. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) |
|
712 |
|
713 Split a string giving a list of Unicode strings. If sep is *NULL*, splitting |
|
714 will be done at all whitespace substrings. Otherwise, splits occur at the given |
|
715 separator. At most *maxsplit* splits will be done. If negative, no limit is |
|
716 set. Separators are not included in the resulting list. |
|
717 |
|
718 |
|
719 .. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) |
|
720 |
|
721 Split a Unicode string at line breaks, returning a list of Unicode strings. |
|
722 CRLF is considered to be one line break. If *keepend* is 0, the Line break |
|
723 characters are not included in the resulting strings. |
|
724 |
|
725 |
|
726 .. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) |
|
727 |
|
728 Translate a string by applying a character mapping table to it and return the |
|
729 resulting Unicode object. |
|
730 |
|
731 The mapping table must map Unicode ordinal integers to Unicode ordinal integers |
|
732 or None (causing deletion of the character). |
|
733 |
|
734 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries |
|
735 and sequences work well. Unmapped character ordinals (ones which cause a |
|
736 :exc:`LookupError`) are left untouched and are copied as-is. |
|
737 |
|
738 *errors* has the usual meaning for codecs. It may be *NULL* which indicates to |
|
739 use the default error handling. |
|
740 |
|
741 |
|
742 .. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) |
|
743 |
|
744 Join a sequence of strings using the given separator and return the resulting |
|
745 Unicode string. |
|
746 |
|
747 |
|
748 .. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) |
|
749 |
|
750 Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end |
|
751 (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match), |
|
752 0 otherwise. Return ``-1`` if an error occurred. |
|
753 |
|
754 |
|
755 .. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) |
|
756 |
|
757 Return the first position of *substr* in *str*[*start*:*end*] using the given |
|
758 *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a |
|
759 backward search). The return value is the index of the first match; a value of |
|
760 ``-1`` indicates that no match was found, and ``-2`` indicates that an error |
|
761 occurred and an exception has been set. |
|
762 |
|
763 |
|
764 .. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end) |
|
765 |
|
766 Return the number of non-overlapping occurrences of *substr* in |
|
767 ``str[start:end]``. Return ``-1`` if an error occurred. |
|
768 |
|
769 |
|
770 .. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount) |
|
771 |
|
772 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and |
|
773 return the resulting Unicode object. *maxcount* == -1 means replace all |
|
774 occurrences. |
|
775 |
|
776 |
|
777 .. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right) |
|
778 |
|
779 Compare two strings and return -1, 0, 1 for less than, equal, and greater than, |
|
780 respectively. |
|
781 |
|
782 |
|
783 .. cfunction:: int PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) |
|
784 |
|
785 Rich compare two unicode strings and return one of the following: |
|
786 |
|
787 * ``NULL`` in case an exception was raised |
|
788 * :const:`Py_True` or :const:`Py_False` for successful comparisons |
|
789 * :const:`Py_NotImplemented` in case the type combination is unknown |
|
790 |
|
791 Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a |
|
792 :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails |
|
793 with a :exc:`UnicodeDecodeError`. |
|
794 |
|
795 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, |
|
796 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. |
|
797 |
|
798 |
|
799 .. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) |
|
800 |
|
801 Return a new string object from *format* and *args*; this is analogous to |
|
802 ``format % args``. The *args* argument must be a tuple. |
|
803 |
|
804 |
|
805 .. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element) |
|
806 |
|
807 Check whether *element* is contained in *container* and return true or false |
|
808 accordingly. |
|
809 |
|
810 *element* has to coerce to a one element Unicode string. ``-1`` is returned if |
|
811 there was an error. |