|
1 |
|
2 :mod:`codecs` --- Codec registry and base classes |
|
3 ================================================= |
|
4 |
|
5 .. module:: codecs |
|
6 :synopsis: Encode and decode data and streams. |
|
7 .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
|
8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
|
9 .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> |
|
10 |
|
11 |
|
12 .. index:: |
|
13 single: Unicode |
|
14 single: Codecs |
|
15 pair: Codecs; encode |
|
16 pair: Codecs; decode |
|
17 single: streams |
|
18 pair: stackable; streams |
|
19 |
|
20 This module defines base classes for standard Python codecs (encoders and |
|
21 decoders) and provides access to the internal Python codec registry which |
|
22 manages the codec and error handling lookup process. |
|
23 |
|
24 It defines the following functions: |
|
25 |
|
26 |
|
27 .. function:: register(search_function) |
|
28 |
|
29 Register a codec search function. Search functions are expected to take one |
|
30 argument, the encoding name in all lower case letters, and return a |
|
31 :class:`CodecInfo` object having the following attributes: |
|
32 |
|
33 * ``name`` The name of the encoding; |
|
34 |
|
35 * ``encode`` The stateless encoding function; |
|
36 |
|
37 * ``decode`` The stateless decoding function; |
|
38 |
|
39 * ``incrementalencoder`` An incremental encoder class or factory function; |
|
40 |
|
41 * ``incrementaldecoder`` An incremental decoder class or factory function; |
|
42 |
|
43 * ``streamwriter`` A stream writer class or factory function; |
|
44 |
|
45 * ``streamreader`` A stream reader class or factory function. |
|
46 |
|
47 The various functions or classes take the following arguments: |
|
48 |
|
49 *encode* and *decode*: These must be functions or methods which have the same |
|
50 interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see |
|
51 Codec Interface). The functions/methods are expected to work in a stateless |
|
52 mode. |
|
53 |
|
54 *incrementalencoder* and *incrementaldecoder*: These have to be factory |
|
55 functions providing the following interface: |
|
56 |
|
57 ``factory(errors='strict')`` |
|
58 |
|
59 The factory functions must return objects providing the interfaces defined by |
|
60 the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, |
|
61 respectively. Incremental codecs can maintain state. |
|
62 |
|
63 *streamreader* and *streamwriter*: These have to be factory functions providing |
|
64 the following interface: |
|
65 |
|
66 ``factory(stream, errors='strict')`` |
|
67 |
|
68 The factory functions must return objects providing the interfaces defined by |
|
69 the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively. |
|
70 Stream codecs can maintain state. |
|
71 |
|
72 Possible values for errors are ``'strict'`` (raise an exception in case of an |
|
73 encoding error), ``'replace'`` (replace malformed data with a suitable |
|
74 replacement marker, such as ``'?'``), ``'ignore'`` (ignore malformed data and |
|
75 continue without further notice), ``'xmlcharrefreplace'`` (replace with the |
|
76 appropriate XML character reference (for encoding only)) and |
|
77 ``'backslashreplace'`` (replace with backslashed escape sequences (for encoding |
|
78 only)) as well as any other error handling name defined via |
|
79 :func:`register_error`. |
|
80 |
|
81 In case a search function cannot find a given encoding, it should return |
|
82 ``None``. |
|
83 |
|
84 |
|
85 .. function:: lookup(encoding) |
|
86 |
|
87 Looks up the codec info in the Python codec registry and returns a |
|
88 :class:`CodecInfo` object as defined above. |
|
89 |
|
90 Encodings are first looked up in the registry's cache. If not found, the list of |
|
91 registered search functions is scanned. If no :class:`CodecInfo` object is |
|
92 found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object |
|
93 is stored in the cache and returned to the caller. |
|
94 |
|
95 To simplify access to the various codecs, the module provides these additional |
|
96 functions which use :func:`lookup` for the codec lookup: |
|
97 |
|
98 |
|
99 .. function:: getencoder(encoding) |
|
100 |
|
101 Look up the codec for the given encoding and return its encoder function. |
|
102 |
|
103 Raises a :exc:`LookupError` in case the encoding cannot be found. |
|
104 |
|
105 |
|
106 .. function:: getdecoder(encoding) |
|
107 |
|
108 Look up the codec for the given encoding and return its decoder function. |
|
109 |
|
110 Raises a :exc:`LookupError` in case the encoding cannot be found. |
|
111 |
|
112 |
|
113 .. function:: getincrementalencoder(encoding) |
|
114 |
|
115 Look up the codec for the given encoding and return its incremental encoder |
|
116 class or factory function. |
|
117 |
|
118 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec |
|
119 doesn't support an incremental encoder. |
|
120 |
|
121 .. versionadded:: 2.5 |
|
122 |
|
123 |
|
124 .. function:: getincrementaldecoder(encoding) |
|
125 |
|
126 Look up the codec for the given encoding and return its incremental decoder |
|
127 class or factory function. |
|
128 |
|
129 Raises a :exc:`LookupError` in case the encoding cannot be found or the codec |
|
130 doesn't support an incremental decoder. |
|
131 |
|
132 .. versionadded:: 2.5 |
|
133 |
|
134 |
|
135 .. function:: getreader(encoding) |
|
136 |
|
137 Look up the codec for the given encoding and return its StreamReader class or |
|
138 factory function. |
|
139 |
|
140 Raises a :exc:`LookupError` in case the encoding cannot be found. |
|
141 |
|
142 |
|
143 .. function:: getwriter(encoding) |
|
144 |
|
145 Look up the codec for the given encoding and return its StreamWriter class or |
|
146 factory function. |
|
147 |
|
148 Raises a :exc:`LookupError` in case the encoding cannot be found. |
|
149 |
|
150 |
|
151 .. function:: register_error(name, error_handler) |
|
152 |
|
153 Register the error handling function *error_handler* under the name *name*. |
|
154 *error_handler* will be called during encoding and decoding in case of an error, |
|
155 when *name* is specified as the errors parameter. |
|
156 |
|
157 For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError` |
|
158 instance, which contains information about the location of the error. The error |
|
159 handler must either raise this or a different exception or return a tuple with a |
|
160 replacement for the unencodable part of the input and a position where encoding |
|
161 should continue. The encoder will encode the replacement and continue encoding |
|
162 the original input at the specified position. Negative position values will be |
|
163 treated as being relative to the end of the input string. If the resulting |
|
164 position is out of bound an :exc:`IndexError` will be raised. |
|
165 |
|
166 Decoding and translating works similar, except :exc:`UnicodeDecodeError` or |
|
167 :exc:`UnicodeTranslateError` will be passed to the handler and that the |
|
168 replacement from the error handler will be put into the output directly. |
|
169 |
|
170 |
|
171 .. function:: lookup_error(name) |
|
172 |
|
173 Return the error handler previously registered under the name *name*. |
|
174 |
|
175 Raises a :exc:`LookupError` in case the handler cannot be found. |
|
176 |
|
177 |
|
178 .. function:: strict_errors(exception) |
|
179 |
|
180 Implements the ``strict`` error handling. |
|
181 |
|
182 |
|
183 .. function:: replace_errors(exception) |
|
184 |
|
185 Implements the ``replace`` error handling. |
|
186 |
|
187 |
|
188 .. function:: ignore_errors(exception) |
|
189 |
|
190 Implements the ``ignore`` error handling. |
|
191 |
|
192 |
|
193 .. function:: xmlcharrefreplace_errors(exception) |
|
194 |
|
195 Implements the ``xmlcharrefreplace`` error handling. |
|
196 |
|
197 |
|
198 .. function:: backslashreplace_errors(exception) |
|
199 |
|
200 Implements the ``backslashreplace`` error handling. |
|
201 |
|
202 To simplify working with encoded files or stream, the module also defines these |
|
203 utility functions: |
|
204 |
|
205 |
|
206 .. function:: open(filename, mode[, encoding[, errors[, buffering]]]) |
|
207 |
|
208 Open an encoded file using the given *mode* and return a wrapped version |
|
209 providing transparent encoding/decoding. The default file mode is ``'r'`` |
|
210 meaning to open the file in read mode. |
|
211 |
|
212 .. note:: |
|
213 |
|
214 The wrapped version will only accept the object format defined by the codecs, |
|
215 i.e. Unicode objects for most built-in codecs. Output is also codec-dependent |
|
216 and will usually be Unicode as well. |
|
217 |
|
218 .. note:: |
|
219 |
|
220 Files are always opened in binary mode, even if no binary mode was |
|
221 specified. This is done to avoid data loss due to encodings using 8-bit |
|
222 values. This means that no automatic conversion of ``'\n'`` is done |
|
223 on reading and writing. |
|
224 |
|
225 *encoding* specifies the encoding which is to be used for the file. |
|
226 |
|
227 *errors* may be given to define the error handling. It defaults to ``'strict'`` |
|
228 which causes a :exc:`ValueError` to be raised in case an encoding error occurs. |
|
229 |
|
230 *buffering* has the same meaning as for the built-in :func:`open` function. It |
|
231 defaults to line buffered. |
|
232 |
|
233 |
|
234 .. function:: EncodedFile(file, input[, output[, errors]]) |
|
235 |
|
236 Return a wrapped version of file which provides transparent encoding |
|
237 translation. |
|
238 |
|
239 Strings written to the wrapped file are interpreted according to the given |
|
240 *input* encoding and then written to the original file as strings using the |
|
241 *output* encoding. The intermediate encoding will usually be Unicode but depends |
|
242 on the specified codecs. |
|
243 |
|
244 If *output* is not given, it defaults to *input*. |
|
245 |
|
246 *errors* may be given to define the error handling. It defaults to ``'strict'``, |
|
247 which causes :exc:`ValueError` to be raised in case an encoding error occurs. |
|
248 |
|
249 |
|
250 .. function:: iterencode(iterable, encoding[, errors]) |
|
251 |
|
252 Uses an incremental encoder to iteratively encode the input provided by |
|
253 *iterable*. This function is a :term:`generator`. *errors* (as well as any |
|
254 other keyword argument) is passed through to the incremental encoder. |
|
255 |
|
256 .. versionadded:: 2.5 |
|
257 |
|
258 |
|
259 .. function:: iterdecode(iterable, encoding[, errors]) |
|
260 |
|
261 Uses an incremental decoder to iteratively decode the input provided by |
|
262 *iterable*. This function is a :term:`generator`. *errors* (as well as any |
|
263 other keyword argument) is passed through to the incremental decoder. |
|
264 |
|
265 .. versionadded:: 2.5 |
|
266 |
|
267 The module also provides the following constants which are useful for reading |
|
268 and writing to platform dependent files: |
|
269 |
|
270 |
|
271 .. data:: BOM |
|
272 BOM_BE |
|
273 BOM_LE |
|
274 BOM_UTF8 |
|
275 BOM_UTF16 |
|
276 BOM_UTF16_BE |
|
277 BOM_UTF16_LE |
|
278 BOM_UTF32 |
|
279 BOM_UTF32_BE |
|
280 BOM_UTF32_LE |
|
281 |
|
282 These constants define various encodings of the Unicode byte order mark (BOM) |
|
283 used in UTF-16 and UTF-32 data streams to indicate the byte order used in the |
|
284 stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either |
|
285 :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's |
|
286 native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`, |
|
287 :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for |
|
288 :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32 |
|
289 encodings. |
|
290 |
|
291 |
|
292 .. _codec-base-classes: |
|
293 |
|
294 Codec Base Classes |
|
295 ------------------ |
|
296 |
|
297 The :mod:`codecs` module defines a set of base classes which define the |
|
298 interface and can also be used to easily write your own codecs for use in |
|
299 Python. |
|
300 |
|
301 Each codec has to define four interfaces to make it usable as codec in Python: |
|
302 stateless encoder, stateless decoder, stream reader and stream writer. The |
|
303 stream reader and writers typically reuse the stateless encoder/decoder to |
|
304 implement the file protocols. |
|
305 |
|
306 The :class:`Codec` class defines the interface for stateless encoders/decoders. |
|
307 |
|
308 To simplify and standardize error handling, the :meth:`encode` and |
|
309 :meth:`decode` methods may implement different error handling schemes by |
|
310 providing the *errors* string argument. The following string values are defined |
|
311 and implemented by all standard Python codecs: |
|
312 |
|
313 +-------------------------+-----------------------------------------------+ |
|
314 | Value | Meaning | |
|
315 +=========================+===============================================+ |
|
316 | ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); | |
|
317 | | this is the default. | |
|
318 +-------------------------+-----------------------------------------------+ |
|
319 | ``'ignore'`` | Ignore the character and continue with the | |
|
320 | | next. | |
|
321 +-------------------------+-----------------------------------------------+ |
|
322 | ``'replace'`` | Replace with a suitable replacement | |
|
323 | | character; Python will use the official | |
|
324 | | U+FFFD REPLACEMENT CHARACTER for the built-in | |
|
325 | | Unicode codecs on decoding and '?' on | |
|
326 | | encoding. | |
|
327 +-------------------------+-----------------------------------------------+ |
|
328 | ``'xmlcharrefreplace'`` | Replace with the appropriate XML character | |
|
329 | | reference (only for encoding). | |
|
330 +-------------------------+-----------------------------------------------+ |
|
331 | ``'backslashreplace'`` | Replace with backslashed escape sequences | |
|
332 | | (only for encoding). | |
|
333 +-------------------------+-----------------------------------------------+ |
|
334 |
|
335 The set of allowed values can be extended via :meth:`register_error`. |
|
336 |
|
337 |
|
338 .. _codec-objects: |
|
339 |
|
340 Codec Objects |
|
341 ^^^^^^^^^^^^^ |
|
342 |
|
343 The :class:`Codec` class defines these methods which also define the function |
|
344 interfaces of the stateless encoder and decoder: |
|
345 |
|
346 |
|
347 .. method:: Codec.encode(input[, errors]) |
|
348 |
|
349 Encodes the object *input* and returns a tuple (output object, length consumed). |
|
350 While codecs are not restricted to use with Unicode, in a Unicode context, |
|
351 encoding converts a Unicode object to a plain string using a particular |
|
352 character set encoding (e.g., ``cp1252`` or ``iso-8859-1``). |
|
353 |
|
354 *errors* defines the error handling to apply. It defaults to ``'strict'`` |
|
355 handling. |
|
356 |
|
357 The method may not store state in the :class:`Codec` instance. Use |
|
358 :class:`StreamCodec` for codecs which have to keep state in order to make |
|
359 encoding/decoding efficient. |
|
360 |
|
361 The encoder must be able to handle zero length input and return an empty object |
|
362 of the output object type in this situation. |
|
363 |
|
364 |
|
365 .. method:: Codec.decode(input[, errors]) |
|
366 |
|
367 Decodes the object *input* and returns a tuple (output object, length consumed). |
|
368 In a Unicode context, decoding converts a plain string encoded using a |
|
369 particular character set encoding to a Unicode object. |
|
370 |
|
371 *input* must be an object which provides the ``bf_getreadbuf`` buffer slot. |
|
372 Python strings, buffer objects and memory mapped files are examples of objects |
|
373 providing this slot. |
|
374 |
|
375 *errors* defines the error handling to apply. It defaults to ``'strict'`` |
|
376 handling. |
|
377 |
|
378 The method may not store state in the :class:`Codec` instance. Use |
|
379 :class:`StreamCodec` for codecs which have to keep state in order to make |
|
380 encoding/decoding efficient. |
|
381 |
|
382 The decoder must be able to handle zero length input and return an empty object |
|
383 of the output object type in this situation. |
|
384 |
|
385 The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide |
|
386 the basic interface for incremental encoding and decoding. Encoding/decoding the |
|
387 input isn't done with one call to the stateless encoder/decoder function, but |
|
388 with multiple calls to the :meth:`encode`/:meth:`decode` method of the |
|
389 incremental encoder/decoder. The incremental encoder/decoder keeps track of the |
|
390 encoding/decoding process during method calls. |
|
391 |
|
392 The joined output of calls to the :meth:`encode`/:meth:`decode` method is the |
|
393 same as if all the single inputs were joined into one, and this input was |
|
394 encoded/decoded with the stateless encoder/decoder. |
|
395 |
|
396 |
|
397 .. _incremental-encoder-objects: |
|
398 |
|
399 IncrementalEncoder Objects |
|
400 ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
401 |
|
402 .. versionadded:: 2.5 |
|
403 |
|
404 The :class:`IncrementalEncoder` class is used for encoding an input in multiple |
|
405 steps. It defines the following methods which every incremental encoder must |
|
406 define in order to be compatible with the Python codec registry. |
|
407 |
|
408 |
|
409 .. class:: IncrementalEncoder([errors]) |
|
410 |
|
411 Constructor for an :class:`IncrementalEncoder` instance. |
|
412 |
|
413 All incremental encoders must provide this constructor interface. They are free |
|
414 to add additional keyword arguments, but only the ones defined here are used by |
|
415 the Python codec registry. |
|
416 |
|
417 The :class:`IncrementalEncoder` may implement different error handling schemes |
|
418 by providing the *errors* keyword argument. These parameters are predefined: |
|
419 |
|
420 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
|
421 |
|
422 * ``'ignore'`` Ignore the character and continue with the next. |
|
423 |
|
424 * ``'replace'`` Replace with a suitable replacement character |
|
425 |
|
426 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference |
|
427 |
|
428 * ``'backslashreplace'`` Replace with backslashed escape sequences. |
|
429 |
|
430 The *errors* argument will be assigned to an attribute of the same name. |
|
431 Assigning to this attribute makes it possible to switch between different error |
|
432 handling strategies during the lifetime of the :class:`IncrementalEncoder` |
|
433 object. |
|
434 |
|
435 The set of allowed values for the *errors* argument can be extended with |
|
436 :func:`register_error`. |
|
437 |
|
438 |
|
439 .. method:: encode(object[, final]) |
|
440 |
|
441 Encodes *object* (taking the current state of the encoder into account) |
|
442 and returns the resulting encoded object. If this is the last call to |
|
443 :meth:`encode` *final* must be true (the default is false). |
|
444 |
|
445 |
|
446 .. method:: reset() |
|
447 |
|
448 Reset the encoder to the initial state. |
|
449 |
|
450 |
|
451 .. _incremental-decoder-objects: |
|
452 |
|
453 IncrementalDecoder Objects |
|
454 ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
455 |
|
456 The :class:`IncrementalDecoder` class is used for decoding an input in multiple |
|
457 steps. It defines the following methods which every incremental decoder must |
|
458 define in order to be compatible with the Python codec registry. |
|
459 |
|
460 |
|
461 .. class:: IncrementalDecoder([errors]) |
|
462 |
|
463 Constructor for an :class:`IncrementalDecoder` instance. |
|
464 |
|
465 All incremental decoders must provide this constructor interface. They are free |
|
466 to add additional keyword arguments, but only the ones defined here are used by |
|
467 the Python codec registry. |
|
468 |
|
469 The :class:`IncrementalDecoder` may implement different error handling schemes |
|
470 by providing the *errors* keyword argument. These parameters are predefined: |
|
471 |
|
472 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
|
473 |
|
474 * ``'ignore'`` Ignore the character and continue with the next. |
|
475 |
|
476 * ``'replace'`` Replace with a suitable replacement character. |
|
477 |
|
478 The *errors* argument will be assigned to an attribute of the same name. |
|
479 Assigning to this attribute makes it possible to switch between different error |
|
480 handling strategies during the lifetime of the :class:`IncrementalDecoder` |
|
481 object. |
|
482 |
|
483 The set of allowed values for the *errors* argument can be extended with |
|
484 :func:`register_error`. |
|
485 |
|
486 |
|
487 .. method:: decode(object[, final]) |
|
488 |
|
489 Decodes *object* (taking the current state of the decoder into account) |
|
490 and returns the resulting decoded object. If this is the last call to |
|
491 :meth:`decode` *final* must be true (the default is false). If *final* is |
|
492 true the decoder must decode the input completely and must flush all |
|
493 buffers. If this isn't possible (e.g. because of incomplete byte sequences |
|
494 at the end of the input) it must initiate error handling just like in the |
|
495 stateless case (which might raise an exception). |
|
496 |
|
497 |
|
498 .. method:: reset() |
|
499 |
|
500 Reset the decoder to the initial state. |
|
501 |
|
502 |
|
503 The :class:`StreamWriter` and :class:`StreamReader` classes provide generic |
|
504 working interfaces which can be used to implement new encoding submodules very |
|
505 easily. See :mod:`encodings.utf_8` for an example of how this is done. |
|
506 |
|
507 |
|
508 .. _stream-writer-objects: |
|
509 |
|
510 StreamWriter Objects |
|
511 ^^^^^^^^^^^^^^^^^^^^ |
|
512 |
|
513 The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the |
|
514 following methods which every stream writer must define in order to be |
|
515 compatible with the Python codec registry. |
|
516 |
|
517 |
|
518 .. class:: StreamWriter(stream[, errors]) |
|
519 |
|
520 Constructor for a :class:`StreamWriter` instance. |
|
521 |
|
522 All stream writers must provide this constructor interface. They are free to add |
|
523 additional keyword arguments, but only the ones defined here are used by the |
|
524 Python codec registry. |
|
525 |
|
526 *stream* must be a file-like object open for writing binary data. |
|
527 |
|
528 The :class:`StreamWriter` may implement different error handling schemes by |
|
529 providing the *errors* keyword argument. These parameters are predefined: |
|
530 |
|
531 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
|
532 |
|
533 * ``'ignore'`` Ignore the character and continue with the next. |
|
534 |
|
535 * ``'replace'`` Replace with a suitable replacement character |
|
536 |
|
537 * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference |
|
538 |
|
539 * ``'backslashreplace'`` Replace with backslashed escape sequences. |
|
540 |
|
541 The *errors* argument will be assigned to an attribute of the same name. |
|
542 Assigning to this attribute makes it possible to switch between different error |
|
543 handling strategies during the lifetime of the :class:`StreamWriter` object. |
|
544 |
|
545 The set of allowed values for the *errors* argument can be extended with |
|
546 :func:`register_error`. |
|
547 |
|
548 |
|
549 .. method:: write(object) |
|
550 |
|
551 Writes the object's contents encoded to the stream. |
|
552 |
|
553 |
|
554 .. method:: writelines(list) |
|
555 |
|
556 Writes the concatenated list of strings to the stream (possibly by reusing |
|
557 the :meth:`write` method). |
|
558 |
|
559 |
|
560 .. method:: reset() |
|
561 |
|
562 Flushes and resets the codec buffers used for keeping state. |
|
563 |
|
564 Calling this method should ensure that the data on the output is put into |
|
565 a clean state that allows appending of new fresh data without having to |
|
566 rescan the whole stream to recover state. |
|
567 |
|
568 |
|
569 In addition to the above methods, the :class:`StreamWriter` must also inherit |
|
570 all other methods and attributes from the underlying stream. |
|
571 |
|
572 |
|
573 .. _stream-reader-objects: |
|
574 |
|
575 StreamReader Objects |
|
576 ^^^^^^^^^^^^^^^^^^^^ |
|
577 |
|
578 The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the |
|
579 following methods which every stream reader must define in order to be |
|
580 compatible with the Python codec registry. |
|
581 |
|
582 |
|
583 .. class:: StreamReader(stream[, errors]) |
|
584 |
|
585 Constructor for a :class:`StreamReader` instance. |
|
586 |
|
587 All stream readers must provide this constructor interface. They are free to add |
|
588 additional keyword arguments, but only the ones defined here are used by the |
|
589 Python codec registry. |
|
590 |
|
591 *stream* must be a file-like object open for reading (binary) data. |
|
592 |
|
593 The :class:`StreamReader` may implement different error handling schemes by |
|
594 providing the *errors* keyword argument. These parameters are defined: |
|
595 |
|
596 * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
|
597 |
|
598 * ``'ignore'`` Ignore the character and continue with the next. |
|
599 |
|
600 * ``'replace'`` Replace with a suitable replacement character. |
|
601 |
|
602 The *errors* argument will be assigned to an attribute of the same name. |
|
603 Assigning to this attribute makes it possible to switch between different error |
|
604 handling strategies during the lifetime of the :class:`StreamReader` object. |
|
605 |
|
606 The set of allowed values for the *errors* argument can be extended with |
|
607 :func:`register_error`. |
|
608 |
|
609 |
|
610 .. method:: read([size[, chars, [firstline]]]) |
|
611 |
|
612 Decodes data from the stream and returns the resulting object. |
|
613 |
|
614 *chars* indicates the number of characters to read from the |
|
615 stream. :func:`read` will never return more than *chars* characters, but |
|
616 it might return less, if there are not enough characters available. |
|
617 |
|
618 *size* indicates the approximate maximum number of bytes to read from the |
|
619 stream for decoding purposes. The decoder can modify this setting as |
|
620 appropriate. The default value -1 indicates to read and decode as much as |
|
621 possible. *size* is intended to prevent having to decode huge files in |
|
622 one step. |
|
623 |
|
624 *firstline* indicates that it would be sufficient to only return the first |
|
625 line, if there are decoding errors on later lines. |
|
626 |
|
627 The method should use a greedy read strategy meaning that it should read |
|
628 as much data as is allowed within the definition of the encoding and the |
|
629 given size, e.g. if optional encoding endings or state markers are |
|
630 available on the stream, these should be read too. |
|
631 |
|
632 .. versionchanged:: 2.4 |
|
633 *chars* argument added. |
|
634 |
|
635 .. versionchanged:: 2.4.2 |
|
636 *firstline* argument added. |
|
637 |
|
638 |
|
639 .. method:: readline([size[, keepends]]) |
|
640 |
|
641 Read one line from the input stream and return the decoded data. |
|
642 |
|
643 *size*, if given, is passed as size argument to the stream's |
|
644 :meth:`readline` method. |
|
645 |
|
646 If *keepends* is false line-endings will be stripped from the lines |
|
647 returned. |
|
648 |
|
649 .. versionchanged:: 2.4 |
|
650 *keepends* argument added. |
|
651 |
|
652 |
|
653 .. method:: readlines([sizehint[, keepends]]) |
|
654 |
|
655 Read all lines available on the input stream and return them as a list of |
|
656 lines. |
|
657 |
|
658 Line-endings are implemented using the codec's decoder method and are |
|
659 included in the list entries if *keepends* is true. |
|
660 |
|
661 *sizehint*, if given, is passed as the *size* argument to the stream's |
|
662 :meth:`read` method. |
|
663 |
|
664 |
|
665 .. method:: reset() |
|
666 |
|
667 Resets the codec buffers used for keeping state. |
|
668 |
|
669 Note that no stream repositioning should take place. This method is |
|
670 primarily intended to be able to recover from decoding errors. |
|
671 |
|
672 |
|
673 In addition to the above methods, the :class:`StreamReader` must also inherit |
|
674 all other methods and attributes from the underlying stream. |
|
675 |
|
676 The next two base classes are included for convenience. They are not needed by |
|
677 the codec registry, but may provide useful in practice. |
|
678 |
|
679 |
|
680 .. _stream-reader-writer: |
|
681 |
|
682 StreamReaderWriter Objects |
|
683 ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
684 |
|
685 The :class:`StreamReaderWriter` allows wrapping streams which work in both read |
|
686 and write modes. |
|
687 |
|
688 The design is such that one can use the factory functions returned by the |
|
689 :func:`lookup` function to construct the instance. |
|
690 |
|
691 |
|
692 .. class:: StreamReaderWriter(stream, Reader, Writer, errors) |
|
693 |
|
694 Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like |
|
695 object. *Reader* and *Writer* must be factory functions or classes providing the |
|
696 :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling |
|
697 is done in the same way as defined for the stream readers and writers. |
|
698 |
|
699 :class:`StreamReaderWriter` instances define the combined interfaces of |
|
700 :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other |
|
701 methods and attributes from the underlying stream. |
|
702 |
|
703 |
|
704 .. _stream-recoder-objects: |
|
705 |
|
706 StreamRecoder Objects |
|
707 ^^^^^^^^^^^^^^^^^^^^^ |
|
708 |
|
709 The :class:`StreamRecoder` provide a frontend - backend view of encoding data |
|
710 which is sometimes useful when dealing with different encoding environments. |
|
711 |
|
712 The design is such that one can use the factory functions returned by the |
|
713 :func:`lookup` function to construct the instance. |
|
714 |
|
715 |
|
716 .. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors) |
|
717 |
|
718 Creates a :class:`StreamRecoder` instance which implements a two-way conversion: |
|
719 *encode* and *decode* work on the frontend (the input to :meth:`read` and output |
|
720 of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and |
|
721 writing to the stream). |
|
722 |
|
723 You can use these objects to do transparent direct recodings from e.g. Latin-1 |
|
724 to UTF-8 and back. |
|
725 |
|
726 *stream* must be a file-like object. |
|
727 |
|
728 *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*, |
|
729 *Writer* must be factory functions or classes providing objects of the |
|
730 :class:`StreamReader` and :class:`StreamWriter` interface respectively. |
|
731 |
|
732 *encode* and *decode* are needed for the frontend translation, *Reader* and |
|
733 *Writer* for the backend translation. The intermediate format used is |
|
734 determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode |
|
735 as the intermediate encoding. |
|
736 |
|
737 Error handling is done in the same way as defined for the stream readers and |
|
738 writers. |
|
739 |
|
740 |
|
741 :class:`StreamRecoder` instances define the combined interfaces of |
|
742 :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other |
|
743 methods and attributes from the underlying stream. |
|
744 |
|
745 |
|
746 .. _encodings-overview: |
|
747 |
|
748 Encodings and Unicode |
|
749 --------------------- |
|
750 |
|
751 Unicode strings are stored internally as sequences of codepoints (to be precise |
|
752 as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either |
|
753 via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the |
|
754 former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data |
|
755 type. Once a Unicode object is used outside of CPU and memory, CPU endianness |
|
756 and how these arrays are stored as bytes become an issue. Transforming a |
|
757 unicode object into a sequence of bytes is called encoding and recreating the |
|
758 unicode object from the sequence of bytes is known as decoding. There are many |
|
759 different methods for how this transformation can be done (these methods are |
|
760 also called encodings). The simplest method is to map the codepoints 0-255 to |
|
761 the bytes ``0x0``-``0xff``. This means that a unicode object that contains |
|
762 codepoints above ``U+00FF`` can't be encoded with this method (which is called |
|
763 ``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a |
|
764 :exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1' |
|
765 codec can't encode character u'\u1234' in position 3: ordinal not in |
|
766 range(256)``. |
|
767 |
|
768 There's another group of encodings (the so called charmap encodings) that choose |
|
769 a different subset of all unicode code points and how these codepoints are |
|
770 mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open |
|
771 e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on |
|
772 Windows). There's a string constant with 256 characters that shows you which |
|
773 character is mapped to which byte value. |
|
774 |
|
775 All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints |
|
776 defined in unicode. A simple and straightforward way that can store each Unicode |
|
777 code point, is to store each codepoint as two consecutive bytes. There are two |
|
778 possibilities: Store the bytes in big endian or in little endian order. These |
|
779 two encodings are called UTF-16-BE and UTF-16-LE respectively. Their |
|
780 disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you |
|
781 will always have to swap bytes on encoding and decoding. UTF-16 avoids this |
|
782 problem: Bytes will always be in natural endianness. When these bytes are read |
|
783 by a CPU with a different endianness, then bytes have to be swapped though. To |
|
784 be able to detect the endianness of a UTF-16 byte sequence, there's the so |
|
785 called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``. |
|
786 This character will be prepended to every UTF-16 byte sequence. The byte swapped |
|
787 version of this character (``0xFFFE``) is an illegal character that may not |
|
788 appear in a Unicode text. So when the first character in an UTF-16 byte sequence |
|
789 appears to be a ``U+FFFE`` the bytes have to be swapped on decoding. |
|
790 Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as |
|
791 a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow |
|
792 a word to be split. It can e.g. be used to give hints to a ligature algorithm. |
|
793 With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been |
|
794 deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless |
|
795 Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM |
|
796 it's a device to determine the storage layout of the encoded bytes, and vanishes |
|
797 once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH |
|
798 NO-BREAK SPACE`` it's a normal character that will be decoded like any other. |
|
799 |
|
800 There's another encoding that is able to encoding the full range of Unicode |
|
801 characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues |
|
802 with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two |
|
803 parts: Marker bits (the most significant bits) and payload bits. The marker bits |
|
804 are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are |
|
805 encoded like this (with x being payload bits, which when concatenated give the |
|
806 Unicode character): |
|
807 |
|
808 +-----------------------------------+----------------------------------------------+ |
|
809 | Range | Encoding | |
|
810 +===================================+==============================================+ |
|
811 | ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx | |
|
812 +-----------------------------------+----------------------------------------------+ |
|
813 | ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx | |
|
814 +-----------------------------------+----------------------------------------------+ |
|
815 | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx | |
|
816 +-----------------------------------+----------------------------------------------+ |
|
817 | ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | |
|
818 +-----------------------------------+----------------------------------------------+ |
|
819 | ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | |
|
820 +-----------------------------------+----------------------------------------------+ |
|
821 | ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | |
|
822 | | 10xxxxxx | |
|
823 +-----------------------------------+----------------------------------------------+ |
|
824 |
|
825 The least significant bit of the Unicode character is the rightmost x bit. |
|
826 |
|
827 As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in |
|
828 the decoded Unicode string (even if it's the first character) is treated as a |
|
829 ``ZERO WIDTH NO-BREAK SPACE``. |
|
830 |
|
831 Without external information it's impossible to reliably determine which |
|
832 encoding was used for encoding a Unicode string. Each charmap encoding can |
|
833 decode any random byte sequence. However that's not possible with UTF-8, as |
|
834 UTF-8 byte sequences have a structure that doesn't allow arbitrary byte |
|
835 sequences. To increase the reliability with which a UTF-8 encoding can be |
|
836 detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls |
|
837 ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters |
|
838 is written to the file, a UTF-8 encoded BOM (which looks like this as a byte |
|
839 sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable |
|
840 that any charmap encoded file starts with these byte values (which would e.g. |
|
841 map to |
|
842 |
|
843 | LATIN SMALL LETTER I WITH DIAERESIS |
|
844 | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
|
845 | INVERTED QUESTION MARK |
|
846 |
|
847 in iso-8859-1), this increases the probability that a utf-8-sig encoding can be |
|
848 correctly guessed from the byte sequence. So here the BOM is not used to be able |
|
849 to determine the byte order used for generating the byte sequence, but as a |
|
850 signature that helps in guessing the encoding. On encoding the utf-8-sig codec |
|
851 will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On |
|
852 decoding utf-8-sig will skip those three bytes if they appear as the first three |
|
853 bytes in the file. |
|
854 |
|
855 |
|
856 .. _standard-encodings: |
|
857 |
|
858 Standard Encodings |
|
859 ------------------ |
|
860 |
|
861 Python comes with a number of codecs built-in, either implemented as C functions |
|
862 or with dictionaries as mapping tables. The following table lists the codecs by |
|
863 name, together with a few common aliases, and the languages for which the |
|
864 encoding is likely used. Neither the list of aliases nor the list of languages |
|
865 is meant to be exhaustive. Notice that spelling alternatives that only differ in |
|
866 case or use a hyphen instead of an underscore are also valid aliases. |
|
867 |
|
868 Many of the character sets support the same languages. They vary in individual |
|
869 characters (e.g. whether the EURO SIGN is supported or not), and in the |
|
870 assignment of characters to code positions. For the European languages in |
|
871 particular, the following variants typically exist: |
|
872 |
|
873 * an ISO 8859 codeset |
|
874 |
|
875 * a Microsoft Windows code page, which is typically derived from a 8859 codeset, |
|
876 but replaces control characters with additional graphic characters |
|
877 |
|
878 * an IBM EBCDIC code page |
|
879 |
|
880 * an IBM PC code page, which is ASCII compatible |
|
881 |
|
882 +-----------------+--------------------------------+--------------------------------+ |
|
883 | Codec | Aliases | Languages | |
|
884 +=================+================================+================================+ |
|
885 | ascii | 646, us-ascii | English | |
|
886 +-----------------+--------------------------------+--------------------------------+ |
|
887 | big5 | big5-tw, csbig5 | Traditional Chinese | |
|
888 +-----------------+--------------------------------+--------------------------------+ |
|
889 | big5hkscs | big5-hkscs, hkscs | Traditional Chinese | |
|
890 +-----------------+--------------------------------+--------------------------------+ |
|
891 | cp037 | IBM037, IBM039 | English | |
|
892 +-----------------+--------------------------------+--------------------------------+ |
|
893 | cp424 | EBCDIC-CP-HE, IBM424 | Hebrew | |
|
894 +-----------------+--------------------------------+--------------------------------+ |
|
895 | cp437 | 437, IBM437 | English | |
|
896 +-----------------+--------------------------------+--------------------------------+ |
|
897 | cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, | Western Europe | |
|
898 | | IBM500 | | |
|
899 +-----------------+--------------------------------+--------------------------------+ |
|
900 | cp737 | | Greek | |
|
901 +-----------------+--------------------------------+--------------------------------+ |
|
902 | cp775 | IBM775 | Baltic languages | |
|
903 +-----------------+--------------------------------+--------------------------------+ |
|
904 | cp850 | 850, IBM850 | Western Europe | |
|
905 +-----------------+--------------------------------+--------------------------------+ |
|
906 | cp852 | 852, IBM852 | Central and Eastern Europe | |
|
907 +-----------------+--------------------------------+--------------------------------+ |
|
908 | cp855 | 855, IBM855 | Bulgarian, Byelorussian, | |
|
909 | | | Macedonian, Russian, Serbian | |
|
910 +-----------------+--------------------------------+--------------------------------+ |
|
911 | cp856 | | Hebrew | |
|
912 +-----------------+--------------------------------+--------------------------------+ |
|
913 | cp857 | 857, IBM857 | Turkish | |
|
914 +-----------------+--------------------------------+--------------------------------+ |
|
915 | cp860 | 860, IBM860 | Portuguese | |
|
916 +-----------------+--------------------------------+--------------------------------+ |
|
917 | cp861 | 861, CP-IS, IBM861 | Icelandic | |
|
918 +-----------------+--------------------------------+--------------------------------+ |
|
919 | cp862 | 862, IBM862 | Hebrew | |
|
920 +-----------------+--------------------------------+--------------------------------+ |
|
921 | cp863 | 863, IBM863 | Canadian | |
|
922 +-----------------+--------------------------------+--------------------------------+ |
|
923 | cp864 | IBM864 | Arabic | |
|
924 +-----------------+--------------------------------+--------------------------------+ |
|
925 | cp865 | 865, IBM865 | Danish, Norwegian | |
|
926 +-----------------+--------------------------------+--------------------------------+ |
|
927 | cp866 | 866, IBM866 | Russian | |
|
928 +-----------------+--------------------------------+--------------------------------+ |
|
929 | cp869 | 869, CP-GR, IBM869 | Greek | |
|
930 +-----------------+--------------------------------+--------------------------------+ |
|
931 | cp874 | | Thai | |
|
932 +-----------------+--------------------------------+--------------------------------+ |
|
933 | cp875 | | Greek | |
|
934 +-----------------+--------------------------------+--------------------------------+ |
|
935 | cp932 | 932, ms932, mskanji, ms-kanji | Japanese | |
|
936 +-----------------+--------------------------------+--------------------------------+ |
|
937 | cp949 | 949, ms949, uhc | Korean | |
|
938 +-----------------+--------------------------------+--------------------------------+ |
|
939 | cp950 | 950, ms950 | Traditional Chinese | |
|
940 +-----------------+--------------------------------+--------------------------------+ |
|
941 | cp1006 | | Urdu | |
|
942 +-----------------+--------------------------------+--------------------------------+ |
|
943 | cp1026 | ibm1026 | Turkish | |
|
944 +-----------------+--------------------------------+--------------------------------+ |
|
945 | cp1140 | ibm1140 | Western Europe | |
|
946 +-----------------+--------------------------------+--------------------------------+ |
|
947 | cp1250 | windows-1250 | Central and Eastern Europe | |
|
948 +-----------------+--------------------------------+--------------------------------+ |
|
949 | cp1251 | windows-1251 | Bulgarian, Byelorussian, | |
|
950 | | | Macedonian, Russian, Serbian | |
|
951 +-----------------+--------------------------------+--------------------------------+ |
|
952 | cp1252 | windows-1252 | Western Europe | |
|
953 +-----------------+--------------------------------+--------------------------------+ |
|
954 | cp1253 | windows-1253 | Greek | |
|
955 +-----------------+--------------------------------+--------------------------------+ |
|
956 | cp1254 | windows-1254 | Turkish | |
|
957 +-----------------+--------------------------------+--------------------------------+ |
|
958 | cp1255 | windows-1255 | Hebrew | |
|
959 +-----------------+--------------------------------+--------------------------------+ |
|
960 | cp1256 | windows1256 | Arabic | |
|
961 +-----------------+--------------------------------+--------------------------------+ |
|
962 | cp1257 | windows-1257 | Baltic languages | |
|
963 +-----------------+--------------------------------+--------------------------------+ |
|
964 | cp1258 | windows-1258 | Vietnamese | |
|
965 +-----------------+--------------------------------+--------------------------------+ |
|
966 | euc_jp | eucjp, ujis, u-jis | Japanese | |
|
967 +-----------------+--------------------------------+--------------------------------+ |
|
968 | euc_jis_2004 | jisx0213, eucjis2004 | Japanese | |
|
969 +-----------------+--------------------------------+--------------------------------+ |
|
970 | euc_jisx0213 | eucjisx0213 | Japanese | |
|
971 +-----------------+--------------------------------+--------------------------------+ |
|
972 | euc_kr | euckr, korean, ksc5601, | Korean | |
|
973 | | ks_c-5601, ks_c-5601-1987, | | |
|
974 | | ksx1001, ks_x-1001 | | |
|
975 +-----------------+--------------------------------+--------------------------------+ |
|
976 | gb2312 | chinese, csiso58gb231280, euc- | Simplified Chinese | |
|
977 | | cn, euccn, eucgb2312-cn, | | |
|
978 | | gb2312-1980, gb2312-80, iso- | | |
|
979 | | ir-58 | | |
|
980 +-----------------+--------------------------------+--------------------------------+ |
|
981 | gbk | 936, cp936, ms936 | Unified Chinese | |
|
982 +-----------------+--------------------------------+--------------------------------+ |
|
983 | gb18030 | gb18030-2000 | Unified Chinese | |
|
984 +-----------------+--------------------------------+--------------------------------+ |
|
985 | hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese | |
|
986 +-----------------+--------------------------------+--------------------------------+ |
|
987 | iso2022_jp | csiso2022jp, iso2022jp, | Japanese | |
|
988 | | iso-2022-jp | | |
|
989 +-----------------+--------------------------------+--------------------------------+ |
|
990 | iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese | |
|
991 +-----------------+--------------------------------+--------------------------------+ |
|
992 | iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified | |
|
993 | | | Chinese, Western Europe, Greek | |
|
994 +-----------------+--------------------------------+--------------------------------+ |
|
995 | iso2022_jp_2004 | iso2022jp-2004, | Japanese | |
|
996 | | iso-2022-jp-2004 | | |
|
997 +-----------------+--------------------------------+--------------------------------+ |
|
998 | iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese | |
|
999 +-----------------+--------------------------------+--------------------------------+ |
|
1000 | iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese | |
|
1001 +-----------------+--------------------------------+--------------------------------+ |
|
1002 | iso2022_kr | csiso2022kr, iso2022kr, | Korean | |
|
1003 | | iso-2022-kr | | |
|
1004 +-----------------+--------------------------------+--------------------------------+ |
|
1005 | latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe | |
|
1006 | | cp819, latin, latin1, L1 | | |
|
1007 +-----------------+--------------------------------+--------------------------------+ |
|
1008 | iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe | |
|
1009 +-----------------+--------------------------------+--------------------------------+ |
|
1010 | iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese | |
|
1011 +-----------------+--------------------------------+--------------------------------+ |
|
1012 | iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages | |
|
1013 +-----------------+--------------------------------+--------------------------------+ |
|
1014 | iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, | |
|
1015 | | | Macedonian, Russian, Serbian | |
|
1016 +-----------------+--------------------------------+--------------------------------+ |
|
1017 | iso8859_6 | iso-8859-6, arabic | Arabic | |
|
1018 +-----------------+--------------------------------+--------------------------------+ |
|
1019 | iso8859_7 | iso-8859-7, greek, greek8 | Greek | |
|
1020 +-----------------+--------------------------------+--------------------------------+ |
|
1021 | iso8859_8 | iso-8859-8, hebrew | Hebrew | |
|
1022 +-----------------+--------------------------------+--------------------------------+ |
|
1023 | iso8859_9 | iso-8859-9, latin5, L5 | Turkish | |
|
1024 +-----------------+--------------------------------+--------------------------------+ |
|
1025 | iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages | |
|
1026 +-----------------+--------------------------------+--------------------------------+ |
|
1027 | iso8859_13 | iso-8859-13 | Baltic languages | |
|
1028 +-----------------+--------------------------------+--------------------------------+ |
|
1029 | iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages | |
|
1030 +-----------------+--------------------------------+--------------------------------+ |
|
1031 | iso8859_15 | iso-8859-15 | Western Europe | |
|
1032 +-----------------+--------------------------------+--------------------------------+ |
|
1033 | johab | cp1361, ms1361 | Korean | |
|
1034 +-----------------+--------------------------------+--------------------------------+ |
|
1035 | koi8_r | | Russian | |
|
1036 +-----------------+--------------------------------+--------------------------------+ |
|
1037 | koi8_u | | Ukrainian | |
|
1038 +-----------------+--------------------------------+--------------------------------+ |
|
1039 | mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, | |
|
1040 | | | Macedonian, Russian, Serbian | |
|
1041 +-----------------+--------------------------------+--------------------------------+ |
|
1042 | mac_greek | macgreek | Greek | |
|
1043 +-----------------+--------------------------------+--------------------------------+ |
|
1044 | mac_iceland | maciceland | Icelandic | |
|
1045 +-----------------+--------------------------------+--------------------------------+ |
|
1046 | mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe | |
|
1047 +-----------------+--------------------------------+--------------------------------+ |
|
1048 | mac_roman | macroman | Western Europe | |
|
1049 +-----------------+--------------------------------+--------------------------------+ |
|
1050 | mac_turkish | macturkish | Turkish | |
|
1051 +-----------------+--------------------------------+--------------------------------+ |
|
1052 | ptcp154 | csptcp154, pt154, cp154, | Kazakh | |
|
1053 | | cyrillic-asian | | |
|
1054 +-----------------+--------------------------------+--------------------------------+ |
|
1055 | shift_jis | csshiftjis, shiftjis, sjis, | Japanese | |
|
1056 | | s_jis | | |
|
1057 +-----------------+--------------------------------+--------------------------------+ |
|
1058 | shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese | |
|
1059 | | sjis2004 | | |
|
1060 +-----------------+--------------------------------+--------------------------------+ |
|
1061 | shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese | |
|
1062 | | s_jisx0213 | | |
|
1063 +-----------------+--------------------------------+--------------------------------+ |
|
1064 | utf_32 | U32, utf32 | all languages | |
|
1065 +-----------------+--------------------------------+--------------------------------+ |
|
1066 | utf_32_be | UTF-32BE | all languages | |
|
1067 +-----------------+--------------------------------+--------------------------------+ |
|
1068 | utf_32_le | UTF-32LE | all languages | |
|
1069 +-----------------+--------------------------------+--------------------------------+ |
|
1070 | utf_16 | U16, utf16 | all languages | |
|
1071 +-----------------+--------------------------------+--------------------------------+ |
|
1072 | utf_16_be | UTF-16BE | all languages (BMP only) | |
|
1073 +-----------------+--------------------------------+--------------------------------+ |
|
1074 | utf_16_le | UTF-16LE | all languages (BMP only) | |
|
1075 +-----------------+--------------------------------+--------------------------------+ |
|
1076 | utf_7 | U7, unicode-1-1-utf-7 | all languages | |
|
1077 +-----------------+--------------------------------+--------------------------------+ |
|
1078 | utf_8 | U8, UTF, utf8 | all languages | |
|
1079 +-----------------+--------------------------------+--------------------------------+ |
|
1080 | utf_8_sig | | all languages | |
|
1081 +-----------------+--------------------------------+--------------------------------+ |
|
1082 |
|
1083 A number of codecs are specific to Python, so their codec names have no meaning |
|
1084 outside Python. Some of them don't convert from Unicode strings to byte strings, |
|
1085 but instead use the property of the Python codecs machinery that any bijective |
|
1086 function with one argument can be considered as an encoding. |
|
1087 |
|
1088 For the codecs listed below, the result in the "encoding" direction is always a |
|
1089 byte string. The result of the "decoding" direction is listed as operand type in |
|
1090 the table. |
|
1091 |
|
1092 +--------------------+---------------------------+----------------+---------------------------+ |
|
1093 | Codec | Aliases | Operand type | Purpose | |
|
1094 +====================+===========================+================+===========================+ |
|
1095 | base64_codec | base64, base-64 | byte string | Convert operand to MIME | |
|
1096 | | | | base64 | |
|
1097 +--------------------+---------------------------+----------------+---------------------------+ |
|
1098 | bz2_codec | bz2 | byte string | Compress the operand | |
|
1099 | | | | using bz2 | |
|
1100 +--------------------+---------------------------+----------------+---------------------------+ |
|
1101 | hex_codec | hex | byte string | Convert operand to | |
|
1102 | | | | hexadecimal | |
|
1103 | | | | representation, with two | |
|
1104 | | | | digits per byte | |
|
1105 +--------------------+---------------------------+----------------+---------------------------+ |
|
1106 | idna | | Unicode string | Implements :rfc:`3490`, | |
|
1107 | | | | see also | |
|
1108 | | | | :mod:`encodings.idna` | |
|
1109 +--------------------+---------------------------+----------------+---------------------------+ |
|
1110 | mbcs | dbcs | Unicode string | Windows only: Encode | |
|
1111 | | | | operand according to the | |
|
1112 | | | | ANSI codepage (CP_ACP) | |
|
1113 +--------------------+---------------------------+----------------+---------------------------+ |
|
1114 | palmos | | Unicode string | Encoding of PalmOS 3.5 | |
|
1115 +--------------------+---------------------------+----------------+---------------------------+ |
|
1116 | punycode | | Unicode string | Implements :rfc:`3492` | |
|
1117 +--------------------+---------------------------+----------------+---------------------------+ |
|
1118 | quopri_codec | quopri, quoted-printable, | byte string | Convert operand to MIME | |
|
1119 | | quotedprintable | | quoted printable | |
|
1120 +--------------------+---------------------------+----------------+---------------------------+ |
|
1121 | raw_unicode_escape | | Unicode string | Produce a string that is | |
|
1122 | | | | suitable as raw Unicode | |
|
1123 | | | | literal in Python source | |
|
1124 | | | | code | |
|
1125 +--------------------+---------------------------+----------------+---------------------------+ |
|
1126 | rot_13 | rot13 | Unicode string | Returns the Caesar-cypher | |
|
1127 | | | | encryption of the operand | |
|
1128 +--------------------+---------------------------+----------------+---------------------------+ |
|
1129 | string_escape | | byte string | Produce a string that is | |
|
1130 | | | | suitable as string | |
|
1131 | | | | literal in Python source | |
|
1132 | | | | code | |
|
1133 +--------------------+---------------------------+----------------+---------------------------+ |
|
1134 | undefined | | any | Raise an exception for | |
|
1135 | | | | all conversions. Can be | |
|
1136 | | | | used as the system | |
|
1137 | | | | encoding if no automatic | |
|
1138 | | | | :term:`coercion` between | |
|
1139 | | | | byte and Unicode strings | |
|
1140 | | | | is desired. | |
|
1141 +--------------------+---------------------------+----------------+---------------------------+ |
|
1142 | unicode_escape | | Unicode string | Produce a string that is | |
|
1143 | | | | suitable as Unicode | |
|
1144 | | | | literal in Python source | |
|
1145 | | | | code | |
|
1146 +--------------------+---------------------------+----------------+---------------------------+ |
|
1147 | unicode_internal | | Unicode string | Return the internal | |
|
1148 | | | | representation of the | |
|
1149 | | | | operand | |
|
1150 +--------------------+---------------------------+----------------+---------------------------+ |
|
1151 | uu_codec | uu | byte string | Convert the operand using | |
|
1152 | | | | uuencode | |
|
1153 +--------------------+---------------------------+----------------+---------------------------+ |
|
1154 | zlib_codec | zip, zlib | byte string | Compress the operand | |
|
1155 | | | | using gzip | |
|
1156 +--------------------+---------------------------+----------------+---------------------------+ |
|
1157 |
|
1158 .. versionadded:: 2.3 |
|
1159 The ``idna`` and ``punycode`` encodings. |
|
1160 |
|
1161 |
|
1162 :mod:`encodings.idna` --- Internationalized Domain Names in Applications |
|
1163 ------------------------------------------------------------------------ |
|
1164 |
|
1165 .. module:: encodings.idna |
|
1166 :synopsis: Internationalized Domain Names implementation |
|
1167 .. moduleauthor:: Martin v. Löwis |
|
1168 |
|
1169 .. versionadded:: 2.3 |
|
1170 |
|
1171 This module implements :rfc:`3490` (Internationalized Domain Names in |
|
1172 Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for |
|
1173 Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding |
|
1174 and :mod:`stringprep`. |
|
1175 |
|
1176 These RFCs together define a protocol to support non-ASCII characters in domain |
|
1177 names. A domain name containing non-ASCII characters (such as |
|
1178 ``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding |
|
1179 (ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain |
|
1180 name is then used in all places where arbitrary characters are not allowed by |
|
1181 the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so |
|
1182 on. This conversion is carried out in the application; if possible invisible to |
|
1183 the user: The application should transparently convert Unicode domain labels to |
|
1184 IDNA on the wire, and convert back ACE labels to Unicode before presenting them |
|
1185 to the user. |
|
1186 |
|
1187 Python supports this conversion in several ways: The ``idna`` codec allows to |
|
1188 convert between Unicode and the ACE. Furthermore, the :mod:`socket` module |
|
1189 transparently converts Unicode host names to ACE, so that applications need not |
|
1190 be concerned about converting host names themselves when they pass them to the |
|
1191 socket module. On top of that, modules that have host names as function |
|
1192 parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names |
|
1193 (:mod:`httplib` then also transparently sends an IDNA hostname in the |
|
1194 :mailheader:`Host` field if it sends that field at all). |
|
1195 |
|
1196 When receiving host names from the wire (such as in reverse name lookup), no |
|
1197 automatic conversion to Unicode is performed: Applications wishing to present |
|
1198 such host names to the user should decode them to Unicode. |
|
1199 |
|
1200 The module :mod:`encodings.idna` also implements the nameprep procedure, which |
|
1201 performs certain normalizations on host names, to achieve case-insensitivity of |
|
1202 international domain names, and to unify similar characters. The nameprep |
|
1203 functions can be used directly if desired. |
|
1204 |
|
1205 |
|
1206 .. function:: nameprep(label) |
|
1207 |
|
1208 Return the nameprepped version of *label*. The implementation currently assumes |
|
1209 query strings, so ``AllowUnassigned`` is true. |
|
1210 |
|
1211 |
|
1212 .. function:: ToASCII(label) |
|
1213 |
|
1214 Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is |
|
1215 assumed to be false. |
|
1216 |
|
1217 |
|
1218 .. function:: ToUnicode(label) |
|
1219 |
|
1220 Convert a label to Unicode, as specified in :rfc:`3490`. |
|
1221 |
|
1222 |
|
1223 :mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature |
|
1224 ------------------------------------------------------------- |
|
1225 |
|
1226 .. module:: encodings.utf_8_sig |
|
1227 :synopsis: UTF-8 codec with BOM signature |
|
1228 .. moduleauthor:: Walter Dörwald |
|
1229 |
|
1230 .. versionadded:: 2.5 |
|
1231 |
|
1232 This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded |
|
1233 BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this |
|
1234 is only done once (on the first write to the byte stream). For decoding an |
|
1235 optional UTF-8 encoded BOM at the start of the data will be skipped. |
|
1236 |