|
1 '''"Executable documentation" for the pickle module. |
|
2 |
|
3 Extensive comments about the pickle protocols and pickle-machine opcodes |
|
4 can be found here. Some functions meant for external use: |
|
5 |
|
6 genops(pickle) |
|
7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples. |
|
8 |
|
9 dis(pickle, out=None, memo=None, indentlevel=4) |
|
10 Print a symbolic disassembly of a pickle. |
|
11 ''' |
|
12 |
|
13 __all__ = ['dis', |
|
14 'genops', |
|
15 ] |
|
16 |
|
17 # Other ideas: |
|
18 # |
|
19 # - A pickle verifier: read a pickle and check it exhaustively for |
|
20 # well-formedness. dis() does a lot of this already. |
|
21 # |
|
22 # - A protocol identifier: examine a pickle and return its protocol number |
|
23 # (== the highest .proto attr value among all the opcodes in the pickle). |
|
24 # dis() already prints this info at the end. |
|
25 # |
|
26 # - A pickle optimizer: for example, tuple-building code is sometimes more |
|
27 # elaborate than necessary, catering for the possibility that the tuple |
|
28 # is recursive. Or lots of times a PUT is generated that's never accessed |
|
29 # by a later GET. |
|
30 |
|
31 |
|
32 """ |
|
33 "A pickle" is a program for a virtual pickle machine (PM, but more accurately |
|
34 called an unpickling machine). It's a sequence of opcodes, interpreted by the |
|
35 PM, building an arbitrarily complex Python object. |
|
36 |
|
37 For the most part, the PM is very simple: there are no looping, testing, or |
|
38 conditional instructions, no arithmetic and no function calls. Opcodes are |
|
39 executed once each, from first to last, until a STOP opcode is reached. |
|
40 |
|
41 The PM has two data areas, "the stack" and "the memo". |
|
42 |
|
43 Many opcodes push Python objects onto the stack; e.g., INT pushes a Python |
|
44 integer object on the stack, whose value is gotten from a decimal string |
|
45 literal immediately following the INT opcode in the pickle bytestream. Other |
|
46 opcodes take Python objects off the stack. The result of unpickling is |
|
47 whatever object is left on the stack when the final STOP opcode is executed. |
|
48 |
|
49 The memo is simply an array of objects, or it can be implemented as a dict |
|
50 mapping little integers to objects. The memo serves as the PM's "long term |
|
51 memory", and the little integers indexing the memo are akin to variable |
|
52 names. Some opcodes pop a stack object into the memo at a given index, |
|
53 and others push a memo object at a given index onto the stack again. |
|
54 |
|
55 At heart, that's all the PM has. Subtleties arise for these reasons: |
|
56 |
|
57 + Object identity. Objects can be arbitrarily complex, and subobjects |
|
58 may be shared (for example, the list [a, a] refers to the same object a |
|
59 twice). It can be vital that unpickling recreate an isomorphic object |
|
60 graph, faithfully reproducing sharing. |
|
61 |
|
62 + Recursive objects. For example, after "L = []; L.append(L)", L is a |
|
63 list, and L[0] is the same list. This is related to the object identity |
|
64 point, and some sequences of pickle opcodes are subtle in order to |
|
65 get the right result in all cases. |
|
66 |
|
67 + Things pickle doesn't know everything about. Examples of things pickle |
|
68 does know everything about are Python's builtin scalar and container |
|
69 types, like ints and tuples. They generally have opcodes dedicated to |
|
70 them. For things like module references and instances of user-defined |
|
71 classes, pickle's knowledge is limited. Historically, many enhancements |
|
72 have been made to the pickle protocol in order to do a better (faster, |
|
73 and/or more compact) job on those. |
|
74 |
|
75 + Backward compatibility and micro-optimization. As explained below, |
|
76 pickle opcodes never go away, not even when better ways to do a thing |
|
77 get invented. The repertoire of the PM just keeps growing over time. |
|
78 For example, protocol 0 had two opcodes for building Python integers (INT |
|
79 and LONG), protocol 1 added three more for more-efficient pickling of short |
|
80 integers, and protocol 2 added two more for more-efficient pickling of |
|
81 long integers (before protocol 2, the only ways to pickle a Python long |
|
82 took time quadratic in the number of digits, for both pickling and |
|
83 unpickling). "Opcode bloat" isn't so much a subtlety as a source of |
|
84 wearying complication. |
|
85 |
|
86 |
|
87 Pickle protocols: |
|
88 |
|
89 For compatibility, the meaning of a pickle opcode never changes. Instead new |
|
90 pickle opcodes get added, and each version's unpickler can handle all the |
|
91 pickle opcodes in all protocol versions to date. So old pickles continue to |
|
92 be readable forever. The pickler can generally be told to restrict itself to |
|
93 the subset of opcodes available under previous protocol versions too, so that |
|
94 users can create pickles under the current version readable by older |
|
95 versions. However, a pickle does not contain its version number embedded |
|
96 within it. If an older unpickler tries to read a pickle using a later |
|
97 protocol, the result is most likely an exception due to seeing an unknown (in |
|
98 the older unpickler) opcode. |
|
99 |
|
100 The original pickle used what's now called "protocol 0", and what was called |
|
101 "text mode" before Python 2.3. The entire pickle bytestream is made up of |
|
102 printable 7-bit ASCII characters, plus the newline character, in protocol 0. |
|
103 That's why it was called text mode. Protocol 0 is small and elegant, but |
|
104 sometimes painfully inefficient. |
|
105 |
|
106 The second major set of additions is now called "protocol 1", and was called |
|
107 "binary mode" before Python 2.3. This added many opcodes with arguments |
|
108 consisting of arbitrary bytes, including NUL bytes and unprintable "high bit" |
|
109 bytes. Binary mode pickles can be substantially smaller than equivalent |
|
110 text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte |
|
111 int as 4 bytes following the opcode, which is cheaper to unpickle than the |
|
112 (perhaps) 11-character decimal string attached to INT. Protocol 1 also added |
|
113 a number of opcodes that operate on many stack elements at once (like APPENDS |
|
114 and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE). |
|
115 |
|
116 The third major set of additions came in Python 2.3, and is called "protocol |
|
117 2". This added: |
|
118 |
|
119 - A better way to pickle instances of new-style classes (NEWOBJ). |
|
120 |
|
121 - A way for a pickle to identify its protocol (PROTO). |
|
122 |
|
123 - Time- and space- efficient pickling of long ints (LONG{1,4}). |
|
124 |
|
125 - Shortcuts for small tuples (TUPLE{1,2,3}}. |
|
126 |
|
127 - Dedicated opcodes for bools (NEWTRUE, NEWFALSE). |
|
128 |
|
129 - The "extension registry", a vector of popular objects that can be pushed |
|
130 efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but |
|
131 the registry contents are predefined (there's nothing akin to the memo's |
|
132 PUT). |
|
133 |
|
134 Another independent change with Python 2.3 is the abandonment of any |
|
135 pretense that it might be safe to load pickles received from untrusted |
|
136 parties -- no sufficient security analysis has been done to guarantee |
|
137 this and there isn't a use case that warrants the expense of such an |
|
138 analysis. |
|
139 |
|
140 To this end, all tests for __safe_for_unpickling__ or for |
|
141 copy_reg.safe_constructors are removed from the unpickling code. |
|
142 References to these variables in the descriptions below are to be seen |
|
143 as describing unpickling in Python 2.2 and before. |
|
144 """ |
|
145 |
|
146 # Meta-rule: Descriptions are stored in instances of descriptor objects, |
|
147 # with plain constructors. No meta-language is defined from which |
|
148 # descriptors could be constructed. If you want, e.g., XML, write a little |
|
149 # program to generate XML from the objects. |
|
150 |
|
151 ############################################################################## |
|
152 # Some pickle opcodes have an argument, following the opcode in the |
|
153 # bytestream. An argument is of a specific type, described by an instance |
|
154 # of ArgumentDescriptor. These are not to be confused with arguments taken |
|
155 # off the stack -- ArgumentDescriptor applies only to arguments embedded in |
|
156 # the opcode stream, immediately following an opcode. |
|
157 |
|
158 # Represents the number of bytes consumed by an argument delimited by the |
|
159 # next newline character. |
|
160 UP_TO_NEWLINE = -1 |
|
161 |
|
162 # Represents the number of bytes consumed by a two-argument opcode where |
|
163 # the first argument gives the number of bytes in the second argument. |
|
164 TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int |
|
165 TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int |
|
166 |
|
167 class ArgumentDescriptor(object): |
|
168 __slots__ = ( |
|
169 # name of descriptor record, also a module global name; a string |
|
170 'name', |
|
171 |
|
172 # length of argument, in bytes; an int; UP_TO_NEWLINE and |
|
173 # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length |
|
174 # cases |
|
175 'n', |
|
176 |
|
177 # a function taking a file-like object, reading this kind of argument |
|
178 # from the object at the current position, advancing the current |
|
179 # position by n bytes, and returning the value of the argument |
|
180 'reader', |
|
181 |
|
182 # human-readable docs for this arg descriptor; a string |
|
183 'doc', |
|
184 ) |
|
185 |
|
186 def __init__(self, name, n, reader, doc): |
|
187 assert isinstance(name, str) |
|
188 self.name = name |
|
189 |
|
190 assert isinstance(n, int) and (n >= 0 or |
|
191 n in (UP_TO_NEWLINE, |
|
192 TAKEN_FROM_ARGUMENT1, |
|
193 TAKEN_FROM_ARGUMENT4)) |
|
194 self.n = n |
|
195 |
|
196 self.reader = reader |
|
197 |
|
198 assert isinstance(doc, str) |
|
199 self.doc = doc |
|
200 |
|
201 from struct import unpack as _unpack |
|
202 |
|
203 def read_uint1(f): |
|
204 r""" |
|
205 >>> import StringIO |
|
206 >>> read_uint1(StringIO.StringIO('\xff')) |
|
207 255 |
|
208 """ |
|
209 |
|
210 data = f.read(1) |
|
211 if data: |
|
212 return ord(data) |
|
213 raise ValueError("not enough data in stream to read uint1") |
|
214 |
|
215 uint1 = ArgumentDescriptor( |
|
216 name='uint1', |
|
217 n=1, |
|
218 reader=read_uint1, |
|
219 doc="One-byte unsigned integer.") |
|
220 |
|
221 |
|
222 def read_uint2(f): |
|
223 r""" |
|
224 >>> import StringIO |
|
225 >>> read_uint2(StringIO.StringIO('\xff\x00')) |
|
226 255 |
|
227 >>> read_uint2(StringIO.StringIO('\xff\xff')) |
|
228 65535 |
|
229 """ |
|
230 |
|
231 data = f.read(2) |
|
232 if len(data) == 2: |
|
233 return _unpack("<H", data)[0] |
|
234 raise ValueError("not enough data in stream to read uint2") |
|
235 |
|
236 uint2 = ArgumentDescriptor( |
|
237 name='uint2', |
|
238 n=2, |
|
239 reader=read_uint2, |
|
240 doc="Two-byte unsigned integer, little-endian.") |
|
241 |
|
242 |
|
243 def read_int4(f): |
|
244 r""" |
|
245 >>> import StringIO |
|
246 >>> read_int4(StringIO.StringIO('\xff\x00\x00\x00')) |
|
247 255 |
|
248 >>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31) |
|
249 True |
|
250 """ |
|
251 |
|
252 data = f.read(4) |
|
253 if len(data) == 4: |
|
254 return _unpack("<i", data)[0] |
|
255 raise ValueError("not enough data in stream to read int4") |
|
256 |
|
257 int4 = ArgumentDescriptor( |
|
258 name='int4', |
|
259 n=4, |
|
260 reader=read_int4, |
|
261 doc="Four-byte signed integer, little-endian, 2's complement.") |
|
262 |
|
263 |
|
264 def read_stringnl(f, decode=True, stripquotes=True): |
|
265 r""" |
|
266 >>> import StringIO |
|
267 >>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n")) |
|
268 'abcd' |
|
269 |
|
270 >>> read_stringnl(StringIO.StringIO("\n")) |
|
271 Traceback (most recent call last): |
|
272 ... |
|
273 ValueError: no string quotes around '' |
|
274 |
|
275 >>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False) |
|
276 '' |
|
277 |
|
278 >>> read_stringnl(StringIO.StringIO("''\n")) |
|
279 '' |
|
280 |
|
281 >>> read_stringnl(StringIO.StringIO('"abcd"')) |
|
282 Traceback (most recent call last): |
|
283 ... |
|
284 ValueError: no newline found when trying to read stringnl |
|
285 |
|
286 Embedded escapes are undone in the result. |
|
287 >>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'")) |
|
288 'a\n\\b\x00c\td' |
|
289 """ |
|
290 |
|
291 data = f.readline() |
|
292 if not data.endswith('\n'): |
|
293 raise ValueError("no newline found when trying to read stringnl") |
|
294 data = data[:-1] # lose the newline |
|
295 |
|
296 if stripquotes: |
|
297 for q in "'\"": |
|
298 if data.startswith(q): |
|
299 if not data.endswith(q): |
|
300 raise ValueError("strinq quote %r not found at both " |
|
301 "ends of %r" % (q, data)) |
|
302 data = data[1:-1] |
|
303 break |
|
304 else: |
|
305 raise ValueError("no string quotes around %r" % data) |
|
306 |
|
307 # I'm not sure when 'string_escape' was added to the std codecs; it's |
|
308 # crazy not to use it if it's there. |
|
309 if decode: |
|
310 data = data.decode('string_escape') |
|
311 return data |
|
312 |
|
313 stringnl = ArgumentDescriptor( |
|
314 name='stringnl', |
|
315 n=UP_TO_NEWLINE, |
|
316 reader=read_stringnl, |
|
317 doc="""A newline-terminated string. |
|
318 |
|
319 This is a repr-style string, with embedded escapes, and |
|
320 bracketing quotes. |
|
321 """) |
|
322 |
|
323 def read_stringnl_noescape(f): |
|
324 return read_stringnl(f, decode=False, stripquotes=False) |
|
325 |
|
326 stringnl_noescape = ArgumentDescriptor( |
|
327 name='stringnl_noescape', |
|
328 n=UP_TO_NEWLINE, |
|
329 reader=read_stringnl_noescape, |
|
330 doc="""A newline-terminated string. |
|
331 |
|
332 This is a str-style string, without embedded escapes, |
|
333 or bracketing quotes. It should consist solely of |
|
334 printable ASCII characters. |
|
335 """) |
|
336 |
|
337 def read_stringnl_noescape_pair(f): |
|
338 r""" |
|
339 >>> import StringIO |
|
340 >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk")) |
|
341 'Queue Empty' |
|
342 """ |
|
343 |
|
344 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f)) |
|
345 |
|
346 stringnl_noescape_pair = ArgumentDescriptor( |
|
347 name='stringnl_noescape_pair', |
|
348 n=UP_TO_NEWLINE, |
|
349 reader=read_stringnl_noescape_pair, |
|
350 doc="""A pair of newline-terminated strings. |
|
351 |
|
352 These are str-style strings, without embedded |
|
353 escapes, or bracketing quotes. They should |
|
354 consist solely of printable ASCII characters. |
|
355 The pair is returned as a single string, with |
|
356 a single blank separating the two strings. |
|
357 """) |
|
358 |
|
359 def read_string4(f): |
|
360 r""" |
|
361 >>> import StringIO |
|
362 >>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc")) |
|
363 '' |
|
364 >>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef")) |
|
365 'abc' |
|
366 >>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef")) |
|
367 Traceback (most recent call last): |
|
368 ... |
|
369 ValueError: expected 50331648 bytes in a string4, but only 6 remain |
|
370 """ |
|
371 |
|
372 n = read_int4(f) |
|
373 if n < 0: |
|
374 raise ValueError("string4 byte count < 0: %d" % n) |
|
375 data = f.read(n) |
|
376 if len(data) == n: |
|
377 return data |
|
378 raise ValueError("expected %d bytes in a string4, but only %d remain" % |
|
379 (n, len(data))) |
|
380 |
|
381 string4 = ArgumentDescriptor( |
|
382 name="string4", |
|
383 n=TAKEN_FROM_ARGUMENT4, |
|
384 reader=read_string4, |
|
385 doc="""A counted string. |
|
386 |
|
387 The first argument is a 4-byte little-endian signed int giving |
|
388 the number of bytes in the string, and the second argument is |
|
389 that many bytes. |
|
390 """) |
|
391 |
|
392 |
|
393 def read_string1(f): |
|
394 r""" |
|
395 >>> import StringIO |
|
396 >>> read_string1(StringIO.StringIO("\x00")) |
|
397 '' |
|
398 >>> read_string1(StringIO.StringIO("\x03abcdef")) |
|
399 'abc' |
|
400 """ |
|
401 |
|
402 n = read_uint1(f) |
|
403 assert n >= 0 |
|
404 data = f.read(n) |
|
405 if len(data) == n: |
|
406 return data |
|
407 raise ValueError("expected %d bytes in a string1, but only %d remain" % |
|
408 (n, len(data))) |
|
409 |
|
410 string1 = ArgumentDescriptor( |
|
411 name="string1", |
|
412 n=TAKEN_FROM_ARGUMENT1, |
|
413 reader=read_string1, |
|
414 doc="""A counted string. |
|
415 |
|
416 The first argument is a 1-byte unsigned int giving the number |
|
417 of bytes in the string, and the second argument is that many |
|
418 bytes. |
|
419 """) |
|
420 |
|
421 |
|
422 def read_unicodestringnl(f): |
|
423 r""" |
|
424 >>> import StringIO |
|
425 >>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk")) |
|
426 u'abc\uabcd' |
|
427 """ |
|
428 |
|
429 data = f.readline() |
|
430 if not data.endswith('\n'): |
|
431 raise ValueError("no newline found when trying to read " |
|
432 "unicodestringnl") |
|
433 data = data[:-1] # lose the newline |
|
434 return unicode(data, 'raw-unicode-escape') |
|
435 |
|
436 unicodestringnl = ArgumentDescriptor( |
|
437 name='unicodestringnl', |
|
438 n=UP_TO_NEWLINE, |
|
439 reader=read_unicodestringnl, |
|
440 doc="""A newline-terminated Unicode string. |
|
441 |
|
442 This is raw-unicode-escape encoded, so consists of |
|
443 printable ASCII characters, and may contain embedded |
|
444 escape sequences. |
|
445 """) |
|
446 |
|
447 def read_unicodestring4(f): |
|
448 r""" |
|
449 >>> import StringIO |
|
450 >>> s = u'abcd\uabcd' |
|
451 >>> enc = s.encode('utf-8') |
|
452 >>> enc |
|
453 'abcd\xea\xaf\x8d' |
|
454 >>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length |
|
455 >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk')) |
|
456 >>> s == t |
|
457 True |
|
458 |
|
459 >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1])) |
|
460 Traceback (most recent call last): |
|
461 ... |
|
462 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain |
|
463 """ |
|
464 |
|
465 n = read_int4(f) |
|
466 if n < 0: |
|
467 raise ValueError("unicodestring4 byte count < 0: %d" % n) |
|
468 data = f.read(n) |
|
469 if len(data) == n: |
|
470 return unicode(data, 'utf-8') |
|
471 raise ValueError("expected %d bytes in a unicodestring4, but only %d " |
|
472 "remain" % (n, len(data))) |
|
473 |
|
474 unicodestring4 = ArgumentDescriptor( |
|
475 name="unicodestring4", |
|
476 n=TAKEN_FROM_ARGUMENT4, |
|
477 reader=read_unicodestring4, |
|
478 doc="""A counted Unicode string. |
|
479 |
|
480 The first argument is a 4-byte little-endian signed int |
|
481 giving the number of bytes in the string, and the second |
|
482 argument-- the UTF-8 encoding of the Unicode string -- |
|
483 contains that many bytes. |
|
484 """) |
|
485 |
|
486 |
|
487 def read_decimalnl_short(f): |
|
488 r""" |
|
489 >>> import StringIO |
|
490 >>> read_decimalnl_short(StringIO.StringIO("1234\n56")) |
|
491 1234 |
|
492 |
|
493 >>> read_decimalnl_short(StringIO.StringIO("1234L\n56")) |
|
494 Traceback (most recent call last): |
|
495 ... |
|
496 ValueError: trailing 'L' not allowed in '1234L' |
|
497 """ |
|
498 |
|
499 s = read_stringnl(f, decode=False, stripquotes=False) |
|
500 if s.endswith("L"): |
|
501 raise ValueError("trailing 'L' not allowed in %r" % s) |
|
502 |
|
503 # It's not necessarily true that the result fits in a Python short int: |
|
504 # the pickle may have been written on a 64-bit box. There's also a hack |
|
505 # for True and False here. |
|
506 if s == "00": |
|
507 return False |
|
508 elif s == "01": |
|
509 return True |
|
510 |
|
511 try: |
|
512 return int(s) |
|
513 except OverflowError: |
|
514 return long(s) |
|
515 |
|
516 def read_decimalnl_long(f): |
|
517 r""" |
|
518 >>> import StringIO |
|
519 |
|
520 >>> read_decimalnl_long(StringIO.StringIO("1234\n56")) |
|
521 Traceback (most recent call last): |
|
522 ... |
|
523 ValueError: trailing 'L' required in '1234' |
|
524 |
|
525 Someday the trailing 'L' will probably go away from this output. |
|
526 |
|
527 >>> read_decimalnl_long(StringIO.StringIO("1234L\n56")) |
|
528 1234L |
|
529 |
|
530 >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\n6")) |
|
531 123456789012345678901234L |
|
532 """ |
|
533 |
|
534 s = read_stringnl(f, decode=False, stripquotes=False) |
|
535 if not s.endswith("L"): |
|
536 raise ValueError("trailing 'L' required in %r" % s) |
|
537 return long(s) |
|
538 |
|
539 |
|
540 decimalnl_short = ArgumentDescriptor( |
|
541 name='decimalnl_short', |
|
542 n=UP_TO_NEWLINE, |
|
543 reader=read_decimalnl_short, |
|
544 doc="""A newline-terminated decimal integer literal. |
|
545 |
|
546 This never has a trailing 'L', and the integer fit |
|
547 in a short Python int on the box where the pickle |
|
548 was written -- but there's no guarantee it will fit |
|
549 in a short Python int on the box where the pickle |
|
550 is read. |
|
551 """) |
|
552 |
|
553 decimalnl_long = ArgumentDescriptor( |
|
554 name='decimalnl_long', |
|
555 n=UP_TO_NEWLINE, |
|
556 reader=read_decimalnl_long, |
|
557 doc="""A newline-terminated decimal integer literal. |
|
558 |
|
559 This has a trailing 'L', and can represent integers |
|
560 of any size. |
|
561 """) |
|
562 |
|
563 |
|
564 def read_floatnl(f): |
|
565 r""" |
|
566 >>> import StringIO |
|
567 >>> read_floatnl(StringIO.StringIO("-1.25\n6")) |
|
568 -1.25 |
|
569 """ |
|
570 s = read_stringnl(f, decode=False, stripquotes=False) |
|
571 return float(s) |
|
572 |
|
573 floatnl = ArgumentDescriptor( |
|
574 name='floatnl', |
|
575 n=UP_TO_NEWLINE, |
|
576 reader=read_floatnl, |
|
577 doc="""A newline-terminated decimal floating literal. |
|
578 |
|
579 In general this requires 17 significant digits for roundtrip |
|
580 identity, and pickling then unpickling infinities, NaNs, and |
|
581 minus zero doesn't work across boxes, or on some boxes even |
|
582 on itself (e.g., Windows can't read the strings it produces |
|
583 for infinities or NaNs). |
|
584 """) |
|
585 |
|
586 def read_float8(f): |
|
587 r""" |
|
588 >>> import StringIO, struct |
|
589 >>> raw = struct.pack(">d", -1.25) |
|
590 >>> raw |
|
591 '\xbf\xf4\x00\x00\x00\x00\x00\x00' |
|
592 >>> read_float8(StringIO.StringIO(raw + "\n")) |
|
593 -1.25 |
|
594 """ |
|
595 |
|
596 data = f.read(8) |
|
597 if len(data) == 8: |
|
598 return _unpack(">d", data)[0] |
|
599 raise ValueError("not enough data in stream to read float8") |
|
600 |
|
601 |
|
602 float8 = ArgumentDescriptor( |
|
603 name='float8', |
|
604 n=8, |
|
605 reader=read_float8, |
|
606 doc="""An 8-byte binary representation of a float, big-endian. |
|
607 |
|
608 The format is unique to Python, and shared with the struct |
|
609 module (format string '>d') "in theory" (the struct and cPickle |
|
610 implementations don't share the code -- they should). It's |
|
611 strongly related to the IEEE-754 double format, and, in normal |
|
612 cases, is in fact identical to the big-endian 754 double format. |
|
613 On other boxes the dynamic range is limited to that of a 754 |
|
614 double, and "add a half and chop" rounding is used to reduce |
|
615 the precision to 53 bits. However, even on a 754 box, |
|
616 infinities, NaNs, and minus zero may not be handled correctly |
|
617 (may not survive roundtrip pickling intact). |
|
618 """) |
|
619 |
|
620 # Protocol 2 formats |
|
621 |
|
622 from pickle import decode_long |
|
623 |
|
624 def read_long1(f): |
|
625 r""" |
|
626 >>> import StringIO |
|
627 >>> read_long1(StringIO.StringIO("\x00")) |
|
628 0L |
|
629 >>> read_long1(StringIO.StringIO("\x02\xff\x00")) |
|
630 255L |
|
631 >>> read_long1(StringIO.StringIO("\x02\xff\x7f")) |
|
632 32767L |
|
633 >>> read_long1(StringIO.StringIO("\x02\x00\xff")) |
|
634 -256L |
|
635 >>> read_long1(StringIO.StringIO("\x02\x00\x80")) |
|
636 -32768L |
|
637 """ |
|
638 |
|
639 n = read_uint1(f) |
|
640 data = f.read(n) |
|
641 if len(data) != n: |
|
642 raise ValueError("not enough data in stream to read long1") |
|
643 return decode_long(data) |
|
644 |
|
645 long1 = ArgumentDescriptor( |
|
646 name="long1", |
|
647 n=TAKEN_FROM_ARGUMENT1, |
|
648 reader=read_long1, |
|
649 doc="""A binary long, little-endian, using 1-byte size. |
|
650 |
|
651 This first reads one byte as an unsigned size, then reads that |
|
652 many bytes and interprets them as a little-endian 2's-complement long. |
|
653 If the size is 0, that's taken as a shortcut for the long 0L. |
|
654 """) |
|
655 |
|
656 def read_long4(f): |
|
657 r""" |
|
658 >>> import StringIO |
|
659 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00")) |
|
660 255L |
|
661 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f")) |
|
662 32767L |
|
663 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff")) |
|
664 -256L |
|
665 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80")) |
|
666 -32768L |
|
667 >>> read_long1(StringIO.StringIO("\x00\x00\x00\x00")) |
|
668 0L |
|
669 """ |
|
670 |
|
671 n = read_int4(f) |
|
672 if n < 0: |
|
673 raise ValueError("long4 byte count < 0: %d" % n) |
|
674 data = f.read(n) |
|
675 if len(data) != n: |
|
676 raise ValueError("not enough data in stream to read long4") |
|
677 return decode_long(data) |
|
678 |
|
679 long4 = ArgumentDescriptor( |
|
680 name="long4", |
|
681 n=TAKEN_FROM_ARGUMENT4, |
|
682 reader=read_long4, |
|
683 doc="""A binary representation of a long, little-endian. |
|
684 |
|
685 This first reads four bytes as a signed size (but requires the |
|
686 size to be >= 0), then reads that many bytes and interprets them |
|
687 as a little-endian 2's-complement long. If the size is 0, that's taken |
|
688 as a shortcut for the long 0L, although LONG1 should really be used |
|
689 then instead (and in any case where # of bytes < 256). |
|
690 """) |
|
691 |
|
692 |
|
693 ############################################################################## |
|
694 # Object descriptors. The stack used by the pickle machine holds objects, |
|
695 # and in the stack_before and stack_after attributes of OpcodeInfo |
|
696 # descriptors we need names to describe the various types of objects that can |
|
697 # appear on the stack. |
|
698 |
|
699 class StackObject(object): |
|
700 __slots__ = ( |
|
701 # name of descriptor record, for info only |
|
702 'name', |
|
703 |
|
704 # type of object, or tuple of type objects (meaning the object can |
|
705 # be of any type in the tuple) |
|
706 'obtype', |
|
707 |
|
708 # human-readable docs for this kind of stack object; a string |
|
709 'doc', |
|
710 ) |
|
711 |
|
712 def __init__(self, name, obtype, doc): |
|
713 assert isinstance(name, str) |
|
714 self.name = name |
|
715 |
|
716 assert isinstance(obtype, type) or isinstance(obtype, tuple) |
|
717 if isinstance(obtype, tuple): |
|
718 for contained in obtype: |
|
719 assert isinstance(contained, type) |
|
720 self.obtype = obtype |
|
721 |
|
722 assert isinstance(doc, str) |
|
723 self.doc = doc |
|
724 |
|
725 def __repr__(self): |
|
726 return self.name |
|
727 |
|
728 |
|
729 pyint = StackObject( |
|
730 name='int', |
|
731 obtype=int, |
|
732 doc="A short (as opposed to long) Python integer object.") |
|
733 |
|
734 pylong = StackObject( |
|
735 name='long', |
|
736 obtype=long, |
|
737 doc="A long (as opposed to short) Python integer object.") |
|
738 |
|
739 pyinteger_or_bool = StackObject( |
|
740 name='int_or_bool', |
|
741 obtype=(int, long, bool), |
|
742 doc="A Python integer object (short or long), or " |
|
743 "a Python bool.") |
|
744 |
|
745 pybool = StackObject( |
|
746 name='bool', |
|
747 obtype=(bool,), |
|
748 doc="A Python bool object.") |
|
749 |
|
750 pyfloat = StackObject( |
|
751 name='float', |
|
752 obtype=float, |
|
753 doc="A Python float object.") |
|
754 |
|
755 pystring = StackObject( |
|
756 name='str', |
|
757 obtype=str, |
|
758 doc="A Python string object.") |
|
759 |
|
760 pyunicode = StackObject( |
|
761 name='unicode', |
|
762 obtype=unicode, |
|
763 doc="A Python Unicode string object.") |
|
764 |
|
765 pynone = StackObject( |
|
766 name="None", |
|
767 obtype=type(None), |
|
768 doc="The Python None object.") |
|
769 |
|
770 pytuple = StackObject( |
|
771 name="tuple", |
|
772 obtype=tuple, |
|
773 doc="A Python tuple object.") |
|
774 |
|
775 pylist = StackObject( |
|
776 name="list", |
|
777 obtype=list, |
|
778 doc="A Python list object.") |
|
779 |
|
780 pydict = StackObject( |
|
781 name="dict", |
|
782 obtype=dict, |
|
783 doc="A Python dict object.") |
|
784 |
|
785 anyobject = StackObject( |
|
786 name='any', |
|
787 obtype=object, |
|
788 doc="Any kind of object whatsoever.") |
|
789 |
|
790 markobject = StackObject( |
|
791 name="mark", |
|
792 obtype=StackObject, |
|
793 doc="""'The mark' is a unique object. |
|
794 |
|
795 Opcodes that operate on a variable number of objects |
|
796 generally don't embed the count of objects in the opcode, |
|
797 or pull it off the stack. Instead the MARK opcode is used |
|
798 to push a special marker object on the stack, and then |
|
799 some other opcodes grab all the objects from the top of |
|
800 the stack down to (but not including) the topmost marker |
|
801 object. |
|
802 """) |
|
803 |
|
804 stackslice = StackObject( |
|
805 name="stackslice", |
|
806 obtype=StackObject, |
|
807 doc="""An object representing a contiguous slice of the stack. |
|
808 |
|
809 This is used in conjuction with markobject, to represent all |
|
810 of the stack following the topmost markobject. For example, |
|
811 the POP_MARK opcode changes the stack from |
|
812 |
|
813 [..., markobject, stackslice] |
|
814 to |
|
815 [...] |
|
816 |
|
817 No matter how many object are on the stack after the topmost |
|
818 markobject, POP_MARK gets rid of all of them (including the |
|
819 topmost markobject too). |
|
820 """) |
|
821 |
|
822 ############################################################################## |
|
823 # Descriptors for pickle opcodes. |
|
824 |
|
825 class OpcodeInfo(object): |
|
826 |
|
827 __slots__ = ( |
|
828 # symbolic name of opcode; a string |
|
829 'name', |
|
830 |
|
831 # the code used in a bytestream to represent the opcode; a |
|
832 # one-character string |
|
833 'code', |
|
834 |
|
835 # If the opcode has an argument embedded in the byte string, an |
|
836 # instance of ArgumentDescriptor specifying its type. Note that |
|
837 # arg.reader(s) can be used to read and decode the argument from |
|
838 # the bytestream s, and arg.doc documents the format of the raw |
|
839 # argument bytes. If the opcode doesn't have an argument embedded |
|
840 # in the bytestream, arg should be None. |
|
841 'arg', |
|
842 |
|
843 # what the stack looks like before this opcode runs; a list |
|
844 'stack_before', |
|
845 |
|
846 # what the stack looks like after this opcode runs; a list |
|
847 'stack_after', |
|
848 |
|
849 # the protocol number in which this opcode was introduced; an int |
|
850 'proto', |
|
851 |
|
852 # human-readable docs for this opcode; a string |
|
853 'doc', |
|
854 ) |
|
855 |
|
856 def __init__(self, name, code, arg, |
|
857 stack_before, stack_after, proto, doc): |
|
858 assert isinstance(name, str) |
|
859 self.name = name |
|
860 |
|
861 assert isinstance(code, str) |
|
862 assert len(code) == 1 |
|
863 self.code = code |
|
864 |
|
865 assert arg is None or isinstance(arg, ArgumentDescriptor) |
|
866 self.arg = arg |
|
867 |
|
868 assert isinstance(stack_before, list) |
|
869 for x in stack_before: |
|
870 assert isinstance(x, StackObject) |
|
871 self.stack_before = stack_before |
|
872 |
|
873 assert isinstance(stack_after, list) |
|
874 for x in stack_after: |
|
875 assert isinstance(x, StackObject) |
|
876 self.stack_after = stack_after |
|
877 |
|
878 assert isinstance(proto, int) and 0 <= proto <= 2 |
|
879 self.proto = proto |
|
880 |
|
881 assert isinstance(doc, str) |
|
882 self.doc = doc |
|
883 |
|
884 I = OpcodeInfo |
|
885 opcodes = [ |
|
886 |
|
887 # Ways to spell integers. |
|
888 |
|
889 I(name='INT', |
|
890 code='I', |
|
891 arg=decimalnl_short, |
|
892 stack_before=[], |
|
893 stack_after=[pyinteger_or_bool], |
|
894 proto=0, |
|
895 doc="""Push an integer or bool. |
|
896 |
|
897 The argument is a newline-terminated decimal literal string. |
|
898 |
|
899 The intent may have been that this always fit in a short Python int, |
|
900 but INT can be generated in pickles written on a 64-bit box that |
|
901 require a Python long on a 32-bit box. The difference between this |
|
902 and LONG then is that INT skips a trailing 'L', and produces a short |
|
903 int whenever possible. |
|
904 |
|
905 Another difference is due to that, when bool was introduced as a |
|
906 distinct type in 2.3, builtin names True and False were also added to |
|
907 2.2.2, mapping to ints 1 and 0. For compatibility in both directions, |
|
908 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n". |
|
909 Leading zeroes are never produced for a genuine integer. The 2.3 |
|
910 (and later) unpicklers special-case these and return bool instead; |
|
911 earlier unpicklers ignore the leading "0" and return the int. |
|
912 """), |
|
913 |
|
914 I(name='BININT', |
|
915 code='J', |
|
916 arg=int4, |
|
917 stack_before=[], |
|
918 stack_after=[pyint], |
|
919 proto=1, |
|
920 doc="""Push a four-byte signed integer. |
|
921 |
|
922 This handles the full range of Python (short) integers on a 32-bit |
|
923 box, directly as binary bytes (1 for the opcode and 4 for the integer). |
|
924 If the integer is non-negative and fits in 1 or 2 bytes, pickling via |
|
925 BININT1 or BININT2 saves space. |
|
926 """), |
|
927 |
|
928 I(name='BININT1', |
|
929 code='K', |
|
930 arg=uint1, |
|
931 stack_before=[], |
|
932 stack_after=[pyint], |
|
933 proto=1, |
|
934 doc="""Push a one-byte unsigned integer. |
|
935 |
|
936 This is a space optimization for pickling very small non-negative ints, |
|
937 in range(256). |
|
938 """), |
|
939 |
|
940 I(name='BININT2', |
|
941 code='M', |
|
942 arg=uint2, |
|
943 stack_before=[], |
|
944 stack_after=[pyint], |
|
945 proto=1, |
|
946 doc="""Push a two-byte unsigned integer. |
|
947 |
|
948 This is a space optimization for pickling small positive ints, in |
|
949 range(256, 2**16). Integers in range(256) can also be pickled via |
|
950 BININT2, but BININT1 instead saves a byte. |
|
951 """), |
|
952 |
|
953 I(name='LONG', |
|
954 code='L', |
|
955 arg=decimalnl_long, |
|
956 stack_before=[], |
|
957 stack_after=[pylong], |
|
958 proto=0, |
|
959 doc="""Push a long integer. |
|
960 |
|
961 The same as INT, except that the literal ends with 'L', and always |
|
962 unpickles to a Python long. There doesn't seem a real purpose to the |
|
963 trailing 'L'. |
|
964 |
|
965 Note that LONG takes time quadratic in the number of digits when |
|
966 unpickling (this is simply due to the nature of decimal->binary |
|
967 conversion). Proto 2 added linear-time (in C; still quadratic-time |
|
968 in Python) LONG1 and LONG4 opcodes. |
|
969 """), |
|
970 |
|
971 I(name="LONG1", |
|
972 code='\x8a', |
|
973 arg=long1, |
|
974 stack_before=[], |
|
975 stack_after=[pylong], |
|
976 proto=2, |
|
977 doc="""Long integer using one-byte length. |
|
978 |
|
979 A more efficient encoding of a Python long; the long1 encoding |
|
980 says it all."""), |
|
981 |
|
982 I(name="LONG4", |
|
983 code='\x8b', |
|
984 arg=long4, |
|
985 stack_before=[], |
|
986 stack_after=[pylong], |
|
987 proto=2, |
|
988 doc="""Long integer using found-byte length. |
|
989 |
|
990 A more efficient encoding of a Python long; the long4 encoding |
|
991 says it all."""), |
|
992 |
|
993 # Ways to spell strings (8-bit, not Unicode). |
|
994 |
|
995 I(name='STRING', |
|
996 code='S', |
|
997 arg=stringnl, |
|
998 stack_before=[], |
|
999 stack_after=[pystring], |
|
1000 proto=0, |
|
1001 doc="""Push a Python string object. |
|
1002 |
|
1003 The argument is a repr-style string, with bracketing quote characters, |
|
1004 and perhaps embedded escapes. The argument extends until the next |
|
1005 newline character. |
|
1006 """), |
|
1007 |
|
1008 I(name='BINSTRING', |
|
1009 code='T', |
|
1010 arg=string4, |
|
1011 stack_before=[], |
|
1012 stack_after=[pystring], |
|
1013 proto=1, |
|
1014 doc="""Push a Python string object. |
|
1015 |
|
1016 There are two arguments: the first is a 4-byte little-endian signed int |
|
1017 giving the number of bytes in the string, and the second is that many |
|
1018 bytes, which are taken literally as the string content. |
|
1019 """), |
|
1020 |
|
1021 I(name='SHORT_BINSTRING', |
|
1022 code='U', |
|
1023 arg=string1, |
|
1024 stack_before=[], |
|
1025 stack_after=[pystring], |
|
1026 proto=1, |
|
1027 doc="""Push a Python string object. |
|
1028 |
|
1029 There are two arguments: the first is a 1-byte unsigned int giving |
|
1030 the number of bytes in the string, and the second is that many bytes, |
|
1031 which are taken literally as the string content. |
|
1032 """), |
|
1033 |
|
1034 # Ways to spell None. |
|
1035 |
|
1036 I(name='NONE', |
|
1037 code='N', |
|
1038 arg=None, |
|
1039 stack_before=[], |
|
1040 stack_after=[pynone], |
|
1041 proto=0, |
|
1042 doc="Push None on the stack."), |
|
1043 |
|
1044 # Ways to spell bools, starting with proto 2. See INT for how this was |
|
1045 # done before proto 2. |
|
1046 |
|
1047 I(name='NEWTRUE', |
|
1048 code='\x88', |
|
1049 arg=None, |
|
1050 stack_before=[], |
|
1051 stack_after=[pybool], |
|
1052 proto=2, |
|
1053 doc="""True. |
|
1054 |
|
1055 Push True onto the stack."""), |
|
1056 |
|
1057 I(name='NEWFALSE', |
|
1058 code='\x89', |
|
1059 arg=None, |
|
1060 stack_before=[], |
|
1061 stack_after=[pybool], |
|
1062 proto=2, |
|
1063 doc="""True. |
|
1064 |
|
1065 Push False onto the stack."""), |
|
1066 |
|
1067 # Ways to spell Unicode strings. |
|
1068 |
|
1069 I(name='UNICODE', |
|
1070 code='V', |
|
1071 arg=unicodestringnl, |
|
1072 stack_before=[], |
|
1073 stack_after=[pyunicode], |
|
1074 proto=0, # this may be pure-text, but it's a later addition |
|
1075 doc="""Push a Python Unicode string object. |
|
1076 |
|
1077 The argument is a raw-unicode-escape encoding of a Unicode string, |
|
1078 and so may contain embedded escape sequences. The argument extends |
|
1079 until the next newline character. |
|
1080 """), |
|
1081 |
|
1082 I(name='BINUNICODE', |
|
1083 code='X', |
|
1084 arg=unicodestring4, |
|
1085 stack_before=[], |
|
1086 stack_after=[pyunicode], |
|
1087 proto=1, |
|
1088 doc="""Push a Python Unicode string object. |
|
1089 |
|
1090 There are two arguments: the first is a 4-byte little-endian signed int |
|
1091 giving the number of bytes in the string. The second is that many |
|
1092 bytes, and is the UTF-8 encoding of the Unicode string. |
|
1093 """), |
|
1094 |
|
1095 # Ways to spell floats. |
|
1096 |
|
1097 I(name='FLOAT', |
|
1098 code='F', |
|
1099 arg=floatnl, |
|
1100 stack_before=[], |
|
1101 stack_after=[pyfloat], |
|
1102 proto=0, |
|
1103 doc="""Newline-terminated decimal float literal. |
|
1104 |
|
1105 The argument is repr(a_float), and in general requires 17 significant |
|
1106 digits for roundtrip conversion to be an identity (this is so for |
|
1107 IEEE-754 double precision values, which is what Python float maps to |
|
1108 on most boxes). |
|
1109 |
|
1110 In general, FLOAT cannot be used to transport infinities, NaNs, or |
|
1111 minus zero across boxes (or even on a single box, if the platform C |
|
1112 library can't read the strings it produces for such things -- Windows |
|
1113 is like that), but may do less damage than BINFLOAT on boxes with |
|
1114 greater precision or dynamic range than IEEE-754 double. |
|
1115 """), |
|
1116 |
|
1117 I(name='BINFLOAT', |
|
1118 code='G', |
|
1119 arg=float8, |
|
1120 stack_before=[], |
|
1121 stack_after=[pyfloat], |
|
1122 proto=1, |
|
1123 doc="""Float stored in binary form, with 8 bytes of data. |
|
1124 |
|
1125 This generally requires less than half the space of FLOAT encoding. |
|
1126 In general, BINFLOAT cannot be used to transport infinities, NaNs, or |
|
1127 minus zero, raises an exception if the exponent exceeds the range of |
|
1128 an IEEE-754 double, and retains no more than 53 bits of precision (if |
|
1129 there are more than that, "add a half and chop" rounding is used to |
|
1130 cut it back to 53 significant bits). |
|
1131 """), |
|
1132 |
|
1133 # Ways to build lists. |
|
1134 |
|
1135 I(name='EMPTY_LIST', |
|
1136 code=']', |
|
1137 arg=None, |
|
1138 stack_before=[], |
|
1139 stack_after=[pylist], |
|
1140 proto=1, |
|
1141 doc="Push an empty list."), |
|
1142 |
|
1143 I(name='APPEND', |
|
1144 code='a', |
|
1145 arg=None, |
|
1146 stack_before=[pylist, anyobject], |
|
1147 stack_after=[pylist], |
|
1148 proto=0, |
|
1149 doc="""Append an object to a list. |
|
1150 |
|
1151 Stack before: ... pylist anyobject |
|
1152 Stack after: ... pylist+[anyobject] |
|
1153 |
|
1154 although pylist is really extended in-place. |
|
1155 """), |
|
1156 |
|
1157 I(name='APPENDS', |
|
1158 code='e', |
|
1159 arg=None, |
|
1160 stack_before=[pylist, markobject, stackslice], |
|
1161 stack_after=[pylist], |
|
1162 proto=1, |
|
1163 doc="""Extend a list by a slice of stack objects. |
|
1164 |
|
1165 Stack before: ... pylist markobject stackslice |
|
1166 Stack after: ... pylist+stackslice |
|
1167 |
|
1168 although pylist is really extended in-place. |
|
1169 """), |
|
1170 |
|
1171 I(name='LIST', |
|
1172 code='l', |
|
1173 arg=None, |
|
1174 stack_before=[markobject, stackslice], |
|
1175 stack_after=[pylist], |
|
1176 proto=0, |
|
1177 doc="""Build a list out of the topmost stack slice, after markobject. |
|
1178 |
|
1179 All the stack entries following the topmost markobject are placed into |
|
1180 a single Python list, which single list object replaces all of the |
|
1181 stack from the topmost markobject onward. For example, |
|
1182 |
|
1183 Stack before: ... markobject 1 2 3 'abc' |
|
1184 Stack after: ... [1, 2, 3, 'abc'] |
|
1185 """), |
|
1186 |
|
1187 # Ways to build tuples. |
|
1188 |
|
1189 I(name='EMPTY_TUPLE', |
|
1190 code=')', |
|
1191 arg=None, |
|
1192 stack_before=[], |
|
1193 stack_after=[pytuple], |
|
1194 proto=1, |
|
1195 doc="Push an empty tuple."), |
|
1196 |
|
1197 I(name='TUPLE', |
|
1198 code='t', |
|
1199 arg=None, |
|
1200 stack_before=[markobject, stackslice], |
|
1201 stack_after=[pytuple], |
|
1202 proto=0, |
|
1203 doc="""Build a tuple out of the topmost stack slice, after markobject. |
|
1204 |
|
1205 All the stack entries following the topmost markobject are placed into |
|
1206 a single Python tuple, which single tuple object replaces all of the |
|
1207 stack from the topmost markobject onward. For example, |
|
1208 |
|
1209 Stack before: ... markobject 1 2 3 'abc' |
|
1210 Stack after: ... (1, 2, 3, 'abc') |
|
1211 """), |
|
1212 |
|
1213 I(name='TUPLE1', |
|
1214 code='\x85', |
|
1215 arg=None, |
|
1216 stack_before=[anyobject], |
|
1217 stack_after=[pytuple], |
|
1218 proto=2, |
|
1219 doc="""One-tuple. |
|
1220 |
|
1221 This code pops one value off the stack and pushes a tuple of |
|
1222 length 1 whose one item is that value back onto it. IOW: |
|
1223 |
|
1224 stack[-1] = tuple(stack[-1:]) |
|
1225 """), |
|
1226 |
|
1227 I(name='TUPLE2', |
|
1228 code='\x86', |
|
1229 arg=None, |
|
1230 stack_before=[anyobject, anyobject], |
|
1231 stack_after=[pytuple], |
|
1232 proto=2, |
|
1233 doc="""One-tuple. |
|
1234 |
|
1235 This code pops two values off the stack and pushes a tuple |
|
1236 of length 2 whose items are those values back onto it. IOW: |
|
1237 |
|
1238 stack[-2:] = [tuple(stack[-2:])] |
|
1239 """), |
|
1240 |
|
1241 I(name='TUPLE3', |
|
1242 code='\x87', |
|
1243 arg=None, |
|
1244 stack_before=[anyobject, anyobject, anyobject], |
|
1245 stack_after=[pytuple], |
|
1246 proto=2, |
|
1247 doc="""One-tuple. |
|
1248 |
|
1249 This code pops three values off the stack and pushes a tuple |
|
1250 of length 3 whose items are those values back onto it. IOW: |
|
1251 |
|
1252 stack[-3:] = [tuple(stack[-3:])] |
|
1253 """), |
|
1254 |
|
1255 # Ways to build dicts. |
|
1256 |
|
1257 I(name='EMPTY_DICT', |
|
1258 code='}', |
|
1259 arg=None, |
|
1260 stack_before=[], |
|
1261 stack_after=[pydict], |
|
1262 proto=1, |
|
1263 doc="Push an empty dict."), |
|
1264 |
|
1265 I(name='DICT', |
|
1266 code='d', |
|
1267 arg=None, |
|
1268 stack_before=[markobject, stackslice], |
|
1269 stack_after=[pydict], |
|
1270 proto=0, |
|
1271 doc="""Build a dict out of the topmost stack slice, after markobject. |
|
1272 |
|
1273 All the stack entries following the topmost markobject are placed into |
|
1274 a single Python dict, which single dict object replaces all of the |
|
1275 stack from the topmost markobject onward. The stack slice alternates |
|
1276 key, value, key, value, .... For example, |
|
1277 |
|
1278 Stack before: ... markobject 1 2 3 'abc' |
|
1279 Stack after: ... {1: 2, 3: 'abc'} |
|
1280 """), |
|
1281 |
|
1282 I(name='SETITEM', |
|
1283 code='s', |
|
1284 arg=None, |
|
1285 stack_before=[pydict, anyobject, anyobject], |
|
1286 stack_after=[pydict], |
|
1287 proto=0, |
|
1288 doc="""Add a key+value pair to an existing dict. |
|
1289 |
|
1290 Stack before: ... pydict key value |
|
1291 Stack after: ... pydict |
|
1292 |
|
1293 where pydict has been modified via pydict[key] = value. |
|
1294 """), |
|
1295 |
|
1296 I(name='SETITEMS', |
|
1297 code='u', |
|
1298 arg=None, |
|
1299 stack_before=[pydict, markobject, stackslice], |
|
1300 stack_after=[pydict], |
|
1301 proto=1, |
|
1302 doc="""Add an arbitrary number of key+value pairs to an existing dict. |
|
1303 |
|
1304 The slice of the stack following the topmost markobject is taken as |
|
1305 an alternating sequence of keys and values, added to the dict |
|
1306 immediately under the topmost markobject. Everything at and after the |
|
1307 topmost markobject is popped, leaving the mutated dict at the top |
|
1308 of the stack. |
|
1309 |
|
1310 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n |
|
1311 Stack after: ... pydict |
|
1312 |
|
1313 where pydict has been modified via pydict[key_i] = value_i for i in |
|
1314 1, 2, ..., n, and in that order. |
|
1315 """), |
|
1316 |
|
1317 # Stack manipulation. |
|
1318 |
|
1319 I(name='POP', |
|
1320 code='0', |
|
1321 arg=None, |
|
1322 stack_before=[anyobject], |
|
1323 stack_after=[], |
|
1324 proto=0, |
|
1325 doc="Discard the top stack item, shrinking the stack by one item."), |
|
1326 |
|
1327 I(name='DUP', |
|
1328 code='2', |
|
1329 arg=None, |
|
1330 stack_before=[anyobject], |
|
1331 stack_after=[anyobject, anyobject], |
|
1332 proto=0, |
|
1333 doc="Push the top stack item onto the stack again, duplicating it."), |
|
1334 |
|
1335 I(name='MARK', |
|
1336 code='(', |
|
1337 arg=None, |
|
1338 stack_before=[], |
|
1339 stack_after=[markobject], |
|
1340 proto=0, |
|
1341 doc="""Push markobject onto the stack. |
|
1342 |
|
1343 markobject is a unique object, used by other opcodes to identify a |
|
1344 region of the stack containing a variable number of objects for them |
|
1345 to work on. See markobject.doc for more detail. |
|
1346 """), |
|
1347 |
|
1348 I(name='POP_MARK', |
|
1349 code='1', |
|
1350 arg=None, |
|
1351 stack_before=[markobject, stackslice], |
|
1352 stack_after=[], |
|
1353 proto=0, |
|
1354 doc="""Pop all the stack objects at and above the topmost markobject. |
|
1355 |
|
1356 When an opcode using a variable number of stack objects is done, |
|
1357 POP_MARK is used to remove those objects, and to remove the markobject |
|
1358 that delimited their starting position on the stack. |
|
1359 """), |
|
1360 |
|
1361 # Memo manipulation. There are really only two operations (get and put), |
|
1362 # each in all-text, "short binary", and "long binary" flavors. |
|
1363 |
|
1364 I(name='GET', |
|
1365 code='g', |
|
1366 arg=decimalnl_short, |
|
1367 stack_before=[], |
|
1368 stack_after=[anyobject], |
|
1369 proto=0, |
|
1370 doc="""Read an object from the memo and push it on the stack. |
|
1371 |
|
1372 The index of the memo object to push is given by the newline-teriminated |
|
1373 decimal string following. BINGET and LONG_BINGET are space-optimized |
|
1374 versions. |
|
1375 """), |
|
1376 |
|
1377 I(name='BINGET', |
|
1378 code='h', |
|
1379 arg=uint1, |
|
1380 stack_before=[], |
|
1381 stack_after=[anyobject], |
|
1382 proto=1, |
|
1383 doc="""Read an object from the memo and push it on the stack. |
|
1384 |
|
1385 The index of the memo object to push is given by the 1-byte unsigned |
|
1386 integer following. |
|
1387 """), |
|
1388 |
|
1389 I(name='LONG_BINGET', |
|
1390 code='j', |
|
1391 arg=int4, |
|
1392 stack_before=[], |
|
1393 stack_after=[anyobject], |
|
1394 proto=1, |
|
1395 doc="""Read an object from the memo and push it on the stack. |
|
1396 |
|
1397 The index of the memo object to push is given by the 4-byte signed |
|
1398 little-endian integer following. |
|
1399 """), |
|
1400 |
|
1401 I(name='PUT', |
|
1402 code='p', |
|
1403 arg=decimalnl_short, |
|
1404 stack_before=[], |
|
1405 stack_after=[], |
|
1406 proto=0, |
|
1407 doc="""Store the stack top into the memo. The stack is not popped. |
|
1408 |
|
1409 The index of the memo location to write into is given by the newline- |
|
1410 terminated decimal string following. BINPUT and LONG_BINPUT are |
|
1411 space-optimized versions. |
|
1412 """), |
|
1413 |
|
1414 I(name='BINPUT', |
|
1415 code='q', |
|
1416 arg=uint1, |
|
1417 stack_before=[], |
|
1418 stack_after=[], |
|
1419 proto=1, |
|
1420 doc="""Store the stack top into the memo. The stack is not popped. |
|
1421 |
|
1422 The index of the memo location to write into is given by the 1-byte |
|
1423 unsigned integer following. |
|
1424 """), |
|
1425 |
|
1426 I(name='LONG_BINPUT', |
|
1427 code='r', |
|
1428 arg=int4, |
|
1429 stack_before=[], |
|
1430 stack_after=[], |
|
1431 proto=1, |
|
1432 doc="""Store the stack top into the memo. The stack is not popped. |
|
1433 |
|
1434 The index of the memo location to write into is given by the 4-byte |
|
1435 signed little-endian integer following. |
|
1436 """), |
|
1437 |
|
1438 # Access the extension registry (predefined objects). Akin to the GET |
|
1439 # family. |
|
1440 |
|
1441 I(name='EXT1', |
|
1442 code='\x82', |
|
1443 arg=uint1, |
|
1444 stack_before=[], |
|
1445 stack_after=[anyobject], |
|
1446 proto=2, |
|
1447 doc="""Extension code. |
|
1448 |
|
1449 This code and the similar EXT2 and EXT4 allow using a registry |
|
1450 of popular objects that are pickled by name, typically classes. |
|
1451 It is envisioned that through a global negotiation and |
|
1452 registration process, third parties can set up a mapping between |
|
1453 ints and object names. |
|
1454 |
|
1455 In order to guarantee pickle interchangeability, the extension |
|
1456 code registry ought to be global, although a range of codes may |
|
1457 be reserved for private use. |
|
1458 |
|
1459 EXT1 has a 1-byte integer argument. This is used to index into the |
|
1460 extension registry, and the object at that index is pushed on the stack. |
|
1461 """), |
|
1462 |
|
1463 I(name='EXT2', |
|
1464 code='\x83', |
|
1465 arg=uint2, |
|
1466 stack_before=[], |
|
1467 stack_after=[anyobject], |
|
1468 proto=2, |
|
1469 doc="""Extension code. |
|
1470 |
|
1471 See EXT1. EXT2 has a two-byte integer argument. |
|
1472 """), |
|
1473 |
|
1474 I(name='EXT4', |
|
1475 code='\x84', |
|
1476 arg=int4, |
|
1477 stack_before=[], |
|
1478 stack_after=[anyobject], |
|
1479 proto=2, |
|
1480 doc="""Extension code. |
|
1481 |
|
1482 See EXT1. EXT4 has a four-byte integer argument. |
|
1483 """), |
|
1484 |
|
1485 # Push a class object, or module function, on the stack, via its module |
|
1486 # and name. |
|
1487 |
|
1488 I(name='GLOBAL', |
|
1489 code='c', |
|
1490 arg=stringnl_noescape_pair, |
|
1491 stack_before=[], |
|
1492 stack_after=[anyobject], |
|
1493 proto=0, |
|
1494 doc="""Push a global object (module.attr) on the stack. |
|
1495 |
|
1496 Two newline-terminated strings follow the GLOBAL opcode. The first is |
|
1497 taken as a module name, and the second as a class name. The class |
|
1498 object module.class is pushed on the stack. More accurately, the |
|
1499 object returned by self.find_class(module, class) is pushed on the |
|
1500 stack, so unpickling subclasses can override this form of lookup. |
|
1501 """), |
|
1502 |
|
1503 # Ways to build objects of classes pickle doesn't know about directly |
|
1504 # (user-defined classes). I despair of documenting this accurately |
|
1505 # and comprehensibly -- you really have to read the pickle code to |
|
1506 # find all the special cases. |
|
1507 |
|
1508 I(name='REDUCE', |
|
1509 code='R', |
|
1510 arg=None, |
|
1511 stack_before=[anyobject, anyobject], |
|
1512 stack_after=[anyobject], |
|
1513 proto=0, |
|
1514 doc="""Push an object built from a callable and an argument tuple. |
|
1515 |
|
1516 The opcode is named to remind of the __reduce__() method. |
|
1517 |
|
1518 Stack before: ... callable pytuple |
|
1519 Stack after: ... callable(*pytuple) |
|
1520 |
|
1521 The callable and the argument tuple are the first two items returned |
|
1522 by a __reduce__ method. Applying the callable to the argtuple is |
|
1523 supposed to reproduce the original object, or at least get it started. |
|
1524 If the __reduce__ method returns a 3-tuple, the last component is an |
|
1525 argument to be passed to the object's __setstate__, and then the REDUCE |
|
1526 opcode is followed by code to create setstate's argument, and then a |
|
1527 BUILD opcode to apply __setstate__ to that argument. |
|
1528 |
|
1529 If type(callable) is not ClassType, REDUCE complains unless the |
|
1530 callable has been registered with the copy_reg module's |
|
1531 safe_constructors dict, or the callable has a magic |
|
1532 '__safe_for_unpickling__' attribute with a true value. I'm not sure |
|
1533 why it does this, but I've sure seen this complaint often enough when |
|
1534 I didn't want to <wink>. |
|
1535 """), |
|
1536 |
|
1537 I(name='BUILD', |
|
1538 code='b', |
|
1539 arg=None, |
|
1540 stack_before=[anyobject, anyobject], |
|
1541 stack_after=[anyobject], |
|
1542 proto=0, |
|
1543 doc="""Finish building an object, via __setstate__ or dict update. |
|
1544 |
|
1545 Stack before: ... anyobject argument |
|
1546 Stack after: ... anyobject |
|
1547 |
|
1548 where anyobject may have been mutated, as follows: |
|
1549 |
|
1550 If the object has a __setstate__ method, |
|
1551 |
|
1552 anyobject.__setstate__(argument) |
|
1553 |
|
1554 is called. |
|
1555 |
|
1556 Else the argument must be a dict, the object must have a __dict__, and |
|
1557 the object is updated via |
|
1558 |
|
1559 anyobject.__dict__.update(argument) |
|
1560 |
|
1561 This may raise RuntimeError in restricted execution mode (which |
|
1562 disallows access to __dict__ directly); in that case, the object |
|
1563 is updated instead via |
|
1564 |
|
1565 for k, v in argument.items(): |
|
1566 anyobject[k] = v |
|
1567 """), |
|
1568 |
|
1569 I(name='INST', |
|
1570 code='i', |
|
1571 arg=stringnl_noescape_pair, |
|
1572 stack_before=[markobject, stackslice], |
|
1573 stack_after=[anyobject], |
|
1574 proto=0, |
|
1575 doc="""Build a class instance. |
|
1576 |
|
1577 This is the protocol 0 version of protocol 1's OBJ opcode. |
|
1578 INST is followed by two newline-terminated strings, giving a |
|
1579 module and class name, just as for the GLOBAL opcode (and see |
|
1580 GLOBAL for more details about that). self.find_class(module, name) |
|
1581 is used to get a class object. |
|
1582 |
|
1583 In addition, all the objects on the stack following the topmost |
|
1584 markobject are gathered into a tuple and popped (along with the |
|
1585 topmost markobject), just as for the TUPLE opcode. |
|
1586 |
|
1587 Now it gets complicated. If all of these are true: |
|
1588 |
|
1589 + The argtuple is empty (markobject was at the top of the stack |
|
1590 at the start). |
|
1591 |
|
1592 + It's an old-style class object (the type of the class object is |
|
1593 ClassType). |
|
1594 |
|
1595 + The class object does not have a __getinitargs__ attribute. |
|
1596 |
|
1597 then we want to create an old-style class instance without invoking |
|
1598 its __init__() method (pickle has waffled on this over the years; not |
|
1599 calling __init__() is current wisdom). In this case, an instance of |
|
1600 an old-style dummy class is created, and then we try to rebind its |
|
1601 __class__ attribute to the desired class object. If this succeeds, |
|
1602 the new instance object is pushed on the stack, and we're done. In |
|
1603 restricted execution mode it can fail (assignment to __class__ is |
|
1604 disallowed), and I'm not really sure what happens then -- it looks |
|
1605 like the code ends up calling the class object's __init__ anyway, |
|
1606 via falling into the next case. |
|
1607 |
|
1608 Else (the argtuple is not empty, it's not an old-style class object, |
|
1609 or the class object does have a __getinitargs__ attribute), the code |
|
1610 first insists that the class object have a __safe_for_unpickling__ |
|
1611 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE, |
|
1612 it doesn't matter whether this attribute has a true or false value, it |
|
1613 only matters whether it exists (XXX this is a bug; cPickle |
|
1614 requires the attribute to be true). If __safe_for_unpickling__ |
|
1615 doesn't exist, UnpicklingError is raised. |
|
1616 |
|
1617 Else (the class object does have a __safe_for_unpickling__ attr), |
|
1618 the class object obtained from INST's arguments is applied to the |
|
1619 argtuple obtained from the stack, and the resulting instance object |
|
1620 is pushed on the stack. |
|
1621 |
|
1622 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3. |
|
1623 """), |
|
1624 |
|
1625 I(name='OBJ', |
|
1626 code='o', |
|
1627 arg=None, |
|
1628 stack_before=[markobject, anyobject, stackslice], |
|
1629 stack_after=[anyobject], |
|
1630 proto=1, |
|
1631 doc="""Build a class instance. |
|
1632 |
|
1633 This is the protocol 1 version of protocol 0's INST opcode, and is |
|
1634 very much like it. The major difference is that the class object |
|
1635 is taken off the stack, allowing it to be retrieved from the memo |
|
1636 repeatedly if several instances of the same class are created. This |
|
1637 can be much more efficient (in both time and space) than repeatedly |
|
1638 embedding the module and class names in INST opcodes. |
|
1639 |
|
1640 Unlike INST, OBJ takes no arguments from the opcode stream. Instead |
|
1641 the class object is taken off the stack, immediately above the |
|
1642 topmost markobject: |
|
1643 |
|
1644 Stack before: ... markobject classobject stackslice |
|
1645 Stack after: ... new_instance_object |
|
1646 |
|
1647 As for INST, the remainder of the stack above the markobject is |
|
1648 gathered into an argument tuple, and then the logic seems identical, |
|
1649 except that no __safe_for_unpickling__ check is done (XXX this is |
|
1650 a bug; cPickle does test __safe_for_unpickling__). See INST for |
|
1651 the gory details. |
|
1652 |
|
1653 NOTE: In Python 2.3, INST and OBJ are identical except for how they |
|
1654 get the class object. That was always the intent; the implementations |
|
1655 had diverged for accidental reasons. |
|
1656 """), |
|
1657 |
|
1658 I(name='NEWOBJ', |
|
1659 code='\x81', |
|
1660 arg=None, |
|
1661 stack_before=[anyobject, anyobject], |
|
1662 stack_after=[anyobject], |
|
1663 proto=2, |
|
1664 doc="""Build an object instance. |
|
1665 |
|
1666 The stack before should be thought of as containing a class |
|
1667 object followed by an argument tuple (the tuple being the stack |
|
1668 top). Call these cls and args. They are popped off the stack, |
|
1669 and the value returned by cls.__new__(cls, *args) is pushed back |
|
1670 onto the stack. |
|
1671 """), |
|
1672 |
|
1673 # Machine control. |
|
1674 |
|
1675 I(name='PROTO', |
|
1676 code='\x80', |
|
1677 arg=uint1, |
|
1678 stack_before=[], |
|
1679 stack_after=[], |
|
1680 proto=2, |
|
1681 doc="""Protocol version indicator. |
|
1682 |
|
1683 For protocol 2 and above, a pickle must start with this opcode. |
|
1684 The argument is the protocol version, an int in range(2, 256). |
|
1685 """), |
|
1686 |
|
1687 I(name='STOP', |
|
1688 code='.', |
|
1689 arg=None, |
|
1690 stack_before=[anyobject], |
|
1691 stack_after=[], |
|
1692 proto=0, |
|
1693 doc="""Stop the unpickling machine. |
|
1694 |
|
1695 Every pickle ends with this opcode. The object at the top of the stack |
|
1696 is popped, and that's the result of unpickling. The stack should be |
|
1697 empty then. |
|
1698 """), |
|
1699 |
|
1700 # Ways to deal with persistent IDs. |
|
1701 |
|
1702 I(name='PERSID', |
|
1703 code='P', |
|
1704 arg=stringnl_noescape, |
|
1705 stack_before=[], |
|
1706 stack_after=[anyobject], |
|
1707 proto=0, |
|
1708 doc="""Push an object identified by a persistent ID. |
|
1709 |
|
1710 The pickle module doesn't define what a persistent ID means. PERSID's |
|
1711 argument is a newline-terminated str-style (no embedded escapes, no |
|
1712 bracketing quote characters) string, which *is* "the persistent ID". |
|
1713 The unpickler passes this string to self.persistent_load(). Whatever |
|
1714 object that returns is pushed on the stack. There is no implementation |
|
1715 of persistent_load() in Python's unpickler: it must be supplied by an |
|
1716 unpickler subclass. |
|
1717 """), |
|
1718 |
|
1719 I(name='BINPERSID', |
|
1720 code='Q', |
|
1721 arg=None, |
|
1722 stack_before=[anyobject], |
|
1723 stack_after=[anyobject], |
|
1724 proto=1, |
|
1725 doc="""Push an object identified by a persistent ID. |
|
1726 |
|
1727 Like PERSID, except the persistent ID is popped off the stack (instead |
|
1728 of being a string embedded in the opcode bytestream). The persistent |
|
1729 ID is passed to self.persistent_load(), and whatever object that |
|
1730 returns is pushed on the stack. See PERSID for more detail. |
|
1731 """), |
|
1732 ] |
|
1733 del I |
|
1734 |
|
1735 # Verify uniqueness of .name and .code members. |
|
1736 name2i = {} |
|
1737 code2i = {} |
|
1738 |
|
1739 for i, d in enumerate(opcodes): |
|
1740 if d.name in name2i: |
|
1741 raise ValueError("repeated name %r at indices %d and %d" % |
|
1742 (d.name, name2i[d.name], i)) |
|
1743 if d.code in code2i: |
|
1744 raise ValueError("repeated code %r at indices %d and %d" % |
|
1745 (d.code, code2i[d.code], i)) |
|
1746 |
|
1747 name2i[d.name] = i |
|
1748 code2i[d.code] = i |
|
1749 |
|
1750 del name2i, code2i, i, d |
|
1751 |
|
1752 ############################################################################## |
|
1753 # Build a code2op dict, mapping opcode characters to OpcodeInfo records. |
|
1754 # Also ensure we've got the same stuff as pickle.py, although the |
|
1755 # introspection here is dicey. |
|
1756 |
|
1757 code2op = {} |
|
1758 for d in opcodes: |
|
1759 code2op[d.code] = d |
|
1760 del d |
|
1761 |
|
1762 def assure_pickle_consistency(verbose=False): |
|
1763 import pickle, re |
|
1764 |
|
1765 copy = code2op.copy() |
|
1766 for name in pickle.__all__: |
|
1767 if not re.match("[A-Z][A-Z0-9_]+$", name): |
|
1768 if verbose: |
|
1769 print "skipping %r: it doesn't look like an opcode name" % name |
|
1770 continue |
|
1771 picklecode = getattr(pickle, name) |
|
1772 if not isinstance(picklecode, str) or len(picklecode) != 1: |
|
1773 if verbose: |
|
1774 print ("skipping %r: value %r doesn't look like a pickle " |
|
1775 "code" % (name, picklecode)) |
|
1776 continue |
|
1777 if picklecode in copy: |
|
1778 if verbose: |
|
1779 print "checking name %r w/ code %r for consistency" % ( |
|
1780 name, picklecode) |
|
1781 d = copy[picklecode] |
|
1782 if d.name != name: |
|
1783 raise ValueError("for pickle code %r, pickle.py uses name %r " |
|
1784 "but we're using name %r" % (picklecode, |
|
1785 name, |
|
1786 d.name)) |
|
1787 # Forget this one. Any left over in copy at the end are a problem |
|
1788 # of a different kind. |
|
1789 del copy[picklecode] |
|
1790 else: |
|
1791 raise ValueError("pickle.py appears to have a pickle opcode with " |
|
1792 "name %r and code %r, but we don't" % |
|
1793 (name, picklecode)) |
|
1794 if copy: |
|
1795 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"] |
|
1796 for code, d in copy.items(): |
|
1797 msg.append(" name %r with code %r" % (d.name, code)) |
|
1798 raise ValueError("\n".join(msg)) |
|
1799 |
|
1800 assure_pickle_consistency() |
|
1801 del assure_pickle_consistency |
|
1802 |
|
1803 ############################################################################## |
|
1804 # A pickle opcode generator. |
|
1805 |
|
1806 def genops(pickle): |
|
1807 """Generate all the opcodes in a pickle. |
|
1808 |
|
1809 'pickle' is a file-like object, or string, containing the pickle. |
|
1810 |
|
1811 Each opcode in the pickle is generated, from the current pickle position, |
|
1812 stopping after a STOP opcode is delivered. A triple is generated for |
|
1813 each opcode: |
|
1814 |
|
1815 opcode, arg, pos |
|
1816 |
|
1817 opcode is an OpcodeInfo record, describing the current opcode. |
|
1818 |
|
1819 If the opcode has an argument embedded in the pickle, arg is its decoded |
|
1820 value, as a Python object. If the opcode doesn't have an argument, arg |
|
1821 is None. |
|
1822 |
|
1823 If the pickle has a tell() method, pos was the value of pickle.tell() |
|
1824 before reading the current opcode. If the pickle is a string object, |
|
1825 it's wrapped in a StringIO object, and the latter's tell() result is |
|
1826 used. Else (the pickle doesn't have a tell(), and it's not obvious how |
|
1827 to query its current position) pos is None. |
|
1828 """ |
|
1829 |
|
1830 import cStringIO as StringIO |
|
1831 |
|
1832 if isinstance(pickle, str): |
|
1833 pickle = StringIO.StringIO(pickle) |
|
1834 |
|
1835 if hasattr(pickle, "tell"): |
|
1836 getpos = pickle.tell |
|
1837 else: |
|
1838 getpos = lambda: None |
|
1839 |
|
1840 while True: |
|
1841 pos = getpos() |
|
1842 code = pickle.read(1) |
|
1843 opcode = code2op.get(code) |
|
1844 if opcode is None: |
|
1845 if code == "": |
|
1846 raise ValueError("pickle exhausted before seeing STOP") |
|
1847 else: |
|
1848 raise ValueError("at position %s, opcode %r unknown" % ( |
|
1849 pos is None and "<unknown>" or pos, |
|
1850 code)) |
|
1851 if opcode.arg is None: |
|
1852 arg = None |
|
1853 else: |
|
1854 arg = opcode.arg.reader(pickle) |
|
1855 yield opcode, arg, pos |
|
1856 if code == '.': |
|
1857 assert opcode.name == 'STOP' |
|
1858 break |
|
1859 |
|
1860 ############################################################################## |
|
1861 # A symbolic pickle disassembler. |
|
1862 |
|
1863 def dis(pickle, out=None, memo=None, indentlevel=4): |
|
1864 """Produce a symbolic disassembly of a pickle. |
|
1865 |
|
1866 'pickle' is a file-like object, or string, containing a (at least one) |
|
1867 pickle. The pickle is disassembled from the current position, through |
|
1868 the first STOP opcode encountered. |
|
1869 |
|
1870 Optional arg 'out' is a file-like object to which the disassembly is |
|
1871 printed. It defaults to sys.stdout. |
|
1872 |
|
1873 Optional arg 'memo' is a Python dict, used as the pickle's memo. It |
|
1874 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes. |
|
1875 Passing the same memo object to another dis() call then allows disassembly |
|
1876 to proceed across multiple pickles that were all created by the same |
|
1877 pickler with the same memo. Ordinarily you don't need to worry about this. |
|
1878 |
|
1879 Optional arg indentlevel is the number of blanks by which to indent |
|
1880 a new MARK level. It defaults to 4. |
|
1881 |
|
1882 In addition to printing the disassembly, some sanity checks are made: |
|
1883 |
|
1884 + All embedded opcode arguments "make sense". |
|
1885 |
|
1886 + Explicit and implicit pop operations have enough items on the stack. |
|
1887 |
|
1888 + When an opcode implicitly refers to a markobject, a markobject is |
|
1889 actually on the stack. |
|
1890 |
|
1891 + A memo entry isn't referenced before it's defined. |
|
1892 |
|
1893 + The markobject isn't stored in the memo. |
|
1894 |
|
1895 + A memo entry isn't redefined. |
|
1896 """ |
|
1897 |
|
1898 # Most of the hair here is for sanity checks, but most of it is needed |
|
1899 # anyway to detect when a protocol 0 POP takes a MARK off the stack |
|
1900 # (which in turn is needed to indent MARK blocks correctly). |
|
1901 |
|
1902 stack = [] # crude emulation of unpickler stack |
|
1903 if memo is None: |
|
1904 memo = {} # crude emulation of unpicker memo |
|
1905 maxproto = -1 # max protocol number seen |
|
1906 markstack = [] # bytecode positions of MARK opcodes |
|
1907 indentchunk = ' ' * indentlevel |
|
1908 errormsg = None |
|
1909 for opcode, arg, pos in genops(pickle): |
|
1910 if pos is not None: |
|
1911 print >> out, "%5d:" % pos, |
|
1912 |
|
1913 line = "%-4s %s%s" % (repr(opcode.code)[1:-1], |
|
1914 indentchunk * len(markstack), |
|
1915 opcode.name) |
|
1916 |
|
1917 maxproto = max(maxproto, opcode.proto) |
|
1918 before = opcode.stack_before # don't mutate |
|
1919 after = opcode.stack_after # don't mutate |
|
1920 numtopop = len(before) |
|
1921 |
|
1922 # See whether a MARK should be popped. |
|
1923 markmsg = None |
|
1924 if markobject in before or (opcode.name == "POP" and |
|
1925 stack and |
|
1926 stack[-1] is markobject): |
|
1927 assert markobject not in after |
|
1928 if __debug__: |
|
1929 if markobject in before: |
|
1930 assert before[-1] is stackslice |
|
1931 if markstack: |
|
1932 markpos = markstack.pop() |
|
1933 if markpos is None: |
|
1934 markmsg = "(MARK at unknown opcode offset)" |
|
1935 else: |
|
1936 markmsg = "(MARK at %d)" % markpos |
|
1937 # Pop everything at and after the topmost markobject. |
|
1938 while stack[-1] is not markobject: |
|
1939 stack.pop() |
|
1940 stack.pop() |
|
1941 # Stop later code from popping too much. |
|
1942 try: |
|
1943 numtopop = before.index(markobject) |
|
1944 except ValueError: |
|
1945 assert opcode.name == "POP" |
|
1946 numtopop = 0 |
|
1947 else: |
|
1948 errormsg = markmsg = "no MARK exists on stack" |
|
1949 |
|
1950 # Check for correct memo usage. |
|
1951 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"): |
|
1952 assert arg is not None |
|
1953 if arg in memo: |
|
1954 errormsg = "memo key %r already defined" % arg |
|
1955 elif not stack: |
|
1956 errormsg = "stack is empty -- can't store into memo" |
|
1957 elif stack[-1] is markobject: |
|
1958 errormsg = "can't store markobject in the memo" |
|
1959 else: |
|
1960 memo[arg] = stack[-1] |
|
1961 |
|
1962 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"): |
|
1963 if arg in memo: |
|
1964 assert len(after) == 1 |
|
1965 after = [memo[arg]] # for better stack emulation |
|
1966 else: |
|
1967 errormsg = "memo key %r has never been stored into" % arg |
|
1968 |
|
1969 if arg is not None or markmsg: |
|
1970 # make a mild effort to align arguments |
|
1971 line += ' ' * (10 - len(opcode.name)) |
|
1972 if arg is not None: |
|
1973 line += ' ' + repr(arg) |
|
1974 if markmsg: |
|
1975 line += ' ' + markmsg |
|
1976 print >> out, line |
|
1977 |
|
1978 if errormsg: |
|
1979 # Note that we delayed complaining until the offending opcode |
|
1980 # was printed. |
|
1981 raise ValueError(errormsg) |
|
1982 |
|
1983 # Emulate the stack effects. |
|
1984 if len(stack) < numtopop: |
|
1985 raise ValueError("tries to pop %d items from stack with " |
|
1986 "only %d items" % (numtopop, len(stack))) |
|
1987 if numtopop: |
|
1988 del stack[-numtopop:] |
|
1989 if markobject in after: |
|
1990 assert markobject not in before |
|
1991 markstack.append(pos) |
|
1992 |
|
1993 stack.extend(after) |
|
1994 |
|
1995 print >> out, "highest protocol among opcodes =", maxproto |
|
1996 if stack: |
|
1997 raise ValueError("stack not empty after STOP: %r" % stack) |
|
1998 |
|
1999 # For use in the doctest, simply as an example of a class to pickle. |
|
2000 class _Example: |
|
2001 def __init__(self, value): |
|
2002 self.value = value |
|
2003 |
|
2004 _dis_test = r""" |
|
2005 >>> import pickle |
|
2006 >>> x = [1, 2, (3, 4), {'abc': u"def"}] |
|
2007 >>> pkl = pickle.dumps(x, 0) |
|
2008 >>> dis(pkl) |
|
2009 0: ( MARK |
|
2010 1: l LIST (MARK at 0) |
|
2011 2: p PUT 0 |
|
2012 5: I INT 1 |
|
2013 8: a APPEND |
|
2014 9: I INT 2 |
|
2015 12: a APPEND |
|
2016 13: ( MARK |
|
2017 14: I INT 3 |
|
2018 17: I INT 4 |
|
2019 20: t TUPLE (MARK at 13) |
|
2020 21: p PUT 1 |
|
2021 24: a APPEND |
|
2022 25: ( MARK |
|
2023 26: d DICT (MARK at 25) |
|
2024 27: p PUT 2 |
|
2025 30: S STRING 'abc' |
|
2026 37: p PUT 3 |
|
2027 40: V UNICODE u'def' |
|
2028 45: p PUT 4 |
|
2029 48: s SETITEM |
|
2030 49: a APPEND |
|
2031 50: . STOP |
|
2032 highest protocol among opcodes = 0 |
|
2033 |
|
2034 Try again with a "binary" pickle. |
|
2035 |
|
2036 >>> pkl = pickle.dumps(x, 1) |
|
2037 >>> dis(pkl) |
|
2038 0: ] EMPTY_LIST |
|
2039 1: q BINPUT 0 |
|
2040 3: ( MARK |
|
2041 4: K BININT1 1 |
|
2042 6: K BININT1 2 |
|
2043 8: ( MARK |
|
2044 9: K BININT1 3 |
|
2045 11: K BININT1 4 |
|
2046 13: t TUPLE (MARK at 8) |
|
2047 14: q BINPUT 1 |
|
2048 16: } EMPTY_DICT |
|
2049 17: q BINPUT 2 |
|
2050 19: U SHORT_BINSTRING 'abc' |
|
2051 24: q BINPUT 3 |
|
2052 26: X BINUNICODE u'def' |
|
2053 34: q BINPUT 4 |
|
2054 36: s SETITEM |
|
2055 37: e APPENDS (MARK at 3) |
|
2056 38: . STOP |
|
2057 highest protocol among opcodes = 1 |
|
2058 |
|
2059 Exercise the INST/OBJ/BUILD family. |
|
2060 |
|
2061 >>> import random |
|
2062 >>> dis(pickle.dumps(random.random, 0)) |
|
2063 0: c GLOBAL 'random random' |
|
2064 15: p PUT 0 |
|
2065 18: . STOP |
|
2066 highest protocol among opcodes = 0 |
|
2067 |
|
2068 >>> from pickletools import _Example |
|
2069 >>> x = [_Example(42)] * 2 |
|
2070 >>> dis(pickle.dumps(x, 0)) |
|
2071 0: ( MARK |
|
2072 1: l LIST (MARK at 0) |
|
2073 2: p PUT 0 |
|
2074 5: ( MARK |
|
2075 6: i INST 'pickletools _Example' (MARK at 5) |
|
2076 28: p PUT 1 |
|
2077 31: ( MARK |
|
2078 32: d DICT (MARK at 31) |
|
2079 33: p PUT 2 |
|
2080 36: S STRING 'value' |
|
2081 45: p PUT 3 |
|
2082 48: I INT 42 |
|
2083 52: s SETITEM |
|
2084 53: b BUILD |
|
2085 54: a APPEND |
|
2086 55: g GET 1 |
|
2087 58: a APPEND |
|
2088 59: . STOP |
|
2089 highest protocol among opcodes = 0 |
|
2090 |
|
2091 >>> dis(pickle.dumps(x, 1)) |
|
2092 0: ] EMPTY_LIST |
|
2093 1: q BINPUT 0 |
|
2094 3: ( MARK |
|
2095 4: ( MARK |
|
2096 5: c GLOBAL 'pickletools _Example' |
|
2097 27: q BINPUT 1 |
|
2098 29: o OBJ (MARK at 4) |
|
2099 30: q BINPUT 2 |
|
2100 32: } EMPTY_DICT |
|
2101 33: q BINPUT 3 |
|
2102 35: U SHORT_BINSTRING 'value' |
|
2103 42: q BINPUT 4 |
|
2104 44: K BININT1 42 |
|
2105 46: s SETITEM |
|
2106 47: b BUILD |
|
2107 48: h BINGET 2 |
|
2108 50: e APPENDS (MARK at 3) |
|
2109 51: . STOP |
|
2110 highest protocol among opcodes = 1 |
|
2111 |
|
2112 Try "the canonical" recursive-object test. |
|
2113 |
|
2114 >>> L = [] |
|
2115 >>> T = L, |
|
2116 >>> L.append(T) |
|
2117 >>> L[0] is T |
|
2118 True |
|
2119 >>> T[0] is L |
|
2120 True |
|
2121 >>> L[0][0] is L |
|
2122 True |
|
2123 >>> T[0][0] is T |
|
2124 True |
|
2125 >>> dis(pickle.dumps(L, 0)) |
|
2126 0: ( MARK |
|
2127 1: l LIST (MARK at 0) |
|
2128 2: p PUT 0 |
|
2129 5: ( MARK |
|
2130 6: g GET 0 |
|
2131 9: t TUPLE (MARK at 5) |
|
2132 10: p PUT 1 |
|
2133 13: a APPEND |
|
2134 14: . STOP |
|
2135 highest protocol among opcodes = 0 |
|
2136 |
|
2137 >>> dis(pickle.dumps(L, 1)) |
|
2138 0: ] EMPTY_LIST |
|
2139 1: q BINPUT 0 |
|
2140 3: ( MARK |
|
2141 4: h BINGET 0 |
|
2142 6: t TUPLE (MARK at 3) |
|
2143 7: q BINPUT 1 |
|
2144 9: a APPEND |
|
2145 10: . STOP |
|
2146 highest protocol among opcodes = 1 |
|
2147 |
|
2148 Note that, in the protocol 0 pickle of the recursive tuple, the disassembler |
|
2149 has to emulate the stack in order to realize that the POP opcode at 16 gets |
|
2150 rid of the MARK at 0. |
|
2151 |
|
2152 >>> dis(pickle.dumps(T, 0)) |
|
2153 0: ( MARK |
|
2154 1: ( MARK |
|
2155 2: l LIST (MARK at 1) |
|
2156 3: p PUT 0 |
|
2157 6: ( MARK |
|
2158 7: g GET 0 |
|
2159 10: t TUPLE (MARK at 6) |
|
2160 11: p PUT 1 |
|
2161 14: a APPEND |
|
2162 15: 0 POP |
|
2163 16: 0 POP (MARK at 0) |
|
2164 17: g GET 1 |
|
2165 20: . STOP |
|
2166 highest protocol among opcodes = 0 |
|
2167 |
|
2168 >>> dis(pickle.dumps(T, 1)) |
|
2169 0: ( MARK |
|
2170 1: ] EMPTY_LIST |
|
2171 2: q BINPUT 0 |
|
2172 4: ( MARK |
|
2173 5: h BINGET 0 |
|
2174 7: t TUPLE (MARK at 4) |
|
2175 8: q BINPUT 1 |
|
2176 10: a APPEND |
|
2177 11: 1 POP_MARK (MARK at 0) |
|
2178 12: h BINGET 1 |
|
2179 14: . STOP |
|
2180 highest protocol among opcodes = 1 |
|
2181 |
|
2182 Try protocol 2. |
|
2183 |
|
2184 >>> dis(pickle.dumps(L, 2)) |
|
2185 0: \x80 PROTO 2 |
|
2186 2: ] EMPTY_LIST |
|
2187 3: q BINPUT 0 |
|
2188 5: h BINGET 0 |
|
2189 7: \x85 TUPLE1 |
|
2190 8: q BINPUT 1 |
|
2191 10: a APPEND |
|
2192 11: . STOP |
|
2193 highest protocol among opcodes = 2 |
|
2194 |
|
2195 >>> dis(pickle.dumps(T, 2)) |
|
2196 0: \x80 PROTO 2 |
|
2197 2: ] EMPTY_LIST |
|
2198 3: q BINPUT 0 |
|
2199 5: h BINGET 0 |
|
2200 7: \x85 TUPLE1 |
|
2201 8: q BINPUT 1 |
|
2202 10: a APPEND |
|
2203 11: 0 POP |
|
2204 12: h BINGET 1 |
|
2205 14: . STOP |
|
2206 highest protocol among opcodes = 2 |
|
2207 """ |
|
2208 |
|
2209 _memo_test = r""" |
|
2210 >>> import pickle |
|
2211 >>> from StringIO import StringIO |
|
2212 >>> f = StringIO() |
|
2213 >>> p = pickle.Pickler(f, 2) |
|
2214 >>> x = [1, 2, 3] |
|
2215 >>> p.dump(x) |
|
2216 >>> p.dump(x) |
|
2217 >>> f.seek(0) |
|
2218 >>> memo = {} |
|
2219 >>> dis(f, memo=memo) |
|
2220 0: \x80 PROTO 2 |
|
2221 2: ] EMPTY_LIST |
|
2222 3: q BINPUT 0 |
|
2223 5: ( MARK |
|
2224 6: K BININT1 1 |
|
2225 8: K BININT1 2 |
|
2226 10: K BININT1 3 |
|
2227 12: e APPENDS (MARK at 5) |
|
2228 13: . STOP |
|
2229 highest protocol among opcodes = 2 |
|
2230 >>> dis(f, memo=memo) |
|
2231 14: \x80 PROTO 2 |
|
2232 16: h BINGET 0 |
|
2233 18: . STOP |
|
2234 highest protocol among opcodes = 2 |
|
2235 """ |
|
2236 |
|
2237 __test__ = {'disassembler_test': _dis_test, |
|
2238 'disassembler_memo_test': _memo_test, |
|
2239 } |
|
2240 |
|
2241 def _test(): |
|
2242 import doctest |
|
2243 return doctest.testmod() |
|
2244 |
|
2245 if __name__ == "__main__": |
|
2246 _test() |