|
1 '''"Executable documentation" for the pickle module. |
|
2 |
|
3 Extensive comments about the pickle protocols and pickle-machine opcodes |
|
4 can be found here. Some functions meant for external use: |
|
5 |
|
6 genops(pickle) |
|
7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples. |
|
8 |
|
9 dis(pickle, out=None, memo=None, indentlevel=4) |
|
10 Print a symbolic disassembly of a pickle. |
|
11 ''' |
|
12 |
|
13 __all__ = ['dis', 'genops', 'optimize'] |
|
14 |
|
15 # Other ideas: |
|
16 # |
|
17 # - A pickle verifier: read a pickle and check it exhaustively for |
|
18 # well-formedness. dis() does a lot of this already. |
|
19 # |
|
20 # - A protocol identifier: examine a pickle and return its protocol number |
|
21 # (== the highest .proto attr value among all the opcodes in the pickle). |
|
22 # dis() already prints this info at the end. |
|
23 # |
|
24 # - A pickle optimizer: for example, tuple-building code is sometimes more |
|
25 # elaborate than necessary, catering for the possibility that the tuple |
|
26 # is recursive. Or lots of times a PUT is generated that's never accessed |
|
27 # by a later GET. |
|
28 |
|
29 |
|
30 """ |
|
31 "A pickle" is a program for a virtual pickle machine (PM, but more accurately |
|
32 called an unpickling machine). It's a sequence of opcodes, interpreted by the |
|
33 PM, building an arbitrarily complex Python object. |
|
34 |
|
35 For the most part, the PM is very simple: there are no looping, testing, or |
|
36 conditional instructions, no arithmetic and no function calls. Opcodes are |
|
37 executed once each, from first to last, until a STOP opcode is reached. |
|
38 |
|
39 The PM has two data areas, "the stack" and "the memo". |
|
40 |
|
41 Many opcodes push Python objects onto the stack; e.g., INT pushes a Python |
|
42 integer object on the stack, whose value is gotten from a decimal string |
|
43 literal immediately following the INT opcode in the pickle bytestream. Other |
|
44 opcodes take Python objects off the stack. The result of unpickling is |
|
45 whatever object is left on the stack when the final STOP opcode is executed. |
|
46 |
|
47 The memo is simply an array of objects, or it can be implemented as a dict |
|
48 mapping little integers to objects. The memo serves as the PM's "long term |
|
49 memory", and the little integers indexing the memo are akin to variable |
|
50 names. Some opcodes pop a stack object into the memo at a given index, |
|
51 and others push a memo object at a given index onto the stack again. |
|
52 |
|
53 At heart, that's all the PM has. Subtleties arise for these reasons: |
|
54 |
|
55 + Object identity. Objects can be arbitrarily complex, and subobjects |
|
56 may be shared (for example, the list [a, a] refers to the same object a |
|
57 twice). It can be vital that unpickling recreate an isomorphic object |
|
58 graph, faithfully reproducing sharing. |
|
59 |
|
60 + Recursive objects. For example, after "L = []; L.append(L)", L is a |
|
61 list, and L[0] is the same list. This is related to the object identity |
|
62 point, and some sequences of pickle opcodes are subtle in order to |
|
63 get the right result in all cases. |
|
64 |
|
65 + Things pickle doesn't know everything about. Examples of things pickle |
|
66 does know everything about are Python's builtin scalar and container |
|
67 types, like ints and tuples. They generally have opcodes dedicated to |
|
68 them. For things like module references and instances of user-defined |
|
69 classes, pickle's knowledge is limited. Historically, many enhancements |
|
70 have been made to the pickle protocol in order to do a better (faster, |
|
71 and/or more compact) job on those. |
|
72 |
|
73 + Backward compatibility and micro-optimization. As explained below, |
|
74 pickle opcodes never go away, not even when better ways to do a thing |
|
75 get invented. The repertoire of the PM just keeps growing over time. |
|
76 For example, protocol 0 had two opcodes for building Python integers (INT |
|
77 and LONG), protocol 1 added three more for more-efficient pickling of short |
|
78 integers, and protocol 2 added two more for more-efficient pickling of |
|
79 long integers (before protocol 2, the only ways to pickle a Python long |
|
80 took time quadratic in the number of digits, for both pickling and |
|
81 unpickling). "Opcode bloat" isn't so much a subtlety as a source of |
|
82 wearying complication. |
|
83 |
|
84 |
|
85 Pickle protocols: |
|
86 |
|
87 For compatibility, the meaning of a pickle opcode never changes. Instead new |
|
88 pickle opcodes get added, and each version's unpickler can handle all the |
|
89 pickle opcodes in all protocol versions to date. So old pickles continue to |
|
90 be readable forever. The pickler can generally be told to restrict itself to |
|
91 the subset of opcodes available under previous protocol versions too, so that |
|
92 users can create pickles under the current version readable by older |
|
93 versions. However, a pickle does not contain its version number embedded |
|
94 within it. If an older unpickler tries to read a pickle using a later |
|
95 protocol, the result is most likely an exception due to seeing an unknown (in |
|
96 the older unpickler) opcode. |
|
97 |
|
98 The original pickle used what's now called "protocol 0", and what was called |
|
99 "text mode" before Python 2.3. The entire pickle bytestream is made up of |
|
100 printable 7-bit ASCII characters, plus the newline character, in protocol 0. |
|
101 That's why it was called text mode. Protocol 0 is small and elegant, but |
|
102 sometimes painfully inefficient. |
|
103 |
|
104 The second major set of additions is now called "protocol 1", and was called |
|
105 "binary mode" before Python 2.3. This added many opcodes with arguments |
|
106 consisting of arbitrary bytes, including NUL bytes and unprintable "high bit" |
|
107 bytes. Binary mode pickles can be substantially smaller than equivalent |
|
108 text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte |
|
109 int as 4 bytes following the opcode, which is cheaper to unpickle than the |
|
110 (perhaps) 11-character decimal string attached to INT. Protocol 1 also added |
|
111 a number of opcodes that operate on many stack elements at once (like APPENDS |
|
112 and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE). |
|
113 |
|
114 The third major set of additions came in Python 2.3, and is called "protocol |
|
115 2". This added: |
|
116 |
|
117 - A better way to pickle instances of new-style classes (NEWOBJ). |
|
118 |
|
119 - A way for a pickle to identify its protocol (PROTO). |
|
120 |
|
121 - Time- and space- efficient pickling of long ints (LONG{1,4}). |
|
122 |
|
123 - Shortcuts for small tuples (TUPLE{1,2,3}}. |
|
124 |
|
125 - Dedicated opcodes for bools (NEWTRUE, NEWFALSE). |
|
126 |
|
127 - The "extension registry", a vector of popular objects that can be pushed |
|
128 efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but |
|
129 the registry contents are predefined (there's nothing akin to the memo's |
|
130 PUT). |
|
131 |
|
132 Another independent change with Python 2.3 is the abandonment of any |
|
133 pretense that it might be safe to load pickles received from untrusted |
|
134 parties -- no sufficient security analysis has been done to guarantee |
|
135 this and there isn't a use case that warrants the expense of such an |
|
136 analysis. |
|
137 |
|
138 To this end, all tests for __safe_for_unpickling__ or for |
|
139 copy_reg.safe_constructors are removed from the unpickling code. |
|
140 References to these variables in the descriptions below are to be seen |
|
141 as describing unpickling in Python 2.2 and before. |
|
142 """ |
|
143 |
|
144 # Meta-rule: Descriptions are stored in instances of descriptor objects, |
|
145 # with plain constructors. No meta-language is defined from which |
|
146 # descriptors could be constructed. If you want, e.g., XML, write a little |
|
147 # program to generate XML from the objects. |
|
148 |
|
149 ############################################################################## |
|
150 # Some pickle opcodes have an argument, following the opcode in the |
|
151 # bytestream. An argument is of a specific type, described by an instance |
|
152 # of ArgumentDescriptor. These are not to be confused with arguments taken |
|
153 # off the stack -- ArgumentDescriptor applies only to arguments embedded in |
|
154 # the opcode stream, immediately following an opcode. |
|
155 |
|
156 # Represents the number of bytes consumed by an argument delimited by the |
|
157 # next newline character. |
|
158 UP_TO_NEWLINE = -1 |
|
159 |
|
160 # Represents the number of bytes consumed by a two-argument opcode where |
|
161 # the first argument gives the number of bytes in the second argument. |
|
162 TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int |
|
163 TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int |
|
164 |
|
165 class ArgumentDescriptor(object): |
|
166 __slots__ = ( |
|
167 # name of descriptor record, also a module global name; a string |
|
168 'name', |
|
169 |
|
170 # length of argument, in bytes; an int; UP_TO_NEWLINE and |
|
171 # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length |
|
172 # cases |
|
173 'n', |
|
174 |
|
175 # a function taking a file-like object, reading this kind of argument |
|
176 # from the object at the current position, advancing the current |
|
177 # position by n bytes, and returning the value of the argument |
|
178 'reader', |
|
179 |
|
180 # human-readable docs for this arg descriptor; a string |
|
181 'doc', |
|
182 ) |
|
183 |
|
184 def __init__(self, name, n, reader, doc): |
|
185 assert isinstance(name, str) |
|
186 self.name = name |
|
187 |
|
188 assert isinstance(n, int) and (n >= 0 or |
|
189 n in (UP_TO_NEWLINE, |
|
190 TAKEN_FROM_ARGUMENT1, |
|
191 TAKEN_FROM_ARGUMENT4)) |
|
192 self.n = n |
|
193 |
|
194 self.reader = reader |
|
195 |
|
196 assert isinstance(doc, str) |
|
197 self.doc = doc |
|
198 |
|
199 from struct import unpack as _unpack |
|
200 |
|
201 def read_uint1(f): |
|
202 r""" |
|
203 >>> import StringIO |
|
204 >>> read_uint1(StringIO.StringIO('\xff')) |
|
205 255 |
|
206 """ |
|
207 |
|
208 data = f.read(1) |
|
209 if data: |
|
210 return ord(data) |
|
211 raise ValueError("not enough data in stream to read uint1") |
|
212 |
|
213 uint1 = ArgumentDescriptor( |
|
214 name='uint1', |
|
215 n=1, |
|
216 reader=read_uint1, |
|
217 doc="One-byte unsigned integer.") |
|
218 |
|
219 |
|
220 def read_uint2(f): |
|
221 r""" |
|
222 >>> import StringIO |
|
223 >>> read_uint2(StringIO.StringIO('\xff\x00')) |
|
224 255 |
|
225 >>> read_uint2(StringIO.StringIO('\xff\xff')) |
|
226 65535 |
|
227 """ |
|
228 |
|
229 data = f.read(2) |
|
230 if len(data) == 2: |
|
231 return _unpack("<H", data)[0] |
|
232 raise ValueError("not enough data in stream to read uint2") |
|
233 |
|
234 uint2 = ArgumentDescriptor( |
|
235 name='uint2', |
|
236 n=2, |
|
237 reader=read_uint2, |
|
238 doc="Two-byte unsigned integer, little-endian.") |
|
239 |
|
240 |
|
241 def read_int4(f): |
|
242 r""" |
|
243 >>> import StringIO |
|
244 >>> read_int4(StringIO.StringIO('\xff\x00\x00\x00')) |
|
245 255 |
|
246 >>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31) |
|
247 True |
|
248 """ |
|
249 |
|
250 data = f.read(4) |
|
251 if len(data) == 4: |
|
252 return _unpack("<i", data)[0] |
|
253 raise ValueError("not enough data in stream to read int4") |
|
254 |
|
255 int4 = ArgumentDescriptor( |
|
256 name='int4', |
|
257 n=4, |
|
258 reader=read_int4, |
|
259 doc="Four-byte signed integer, little-endian, 2's complement.") |
|
260 |
|
261 |
|
262 def read_stringnl(f, decode=True, stripquotes=True): |
|
263 r""" |
|
264 >>> import StringIO |
|
265 >>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n")) |
|
266 'abcd' |
|
267 |
|
268 >>> read_stringnl(StringIO.StringIO("\n")) |
|
269 Traceback (most recent call last): |
|
270 ... |
|
271 ValueError: no string quotes around '' |
|
272 |
|
273 >>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False) |
|
274 '' |
|
275 |
|
276 >>> read_stringnl(StringIO.StringIO("''\n")) |
|
277 '' |
|
278 |
|
279 >>> read_stringnl(StringIO.StringIO('"abcd"')) |
|
280 Traceback (most recent call last): |
|
281 ... |
|
282 ValueError: no newline found when trying to read stringnl |
|
283 |
|
284 Embedded escapes are undone in the result. |
|
285 >>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'")) |
|
286 'a\n\\b\x00c\td' |
|
287 """ |
|
288 |
|
289 data = f.readline() |
|
290 if not data.endswith('\n'): |
|
291 raise ValueError("no newline found when trying to read stringnl") |
|
292 data = data[:-1] # lose the newline |
|
293 |
|
294 if stripquotes: |
|
295 for q in "'\"": |
|
296 if data.startswith(q): |
|
297 if not data.endswith(q): |
|
298 raise ValueError("strinq quote %r not found at both " |
|
299 "ends of %r" % (q, data)) |
|
300 data = data[1:-1] |
|
301 break |
|
302 else: |
|
303 raise ValueError("no string quotes around %r" % data) |
|
304 |
|
305 # I'm not sure when 'string_escape' was added to the std codecs; it's |
|
306 # crazy not to use it if it's there. |
|
307 if decode: |
|
308 data = data.decode('string_escape') |
|
309 return data |
|
310 |
|
311 stringnl = ArgumentDescriptor( |
|
312 name='stringnl', |
|
313 n=UP_TO_NEWLINE, |
|
314 reader=read_stringnl, |
|
315 doc="""A newline-terminated string. |
|
316 |
|
317 This is a repr-style string, with embedded escapes, and |
|
318 bracketing quotes. |
|
319 """) |
|
320 |
|
321 def read_stringnl_noescape(f): |
|
322 return read_stringnl(f, decode=False, stripquotes=False) |
|
323 |
|
324 stringnl_noescape = ArgumentDescriptor( |
|
325 name='stringnl_noescape', |
|
326 n=UP_TO_NEWLINE, |
|
327 reader=read_stringnl_noescape, |
|
328 doc="""A newline-terminated string. |
|
329 |
|
330 This is a str-style string, without embedded escapes, |
|
331 or bracketing quotes. It should consist solely of |
|
332 printable ASCII characters. |
|
333 """) |
|
334 |
|
335 def read_stringnl_noescape_pair(f): |
|
336 r""" |
|
337 >>> import StringIO |
|
338 >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk")) |
|
339 'Queue Empty' |
|
340 """ |
|
341 |
|
342 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f)) |
|
343 |
|
344 stringnl_noescape_pair = ArgumentDescriptor( |
|
345 name='stringnl_noescape_pair', |
|
346 n=UP_TO_NEWLINE, |
|
347 reader=read_stringnl_noescape_pair, |
|
348 doc="""A pair of newline-terminated strings. |
|
349 |
|
350 These are str-style strings, without embedded |
|
351 escapes, or bracketing quotes. They should |
|
352 consist solely of printable ASCII characters. |
|
353 The pair is returned as a single string, with |
|
354 a single blank separating the two strings. |
|
355 """) |
|
356 |
|
357 def read_string4(f): |
|
358 r""" |
|
359 >>> import StringIO |
|
360 >>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc")) |
|
361 '' |
|
362 >>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef")) |
|
363 'abc' |
|
364 >>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef")) |
|
365 Traceback (most recent call last): |
|
366 ... |
|
367 ValueError: expected 50331648 bytes in a string4, but only 6 remain |
|
368 """ |
|
369 |
|
370 n = read_int4(f) |
|
371 if n < 0: |
|
372 raise ValueError("string4 byte count < 0: %d" % n) |
|
373 data = f.read(n) |
|
374 if len(data) == n: |
|
375 return data |
|
376 raise ValueError("expected %d bytes in a string4, but only %d remain" % |
|
377 (n, len(data))) |
|
378 |
|
379 string4 = ArgumentDescriptor( |
|
380 name="string4", |
|
381 n=TAKEN_FROM_ARGUMENT4, |
|
382 reader=read_string4, |
|
383 doc="""A counted string. |
|
384 |
|
385 The first argument is a 4-byte little-endian signed int giving |
|
386 the number of bytes in the string, and the second argument is |
|
387 that many bytes. |
|
388 """) |
|
389 |
|
390 |
|
391 def read_string1(f): |
|
392 r""" |
|
393 >>> import StringIO |
|
394 >>> read_string1(StringIO.StringIO("\x00")) |
|
395 '' |
|
396 >>> read_string1(StringIO.StringIO("\x03abcdef")) |
|
397 'abc' |
|
398 """ |
|
399 |
|
400 n = read_uint1(f) |
|
401 assert n >= 0 |
|
402 data = f.read(n) |
|
403 if len(data) == n: |
|
404 return data |
|
405 raise ValueError("expected %d bytes in a string1, but only %d remain" % |
|
406 (n, len(data))) |
|
407 |
|
408 string1 = ArgumentDescriptor( |
|
409 name="string1", |
|
410 n=TAKEN_FROM_ARGUMENT1, |
|
411 reader=read_string1, |
|
412 doc="""A counted string. |
|
413 |
|
414 The first argument is a 1-byte unsigned int giving the number |
|
415 of bytes in the string, and the second argument is that many |
|
416 bytes. |
|
417 """) |
|
418 |
|
419 |
|
420 def read_unicodestringnl(f): |
|
421 r""" |
|
422 >>> import StringIO |
|
423 >>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk")) |
|
424 u'abc\uabcd' |
|
425 """ |
|
426 |
|
427 data = f.readline() |
|
428 if not data.endswith('\n'): |
|
429 raise ValueError("no newline found when trying to read " |
|
430 "unicodestringnl") |
|
431 data = data[:-1] # lose the newline |
|
432 return unicode(data, 'raw-unicode-escape') |
|
433 |
|
434 unicodestringnl = ArgumentDescriptor( |
|
435 name='unicodestringnl', |
|
436 n=UP_TO_NEWLINE, |
|
437 reader=read_unicodestringnl, |
|
438 doc="""A newline-terminated Unicode string. |
|
439 |
|
440 This is raw-unicode-escape encoded, so consists of |
|
441 printable ASCII characters, and may contain embedded |
|
442 escape sequences. |
|
443 """) |
|
444 |
|
445 def read_unicodestring4(f): |
|
446 r""" |
|
447 >>> import StringIO |
|
448 >>> s = u'abcd\uabcd' |
|
449 >>> enc = s.encode('utf-8') |
|
450 >>> enc |
|
451 'abcd\xea\xaf\x8d' |
|
452 >>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length |
|
453 >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk')) |
|
454 >>> s == t |
|
455 True |
|
456 |
|
457 >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1])) |
|
458 Traceback (most recent call last): |
|
459 ... |
|
460 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain |
|
461 """ |
|
462 |
|
463 n = read_int4(f) |
|
464 if n < 0: |
|
465 raise ValueError("unicodestring4 byte count < 0: %d" % n) |
|
466 data = f.read(n) |
|
467 if len(data) == n: |
|
468 return unicode(data, 'utf-8') |
|
469 raise ValueError("expected %d bytes in a unicodestring4, but only %d " |
|
470 "remain" % (n, len(data))) |
|
471 |
|
472 unicodestring4 = ArgumentDescriptor( |
|
473 name="unicodestring4", |
|
474 n=TAKEN_FROM_ARGUMENT4, |
|
475 reader=read_unicodestring4, |
|
476 doc="""A counted Unicode string. |
|
477 |
|
478 The first argument is a 4-byte little-endian signed int |
|
479 giving the number of bytes in the string, and the second |
|
480 argument-- the UTF-8 encoding of the Unicode string -- |
|
481 contains that many bytes. |
|
482 """) |
|
483 |
|
484 |
|
485 def read_decimalnl_short(f): |
|
486 r""" |
|
487 >>> import StringIO |
|
488 >>> read_decimalnl_short(StringIO.StringIO("1234\n56")) |
|
489 1234 |
|
490 |
|
491 >>> read_decimalnl_short(StringIO.StringIO("1234L\n56")) |
|
492 Traceback (most recent call last): |
|
493 ... |
|
494 ValueError: trailing 'L' not allowed in '1234L' |
|
495 """ |
|
496 |
|
497 s = read_stringnl(f, decode=False, stripquotes=False) |
|
498 if s.endswith("L"): |
|
499 raise ValueError("trailing 'L' not allowed in %r" % s) |
|
500 |
|
501 # It's not necessarily true that the result fits in a Python short int: |
|
502 # the pickle may have been written on a 64-bit box. There's also a hack |
|
503 # for True and False here. |
|
504 if s == "00": |
|
505 return False |
|
506 elif s == "01": |
|
507 return True |
|
508 |
|
509 try: |
|
510 return int(s) |
|
511 except OverflowError: |
|
512 return long(s) |
|
513 |
|
514 def read_decimalnl_long(f): |
|
515 r""" |
|
516 >>> import StringIO |
|
517 |
|
518 >>> read_decimalnl_long(StringIO.StringIO("1234\n56")) |
|
519 Traceback (most recent call last): |
|
520 ... |
|
521 ValueError: trailing 'L' required in '1234' |
|
522 |
|
523 Someday the trailing 'L' will probably go away from this output. |
|
524 |
|
525 >>> read_decimalnl_long(StringIO.StringIO("1234L\n56")) |
|
526 1234L |
|
527 |
|
528 >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\n6")) |
|
529 123456789012345678901234L |
|
530 """ |
|
531 |
|
532 s = read_stringnl(f, decode=False, stripquotes=False) |
|
533 if not s.endswith("L"): |
|
534 raise ValueError("trailing 'L' required in %r" % s) |
|
535 return long(s) |
|
536 |
|
537 |
|
538 decimalnl_short = ArgumentDescriptor( |
|
539 name='decimalnl_short', |
|
540 n=UP_TO_NEWLINE, |
|
541 reader=read_decimalnl_short, |
|
542 doc="""A newline-terminated decimal integer literal. |
|
543 |
|
544 This never has a trailing 'L', and the integer fit |
|
545 in a short Python int on the box where the pickle |
|
546 was written -- but there's no guarantee it will fit |
|
547 in a short Python int on the box where the pickle |
|
548 is read. |
|
549 """) |
|
550 |
|
551 decimalnl_long = ArgumentDescriptor( |
|
552 name='decimalnl_long', |
|
553 n=UP_TO_NEWLINE, |
|
554 reader=read_decimalnl_long, |
|
555 doc="""A newline-terminated decimal integer literal. |
|
556 |
|
557 This has a trailing 'L', and can represent integers |
|
558 of any size. |
|
559 """) |
|
560 |
|
561 |
|
562 def read_floatnl(f): |
|
563 r""" |
|
564 >>> import StringIO |
|
565 >>> read_floatnl(StringIO.StringIO("-1.25\n6")) |
|
566 -1.25 |
|
567 """ |
|
568 s = read_stringnl(f, decode=False, stripquotes=False) |
|
569 return float(s) |
|
570 |
|
571 floatnl = ArgumentDescriptor( |
|
572 name='floatnl', |
|
573 n=UP_TO_NEWLINE, |
|
574 reader=read_floatnl, |
|
575 doc="""A newline-terminated decimal floating literal. |
|
576 |
|
577 In general this requires 17 significant digits for roundtrip |
|
578 identity, and pickling then unpickling infinities, NaNs, and |
|
579 minus zero doesn't work across boxes, or on some boxes even |
|
580 on itself (e.g., Windows can't read the strings it produces |
|
581 for infinities or NaNs). |
|
582 """) |
|
583 |
|
584 def read_float8(f): |
|
585 r""" |
|
586 >>> import StringIO, struct |
|
587 >>> raw = struct.pack(">d", -1.25) |
|
588 >>> raw |
|
589 '\xbf\xf4\x00\x00\x00\x00\x00\x00' |
|
590 >>> read_float8(StringIO.StringIO(raw + "\n")) |
|
591 -1.25 |
|
592 """ |
|
593 |
|
594 data = f.read(8) |
|
595 if len(data) == 8: |
|
596 return _unpack(">d", data)[0] |
|
597 raise ValueError("not enough data in stream to read float8") |
|
598 |
|
599 |
|
600 float8 = ArgumentDescriptor( |
|
601 name='float8', |
|
602 n=8, |
|
603 reader=read_float8, |
|
604 doc="""An 8-byte binary representation of a float, big-endian. |
|
605 |
|
606 The format is unique to Python, and shared with the struct |
|
607 module (format string '>d') "in theory" (the struct and cPickle |
|
608 implementations don't share the code -- they should). It's |
|
609 strongly related to the IEEE-754 double format, and, in normal |
|
610 cases, is in fact identical to the big-endian 754 double format. |
|
611 On other boxes the dynamic range is limited to that of a 754 |
|
612 double, and "add a half and chop" rounding is used to reduce |
|
613 the precision to 53 bits. However, even on a 754 box, |
|
614 infinities, NaNs, and minus zero may not be handled correctly |
|
615 (may not survive roundtrip pickling intact). |
|
616 """) |
|
617 |
|
618 # Protocol 2 formats |
|
619 |
|
620 from pickle import decode_long |
|
621 |
|
622 def read_long1(f): |
|
623 r""" |
|
624 >>> import StringIO |
|
625 >>> read_long1(StringIO.StringIO("\x00")) |
|
626 0L |
|
627 >>> read_long1(StringIO.StringIO("\x02\xff\x00")) |
|
628 255L |
|
629 >>> read_long1(StringIO.StringIO("\x02\xff\x7f")) |
|
630 32767L |
|
631 >>> read_long1(StringIO.StringIO("\x02\x00\xff")) |
|
632 -256L |
|
633 >>> read_long1(StringIO.StringIO("\x02\x00\x80")) |
|
634 -32768L |
|
635 """ |
|
636 |
|
637 n = read_uint1(f) |
|
638 data = f.read(n) |
|
639 if len(data) != n: |
|
640 raise ValueError("not enough data in stream to read long1") |
|
641 return decode_long(data) |
|
642 |
|
643 long1 = ArgumentDescriptor( |
|
644 name="long1", |
|
645 n=TAKEN_FROM_ARGUMENT1, |
|
646 reader=read_long1, |
|
647 doc="""A binary long, little-endian, using 1-byte size. |
|
648 |
|
649 This first reads one byte as an unsigned size, then reads that |
|
650 many bytes and interprets them as a little-endian 2's-complement long. |
|
651 If the size is 0, that's taken as a shortcut for the long 0L. |
|
652 """) |
|
653 |
|
654 def read_long4(f): |
|
655 r""" |
|
656 >>> import StringIO |
|
657 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00")) |
|
658 255L |
|
659 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f")) |
|
660 32767L |
|
661 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff")) |
|
662 -256L |
|
663 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80")) |
|
664 -32768L |
|
665 >>> read_long1(StringIO.StringIO("\x00\x00\x00\x00")) |
|
666 0L |
|
667 """ |
|
668 |
|
669 n = read_int4(f) |
|
670 if n < 0: |
|
671 raise ValueError("long4 byte count < 0: %d" % n) |
|
672 data = f.read(n) |
|
673 if len(data) != n: |
|
674 raise ValueError("not enough data in stream to read long4") |
|
675 return decode_long(data) |
|
676 |
|
677 long4 = ArgumentDescriptor( |
|
678 name="long4", |
|
679 n=TAKEN_FROM_ARGUMENT4, |
|
680 reader=read_long4, |
|
681 doc="""A binary representation of a long, little-endian. |
|
682 |
|
683 This first reads four bytes as a signed size (but requires the |
|
684 size to be >= 0), then reads that many bytes and interprets them |
|
685 as a little-endian 2's-complement long. If the size is 0, that's taken |
|
686 as a shortcut for the long 0L, although LONG1 should really be used |
|
687 then instead (and in any case where # of bytes < 256). |
|
688 """) |
|
689 |
|
690 |
|
691 ############################################################################## |
|
692 # Object descriptors. The stack used by the pickle machine holds objects, |
|
693 # and in the stack_before and stack_after attributes of OpcodeInfo |
|
694 # descriptors we need names to describe the various types of objects that can |
|
695 # appear on the stack. |
|
696 |
|
697 class StackObject(object): |
|
698 __slots__ = ( |
|
699 # name of descriptor record, for info only |
|
700 'name', |
|
701 |
|
702 # type of object, or tuple of type objects (meaning the object can |
|
703 # be of any type in the tuple) |
|
704 'obtype', |
|
705 |
|
706 # human-readable docs for this kind of stack object; a string |
|
707 'doc', |
|
708 ) |
|
709 |
|
710 def __init__(self, name, obtype, doc): |
|
711 assert isinstance(name, str) |
|
712 self.name = name |
|
713 |
|
714 assert isinstance(obtype, type) or isinstance(obtype, tuple) |
|
715 if isinstance(obtype, tuple): |
|
716 for contained in obtype: |
|
717 assert isinstance(contained, type) |
|
718 self.obtype = obtype |
|
719 |
|
720 assert isinstance(doc, str) |
|
721 self.doc = doc |
|
722 |
|
723 def __repr__(self): |
|
724 return self.name |
|
725 |
|
726 |
|
727 pyint = StackObject( |
|
728 name='int', |
|
729 obtype=int, |
|
730 doc="A short (as opposed to long) Python integer object.") |
|
731 |
|
732 pylong = StackObject( |
|
733 name='long', |
|
734 obtype=long, |
|
735 doc="A long (as opposed to short) Python integer object.") |
|
736 |
|
737 pyinteger_or_bool = StackObject( |
|
738 name='int_or_bool', |
|
739 obtype=(int, long, bool), |
|
740 doc="A Python integer object (short or long), or " |
|
741 "a Python bool.") |
|
742 |
|
743 pybool = StackObject( |
|
744 name='bool', |
|
745 obtype=(bool,), |
|
746 doc="A Python bool object.") |
|
747 |
|
748 pyfloat = StackObject( |
|
749 name='float', |
|
750 obtype=float, |
|
751 doc="A Python float object.") |
|
752 |
|
753 pystring = StackObject( |
|
754 name='str', |
|
755 obtype=str, |
|
756 doc="A Python string object.") |
|
757 |
|
758 pyunicode = StackObject( |
|
759 name='unicode', |
|
760 obtype=unicode, |
|
761 doc="A Python Unicode string object.") |
|
762 |
|
763 pynone = StackObject( |
|
764 name="None", |
|
765 obtype=type(None), |
|
766 doc="The Python None object.") |
|
767 |
|
768 pytuple = StackObject( |
|
769 name="tuple", |
|
770 obtype=tuple, |
|
771 doc="A Python tuple object.") |
|
772 |
|
773 pylist = StackObject( |
|
774 name="list", |
|
775 obtype=list, |
|
776 doc="A Python list object.") |
|
777 |
|
778 pydict = StackObject( |
|
779 name="dict", |
|
780 obtype=dict, |
|
781 doc="A Python dict object.") |
|
782 |
|
783 anyobject = StackObject( |
|
784 name='any', |
|
785 obtype=object, |
|
786 doc="Any kind of object whatsoever.") |
|
787 |
|
788 markobject = StackObject( |
|
789 name="mark", |
|
790 obtype=StackObject, |
|
791 doc="""'The mark' is a unique object. |
|
792 |
|
793 Opcodes that operate on a variable number of objects |
|
794 generally don't embed the count of objects in the opcode, |
|
795 or pull it off the stack. Instead the MARK opcode is used |
|
796 to push a special marker object on the stack, and then |
|
797 some other opcodes grab all the objects from the top of |
|
798 the stack down to (but not including) the topmost marker |
|
799 object. |
|
800 """) |
|
801 |
|
802 stackslice = StackObject( |
|
803 name="stackslice", |
|
804 obtype=StackObject, |
|
805 doc="""An object representing a contiguous slice of the stack. |
|
806 |
|
807 This is used in conjuction with markobject, to represent all |
|
808 of the stack following the topmost markobject. For example, |
|
809 the POP_MARK opcode changes the stack from |
|
810 |
|
811 [..., markobject, stackslice] |
|
812 to |
|
813 [...] |
|
814 |
|
815 No matter how many object are on the stack after the topmost |
|
816 markobject, POP_MARK gets rid of all of them (including the |
|
817 topmost markobject too). |
|
818 """) |
|
819 |
|
820 ############################################################################## |
|
821 # Descriptors for pickle opcodes. |
|
822 |
|
823 class OpcodeInfo(object): |
|
824 |
|
825 __slots__ = ( |
|
826 # symbolic name of opcode; a string |
|
827 'name', |
|
828 |
|
829 # the code used in a bytestream to represent the opcode; a |
|
830 # one-character string |
|
831 'code', |
|
832 |
|
833 # If the opcode has an argument embedded in the byte string, an |
|
834 # instance of ArgumentDescriptor specifying its type. Note that |
|
835 # arg.reader(s) can be used to read and decode the argument from |
|
836 # the bytestream s, and arg.doc documents the format of the raw |
|
837 # argument bytes. If the opcode doesn't have an argument embedded |
|
838 # in the bytestream, arg should be None. |
|
839 'arg', |
|
840 |
|
841 # what the stack looks like before this opcode runs; a list |
|
842 'stack_before', |
|
843 |
|
844 # what the stack looks like after this opcode runs; a list |
|
845 'stack_after', |
|
846 |
|
847 # the protocol number in which this opcode was introduced; an int |
|
848 'proto', |
|
849 |
|
850 # human-readable docs for this opcode; a string |
|
851 'doc', |
|
852 ) |
|
853 |
|
854 def __init__(self, name, code, arg, |
|
855 stack_before, stack_after, proto, doc): |
|
856 assert isinstance(name, str) |
|
857 self.name = name |
|
858 |
|
859 assert isinstance(code, str) |
|
860 assert len(code) == 1 |
|
861 self.code = code |
|
862 |
|
863 assert arg is None or isinstance(arg, ArgumentDescriptor) |
|
864 self.arg = arg |
|
865 |
|
866 assert isinstance(stack_before, list) |
|
867 for x in stack_before: |
|
868 assert isinstance(x, StackObject) |
|
869 self.stack_before = stack_before |
|
870 |
|
871 assert isinstance(stack_after, list) |
|
872 for x in stack_after: |
|
873 assert isinstance(x, StackObject) |
|
874 self.stack_after = stack_after |
|
875 |
|
876 assert isinstance(proto, int) and 0 <= proto <= 2 |
|
877 self.proto = proto |
|
878 |
|
879 assert isinstance(doc, str) |
|
880 self.doc = doc |
|
881 |
|
882 I = OpcodeInfo |
|
883 opcodes = [ |
|
884 |
|
885 # Ways to spell integers. |
|
886 |
|
887 I(name='INT', |
|
888 code='I', |
|
889 arg=decimalnl_short, |
|
890 stack_before=[], |
|
891 stack_after=[pyinteger_or_bool], |
|
892 proto=0, |
|
893 doc="""Push an integer or bool. |
|
894 |
|
895 The argument is a newline-terminated decimal literal string. |
|
896 |
|
897 The intent may have been that this always fit in a short Python int, |
|
898 but INT can be generated in pickles written on a 64-bit box that |
|
899 require a Python long on a 32-bit box. The difference between this |
|
900 and LONG then is that INT skips a trailing 'L', and produces a short |
|
901 int whenever possible. |
|
902 |
|
903 Another difference is due to that, when bool was introduced as a |
|
904 distinct type in 2.3, builtin names True and False were also added to |
|
905 2.2.2, mapping to ints 1 and 0. For compatibility in both directions, |
|
906 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n". |
|
907 Leading zeroes are never produced for a genuine integer. The 2.3 |
|
908 (and later) unpicklers special-case these and return bool instead; |
|
909 earlier unpicklers ignore the leading "0" and return the int. |
|
910 """), |
|
911 |
|
912 I(name='BININT', |
|
913 code='J', |
|
914 arg=int4, |
|
915 stack_before=[], |
|
916 stack_after=[pyint], |
|
917 proto=1, |
|
918 doc="""Push a four-byte signed integer. |
|
919 |
|
920 This handles the full range of Python (short) integers on a 32-bit |
|
921 box, directly as binary bytes (1 for the opcode and 4 for the integer). |
|
922 If the integer is non-negative and fits in 1 or 2 bytes, pickling via |
|
923 BININT1 or BININT2 saves space. |
|
924 """), |
|
925 |
|
926 I(name='BININT1', |
|
927 code='K', |
|
928 arg=uint1, |
|
929 stack_before=[], |
|
930 stack_after=[pyint], |
|
931 proto=1, |
|
932 doc="""Push a one-byte unsigned integer. |
|
933 |
|
934 This is a space optimization for pickling very small non-negative ints, |
|
935 in range(256). |
|
936 """), |
|
937 |
|
938 I(name='BININT2', |
|
939 code='M', |
|
940 arg=uint2, |
|
941 stack_before=[], |
|
942 stack_after=[pyint], |
|
943 proto=1, |
|
944 doc="""Push a two-byte unsigned integer. |
|
945 |
|
946 This is a space optimization for pickling small positive ints, in |
|
947 range(256, 2**16). Integers in range(256) can also be pickled via |
|
948 BININT2, but BININT1 instead saves a byte. |
|
949 """), |
|
950 |
|
951 I(name='LONG', |
|
952 code='L', |
|
953 arg=decimalnl_long, |
|
954 stack_before=[], |
|
955 stack_after=[pylong], |
|
956 proto=0, |
|
957 doc="""Push a long integer. |
|
958 |
|
959 The same as INT, except that the literal ends with 'L', and always |
|
960 unpickles to a Python long. There doesn't seem a real purpose to the |
|
961 trailing 'L'. |
|
962 |
|
963 Note that LONG takes time quadratic in the number of digits when |
|
964 unpickling (this is simply due to the nature of decimal->binary |
|
965 conversion). Proto 2 added linear-time (in C; still quadratic-time |
|
966 in Python) LONG1 and LONG4 opcodes. |
|
967 """), |
|
968 |
|
969 I(name="LONG1", |
|
970 code='\x8a', |
|
971 arg=long1, |
|
972 stack_before=[], |
|
973 stack_after=[pylong], |
|
974 proto=2, |
|
975 doc="""Long integer using one-byte length. |
|
976 |
|
977 A more efficient encoding of a Python long; the long1 encoding |
|
978 says it all."""), |
|
979 |
|
980 I(name="LONG4", |
|
981 code='\x8b', |
|
982 arg=long4, |
|
983 stack_before=[], |
|
984 stack_after=[pylong], |
|
985 proto=2, |
|
986 doc="""Long integer using found-byte length. |
|
987 |
|
988 A more efficient encoding of a Python long; the long4 encoding |
|
989 says it all."""), |
|
990 |
|
991 # Ways to spell strings (8-bit, not Unicode). |
|
992 |
|
993 I(name='STRING', |
|
994 code='S', |
|
995 arg=stringnl, |
|
996 stack_before=[], |
|
997 stack_after=[pystring], |
|
998 proto=0, |
|
999 doc="""Push a Python string object. |
|
1000 |
|
1001 The argument is a repr-style string, with bracketing quote characters, |
|
1002 and perhaps embedded escapes. The argument extends until the next |
|
1003 newline character. |
|
1004 """), |
|
1005 |
|
1006 I(name='BINSTRING', |
|
1007 code='T', |
|
1008 arg=string4, |
|
1009 stack_before=[], |
|
1010 stack_after=[pystring], |
|
1011 proto=1, |
|
1012 doc="""Push a Python string object. |
|
1013 |
|
1014 There are two arguments: the first is a 4-byte little-endian signed int |
|
1015 giving the number of bytes in the string, and the second is that many |
|
1016 bytes, which are taken literally as the string content. |
|
1017 """), |
|
1018 |
|
1019 I(name='SHORT_BINSTRING', |
|
1020 code='U', |
|
1021 arg=string1, |
|
1022 stack_before=[], |
|
1023 stack_after=[pystring], |
|
1024 proto=1, |
|
1025 doc="""Push a Python string object. |
|
1026 |
|
1027 There are two arguments: the first is a 1-byte unsigned int giving |
|
1028 the number of bytes in the string, and the second is that many bytes, |
|
1029 which are taken literally as the string content. |
|
1030 """), |
|
1031 |
|
1032 # Ways to spell None. |
|
1033 |
|
1034 I(name='NONE', |
|
1035 code='N', |
|
1036 arg=None, |
|
1037 stack_before=[], |
|
1038 stack_after=[pynone], |
|
1039 proto=0, |
|
1040 doc="Push None on the stack."), |
|
1041 |
|
1042 # Ways to spell bools, starting with proto 2. See INT for how this was |
|
1043 # done before proto 2. |
|
1044 |
|
1045 I(name='NEWTRUE', |
|
1046 code='\x88', |
|
1047 arg=None, |
|
1048 stack_before=[], |
|
1049 stack_after=[pybool], |
|
1050 proto=2, |
|
1051 doc="""True. |
|
1052 |
|
1053 Push True onto the stack."""), |
|
1054 |
|
1055 I(name='NEWFALSE', |
|
1056 code='\x89', |
|
1057 arg=None, |
|
1058 stack_before=[], |
|
1059 stack_after=[pybool], |
|
1060 proto=2, |
|
1061 doc="""True. |
|
1062 |
|
1063 Push False onto the stack."""), |
|
1064 |
|
1065 # Ways to spell Unicode strings. |
|
1066 |
|
1067 I(name='UNICODE', |
|
1068 code='V', |
|
1069 arg=unicodestringnl, |
|
1070 stack_before=[], |
|
1071 stack_after=[pyunicode], |
|
1072 proto=0, # this may be pure-text, but it's a later addition |
|
1073 doc="""Push a Python Unicode string object. |
|
1074 |
|
1075 The argument is a raw-unicode-escape encoding of a Unicode string, |
|
1076 and so may contain embedded escape sequences. The argument extends |
|
1077 until the next newline character. |
|
1078 """), |
|
1079 |
|
1080 I(name='BINUNICODE', |
|
1081 code='X', |
|
1082 arg=unicodestring4, |
|
1083 stack_before=[], |
|
1084 stack_after=[pyunicode], |
|
1085 proto=1, |
|
1086 doc="""Push a Python Unicode string object. |
|
1087 |
|
1088 There are two arguments: the first is a 4-byte little-endian signed int |
|
1089 giving the number of bytes in the string. The second is that many |
|
1090 bytes, and is the UTF-8 encoding of the Unicode string. |
|
1091 """), |
|
1092 |
|
1093 # Ways to spell floats. |
|
1094 |
|
1095 I(name='FLOAT', |
|
1096 code='F', |
|
1097 arg=floatnl, |
|
1098 stack_before=[], |
|
1099 stack_after=[pyfloat], |
|
1100 proto=0, |
|
1101 doc="""Newline-terminated decimal float literal. |
|
1102 |
|
1103 The argument is repr(a_float), and in general requires 17 significant |
|
1104 digits for roundtrip conversion to be an identity (this is so for |
|
1105 IEEE-754 double precision values, which is what Python float maps to |
|
1106 on most boxes). |
|
1107 |
|
1108 In general, FLOAT cannot be used to transport infinities, NaNs, or |
|
1109 minus zero across boxes (or even on a single box, if the platform C |
|
1110 library can't read the strings it produces for such things -- Windows |
|
1111 is like that), but may do less damage than BINFLOAT on boxes with |
|
1112 greater precision or dynamic range than IEEE-754 double. |
|
1113 """), |
|
1114 |
|
1115 I(name='BINFLOAT', |
|
1116 code='G', |
|
1117 arg=float8, |
|
1118 stack_before=[], |
|
1119 stack_after=[pyfloat], |
|
1120 proto=1, |
|
1121 doc="""Float stored in binary form, with 8 bytes of data. |
|
1122 |
|
1123 This generally requires less than half the space of FLOAT encoding. |
|
1124 In general, BINFLOAT cannot be used to transport infinities, NaNs, or |
|
1125 minus zero, raises an exception if the exponent exceeds the range of |
|
1126 an IEEE-754 double, and retains no more than 53 bits of precision (if |
|
1127 there are more than that, "add a half and chop" rounding is used to |
|
1128 cut it back to 53 significant bits). |
|
1129 """), |
|
1130 |
|
1131 # Ways to build lists. |
|
1132 |
|
1133 I(name='EMPTY_LIST', |
|
1134 code=']', |
|
1135 arg=None, |
|
1136 stack_before=[], |
|
1137 stack_after=[pylist], |
|
1138 proto=1, |
|
1139 doc="Push an empty list."), |
|
1140 |
|
1141 I(name='APPEND', |
|
1142 code='a', |
|
1143 arg=None, |
|
1144 stack_before=[pylist, anyobject], |
|
1145 stack_after=[pylist], |
|
1146 proto=0, |
|
1147 doc="""Append an object to a list. |
|
1148 |
|
1149 Stack before: ... pylist anyobject |
|
1150 Stack after: ... pylist+[anyobject] |
|
1151 |
|
1152 although pylist is really extended in-place. |
|
1153 """), |
|
1154 |
|
1155 I(name='APPENDS', |
|
1156 code='e', |
|
1157 arg=None, |
|
1158 stack_before=[pylist, markobject, stackslice], |
|
1159 stack_after=[pylist], |
|
1160 proto=1, |
|
1161 doc="""Extend a list by a slice of stack objects. |
|
1162 |
|
1163 Stack before: ... pylist markobject stackslice |
|
1164 Stack after: ... pylist+stackslice |
|
1165 |
|
1166 although pylist is really extended in-place. |
|
1167 """), |
|
1168 |
|
1169 I(name='LIST', |
|
1170 code='l', |
|
1171 arg=None, |
|
1172 stack_before=[markobject, stackslice], |
|
1173 stack_after=[pylist], |
|
1174 proto=0, |
|
1175 doc="""Build a list out of the topmost stack slice, after markobject. |
|
1176 |
|
1177 All the stack entries following the topmost markobject are placed into |
|
1178 a single Python list, which single list object replaces all of the |
|
1179 stack from the topmost markobject onward. For example, |
|
1180 |
|
1181 Stack before: ... markobject 1 2 3 'abc' |
|
1182 Stack after: ... [1, 2, 3, 'abc'] |
|
1183 """), |
|
1184 |
|
1185 # Ways to build tuples. |
|
1186 |
|
1187 I(name='EMPTY_TUPLE', |
|
1188 code=')', |
|
1189 arg=None, |
|
1190 stack_before=[], |
|
1191 stack_after=[pytuple], |
|
1192 proto=1, |
|
1193 doc="Push an empty tuple."), |
|
1194 |
|
1195 I(name='TUPLE', |
|
1196 code='t', |
|
1197 arg=None, |
|
1198 stack_before=[markobject, stackslice], |
|
1199 stack_after=[pytuple], |
|
1200 proto=0, |
|
1201 doc="""Build a tuple out of the topmost stack slice, after markobject. |
|
1202 |
|
1203 All the stack entries following the topmost markobject are placed into |
|
1204 a single Python tuple, which single tuple object replaces all of the |
|
1205 stack from the topmost markobject onward. For example, |
|
1206 |
|
1207 Stack before: ... markobject 1 2 3 'abc' |
|
1208 Stack after: ... (1, 2, 3, 'abc') |
|
1209 """), |
|
1210 |
|
1211 I(name='TUPLE1', |
|
1212 code='\x85', |
|
1213 arg=None, |
|
1214 stack_before=[anyobject], |
|
1215 stack_after=[pytuple], |
|
1216 proto=2, |
|
1217 doc="""One-tuple. |
|
1218 |
|
1219 This code pops one value off the stack and pushes a tuple of |
|
1220 length 1 whose one item is that value back onto it. IOW: |
|
1221 |
|
1222 stack[-1] = tuple(stack[-1:]) |
|
1223 """), |
|
1224 |
|
1225 I(name='TUPLE2', |
|
1226 code='\x86', |
|
1227 arg=None, |
|
1228 stack_before=[anyobject, anyobject], |
|
1229 stack_after=[pytuple], |
|
1230 proto=2, |
|
1231 doc="""One-tuple. |
|
1232 |
|
1233 This code pops two values off the stack and pushes a tuple |
|
1234 of length 2 whose items are those values back onto it. IOW: |
|
1235 |
|
1236 stack[-2:] = [tuple(stack[-2:])] |
|
1237 """), |
|
1238 |
|
1239 I(name='TUPLE3', |
|
1240 code='\x87', |
|
1241 arg=None, |
|
1242 stack_before=[anyobject, anyobject, anyobject], |
|
1243 stack_after=[pytuple], |
|
1244 proto=2, |
|
1245 doc="""One-tuple. |
|
1246 |
|
1247 This code pops three values off the stack and pushes a tuple |
|
1248 of length 3 whose items are those values back onto it. IOW: |
|
1249 |
|
1250 stack[-3:] = [tuple(stack[-3:])] |
|
1251 """), |
|
1252 |
|
1253 # Ways to build dicts. |
|
1254 |
|
1255 I(name='EMPTY_DICT', |
|
1256 code='}', |
|
1257 arg=None, |
|
1258 stack_before=[], |
|
1259 stack_after=[pydict], |
|
1260 proto=1, |
|
1261 doc="Push an empty dict."), |
|
1262 |
|
1263 I(name='DICT', |
|
1264 code='d', |
|
1265 arg=None, |
|
1266 stack_before=[markobject, stackslice], |
|
1267 stack_after=[pydict], |
|
1268 proto=0, |
|
1269 doc="""Build a dict out of the topmost stack slice, after markobject. |
|
1270 |
|
1271 All the stack entries following the topmost markobject are placed into |
|
1272 a single Python dict, which single dict object replaces all of the |
|
1273 stack from the topmost markobject onward. The stack slice alternates |
|
1274 key, value, key, value, .... For example, |
|
1275 |
|
1276 Stack before: ... markobject 1 2 3 'abc' |
|
1277 Stack after: ... {1: 2, 3: 'abc'} |
|
1278 """), |
|
1279 |
|
1280 I(name='SETITEM', |
|
1281 code='s', |
|
1282 arg=None, |
|
1283 stack_before=[pydict, anyobject, anyobject], |
|
1284 stack_after=[pydict], |
|
1285 proto=0, |
|
1286 doc="""Add a key+value pair to an existing dict. |
|
1287 |
|
1288 Stack before: ... pydict key value |
|
1289 Stack after: ... pydict |
|
1290 |
|
1291 where pydict has been modified via pydict[key] = value. |
|
1292 """), |
|
1293 |
|
1294 I(name='SETITEMS', |
|
1295 code='u', |
|
1296 arg=None, |
|
1297 stack_before=[pydict, markobject, stackslice], |
|
1298 stack_after=[pydict], |
|
1299 proto=1, |
|
1300 doc="""Add an arbitrary number of key+value pairs to an existing dict. |
|
1301 |
|
1302 The slice of the stack following the topmost markobject is taken as |
|
1303 an alternating sequence of keys and values, added to the dict |
|
1304 immediately under the topmost markobject. Everything at and after the |
|
1305 topmost markobject is popped, leaving the mutated dict at the top |
|
1306 of the stack. |
|
1307 |
|
1308 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n |
|
1309 Stack after: ... pydict |
|
1310 |
|
1311 where pydict has been modified via pydict[key_i] = value_i for i in |
|
1312 1, 2, ..., n, and in that order. |
|
1313 """), |
|
1314 |
|
1315 # Stack manipulation. |
|
1316 |
|
1317 I(name='POP', |
|
1318 code='0', |
|
1319 arg=None, |
|
1320 stack_before=[anyobject], |
|
1321 stack_after=[], |
|
1322 proto=0, |
|
1323 doc="Discard the top stack item, shrinking the stack by one item."), |
|
1324 |
|
1325 I(name='DUP', |
|
1326 code='2', |
|
1327 arg=None, |
|
1328 stack_before=[anyobject], |
|
1329 stack_after=[anyobject, anyobject], |
|
1330 proto=0, |
|
1331 doc="Push the top stack item onto the stack again, duplicating it."), |
|
1332 |
|
1333 I(name='MARK', |
|
1334 code='(', |
|
1335 arg=None, |
|
1336 stack_before=[], |
|
1337 stack_after=[markobject], |
|
1338 proto=0, |
|
1339 doc="""Push markobject onto the stack. |
|
1340 |
|
1341 markobject is a unique object, used by other opcodes to identify a |
|
1342 region of the stack containing a variable number of objects for them |
|
1343 to work on. See markobject.doc for more detail. |
|
1344 """), |
|
1345 |
|
1346 I(name='POP_MARK', |
|
1347 code='1', |
|
1348 arg=None, |
|
1349 stack_before=[markobject, stackslice], |
|
1350 stack_after=[], |
|
1351 proto=0, |
|
1352 doc="""Pop all the stack objects at and above the topmost markobject. |
|
1353 |
|
1354 When an opcode using a variable number of stack objects is done, |
|
1355 POP_MARK is used to remove those objects, and to remove the markobject |
|
1356 that delimited their starting position on the stack. |
|
1357 """), |
|
1358 |
|
1359 # Memo manipulation. There are really only two operations (get and put), |
|
1360 # each in all-text, "short binary", and "long binary" flavors. |
|
1361 |
|
1362 I(name='GET', |
|
1363 code='g', |
|
1364 arg=decimalnl_short, |
|
1365 stack_before=[], |
|
1366 stack_after=[anyobject], |
|
1367 proto=0, |
|
1368 doc="""Read an object from the memo and push it on the stack. |
|
1369 |
|
1370 The index of the memo object to push is given by the newline-teriminated |
|
1371 decimal string following. BINGET and LONG_BINGET are space-optimized |
|
1372 versions. |
|
1373 """), |
|
1374 |
|
1375 I(name='BINGET', |
|
1376 code='h', |
|
1377 arg=uint1, |
|
1378 stack_before=[], |
|
1379 stack_after=[anyobject], |
|
1380 proto=1, |
|
1381 doc="""Read an object from the memo and push it on the stack. |
|
1382 |
|
1383 The index of the memo object to push is given by the 1-byte unsigned |
|
1384 integer following. |
|
1385 """), |
|
1386 |
|
1387 I(name='LONG_BINGET', |
|
1388 code='j', |
|
1389 arg=int4, |
|
1390 stack_before=[], |
|
1391 stack_after=[anyobject], |
|
1392 proto=1, |
|
1393 doc="""Read an object from the memo and push it on the stack. |
|
1394 |
|
1395 The index of the memo object to push is given by the 4-byte signed |
|
1396 little-endian integer following. |
|
1397 """), |
|
1398 |
|
1399 I(name='PUT', |
|
1400 code='p', |
|
1401 arg=decimalnl_short, |
|
1402 stack_before=[], |
|
1403 stack_after=[], |
|
1404 proto=0, |
|
1405 doc="""Store the stack top into the memo. The stack is not popped. |
|
1406 |
|
1407 The index of the memo location to write into is given by the newline- |
|
1408 terminated decimal string following. BINPUT and LONG_BINPUT are |
|
1409 space-optimized versions. |
|
1410 """), |
|
1411 |
|
1412 I(name='BINPUT', |
|
1413 code='q', |
|
1414 arg=uint1, |
|
1415 stack_before=[], |
|
1416 stack_after=[], |
|
1417 proto=1, |
|
1418 doc="""Store the stack top into the memo. The stack is not popped. |
|
1419 |
|
1420 The index of the memo location to write into is given by the 1-byte |
|
1421 unsigned integer following. |
|
1422 """), |
|
1423 |
|
1424 I(name='LONG_BINPUT', |
|
1425 code='r', |
|
1426 arg=int4, |
|
1427 stack_before=[], |
|
1428 stack_after=[], |
|
1429 proto=1, |
|
1430 doc="""Store the stack top into the memo. The stack is not popped. |
|
1431 |
|
1432 The index of the memo location to write into is given by the 4-byte |
|
1433 signed little-endian integer following. |
|
1434 """), |
|
1435 |
|
1436 # Access the extension registry (predefined objects). Akin to the GET |
|
1437 # family. |
|
1438 |
|
1439 I(name='EXT1', |
|
1440 code='\x82', |
|
1441 arg=uint1, |
|
1442 stack_before=[], |
|
1443 stack_after=[anyobject], |
|
1444 proto=2, |
|
1445 doc="""Extension code. |
|
1446 |
|
1447 This code and the similar EXT2 and EXT4 allow using a registry |
|
1448 of popular objects that are pickled by name, typically classes. |
|
1449 It is envisioned that through a global negotiation and |
|
1450 registration process, third parties can set up a mapping between |
|
1451 ints and object names. |
|
1452 |
|
1453 In order to guarantee pickle interchangeability, the extension |
|
1454 code registry ought to be global, although a range of codes may |
|
1455 be reserved for private use. |
|
1456 |
|
1457 EXT1 has a 1-byte integer argument. This is used to index into the |
|
1458 extension registry, and the object at that index is pushed on the stack. |
|
1459 """), |
|
1460 |
|
1461 I(name='EXT2', |
|
1462 code='\x83', |
|
1463 arg=uint2, |
|
1464 stack_before=[], |
|
1465 stack_after=[anyobject], |
|
1466 proto=2, |
|
1467 doc="""Extension code. |
|
1468 |
|
1469 See EXT1. EXT2 has a two-byte integer argument. |
|
1470 """), |
|
1471 |
|
1472 I(name='EXT4', |
|
1473 code='\x84', |
|
1474 arg=int4, |
|
1475 stack_before=[], |
|
1476 stack_after=[anyobject], |
|
1477 proto=2, |
|
1478 doc="""Extension code. |
|
1479 |
|
1480 See EXT1. EXT4 has a four-byte integer argument. |
|
1481 """), |
|
1482 |
|
1483 # Push a class object, or module function, on the stack, via its module |
|
1484 # and name. |
|
1485 |
|
1486 I(name='GLOBAL', |
|
1487 code='c', |
|
1488 arg=stringnl_noescape_pair, |
|
1489 stack_before=[], |
|
1490 stack_after=[anyobject], |
|
1491 proto=0, |
|
1492 doc="""Push a global object (module.attr) on the stack. |
|
1493 |
|
1494 Two newline-terminated strings follow the GLOBAL opcode. The first is |
|
1495 taken as a module name, and the second as a class name. The class |
|
1496 object module.class is pushed on the stack. More accurately, the |
|
1497 object returned by self.find_class(module, class) is pushed on the |
|
1498 stack, so unpickling subclasses can override this form of lookup. |
|
1499 """), |
|
1500 |
|
1501 # Ways to build objects of classes pickle doesn't know about directly |
|
1502 # (user-defined classes). I despair of documenting this accurately |
|
1503 # and comprehensibly -- you really have to read the pickle code to |
|
1504 # find all the special cases. |
|
1505 |
|
1506 I(name='REDUCE', |
|
1507 code='R', |
|
1508 arg=None, |
|
1509 stack_before=[anyobject, anyobject], |
|
1510 stack_after=[anyobject], |
|
1511 proto=0, |
|
1512 doc="""Push an object built from a callable and an argument tuple. |
|
1513 |
|
1514 The opcode is named to remind of the __reduce__() method. |
|
1515 |
|
1516 Stack before: ... callable pytuple |
|
1517 Stack after: ... callable(*pytuple) |
|
1518 |
|
1519 The callable and the argument tuple are the first two items returned |
|
1520 by a __reduce__ method. Applying the callable to the argtuple is |
|
1521 supposed to reproduce the original object, or at least get it started. |
|
1522 If the __reduce__ method returns a 3-tuple, the last component is an |
|
1523 argument to be passed to the object's __setstate__, and then the REDUCE |
|
1524 opcode is followed by code to create setstate's argument, and then a |
|
1525 BUILD opcode to apply __setstate__ to that argument. |
|
1526 |
|
1527 If type(callable) is not ClassType, REDUCE complains unless the |
|
1528 callable has been registered with the copy_reg module's |
|
1529 safe_constructors dict, or the callable has a magic |
|
1530 '__safe_for_unpickling__' attribute with a true value. I'm not sure |
|
1531 why it does this, but I've sure seen this complaint often enough when |
|
1532 I didn't want to <wink>. |
|
1533 """), |
|
1534 |
|
1535 I(name='BUILD', |
|
1536 code='b', |
|
1537 arg=None, |
|
1538 stack_before=[anyobject, anyobject], |
|
1539 stack_after=[anyobject], |
|
1540 proto=0, |
|
1541 doc="""Finish building an object, via __setstate__ or dict update. |
|
1542 |
|
1543 Stack before: ... anyobject argument |
|
1544 Stack after: ... anyobject |
|
1545 |
|
1546 where anyobject may have been mutated, as follows: |
|
1547 |
|
1548 If the object has a __setstate__ method, |
|
1549 |
|
1550 anyobject.__setstate__(argument) |
|
1551 |
|
1552 is called. |
|
1553 |
|
1554 Else the argument must be a dict, the object must have a __dict__, and |
|
1555 the object is updated via |
|
1556 |
|
1557 anyobject.__dict__.update(argument) |
|
1558 |
|
1559 This may raise RuntimeError in restricted execution mode (which |
|
1560 disallows access to __dict__ directly); in that case, the object |
|
1561 is updated instead via |
|
1562 |
|
1563 for k, v in argument.items(): |
|
1564 anyobject[k] = v |
|
1565 """), |
|
1566 |
|
1567 I(name='INST', |
|
1568 code='i', |
|
1569 arg=stringnl_noescape_pair, |
|
1570 stack_before=[markobject, stackslice], |
|
1571 stack_after=[anyobject], |
|
1572 proto=0, |
|
1573 doc="""Build a class instance. |
|
1574 |
|
1575 This is the protocol 0 version of protocol 1's OBJ opcode. |
|
1576 INST is followed by two newline-terminated strings, giving a |
|
1577 module and class name, just as for the GLOBAL opcode (and see |
|
1578 GLOBAL for more details about that). self.find_class(module, name) |
|
1579 is used to get a class object. |
|
1580 |
|
1581 In addition, all the objects on the stack following the topmost |
|
1582 markobject are gathered into a tuple and popped (along with the |
|
1583 topmost markobject), just as for the TUPLE opcode. |
|
1584 |
|
1585 Now it gets complicated. If all of these are true: |
|
1586 |
|
1587 + The argtuple is empty (markobject was at the top of the stack |
|
1588 at the start). |
|
1589 |
|
1590 + It's an old-style class object (the type of the class object is |
|
1591 ClassType). |
|
1592 |
|
1593 + The class object does not have a __getinitargs__ attribute. |
|
1594 |
|
1595 then we want to create an old-style class instance without invoking |
|
1596 its __init__() method (pickle has waffled on this over the years; not |
|
1597 calling __init__() is current wisdom). In this case, an instance of |
|
1598 an old-style dummy class is created, and then we try to rebind its |
|
1599 __class__ attribute to the desired class object. If this succeeds, |
|
1600 the new instance object is pushed on the stack, and we're done. In |
|
1601 restricted execution mode it can fail (assignment to __class__ is |
|
1602 disallowed), and I'm not really sure what happens then -- it looks |
|
1603 like the code ends up calling the class object's __init__ anyway, |
|
1604 via falling into the next case. |
|
1605 |
|
1606 Else (the argtuple is not empty, it's not an old-style class object, |
|
1607 or the class object does have a __getinitargs__ attribute), the code |
|
1608 first insists that the class object have a __safe_for_unpickling__ |
|
1609 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE, |
|
1610 it doesn't matter whether this attribute has a true or false value, it |
|
1611 only matters whether it exists (XXX this is a bug; cPickle |
|
1612 requires the attribute to be true). If __safe_for_unpickling__ |
|
1613 doesn't exist, UnpicklingError is raised. |
|
1614 |
|
1615 Else (the class object does have a __safe_for_unpickling__ attr), |
|
1616 the class object obtained from INST's arguments is applied to the |
|
1617 argtuple obtained from the stack, and the resulting instance object |
|
1618 is pushed on the stack. |
|
1619 |
|
1620 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3. |
|
1621 """), |
|
1622 |
|
1623 I(name='OBJ', |
|
1624 code='o', |
|
1625 arg=None, |
|
1626 stack_before=[markobject, anyobject, stackslice], |
|
1627 stack_after=[anyobject], |
|
1628 proto=1, |
|
1629 doc="""Build a class instance. |
|
1630 |
|
1631 This is the protocol 1 version of protocol 0's INST opcode, and is |
|
1632 very much like it. The major difference is that the class object |
|
1633 is taken off the stack, allowing it to be retrieved from the memo |
|
1634 repeatedly if several instances of the same class are created. This |
|
1635 can be much more efficient (in both time and space) than repeatedly |
|
1636 embedding the module and class names in INST opcodes. |
|
1637 |
|
1638 Unlike INST, OBJ takes no arguments from the opcode stream. Instead |
|
1639 the class object is taken off the stack, immediately above the |
|
1640 topmost markobject: |
|
1641 |
|
1642 Stack before: ... markobject classobject stackslice |
|
1643 Stack after: ... new_instance_object |
|
1644 |
|
1645 As for INST, the remainder of the stack above the markobject is |
|
1646 gathered into an argument tuple, and then the logic seems identical, |
|
1647 except that no __safe_for_unpickling__ check is done (XXX this is |
|
1648 a bug; cPickle does test __safe_for_unpickling__). See INST for |
|
1649 the gory details. |
|
1650 |
|
1651 NOTE: In Python 2.3, INST and OBJ are identical except for how they |
|
1652 get the class object. That was always the intent; the implementations |
|
1653 had diverged for accidental reasons. |
|
1654 """), |
|
1655 |
|
1656 I(name='NEWOBJ', |
|
1657 code='\x81', |
|
1658 arg=None, |
|
1659 stack_before=[anyobject, anyobject], |
|
1660 stack_after=[anyobject], |
|
1661 proto=2, |
|
1662 doc="""Build an object instance. |
|
1663 |
|
1664 The stack before should be thought of as containing a class |
|
1665 object followed by an argument tuple (the tuple being the stack |
|
1666 top). Call these cls and args. They are popped off the stack, |
|
1667 and the value returned by cls.__new__(cls, *args) is pushed back |
|
1668 onto the stack. |
|
1669 """), |
|
1670 |
|
1671 # Machine control. |
|
1672 |
|
1673 I(name='PROTO', |
|
1674 code='\x80', |
|
1675 arg=uint1, |
|
1676 stack_before=[], |
|
1677 stack_after=[], |
|
1678 proto=2, |
|
1679 doc="""Protocol version indicator. |
|
1680 |
|
1681 For protocol 2 and above, a pickle must start with this opcode. |
|
1682 The argument is the protocol version, an int in range(2, 256). |
|
1683 """), |
|
1684 |
|
1685 I(name='STOP', |
|
1686 code='.', |
|
1687 arg=None, |
|
1688 stack_before=[anyobject], |
|
1689 stack_after=[], |
|
1690 proto=0, |
|
1691 doc="""Stop the unpickling machine. |
|
1692 |
|
1693 Every pickle ends with this opcode. The object at the top of the stack |
|
1694 is popped, and that's the result of unpickling. The stack should be |
|
1695 empty then. |
|
1696 """), |
|
1697 |
|
1698 # Ways to deal with persistent IDs. |
|
1699 |
|
1700 I(name='PERSID', |
|
1701 code='P', |
|
1702 arg=stringnl_noescape, |
|
1703 stack_before=[], |
|
1704 stack_after=[anyobject], |
|
1705 proto=0, |
|
1706 doc="""Push an object identified by a persistent ID. |
|
1707 |
|
1708 The pickle module doesn't define what a persistent ID means. PERSID's |
|
1709 argument is a newline-terminated str-style (no embedded escapes, no |
|
1710 bracketing quote characters) string, which *is* "the persistent ID". |
|
1711 The unpickler passes this string to self.persistent_load(). Whatever |
|
1712 object that returns is pushed on the stack. There is no implementation |
|
1713 of persistent_load() in Python's unpickler: it must be supplied by an |
|
1714 unpickler subclass. |
|
1715 """), |
|
1716 |
|
1717 I(name='BINPERSID', |
|
1718 code='Q', |
|
1719 arg=None, |
|
1720 stack_before=[anyobject], |
|
1721 stack_after=[anyobject], |
|
1722 proto=1, |
|
1723 doc="""Push an object identified by a persistent ID. |
|
1724 |
|
1725 Like PERSID, except the persistent ID is popped off the stack (instead |
|
1726 of being a string embedded in the opcode bytestream). The persistent |
|
1727 ID is passed to self.persistent_load(), and whatever object that |
|
1728 returns is pushed on the stack. See PERSID for more detail. |
|
1729 """), |
|
1730 ] |
|
1731 del I |
|
1732 |
|
1733 # Verify uniqueness of .name and .code members. |
|
1734 name2i = {} |
|
1735 code2i = {} |
|
1736 |
|
1737 for i, d in enumerate(opcodes): |
|
1738 if d.name in name2i: |
|
1739 raise ValueError("repeated name %r at indices %d and %d" % |
|
1740 (d.name, name2i[d.name], i)) |
|
1741 if d.code in code2i: |
|
1742 raise ValueError("repeated code %r at indices %d and %d" % |
|
1743 (d.code, code2i[d.code], i)) |
|
1744 |
|
1745 name2i[d.name] = i |
|
1746 code2i[d.code] = i |
|
1747 |
|
1748 del name2i, code2i, i, d |
|
1749 |
|
1750 ############################################################################## |
|
1751 # Build a code2op dict, mapping opcode characters to OpcodeInfo records. |
|
1752 # Also ensure we've got the same stuff as pickle.py, although the |
|
1753 # introspection here is dicey. |
|
1754 |
|
1755 code2op = {} |
|
1756 for d in opcodes: |
|
1757 code2op[d.code] = d |
|
1758 del d |
|
1759 |
|
1760 def assure_pickle_consistency(verbose=False): |
|
1761 import pickle, re |
|
1762 |
|
1763 copy = code2op.copy() |
|
1764 for name in pickle.__all__: |
|
1765 if not re.match("[A-Z][A-Z0-9_]+$", name): |
|
1766 if verbose: |
|
1767 print "skipping %r: it doesn't look like an opcode name" % name |
|
1768 continue |
|
1769 picklecode = getattr(pickle, name) |
|
1770 if not isinstance(picklecode, str) or len(picklecode) != 1: |
|
1771 if verbose: |
|
1772 print ("skipping %r: value %r doesn't look like a pickle " |
|
1773 "code" % (name, picklecode)) |
|
1774 continue |
|
1775 if picklecode in copy: |
|
1776 if verbose: |
|
1777 print "checking name %r w/ code %r for consistency" % ( |
|
1778 name, picklecode) |
|
1779 d = copy[picklecode] |
|
1780 if d.name != name: |
|
1781 raise ValueError("for pickle code %r, pickle.py uses name %r " |
|
1782 "but we're using name %r" % (picklecode, |
|
1783 name, |
|
1784 d.name)) |
|
1785 # Forget this one. Any left over in copy at the end are a problem |
|
1786 # of a different kind. |
|
1787 del copy[picklecode] |
|
1788 else: |
|
1789 raise ValueError("pickle.py appears to have a pickle opcode with " |
|
1790 "name %r and code %r, but we don't" % |
|
1791 (name, picklecode)) |
|
1792 if copy: |
|
1793 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"] |
|
1794 for code, d in copy.items(): |
|
1795 msg.append(" name %r with code %r" % (d.name, code)) |
|
1796 raise ValueError("\n".join(msg)) |
|
1797 |
|
1798 assure_pickle_consistency() |
|
1799 del assure_pickle_consistency |
|
1800 |
|
1801 ############################################################################## |
|
1802 # A pickle opcode generator. |
|
1803 |
|
1804 def genops(pickle): |
|
1805 """Generate all the opcodes in a pickle. |
|
1806 |
|
1807 'pickle' is a file-like object, or string, containing the pickle. |
|
1808 |
|
1809 Each opcode in the pickle is generated, from the current pickle position, |
|
1810 stopping after a STOP opcode is delivered. A triple is generated for |
|
1811 each opcode: |
|
1812 |
|
1813 opcode, arg, pos |
|
1814 |
|
1815 opcode is an OpcodeInfo record, describing the current opcode. |
|
1816 |
|
1817 If the opcode has an argument embedded in the pickle, arg is its decoded |
|
1818 value, as a Python object. If the opcode doesn't have an argument, arg |
|
1819 is None. |
|
1820 |
|
1821 If the pickle has a tell() method, pos was the value of pickle.tell() |
|
1822 before reading the current opcode. If the pickle is a string object, |
|
1823 it's wrapped in a StringIO object, and the latter's tell() result is |
|
1824 used. Else (the pickle doesn't have a tell(), and it's not obvious how |
|
1825 to query its current position) pos is None. |
|
1826 """ |
|
1827 |
|
1828 import cStringIO as StringIO |
|
1829 |
|
1830 if isinstance(pickle, str): |
|
1831 pickle = StringIO.StringIO(pickle) |
|
1832 |
|
1833 if hasattr(pickle, "tell"): |
|
1834 getpos = pickle.tell |
|
1835 else: |
|
1836 getpos = lambda: None |
|
1837 |
|
1838 while True: |
|
1839 pos = getpos() |
|
1840 code = pickle.read(1) |
|
1841 opcode = code2op.get(code) |
|
1842 if opcode is None: |
|
1843 if code == "": |
|
1844 raise ValueError("pickle exhausted before seeing STOP") |
|
1845 else: |
|
1846 raise ValueError("at position %s, opcode %r unknown" % ( |
|
1847 pos is None and "<unknown>" or pos, |
|
1848 code)) |
|
1849 if opcode.arg is None: |
|
1850 arg = None |
|
1851 else: |
|
1852 arg = opcode.arg.reader(pickle) |
|
1853 yield opcode, arg, pos |
|
1854 if code == '.': |
|
1855 assert opcode.name == 'STOP' |
|
1856 break |
|
1857 |
|
1858 ############################################################################## |
|
1859 # A pickle optimizer. |
|
1860 |
|
1861 def optimize(p): |
|
1862 'Optimize a pickle string by removing unused PUT opcodes' |
|
1863 gets = set() # set of args used by a GET opcode |
|
1864 puts = [] # (arg, startpos, stoppos) for the PUT opcodes |
|
1865 prevpos = None # set to pos if previous opcode was a PUT |
|
1866 for opcode, arg, pos in genops(p): |
|
1867 if prevpos is not None: |
|
1868 puts.append((prevarg, prevpos, pos)) |
|
1869 prevpos = None |
|
1870 if 'PUT' in opcode.name: |
|
1871 prevarg, prevpos = arg, pos |
|
1872 elif 'GET' in opcode.name: |
|
1873 gets.add(arg) |
|
1874 |
|
1875 # Copy the pickle string except for PUTS without a corresponding GET |
|
1876 s = [] |
|
1877 i = 0 |
|
1878 for arg, start, stop in puts: |
|
1879 j = stop if (arg in gets) else start |
|
1880 s.append(p[i:j]) |
|
1881 i = stop |
|
1882 s.append(p[i:]) |
|
1883 return ''.join(s) |
|
1884 |
|
1885 ############################################################################## |
|
1886 # A symbolic pickle disassembler. |
|
1887 |
|
1888 def dis(pickle, out=None, memo=None, indentlevel=4): |
|
1889 """Produce a symbolic disassembly of a pickle. |
|
1890 |
|
1891 'pickle' is a file-like object, or string, containing a (at least one) |
|
1892 pickle. The pickle is disassembled from the current position, through |
|
1893 the first STOP opcode encountered. |
|
1894 |
|
1895 Optional arg 'out' is a file-like object to which the disassembly is |
|
1896 printed. It defaults to sys.stdout. |
|
1897 |
|
1898 Optional arg 'memo' is a Python dict, used as the pickle's memo. It |
|
1899 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes. |
|
1900 Passing the same memo object to another dis() call then allows disassembly |
|
1901 to proceed across multiple pickles that were all created by the same |
|
1902 pickler with the same memo. Ordinarily you don't need to worry about this. |
|
1903 |
|
1904 Optional arg indentlevel is the number of blanks by which to indent |
|
1905 a new MARK level. It defaults to 4. |
|
1906 |
|
1907 In addition to printing the disassembly, some sanity checks are made: |
|
1908 |
|
1909 + All embedded opcode arguments "make sense". |
|
1910 |
|
1911 + Explicit and implicit pop operations have enough items on the stack. |
|
1912 |
|
1913 + When an opcode implicitly refers to a markobject, a markobject is |
|
1914 actually on the stack. |
|
1915 |
|
1916 + A memo entry isn't referenced before it's defined. |
|
1917 |
|
1918 + The markobject isn't stored in the memo. |
|
1919 |
|
1920 + A memo entry isn't redefined. |
|
1921 """ |
|
1922 |
|
1923 # Most of the hair here is for sanity checks, but most of it is needed |
|
1924 # anyway to detect when a protocol 0 POP takes a MARK off the stack |
|
1925 # (which in turn is needed to indent MARK blocks correctly). |
|
1926 |
|
1927 stack = [] # crude emulation of unpickler stack |
|
1928 if memo is None: |
|
1929 memo = {} # crude emulation of unpicker memo |
|
1930 maxproto = -1 # max protocol number seen |
|
1931 markstack = [] # bytecode positions of MARK opcodes |
|
1932 indentchunk = ' ' * indentlevel |
|
1933 errormsg = None |
|
1934 for opcode, arg, pos in genops(pickle): |
|
1935 if pos is not None: |
|
1936 print >> out, "%5d:" % pos, |
|
1937 |
|
1938 line = "%-4s %s%s" % (repr(opcode.code)[1:-1], |
|
1939 indentchunk * len(markstack), |
|
1940 opcode.name) |
|
1941 |
|
1942 maxproto = max(maxproto, opcode.proto) |
|
1943 before = opcode.stack_before # don't mutate |
|
1944 after = opcode.stack_after # don't mutate |
|
1945 numtopop = len(before) |
|
1946 |
|
1947 # See whether a MARK should be popped. |
|
1948 markmsg = None |
|
1949 if markobject in before or (opcode.name == "POP" and |
|
1950 stack and |
|
1951 stack[-1] is markobject): |
|
1952 assert markobject not in after |
|
1953 if __debug__: |
|
1954 if markobject in before: |
|
1955 assert before[-1] is stackslice |
|
1956 if markstack: |
|
1957 markpos = markstack.pop() |
|
1958 if markpos is None: |
|
1959 markmsg = "(MARK at unknown opcode offset)" |
|
1960 else: |
|
1961 markmsg = "(MARK at %d)" % markpos |
|
1962 # Pop everything at and after the topmost markobject. |
|
1963 while stack[-1] is not markobject: |
|
1964 stack.pop() |
|
1965 stack.pop() |
|
1966 # Stop later code from popping too much. |
|
1967 try: |
|
1968 numtopop = before.index(markobject) |
|
1969 except ValueError: |
|
1970 assert opcode.name == "POP" |
|
1971 numtopop = 0 |
|
1972 else: |
|
1973 errormsg = markmsg = "no MARK exists on stack" |
|
1974 |
|
1975 # Check for correct memo usage. |
|
1976 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"): |
|
1977 assert arg is not None |
|
1978 if arg in memo: |
|
1979 errormsg = "memo key %r already defined" % arg |
|
1980 elif not stack: |
|
1981 errormsg = "stack is empty -- can't store into memo" |
|
1982 elif stack[-1] is markobject: |
|
1983 errormsg = "can't store markobject in the memo" |
|
1984 else: |
|
1985 memo[arg] = stack[-1] |
|
1986 |
|
1987 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"): |
|
1988 if arg in memo: |
|
1989 assert len(after) == 1 |
|
1990 after = [memo[arg]] # for better stack emulation |
|
1991 else: |
|
1992 errormsg = "memo key %r has never been stored into" % arg |
|
1993 |
|
1994 if arg is not None or markmsg: |
|
1995 # make a mild effort to align arguments |
|
1996 line += ' ' * (10 - len(opcode.name)) |
|
1997 if arg is not None: |
|
1998 line += ' ' + repr(arg) |
|
1999 if markmsg: |
|
2000 line += ' ' + markmsg |
|
2001 print >> out, line |
|
2002 |
|
2003 if errormsg: |
|
2004 # Note that we delayed complaining until the offending opcode |
|
2005 # was printed. |
|
2006 raise ValueError(errormsg) |
|
2007 |
|
2008 # Emulate the stack effects. |
|
2009 if len(stack) < numtopop: |
|
2010 raise ValueError("tries to pop %d items from stack with " |
|
2011 "only %d items" % (numtopop, len(stack))) |
|
2012 if numtopop: |
|
2013 del stack[-numtopop:] |
|
2014 if markobject in after: |
|
2015 assert markobject not in before |
|
2016 markstack.append(pos) |
|
2017 |
|
2018 stack.extend(after) |
|
2019 |
|
2020 print >> out, "highest protocol among opcodes =", maxproto |
|
2021 if stack: |
|
2022 raise ValueError("stack not empty after STOP: %r" % stack) |
|
2023 |
|
2024 # For use in the doctest, simply as an example of a class to pickle. |
|
2025 class _Example: |
|
2026 def __init__(self, value): |
|
2027 self.value = value |
|
2028 |
|
2029 _dis_test = r""" |
|
2030 >>> import pickle |
|
2031 >>> x = [1, 2, (3, 4), {'abc': u"def"}] |
|
2032 >>> pkl = pickle.dumps(x, 0) |
|
2033 >>> dis(pkl) |
|
2034 0: ( MARK |
|
2035 1: l LIST (MARK at 0) |
|
2036 2: p PUT 0 |
|
2037 5: I INT 1 |
|
2038 8: a APPEND |
|
2039 9: I INT 2 |
|
2040 12: a APPEND |
|
2041 13: ( MARK |
|
2042 14: I INT 3 |
|
2043 17: I INT 4 |
|
2044 20: t TUPLE (MARK at 13) |
|
2045 21: p PUT 1 |
|
2046 24: a APPEND |
|
2047 25: ( MARK |
|
2048 26: d DICT (MARK at 25) |
|
2049 27: p PUT 2 |
|
2050 30: S STRING 'abc' |
|
2051 37: p PUT 3 |
|
2052 40: V UNICODE u'def' |
|
2053 45: p PUT 4 |
|
2054 48: s SETITEM |
|
2055 49: a APPEND |
|
2056 50: . STOP |
|
2057 highest protocol among opcodes = 0 |
|
2058 |
|
2059 Try again with a "binary" pickle. |
|
2060 |
|
2061 >>> pkl = pickle.dumps(x, 1) |
|
2062 >>> dis(pkl) |
|
2063 0: ] EMPTY_LIST |
|
2064 1: q BINPUT 0 |
|
2065 3: ( MARK |
|
2066 4: K BININT1 1 |
|
2067 6: K BININT1 2 |
|
2068 8: ( MARK |
|
2069 9: K BININT1 3 |
|
2070 11: K BININT1 4 |
|
2071 13: t TUPLE (MARK at 8) |
|
2072 14: q BINPUT 1 |
|
2073 16: } EMPTY_DICT |
|
2074 17: q BINPUT 2 |
|
2075 19: U SHORT_BINSTRING 'abc' |
|
2076 24: q BINPUT 3 |
|
2077 26: X BINUNICODE u'def' |
|
2078 34: q BINPUT 4 |
|
2079 36: s SETITEM |
|
2080 37: e APPENDS (MARK at 3) |
|
2081 38: . STOP |
|
2082 highest protocol among opcodes = 1 |
|
2083 |
|
2084 Exercise the INST/OBJ/BUILD family. |
|
2085 |
|
2086 >>> import random |
|
2087 >>> dis(pickle.dumps(random.random, 0)) |
|
2088 0: c GLOBAL 'random random' |
|
2089 15: p PUT 0 |
|
2090 18: . STOP |
|
2091 highest protocol among opcodes = 0 |
|
2092 |
|
2093 >>> from pickletools import _Example |
|
2094 >>> x = [_Example(42)] * 2 |
|
2095 >>> dis(pickle.dumps(x, 0)) |
|
2096 0: ( MARK |
|
2097 1: l LIST (MARK at 0) |
|
2098 2: p PUT 0 |
|
2099 5: ( MARK |
|
2100 6: i INST 'pickletools _Example' (MARK at 5) |
|
2101 28: p PUT 1 |
|
2102 31: ( MARK |
|
2103 32: d DICT (MARK at 31) |
|
2104 33: p PUT 2 |
|
2105 36: S STRING 'value' |
|
2106 45: p PUT 3 |
|
2107 48: I INT 42 |
|
2108 52: s SETITEM |
|
2109 53: b BUILD |
|
2110 54: a APPEND |
|
2111 55: g GET 1 |
|
2112 58: a APPEND |
|
2113 59: . STOP |
|
2114 highest protocol among opcodes = 0 |
|
2115 |
|
2116 >>> dis(pickle.dumps(x, 1)) |
|
2117 0: ] EMPTY_LIST |
|
2118 1: q BINPUT 0 |
|
2119 3: ( MARK |
|
2120 4: ( MARK |
|
2121 5: c GLOBAL 'pickletools _Example' |
|
2122 27: q BINPUT 1 |
|
2123 29: o OBJ (MARK at 4) |
|
2124 30: q BINPUT 2 |
|
2125 32: } EMPTY_DICT |
|
2126 33: q BINPUT 3 |
|
2127 35: U SHORT_BINSTRING 'value' |
|
2128 42: q BINPUT 4 |
|
2129 44: K BININT1 42 |
|
2130 46: s SETITEM |
|
2131 47: b BUILD |
|
2132 48: h BINGET 2 |
|
2133 50: e APPENDS (MARK at 3) |
|
2134 51: . STOP |
|
2135 highest protocol among opcodes = 1 |
|
2136 |
|
2137 Try "the canonical" recursive-object test. |
|
2138 |
|
2139 >>> L = [] |
|
2140 >>> T = L, |
|
2141 >>> L.append(T) |
|
2142 >>> L[0] is T |
|
2143 True |
|
2144 >>> T[0] is L |
|
2145 True |
|
2146 >>> L[0][0] is L |
|
2147 True |
|
2148 >>> T[0][0] is T |
|
2149 True |
|
2150 >>> dis(pickle.dumps(L, 0)) |
|
2151 0: ( MARK |
|
2152 1: l LIST (MARK at 0) |
|
2153 2: p PUT 0 |
|
2154 5: ( MARK |
|
2155 6: g GET 0 |
|
2156 9: t TUPLE (MARK at 5) |
|
2157 10: p PUT 1 |
|
2158 13: a APPEND |
|
2159 14: . STOP |
|
2160 highest protocol among opcodes = 0 |
|
2161 |
|
2162 >>> dis(pickle.dumps(L, 1)) |
|
2163 0: ] EMPTY_LIST |
|
2164 1: q BINPUT 0 |
|
2165 3: ( MARK |
|
2166 4: h BINGET 0 |
|
2167 6: t TUPLE (MARK at 3) |
|
2168 7: q BINPUT 1 |
|
2169 9: a APPEND |
|
2170 10: . STOP |
|
2171 highest protocol among opcodes = 1 |
|
2172 |
|
2173 Note that, in the protocol 0 pickle of the recursive tuple, the disassembler |
|
2174 has to emulate the stack in order to realize that the POP opcode at 16 gets |
|
2175 rid of the MARK at 0. |
|
2176 |
|
2177 >>> dis(pickle.dumps(T, 0)) |
|
2178 0: ( MARK |
|
2179 1: ( MARK |
|
2180 2: l LIST (MARK at 1) |
|
2181 3: p PUT 0 |
|
2182 6: ( MARK |
|
2183 7: g GET 0 |
|
2184 10: t TUPLE (MARK at 6) |
|
2185 11: p PUT 1 |
|
2186 14: a APPEND |
|
2187 15: 0 POP |
|
2188 16: 0 POP (MARK at 0) |
|
2189 17: g GET 1 |
|
2190 20: . STOP |
|
2191 highest protocol among opcodes = 0 |
|
2192 |
|
2193 >>> dis(pickle.dumps(T, 1)) |
|
2194 0: ( MARK |
|
2195 1: ] EMPTY_LIST |
|
2196 2: q BINPUT 0 |
|
2197 4: ( MARK |
|
2198 5: h BINGET 0 |
|
2199 7: t TUPLE (MARK at 4) |
|
2200 8: q BINPUT 1 |
|
2201 10: a APPEND |
|
2202 11: 1 POP_MARK (MARK at 0) |
|
2203 12: h BINGET 1 |
|
2204 14: . STOP |
|
2205 highest protocol among opcodes = 1 |
|
2206 |
|
2207 Try protocol 2. |
|
2208 |
|
2209 >>> dis(pickle.dumps(L, 2)) |
|
2210 0: \x80 PROTO 2 |
|
2211 2: ] EMPTY_LIST |
|
2212 3: q BINPUT 0 |
|
2213 5: h BINGET 0 |
|
2214 7: \x85 TUPLE1 |
|
2215 8: q BINPUT 1 |
|
2216 10: a APPEND |
|
2217 11: . STOP |
|
2218 highest protocol among opcodes = 2 |
|
2219 |
|
2220 >>> dis(pickle.dumps(T, 2)) |
|
2221 0: \x80 PROTO 2 |
|
2222 2: ] EMPTY_LIST |
|
2223 3: q BINPUT 0 |
|
2224 5: h BINGET 0 |
|
2225 7: \x85 TUPLE1 |
|
2226 8: q BINPUT 1 |
|
2227 10: a APPEND |
|
2228 11: 0 POP |
|
2229 12: h BINGET 1 |
|
2230 14: . STOP |
|
2231 highest protocol among opcodes = 2 |
|
2232 """ |
|
2233 |
|
2234 _memo_test = r""" |
|
2235 >>> import pickle |
|
2236 >>> from StringIO import StringIO |
|
2237 >>> f = StringIO() |
|
2238 >>> p = pickle.Pickler(f, 2) |
|
2239 >>> x = [1, 2, 3] |
|
2240 >>> p.dump(x) |
|
2241 >>> p.dump(x) |
|
2242 >>> f.seek(0) |
|
2243 >>> memo = {} |
|
2244 >>> dis(f, memo=memo) |
|
2245 0: \x80 PROTO 2 |
|
2246 2: ] EMPTY_LIST |
|
2247 3: q BINPUT 0 |
|
2248 5: ( MARK |
|
2249 6: K BININT1 1 |
|
2250 8: K BININT1 2 |
|
2251 10: K BININT1 3 |
|
2252 12: e APPENDS (MARK at 5) |
|
2253 13: . STOP |
|
2254 highest protocol among opcodes = 2 |
|
2255 >>> dis(f, memo=memo) |
|
2256 14: \x80 PROTO 2 |
|
2257 16: h BINGET 0 |
|
2258 18: . STOP |
|
2259 highest protocol among opcodes = 2 |
|
2260 """ |
|
2261 |
|
2262 __test__ = {'disassembler_test': _dis_test, |
|
2263 'disassembler_memo_test': _memo_test, |
|
2264 } |
|
2265 |
|
2266 def _test(): |
|
2267 import doctest |
|
2268 return doctest.testmod() |
|
2269 |
|
2270 if __name__ == "__main__": |
|
2271 _test() |