python-2.5.2/win32/Lib/pickletools.py
changeset 0 ae805ac0140d
equal deleted inserted replaced
-1:000000000000 0:ae805ac0140d
       
     1 '''"Executable documentation" for the pickle module.
       
     2 
       
     3 Extensive comments about the pickle protocols and pickle-machine opcodes
       
     4 can be found here.  Some functions meant for external use:
       
     5 
       
     6 genops(pickle)
       
     7    Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
       
     8 
       
     9 dis(pickle, out=None, memo=None, indentlevel=4)
       
    10    Print a symbolic disassembly of a pickle.
       
    11 '''
       
    12 
       
    13 __all__ = ['dis',
       
    14            'genops',
       
    15           ]
       
    16 
       
    17 # Other ideas:
       
    18 #
       
    19 # - A pickle verifier:  read a pickle and check it exhaustively for
       
    20 #   well-formedness.  dis() does a lot of this already.
       
    21 #
       
    22 # - A protocol identifier:  examine a pickle and return its protocol number
       
    23 #   (== the highest .proto attr value among all the opcodes in the pickle).
       
    24 #   dis() already prints this info at the end.
       
    25 #
       
    26 # - A pickle optimizer:  for example, tuple-building code is sometimes more
       
    27 #   elaborate than necessary, catering for the possibility that the tuple
       
    28 #   is recursive.  Or lots of times a PUT is generated that's never accessed
       
    29 #   by a later GET.
       
    30 
       
    31 
       
    32 """
       
    33 "A pickle" is a program for a virtual pickle machine (PM, but more accurately
       
    34 called an unpickling machine).  It's a sequence of opcodes, interpreted by the
       
    35 PM, building an arbitrarily complex Python object.
       
    36 
       
    37 For the most part, the PM is very simple:  there are no looping, testing, or
       
    38 conditional instructions, no arithmetic and no function calls.  Opcodes are
       
    39 executed once each, from first to last, until a STOP opcode is reached.
       
    40 
       
    41 The PM has two data areas, "the stack" and "the memo".
       
    42 
       
    43 Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
       
    44 integer object on the stack, whose value is gotten from a decimal string
       
    45 literal immediately following the INT opcode in the pickle bytestream.  Other
       
    46 opcodes take Python objects off the stack.  The result of unpickling is
       
    47 whatever object is left on the stack when the final STOP opcode is executed.
       
    48 
       
    49 The memo is simply an array of objects, or it can be implemented as a dict
       
    50 mapping little integers to objects.  The memo serves as the PM's "long term
       
    51 memory", and the little integers indexing the memo are akin to variable
       
    52 names.  Some opcodes pop a stack object into the memo at a given index,
       
    53 and others push a memo object at a given index onto the stack again.
       
    54 
       
    55 At heart, that's all the PM has.  Subtleties arise for these reasons:
       
    56 
       
    57 + Object identity.  Objects can be arbitrarily complex, and subobjects
       
    58   may be shared (for example, the list [a, a] refers to the same object a
       
    59   twice).  It can be vital that unpickling recreate an isomorphic object
       
    60   graph, faithfully reproducing sharing.
       
    61 
       
    62 + Recursive objects.  For example, after "L = []; L.append(L)", L is a
       
    63   list, and L[0] is the same list.  This is related to the object identity
       
    64   point, and some sequences of pickle opcodes are subtle in order to
       
    65   get the right result in all cases.
       
    66 
       
    67 + Things pickle doesn't know everything about.  Examples of things pickle
       
    68   does know everything about are Python's builtin scalar and container
       
    69   types, like ints and tuples.  They generally have opcodes dedicated to
       
    70   them.  For things like module references and instances of user-defined
       
    71   classes, pickle's knowledge is limited.  Historically, many enhancements
       
    72   have been made to the pickle protocol in order to do a better (faster,
       
    73   and/or more compact) job on those.
       
    74 
       
    75 + Backward compatibility and micro-optimization.  As explained below,
       
    76   pickle opcodes never go away, not even when better ways to do a thing
       
    77   get invented.  The repertoire of the PM just keeps growing over time.
       
    78   For example, protocol 0 had two opcodes for building Python integers (INT
       
    79   and LONG), protocol 1 added three more for more-efficient pickling of short
       
    80   integers, and protocol 2 added two more for more-efficient pickling of
       
    81   long integers (before protocol 2, the only ways to pickle a Python long
       
    82   took time quadratic in the number of digits, for both pickling and
       
    83   unpickling).  "Opcode bloat" isn't so much a subtlety as a source of
       
    84   wearying complication.
       
    85 
       
    86 
       
    87 Pickle protocols:
       
    88 
       
    89 For compatibility, the meaning of a pickle opcode never changes.  Instead new
       
    90 pickle opcodes get added, and each version's unpickler can handle all the
       
    91 pickle opcodes in all protocol versions to date.  So old pickles continue to
       
    92 be readable forever.  The pickler can generally be told to restrict itself to
       
    93 the subset of opcodes available under previous protocol versions too, so that
       
    94 users can create pickles under the current version readable by older
       
    95 versions.  However, a pickle does not contain its version number embedded
       
    96 within it.  If an older unpickler tries to read a pickle using a later
       
    97 protocol, the result is most likely an exception due to seeing an unknown (in
       
    98 the older unpickler) opcode.
       
    99 
       
   100 The original pickle used what's now called "protocol 0", and what was called
       
   101 "text mode" before Python 2.3.  The entire pickle bytestream is made up of
       
   102 printable 7-bit ASCII characters, plus the newline character, in protocol 0.
       
   103 That's why it was called text mode.  Protocol 0 is small and elegant, but
       
   104 sometimes painfully inefficient.
       
   105 
       
   106 The second major set of additions is now called "protocol 1", and was called
       
   107 "binary mode" before Python 2.3.  This added many opcodes with arguments
       
   108 consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
       
   109 bytes.  Binary mode pickles can be substantially smaller than equivalent
       
   110 text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
       
   111 int as 4 bytes following the opcode, which is cheaper to unpickle than the
       
   112 (perhaps) 11-character decimal string attached to INT.  Protocol 1 also added
       
   113 a number of opcodes that operate on many stack elements at once (like APPENDS
       
   114 and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
       
   115 
       
   116 The third major set of additions came in Python 2.3, and is called "protocol
       
   117 2".  This added:
       
   118 
       
   119 - A better way to pickle instances of new-style classes (NEWOBJ).
       
   120 
       
   121 - A way for a pickle to identify its protocol (PROTO).
       
   122 
       
   123 - Time- and space- efficient pickling of long ints (LONG{1,4}).
       
   124 
       
   125 - Shortcuts for small tuples (TUPLE{1,2,3}}.
       
   126 
       
   127 - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
       
   128 
       
   129 - The "extension registry", a vector of popular objects that can be pushed
       
   130   efficiently by index (EXT{1,2,4}).  This is akin to the memo and GET, but
       
   131   the registry contents are predefined (there's nothing akin to the memo's
       
   132   PUT).
       
   133 
       
   134 Another independent change with Python 2.3 is the abandonment of any
       
   135 pretense that it might be safe to load pickles received from untrusted
       
   136 parties -- no sufficient security analysis has been done to guarantee
       
   137 this and there isn't a use case that warrants the expense of such an
       
   138 analysis.
       
   139 
       
   140 To this end, all tests for __safe_for_unpickling__ or for
       
   141 copy_reg.safe_constructors are removed from the unpickling code.
       
   142 References to these variables in the descriptions below are to be seen
       
   143 as describing unpickling in Python 2.2 and before.
       
   144 """
       
   145 
       
   146 # Meta-rule:  Descriptions are stored in instances of descriptor objects,
       
   147 # with plain constructors.  No meta-language is defined from which
       
   148 # descriptors could be constructed.  If you want, e.g., XML, write a little
       
   149 # program to generate XML from the objects.
       
   150 
       
   151 ##############################################################################
       
   152 # Some pickle opcodes have an argument, following the opcode in the
       
   153 # bytestream.  An argument is of a specific type, described by an instance
       
   154 # of ArgumentDescriptor.  These are not to be confused with arguments taken
       
   155 # off the stack -- ArgumentDescriptor applies only to arguments embedded in
       
   156 # the opcode stream, immediately following an opcode.
       
   157 
       
   158 # Represents the number of bytes consumed by an argument delimited by the
       
   159 # next newline character.
       
   160 UP_TO_NEWLINE = -1
       
   161 
       
   162 # Represents the number of bytes consumed by a two-argument opcode where
       
   163 # the first argument gives the number of bytes in the second argument.
       
   164 TAKEN_FROM_ARGUMENT1 = -2   # num bytes is 1-byte unsigned int
       
   165 TAKEN_FROM_ARGUMENT4 = -3   # num bytes is 4-byte signed little-endian int
       
   166 
       
   167 class ArgumentDescriptor(object):
       
   168     __slots__ = (
       
   169         # name of descriptor record, also a module global name; a string
       
   170         'name',
       
   171 
       
   172         # length of argument, in bytes; an int; UP_TO_NEWLINE and
       
   173         # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
       
   174         # cases
       
   175         'n',
       
   176 
       
   177         # a function taking a file-like object, reading this kind of argument
       
   178         # from the object at the current position, advancing the current
       
   179         # position by n bytes, and returning the value of the argument
       
   180         'reader',
       
   181 
       
   182         # human-readable docs for this arg descriptor; a string
       
   183         'doc',
       
   184     )
       
   185 
       
   186     def __init__(self, name, n, reader, doc):
       
   187         assert isinstance(name, str)
       
   188         self.name = name
       
   189 
       
   190         assert isinstance(n, int) and (n >= 0 or
       
   191                                        n in (UP_TO_NEWLINE,
       
   192                                              TAKEN_FROM_ARGUMENT1,
       
   193                                              TAKEN_FROM_ARGUMENT4))
       
   194         self.n = n
       
   195 
       
   196         self.reader = reader
       
   197 
       
   198         assert isinstance(doc, str)
       
   199         self.doc = doc
       
   200 
       
   201 from struct import unpack as _unpack
       
   202 
       
   203 def read_uint1(f):
       
   204     r"""
       
   205     >>> import StringIO
       
   206     >>> read_uint1(StringIO.StringIO('\xff'))
       
   207     255
       
   208     """
       
   209 
       
   210     data = f.read(1)
       
   211     if data:
       
   212         return ord(data)
       
   213     raise ValueError("not enough data in stream to read uint1")
       
   214 
       
   215 uint1 = ArgumentDescriptor(
       
   216             name='uint1',
       
   217             n=1,
       
   218             reader=read_uint1,
       
   219             doc="One-byte unsigned integer.")
       
   220 
       
   221 
       
   222 def read_uint2(f):
       
   223     r"""
       
   224     >>> import StringIO
       
   225     >>> read_uint2(StringIO.StringIO('\xff\x00'))
       
   226     255
       
   227     >>> read_uint2(StringIO.StringIO('\xff\xff'))
       
   228     65535
       
   229     """
       
   230 
       
   231     data = f.read(2)
       
   232     if len(data) == 2:
       
   233         return _unpack("<H", data)[0]
       
   234     raise ValueError("not enough data in stream to read uint2")
       
   235 
       
   236 uint2 = ArgumentDescriptor(
       
   237             name='uint2',
       
   238             n=2,
       
   239             reader=read_uint2,
       
   240             doc="Two-byte unsigned integer, little-endian.")
       
   241 
       
   242 
       
   243 def read_int4(f):
       
   244     r"""
       
   245     >>> import StringIO
       
   246     >>> read_int4(StringIO.StringIO('\xff\x00\x00\x00'))
       
   247     255
       
   248     >>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31)
       
   249     True
       
   250     """
       
   251 
       
   252     data = f.read(4)
       
   253     if len(data) == 4:
       
   254         return _unpack("<i", data)[0]
       
   255     raise ValueError("not enough data in stream to read int4")
       
   256 
       
   257 int4 = ArgumentDescriptor(
       
   258            name='int4',
       
   259            n=4,
       
   260            reader=read_int4,
       
   261            doc="Four-byte signed integer, little-endian, 2's complement.")
       
   262 
       
   263 
       
   264 def read_stringnl(f, decode=True, stripquotes=True):
       
   265     r"""
       
   266     >>> import StringIO
       
   267     >>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n"))
       
   268     'abcd'
       
   269 
       
   270     >>> read_stringnl(StringIO.StringIO("\n"))
       
   271     Traceback (most recent call last):
       
   272     ...
       
   273     ValueError: no string quotes around ''
       
   274 
       
   275     >>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False)
       
   276     ''
       
   277 
       
   278     >>> read_stringnl(StringIO.StringIO("''\n"))
       
   279     ''
       
   280 
       
   281     >>> read_stringnl(StringIO.StringIO('"abcd"'))
       
   282     Traceback (most recent call last):
       
   283     ...
       
   284     ValueError: no newline found when trying to read stringnl
       
   285 
       
   286     Embedded escapes are undone in the result.
       
   287     >>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'"))
       
   288     'a\n\\b\x00c\td'
       
   289     """
       
   290 
       
   291     data = f.readline()
       
   292     if not data.endswith('\n'):
       
   293         raise ValueError("no newline found when trying to read stringnl")
       
   294     data = data[:-1]    # lose the newline
       
   295 
       
   296     if stripquotes:
       
   297         for q in "'\"":
       
   298             if data.startswith(q):
       
   299                 if not data.endswith(q):
       
   300                     raise ValueError("strinq quote %r not found at both "
       
   301                                      "ends of %r" % (q, data))
       
   302                 data = data[1:-1]
       
   303                 break
       
   304         else:
       
   305             raise ValueError("no string quotes around %r" % data)
       
   306 
       
   307     # I'm not sure when 'string_escape' was added to the std codecs; it's
       
   308     # crazy not to use it if it's there.
       
   309     if decode:
       
   310         data = data.decode('string_escape')
       
   311     return data
       
   312 
       
   313 stringnl = ArgumentDescriptor(
       
   314                name='stringnl',
       
   315                n=UP_TO_NEWLINE,
       
   316                reader=read_stringnl,
       
   317                doc="""A newline-terminated string.
       
   318 
       
   319                    This is a repr-style string, with embedded escapes, and
       
   320                    bracketing quotes.
       
   321                    """)
       
   322 
       
   323 def read_stringnl_noescape(f):
       
   324     return read_stringnl(f, decode=False, stripquotes=False)
       
   325 
       
   326 stringnl_noescape = ArgumentDescriptor(
       
   327                         name='stringnl_noescape',
       
   328                         n=UP_TO_NEWLINE,
       
   329                         reader=read_stringnl_noescape,
       
   330                         doc="""A newline-terminated string.
       
   331 
       
   332                         This is a str-style string, without embedded escapes,
       
   333                         or bracketing quotes.  It should consist solely of
       
   334                         printable ASCII characters.
       
   335                         """)
       
   336 
       
   337 def read_stringnl_noescape_pair(f):
       
   338     r"""
       
   339     >>> import StringIO
       
   340     >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk"))
       
   341     'Queue Empty'
       
   342     """
       
   343 
       
   344     return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
       
   345 
       
   346 stringnl_noescape_pair = ArgumentDescriptor(
       
   347                              name='stringnl_noescape_pair',
       
   348                              n=UP_TO_NEWLINE,
       
   349                              reader=read_stringnl_noescape_pair,
       
   350                              doc="""A pair of newline-terminated strings.
       
   351 
       
   352                              These are str-style strings, without embedded
       
   353                              escapes, or bracketing quotes.  They should
       
   354                              consist solely of printable ASCII characters.
       
   355                              The pair is returned as a single string, with
       
   356                              a single blank separating the two strings.
       
   357                              """)
       
   358 
       
   359 def read_string4(f):
       
   360     r"""
       
   361     >>> import StringIO
       
   362     >>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc"))
       
   363     ''
       
   364     >>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef"))
       
   365     'abc'
       
   366     >>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef"))
       
   367     Traceback (most recent call last):
       
   368     ...
       
   369     ValueError: expected 50331648 bytes in a string4, but only 6 remain
       
   370     """
       
   371 
       
   372     n = read_int4(f)
       
   373     if n < 0:
       
   374         raise ValueError("string4 byte count < 0: %d" % n)
       
   375     data = f.read(n)
       
   376     if len(data) == n:
       
   377         return data
       
   378     raise ValueError("expected %d bytes in a string4, but only %d remain" %
       
   379                      (n, len(data)))
       
   380 
       
   381 string4 = ArgumentDescriptor(
       
   382               name="string4",
       
   383               n=TAKEN_FROM_ARGUMENT4,
       
   384               reader=read_string4,
       
   385               doc="""A counted string.
       
   386 
       
   387               The first argument is a 4-byte little-endian signed int giving
       
   388               the number of bytes in the string, and the second argument is
       
   389               that many bytes.
       
   390               """)
       
   391 
       
   392 
       
   393 def read_string1(f):
       
   394     r"""
       
   395     >>> import StringIO
       
   396     >>> read_string1(StringIO.StringIO("\x00"))
       
   397     ''
       
   398     >>> read_string1(StringIO.StringIO("\x03abcdef"))
       
   399     'abc'
       
   400     """
       
   401 
       
   402     n = read_uint1(f)
       
   403     assert n >= 0
       
   404     data = f.read(n)
       
   405     if len(data) == n:
       
   406         return data
       
   407     raise ValueError("expected %d bytes in a string1, but only %d remain" %
       
   408                      (n, len(data)))
       
   409 
       
   410 string1 = ArgumentDescriptor(
       
   411               name="string1",
       
   412               n=TAKEN_FROM_ARGUMENT1,
       
   413               reader=read_string1,
       
   414               doc="""A counted string.
       
   415 
       
   416               The first argument is a 1-byte unsigned int giving the number
       
   417               of bytes in the string, and the second argument is that many
       
   418               bytes.
       
   419               """)
       
   420 
       
   421 
       
   422 def read_unicodestringnl(f):
       
   423     r"""
       
   424     >>> import StringIO
       
   425     >>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk"))
       
   426     u'abc\uabcd'
       
   427     """
       
   428 
       
   429     data = f.readline()
       
   430     if not data.endswith('\n'):
       
   431         raise ValueError("no newline found when trying to read "
       
   432                          "unicodestringnl")
       
   433     data = data[:-1]    # lose the newline
       
   434     return unicode(data, 'raw-unicode-escape')
       
   435 
       
   436 unicodestringnl = ArgumentDescriptor(
       
   437                       name='unicodestringnl',
       
   438                       n=UP_TO_NEWLINE,
       
   439                       reader=read_unicodestringnl,
       
   440                       doc="""A newline-terminated Unicode string.
       
   441 
       
   442                       This is raw-unicode-escape encoded, so consists of
       
   443                       printable ASCII characters, and may contain embedded
       
   444                       escape sequences.
       
   445                       """)
       
   446 
       
   447 def read_unicodestring4(f):
       
   448     r"""
       
   449     >>> import StringIO
       
   450     >>> s = u'abcd\uabcd'
       
   451     >>> enc = s.encode('utf-8')
       
   452     >>> enc
       
   453     'abcd\xea\xaf\x8d'
       
   454     >>> n = chr(len(enc)) + chr(0) * 3  # little-endian 4-byte length
       
   455     >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
       
   456     >>> s == t
       
   457     True
       
   458 
       
   459     >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
       
   460     Traceback (most recent call last):
       
   461     ...
       
   462     ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
       
   463     """
       
   464 
       
   465     n = read_int4(f)
       
   466     if n < 0:
       
   467         raise ValueError("unicodestring4 byte count < 0: %d" % n)
       
   468     data = f.read(n)
       
   469     if len(data) == n:
       
   470         return unicode(data, 'utf-8')
       
   471     raise ValueError("expected %d bytes in a unicodestring4, but only %d "
       
   472                      "remain" % (n, len(data)))
       
   473 
       
   474 unicodestring4 = ArgumentDescriptor(
       
   475                     name="unicodestring4",
       
   476                     n=TAKEN_FROM_ARGUMENT4,
       
   477                     reader=read_unicodestring4,
       
   478                     doc="""A counted Unicode string.
       
   479 
       
   480                     The first argument is a 4-byte little-endian signed int
       
   481                     giving the number of bytes in the string, and the second
       
   482                     argument-- the UTF-8 encoding of the Unicode string --
       
   483                     contains that many bytes.
       
   484                     """)
       
   485 
       
   486 
       
   487 def read_decimalnl_short(f):
       
   488     r"""
       
   489     >>> import StringIO
       
   490     >>> read_decimalnl_short(StringIO.StringIO("1234\n56"))
       
   491     1234
       
   492 
       
   493     >>> read_decimalnl_short(StringIO.StringIO("1234L\n56"))
       
   494     Traceback (most recent call last):
       
   495     ...
       
   496     ValueError: trailing 'L' not allowed in '1234L'
       
   497     """
       
   498 
       
   499     s = read_stringnl(f, decode=False, stripquotes=False)
       
   500     if s.endswith("L"):
       
   501         raise ValueError("trailing 'L' not allowed in %r" % s)
       
   502 
       
   503     # It's not necessarily true that the result fits in a Python short int:
       
   504     # the pickle may have been written on a 64-bit box.  There's also a hack
       
   505     # for True and False here.
       
   506     if s == "00":
       
   507         return False
       
   508     elif s == "01":
       
   509         return True
       
   510 
       
   511     try:
       
   512         return int(s)
       
   513     except OverflowError:
       
   514         return long(s)
       
   515 
       
   516 def read_decimalnl_long(f):
       
   517     r"""
       
   518     >>> import StringIO
       
   519 
       
   520     >>> read_decimalnl_long(StringIO.StringIO("1234\n56"))
       
   521     Traceback (most recent call last):
       
   522     ...
       
   523     ValueError: trailing 'L' required in '1234'
       
   524 
       
   525     Someday the trailing 'L' will probably go away from this output.
       
   526 
       
   527     >>> read_decimalnl_long(StringIO.StringIO("1234L\n56"))
       
   528     1234L
       
   529 
       
   530     >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\n6"))
       
   531     123456789012345678901234L
       
   532     """
       
   533 
       
   534     s = read_stringnl(f, decode=False, stripquotes=False)
       
   535     if not s.endswith("L"):
       
   536         raise ValueError("trailing 'L' required in %r" % s)
       
   537     return long(s)
       
   538 
       
   539 
       
   540 decimalnl_short = ArgumentDescriptor(
       
   541                       name='decimalnl_short',
       
   542                       n=UP_TO_NEWLINE,
       
   543                       reader=read_decimalnl_short,
       
   544                       doc="""A newline-terminated decimal integer literal.
       
   545 
       
   546                           This never has a trailing 'L', and the integer fit
       
   547                           in a short Python int on the box where the pickle
       
   548                           was written -- but there's no guarantee it will fit
       
   549                           in a short Python int on the box where the pickle
       
   550                           is read.
       
   551                           """)
       
   552 
       
   553 decimalnl_long = ArgumentDescriptor(
       
   554                      name='decimalnl_long',
       
   555                      n=UP_TO_NEWLINE,
       
   556                      reader=read_decimalnl_long,
       
   557                      doc="""A newline-terminated decimal integer literal.
       
   558 
       
   559                          This has a trailing 'L', and can represent integers
       
   560                          of any size.
       
   561                          """)
       
   562 
       
   563 
       
   564 def read_floatnl(f):
       
   565     r"""
       
   566     >>> import StringIO
       
   567     >>> read_floatnl(StringIO.StringIO("-1.25\n6"))
       
   568     -1.25
       
   569     """
       
   570     s = read_stringnl(f, decode=False, stripquotes=False)
       
   571     return float(s)
       
   572 
       
   573 floatnl = ArgumentDescriptor(
       
   574               name='floatnl',
       
   575               n=UP_TO_NEWLINE,
       
   576               reader=read_floatnl,
       
   577               doc="""A newline-terminated decimal floating literal.
       
   578 
       
   579               In general this requires 17 significant digits for roundtrip
       
   580               identity, and pickling then unpickling infinities, NaNs, and
       
   581               minus zero doesn't work across boxes, or on some boxes even
       
   582               on itself (e.g., Windows can't read the strings it produces
       
   583               for infinities or NaNs).
       
   584               """)
       
   585 
       
   586 def read_float8(f):
       
   587     r"""
       
   588     >>> import StringIO, struct
       
   589     >>> raw = struct.pack(">d", -1.25)
       
   590     >>> raw
       
   591     '\xbf\xf4\x00\x00\x00\x00\x00\x00'
       
   592     >>> read_float8(StringIO.StringIO(raw + "\n"))
       
   593     -1.25
       
   594     """
       
   595 
       
   596     data = f.read(8)
       
   597     if len(data) == 8:
       
   598         return _unpack(">d", data)[0]
       
   599     raise ValueError("not enough data in stream to read float8")
       
   600 
       
   601 
       
   602 float8 = ArgumentDescriptor(
       
   603              name='float8',
       
   604              n=8,
       
   605              reader=read_float8,
       
   606              doc="""An 8-byte binary representation of a float, big-endian.
       
   607 
       
   608              The format is unique to Python, and shared with the struct
       
   609              module (format string '>d') "in theory" (the struct and cPickle
       
   610              implementations don't share the code -- they should).  It's
       
   611              strongly related to the IEEE-754 double format, and, in normal
       
   612              cases, is in fact identical to the big-endian 754 double format.
       
   613              On other boxes the dynamic range is limited to that of a 754
       
   614              double, and "add a half and chop" rounding is used to reduce
       
   615              the precision to 53 bits.  However, even on a 754 box,
       
   616              infinities, NaNs, and minus zero may not be handled correctly
       
   617              (may not survive roundtrip pickling intact).
       
   618              """)
       
   619 
       
   620 # Protocol 2 formats
       
   621 
       
   622 from pickle import decode_long
       
   623 
       
   624 def read_long1(f):
       
   625     r"""
       
   626     >>> import StringIO
       
   627     >>> read_long1(StringIO.StringIO("\x00"))
       
   628     0L
       
   629     >>> read_long1(StringIO.StringIO("\x02\xff\x00"))
       
   630     255L
       
   631     >>> read_long1(StringIO.StringIO("\x02\xff\x7f"))
       
   632     32767L
       
   633     >>> read_long1(StringIO.StringIO("\x02\x00\xff"))
       
   634     -256L
       
   635     >>> read_long1(StringIO.StringIO("\x02\x00\x80"))
       
   636     -32768L
       
   637     """
       
   638 
       
   639     n = read_uint1(f)
       
   640     data = f.read(n)
       
   641     if len(data) != n:
       
   642         raise ValueError("not enough data in stream to read long1")
       
   643     return decode_long(data)
       
   644 
       
   645 long1 = ArgumentDescriptor(
       
   646     name="long1",
       
   647     n=TAKEN_FROM_ARGUMENT1,
       
   648     reader=read_long1,
       
   649     doc="""A binary long, little-endian, using 1-byte size.
       
   650 
       
   651     This first reads one byte as an unsigned size, then reads that
       
   652     many bytes and interprets them as a little-endian 2's-complement long.
       
   653     If the size is 0, that's taken as a shortcut for the long 0L.
       
   654     """)
       
   655 
       
   656 def read_long4(f):
       
   657     r"""
       
   658     >>> import StringIO
       
   659     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00"))
       
   660     255L
       
   661     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f"))
       
   662     32767L
       
   663     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff"))
       
   664     -256L
       
   665     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80"))
       
   666     -32768L
       
   667     >>> read_long1(StringIO.StringIO("\x00\x00\x00\x00"))
       
   668     0L
       
   669     """
       
   670 
       
   671     n = read_int4(f)
       
   672     if n < 0:
       
   673         raise ValueError("long4 byte count < 0: %d" % n)
       
   674     data = f.read(n)
       
   675     if len(data) != n:
       
   676         raise ValueError("not enough data in stream to read long4")
       
   677     return decode_long(data)
       
   678 
       
   679 long4 = ArgumentDescriptor(
       
   680     name="long4",
       
   681     n=TAKEN_FROM_ARGUMENT4,
       
   682     reader=read_long4,
       
   683     doc="""A binary representation of a long, little-endian.
       
   684 
       
   685     This first reads four bytes as a signed size (but requires the
       
   686     size to be >= 0), then reads that many bytes and interprets them
       
   687     as a little-endian 2's-complement long.  If the size is 0, that's taken
       
   688     as a shortcut for the long 0L, although LONG1 should really be used
       
   689     then instead (and in any case where # of bytes < 256).
       
   690     """)
       
   691 
       
   692 
       
   693 ##############################################################################
       
   694 # Object descriptors.  The stack used by the pickle machine holds objects,
       
   695 # and in the stack_before and stack_after attributes of OpcodeInfo
       
   696 # descriptors we need names to describe the various types of objects that can
       
   697 # appear on the stack.
       
   698 
       
   699 class StackObject(object):
       
   700     __slots__ = (
       
   701         # name of descriptor record, for info only
       
   702         'name',
       
   703 
       
   704         # type of object, or tuple of type objects (meaning the object can
       
   705         # be of any type in the tuple)
       
   706         'obtype',
       
   707 
       
   708         # human-readable docs for this kind of stack object; a string
       
   709         'doc',
       
   710     )
       
   711 
       
   712     def __init__(self, name, obtype, doc):
       
   713         assert isinstance(name, str)
       
   714         self.name = name
       
   715 
       
   716         assert isinstance(obtype, type) or isinstance(obtype, tuple)
       
   717         if isinstance(obtype, tuple):
       
   718             for contained in obtype:
       
   719                 assert isinstance(contained, type)
       
   720         self.obtype = obtype
       
   721 
       
   722         assert isinstance(doc, str)
       
   723         self.doc = doc
       
   724 
       
   725     def __repr__(self):
       
   726         return self.name
       
   727 
       
   728 
       
   729 pyint = StackObject(
       
   730             name='int',
       
   731             obtype=int,
       
   732             doc="A short (as opposed to long) Python integer object.")
       
   733 
       
   734 pylong = StackObject(
       
   735              name='long',
       
   736              obtype=long,
       
   737              doc="A long (as opposed to short) Python integer object.")
       
   738 
       
   739 pyinteger_or_bool = StackObject(
       
   740                         name='int_or_bool',
       
   741                         obtype=(int, long, bool),
       
   742                         doc="A Python integer object (short or long), or "
       
   743                             "a Python bool.")
       
   744 
       
   745 pybool = StackObject(
       
   746              name='bool',
       
   747              obtype=(bool,),
       
   748              doc="A Python bool object.")
       
   749 
       
   750 pyfloat = StackObject(
       
   751               name='float',
       
   752               obtype=float,
       
   753               doc="A Python float object.")
       
   754 
       
   755 pystring = StackObject(
       
   756                name='str',
       
   757                obtype=str,
       
   758                doc="A Python string object.")
       
   759 
       
   760 pyunicode = StackObject(
       
   761                 name='unicode',
       
   762                 obtype=unicode,
       
   763                 doc="A Python Unicode string object.")
       
   764 
       
   765 pynone = StackObject(
       
   766              name="None",
       
   767              obtype=type(None),
       
   768              doc="The Python None object.")
       
   769 
       
   770 pytuple = StackObject(
       
   771               name="tuple",
       
   772               obtype=tuple,
       
   773               doc="A Python tuple object.")
       
   774 
       
   775 pylist = StackObject(
       
   776              name="list",
       
   777              obtype=list,
       
   778              doc="A Python list object.")
       
   779 
       
   780 pydict = StackObject(
       
   781              name="dict",
       
   782              obtype=dict,
       
   783              doc="A Python dict object.")
       
   784 
       
   785 anyobject = StackObject(
       
   786                 name='any',
       
   787                 obtype=object,
       
   788                 doc="Any kind of object whatsoever.")
       
   789 
       
   790 markobject = StackObject(
       
   791                  name="mark",
       
   792                  obtype=StackObject,
       
   793                  doc="""'The mark' is a unique object.
       
   794 
       
   795                  Opcodes that operate on a variable number of objects
       
   796                  generally don't embed the count of objects in the opcode,
       
   797                  or pull it off the stack.  Instead the MARK opcode is used
       
   798                  to push a special marker object on the stack, and then
       
   799                  some other opcodes grab all the objects from the top of
       
   800                  the stack down to (but not including) the topmost marker
       
   801                  object.
       
   802                  """)
       
   803 
       
   804 stackslice = StackObject(
       
   805                  name="stackslice",
       
   806                  obtype=StackObject,
       
   807                  doc="""An object representing a contiguous slice of the stack.
       
   808 
       
   809                  This is used in conjuction with markobject, to represent all
       
   810                  of the stack following the topmost markobject.  For example,
       
   811                  the POP_MARK opcode changes the stack from
       
   812 
       
   813                      [..., markobject, stackslice]
       
   814                  to
       
   815                      [...]
       
   816 
       
   817                  No matter how many object are on the stack after the topmost
       
   818                  markobject, POP_MARK gets rid of all of them (including the
       
   819                  topmost markobject too).
       
   820                  """)
       
   821 
       
   822 ##############################################################################
       
   823 # Descriptors for pickle opcodes.
       
   824 
       
   825 class OpcodeInfo(object):
       
   826 
       
   827     __slots__ = (
       
   828         # symbolic name of opcode; a string
       
   829         'name',
       
   830 
       
   831         # the code used in a bytestream to represent the opcode; a
       
   832         # one-character string
       
   833         'code',
       
   834 
       
   835         # If the opcode has an argument embedded in the byte string, an
       
   836         # instance of ArgumentDescriptor specifying its type.  Note that
       
   837         # arg.reader(s) can be used to read and decode the argument from
       
   838         # the bytestream s, and arg.doc documents the format of the raw
       
   839         # argument bytes.  If the opcode doesn't have an argument embedded
       
   840         # in the bytestream, arg should be None.
       
   841         'arg',
       
   842 
       
   843         # what the stack looks like before this opcode runs; a list
       
   844         'stack_before',
       
   845 
       
   846         # what the stack looks like after this opcode runs; a list
       
   847         'stack_after',
       
   848 
       
   849         # the protocol number in which this opcode was introduced; an int
       
   850         'proto',
       
   851 
       
   852         # human-readable docs for this opcode; a string
       
   853         'doc',
       
   854     )
       
   855 
       
   856     def __init__(self, name, code, arg,
       
   857                  stack_before, stack_after, proto, doc):
       
   858         assert isinstance(name, str)
       
   859         self.name = name
       
   860 
       
   861         assert isinstance(code, str)
       
   862         assert len(code) == 1
       
   863         self.code = code
       
   864 
       
   865         assert arg is None or isinstance(arg, ArgumentDescriptor)
       
   866         self.arg = arg
       
   867 
       
   868         assert isinstance(stack_before, list)
       
   869         for x in stack_before:
       
   870             assert isinstance(x, StackObject)
       
   871         self.stack_before = stack_before
       
   872 
       
   873         assert isinstance(stack_after, list)
       
   874         for x in stack_after:
       
   875             assert isinstance(x, StackObject)
       
   876         self.stack_after = stack_after
       
   877 
       
   878         assert isinstance(proto, int) and 0 <= proto <= 2
       
   879         self.proto = proto
       
   880 
       
   881         assert isinstance(doc, str)
       
   882         self.doc = doc
       
   883 
       
   884 I = OpcodeInfo
       
   885 opcodes = [
       
   886 
       
   887     # Ways to spell integers.
       
   888 
       
   889     I(name='INT',
       
   890       code='I',
       
   891       arg=decimalnl_short,
       
   892       stack_before=[],
       
   893       stack_after=[pyinteger_or_bool],
       
   894       proto=0,
       
   895       doc="""Push an integer or bool.
       
   896 
       
   897       The argument is a newline-terminated decimal literal string.
       
   898 
       
   899       The intent may have been that this always fit in a short Python int,
       
   900       but INT can be generated in pickles written on a 64-bit box that
       
   901       require a Python long on a 32-bit box.  The difference between this
       
   902       and LONG then is that INT skips a trailing 'L', and produces a short
       
   903       int whenever possible.
       
   904 
       
   905       Another difference is due to that, when bool was introduced as a
       
   906       distinct type in 2.3, builtin names True and False were also added to
       
   907       2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
       
   908       True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
       
   909       Leading zeroes are never produced for a genuine integer.  The 2.3
       
   910       (and later) unpicklers special-case these and return bool instead;
       
   911       earlier unpicklers ignore the leading "0" and return the int.
       
   912       """),
       
   913 
       
   914     I(name='BININT',
       
   915       code='J',
       
   916       arg=int4,
       
   917       stack_before=[],
       
   918       stack_after=[pyint],
       
   919       proto=1,
       
   920       doc="""Push a four-byte signed integer.
       
   921 
       
   922       This handles the full range of Python (short) integers on a 32-bit
       
   923       box, directly as binary bytes (1 for the opcode and 4 for the integer).
       
   924       If the integer is non-negative and fits in 1 or 2 bytes, pickling via
       
   925       BININT1 or BININT2 saves space.
       
   926       """),
       
   927 
       
   928     I(name='BININT1',
       
   929       code='K',
       
   930       arg=uint1,
       
   931       stack_before=[],
       
   932       stack_after=[pyint],
       
   933       proto=1,
       
   934       doc="""Push a one-byte unsigned integer.
       
   935 
       
   936       This is a space optimization for pickling very small non-negative ints,
       
   937       in range(256).
       
   938       """),
       
   939 
       
   940     I(name='BININT2',
       
   941       code='M',
       
   942       arg=uint2,
       
   943       stack_before=[],
       
   944       stack_after=[pyint],
       
   945       proto=1,
       
   946       doc="""Push a two-byte unsigned integer.
       
   947 
       
   948       This is a space optimization for pickling small positive ints, in
       
   949       range(256, 2**16).  Integers in range(256) can also be pickled via
       
   950       BININT2, but BININT1 instead saves a byte.
       
   951       """),
       
   952 
       
   953     I(name='LONG',
       
   954       code='L',
       
   955       arg=decimalnl_long,
       
   956       stack_before=[],
       
   957       stack_after=[pylong],
       
   958       proto=0,
       
   959       doc="""Push a long integer.
       
   960 
       
   961       The same as INT, except that the literal ends with 'L', and always
       
   962       unpickles to a Python long.  There doesn't seem a real purpose to the
       
   963       trailing 'L'.
       
   964 
       
   965       Note that LONG takes time quadratic in the number of digits when
       
   966       unpickling (this is simply due to the nature of decimal->binary
       
   967       conversion).  Proto 2 added linear-time (in C; still quadratic-time
       
   968       in Python) LONG1 and LONG4 opcodes.
       
   969       """),
       
   970 
       
   971     I(name="LONG1",
       
   972       code='\x8a',
       
   973       arg=long1,
       
   974       stack_before=[],
       
   975       stack_after=[pylong],
       
   976       proto=2,
       
   977       doc="""Long integer using one-byte length.
       
   978 
       
   979       A more efficient encoding of a Python long; the long1 encoding
       
   980       says it all."""),
       
   981 
       
   982     I(name="LONG4",
       
   983       code='\x8b',
       
   984       arg=long4,
       
   985       stack_before=[],
       
   986       stack_after=[pylong],
       
   987       proto=2,
       
   988       doc="""Long integer using found-byte length.
       
   989 
       
   990       A more efficient encoding of a Python long; the long4 encoding
       
   991       says it all."""),
       
   992 
       
   993     # Ways to spell strings (8-bit, not Unicode).
       
   994 
       
   995     I(name='STRING',
       
   996       code='S',
       
   997       arg=stringnl,
       
   998       stack_before=[],
       
   999       stack_after=[pystring],
       
  1000       proto=0,
       
  1001       doc="""Push a Python string object.
       
  1002 
       
  1003       The argument is a repr-style string, with bracketing quote characters,
       
  1004       and perhaps embedded escapes.  The argument extends until the next
       
  1005       newline character.
       
  1006       """),
       
  1007 
       
  1008     I(name='BINSTRING',
       
  1009       code='T',
       
  1010       arg=string4,
       
  1011       stack_before=[],
       
  1012       stack_after=[pystring],
       
  1013       proto=1,
       
  1014       doc="""Push a Python string object.
       
  1015 
       
  1016       There are two arguments:  the first is a 4-byte little-endian signed int
       
  1017       giving the number of bytes in the string, and the second is that many
       
  1018       bytes, which are taken literally as the string content.
       
  1019       """),
       
  1020 
       
  1021     I(name='SHORT_BINSTRING',
       
  1022       code='U',
       
  1023       arg=string1,
       
  1024       stack_before=[],
       
  1025       stack_after=[pystring],
       
  1026       proto=1,
       
  1027       doc="""Push a Python string object.
       
  1028 
       
  1029       There are two arguments:  the first is a 1-byte unsigned int giving
       
  1030       the number of bytes in the string, and the second is that many bytes,
       
  1031       which are taken literally as the string content.
       
  1032       """),
       
  1033 
       
  1034     # Ways to spell None.
       
  1035 
       
  1036     I(name='NONE',
       
  1037       code='N',
       
  1038       arg=None,
       
  1039       stack_before=[],
       
  1040       stack_after=[pynone],
       
  1041       proto=0,
       
  1042       doc="Push None on the stack."),
       
  1043 
       
  1044     # Ways to spell bools, starting with proto 2.  See INT for how this was
       
  1045     # done before proto 2.
       
  1046 
       
  1047     I(name='NEWTRUE',
       
  1048       code='\x88',
       
  1049       arg=None,
       
  1050       stack_before=[],
       
  1051       stack_after=[pybool],
       
  1052       proto=2,
       
  1053       doc="""True.
       
  1054 
       
  1055       Push True onto the stack."""),
       
  1056 
       
  1057     I(name='NEWFALSE',
       
  1058       code='\x89',
       
  1059       arg=None,
       
  1060       stack_before=[],
       
  1061       stack_after=[pybool],
       
  1062       proto=2,
       
  1063       doc="""True.
       
  1064 
       
  1065       Push False onto the stack."""),
       
  1066 
       
  1067     # Ways to spell Unicode strings.
       
  1068 
       
  1069     I(name='UNICODE',
       
  1070       code='V',
       
  1071       arg=unicodestringnl,
       
  1072       stack_before=[],
       
  1073       stack_after=[pyunicode],
       
  1074       proto=0,  # this may be pure-text, but it's a later addition
       
  1075       doc="""Push a Python Unicode string object.
       
  1076 
       
  1077       The argument is a raw-unicode-escape encoding of a Unicode string,
       
  1078       and so may contain embedded escape sequences.  The argument extends
       
  1079       until the next newline character.
       
  1080       """),
       
  1081 
       
  1082     I(name='BINUNICODE',
       
  1083       code='X',
       
  1084       arg=unicodestring4,
       
  1085       stack_before=[],
       
  1086       stack_after=[pyunicode],
       
  1087       proto=1,
       
  1088       doc="""Push a Python Unicode string object.
       
  1089 
       
  1090       There are two arguments:  the first is a 4-byte little-endian signed int
       
  1091       giving the number of bytes in the string.  The second is that many
       
  1092       bytes, and is the UTF-8 encoding of the Unicode string.
       
  1093       """),
       
  1094 
       
  1095     # Ways to spell floats.
       
  1096 
       
  1097     I(name='FLOAT',
       
  1098       code='F',
       
  1099       arg=floatnl,
       
  1100       stack_before=[],
       
  1101       stack_after=[pyfloat],
       
  1102       proto=0,
       
  1103       doc="""Newline-terminated decimal float literal.
       
  1104 
       
  1105       The argument is repr(a_float), and in general requires 17 significant
       
  1106       digits for roundtrip conversion to be an identity (this is so for
       
  1107       IEEE-754 double precision values, which is what Python float maps to
       
  1108       on most boxes).
       
  1109 
       
  1110       In general, FLOAT cannot be used to transport infinities, NaNs, or
       
  1111       minus zero across boxes (or even on a single box, if the platform C
       
  1112       library can't read the strings it produces for such things -- Windows
       
  1113       is like that), but may do less damage than BINFLOAT on boxes with
       
  1114       greater precision or dynamic range than IEEE-754 double.
       
  1115       """),
       
  1116 
       
  1117     I(name='BINFLOAT',
       
  1118       code='G',
       
  1119       arg=float8,
       
  1120       stack_before=[],
       
  1121       stack_after=[pyfloat],
       
  1122       proto=1,
       
  1123       doc="""Float stored in binary form, with 8 bytes of data.
       
  1124 
       
  1125       This generally requires less than half the space of FLOAT encoding.
       
  1126       In general, BINFLOAT cannot be used to transport infinities, NaNs, or
       
  1127       minus zero, raises an exception if the exponent exceeds the range of
       
  1128       an IEEE-754 double, and retains no more than 53 bits of precision (if
       
  1129       there are more than that, "add a half and chop" rounding is used to
       
  1130       cut it back to 53 significant bits).
       
  1131       """),
       
  1132 
       
  1133     # Ways to build lists.
       
  1134 
       
  1135     I(name='EMPTY_LIST',
       
  1136       code=']',
       
  1137       arg=None,
       
  1138       stack_before=[],
       
  1139       stack_after=[pylist],
       
  1140       proto=1,
       
  1141       doc="Push an empty list."),
       
  1142 
       
  1143     I(name='APPEND',
       
  1144       code='a',
       
  1145       arg=None,
       
  1146       stack_before=[pylist, anyobject],
       
  1147       stack_after=[pylist],
       
  1148       proto=0,
       
  1149       doc="""Append an object to a list.
       
  1150 
       
  1151       Stack before:  ... pylist anyobject
       
  1152       Stack after:   ... pylist+[anyobject]
       
  1153 
       
  1154       although pylist is really extended in-place.
       
  1155       """),
       
  1156 
       
  1157     I(name='APPENDS',
       
  1158       code='e',
       
  1159       arg=None,
       
  1160       stack_before=[pylist, markobject, stackslice],
       
  1161       stack_after=[pylist],
       
  1162       proto=1,
       
  1163       doc="""Extend a list by a slice of stack objects.
       
  1164 
       
  1165       Stack before:  ... pylist markobject stackslice
       
  1166       Stack after:   ... pylist+stackslice
       
  1167 
       
  1168       although pylist is really extended in-place.
       
  1169       """),
       
  1170 
       
  1171     I(name='LIST',
       
  1172       code='l',
       
  1173       arg=None,
       
  1174       stack_before=[markobject, stackslice],
       
  1175       stack_after=[pylist],
       
  1176       proto=0,
       
  1177       doc="""Build a list out of the topmost stack slice, after markobject.
       
  1178 
       
  1179       All the stack entries following the topmost markobject are placed into
       
  1180       a single Python list, which single list object replaces all of the
       
  1181       stack from the topmost markobject onward.  For example,
       
  1182 
       
  1183       Stack before: ... markobject 1 2 3 'abc'
       
  1184       Stack after:  ... [1, 2, 3, 'abc']
       
  1185       """),
       
  1186 
       
  1187     # Ways to build tuples.
       
  1188 
       
  1189     I(name='EMPTY_TUPLE',
       
  1190       code=')',
       
  1191       arg=None,
       
  1192       stack_before=[],
       
  1193       stack_after=[pytuple],
       
  1194       proto=1,
       
  1195       doc="Push an empty tuple."),
       
  1196 
       
  1197     I(name='TUPLE',
       
  1198       code='t',
       
  1199       arg=None,
       
  1200       stack_before=[markobject, stackslice],
       
  1201       stack_after=[pytuple],
       
  1202       proto=0,
       
  1203       doc="""Build a tuple out of the topmost stack slice, after markobject.
       
  1204 
       
  1205       All the stack entries following the topmost markobject are placed into
       
  1206       a single Python tuple, which single tuple object replaces all of the
       
  1207       stack from the topmost markobject onward.  For example,
       
  1208 
       
  1209       Stack before: ... markobject 1 2 3 'abc'
       
  1210       Stack after:  ... (1, 2, 3, 'abc')
       
  1211       """),
       
  1212 
       
  1213     I(name='TUPLE1',
       
  1214       code='\x85',
       
  1215       arg=None,
       
  1216       stack_before=[anyobject],
       
  1217       stack_after=[pytuple],
       
  1218       proto=2,
       
  1219       doc="""One-tuple.
       
  1220 
       
  1221       This code pops one value off the stack and pushes a tuple of
       
  1222       length 1 whose one item is that value back onto it.  IOW:
       
  1223 
       
  1224           stack[-1] = tuple(stack[-1:])
       
  1225       """),
       
  1226 
       
  1227     I(name='TUPLE2',
       
  1228       code='\x86',
       
  1229       arg=None,
       
  1230       stack_before=[anyobject, anyobject],
       
  1231       stack_after=[pytuple],
       
  1232       proto=2,
       
  1233       doc="""One-tuple.
       
  1234 
       
  1235       This code pops two values off the stack and pushes a tuple
       
  1236       of length 2 whose items are those values back onto it.  IOW:
       
  1237 
       
  1238           stack[-2:] = [tuple(stack[-2:])]
       
  1239       """),
       
  1240 
       
  1241     I(name='TUPLE3',
       
  1242       code='\x87',
       
  1243       arg=None,
       
  1244       stack_before=[anyobject, anyobject, anyobject],
       
  1245       stack_after=[pytuple],
       
  1246       proto=2,
       
  1247       doc="""One-tuple.
       
  1248 
       
  1249       This code pops three values off the stack and pushes a tuple
       
  1250       of length 3 whose items are those values back onto it.  IOW:
       
  1251 
       
  1252           stack[-3:] = [tuple(stack[-3:])]
       
  1253       """),
       
  1254 
       
  1255     # Ways to build dicts.
       
  1256 
       
  1257     I(name='EMPTY_DICT',
       
  1258       code='}',
       
  1259       arg=None,
       
  1260       stack_before=[],
       
  1261       stack_after=[pydict],
       
  1262       proto=1,
       
  1263       doc="Push an empty dict."),
       
  1264 
       
  1265     I(name='DICT',
       
  1266       code='d',
       
  1267       arg=None,
       
  1268       stack_before=[markobject, stackslice],
       
  1269       stack_after=[pydict],
       
  1270       proto=0,
       
  1271       doc="""Build a dict out of the topmost stack slice, after markobject.
       
  1272 
       
  1273       All the stack entries following the topmost markobject are placed into
       
  1274       a single Python dict, which single dict object replaces all of the
       
  1275       stack from the topmost markobject onward.  The stack slice alternates
       
  1276       key, value, key, value, ....  For example,
       
  1277 
       
  1278       Stack before: ... markobject 1 2 3 'abc'
       
  1279       Stack after:  ... {1: 2, 3: 'abc'}
       
  1280       """),
       
  1281 
       
  1282     I(name='SETITEM',
       
  1283       code='s',
       
  1284       arg=None,
       
  1285       stack_before=[pydict, anyobject, anyobject],
       
  1286       stack_after=[pydict],
       
  1287       proto=0,
       
  1288       doc="""Add a key+value pair to an existing dict.
       
  1289 
       
  1290       Stack before:  ... pydict key value
       
  1291       Stack after:   ... pydict
       
  1292 
       
  1293       where pydict has been modified via pydict[key] = value.
       
  1294       """),
       
  1295 
       
  1296     I(name='SETITEMS',
       
  1297       code='u',
       
  1298       arg=None,
       
  1299       stack_before=[pydict, markobject, stackslice],
       
  1300       stack_after=[pydict],
       
  1301       proto=1,
       
  1302       doc="""Add an arbitrary number of key+value pairs to an existing dict.
       
  1303 
       
  1304       The slice of the stack following the topmost markobject is taken as
       
  1305       an alternating sequence of keys and values, added to the dict
       
  1306       immediately under the topmost markobject.  Everything at and after the
       
  1307       topmost markobject is popped, leaving the mutated dict at the top
       
  1308       of the stack.
       
  1309 
       
  1310       Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
       
  1311       Stack after:   ... pydict
       
  1312 
       
  1313       where pydict has been modified via pydict[key_i] = value_i for i in
       
  1314       1, 2, ..., n, and in that order.
       
  1315       """),
       
  1316 
       
  1317     # Stack manipulation.
       
  1318 
       
  1319     I(name='POP',
       
  1320       code='0',
       
  1321       arg=None,
       
  1322       stack_before=[anyobject],
       
  1323       stack_after=[],
       
  1324       proto=0,
       
  1325       doc="Discard the top stack item, shrinking the stack by one item."),
       
  1326 
       
  1327     I(name='DUP',
       
  1328       code='2',
       
  1329       arg=None,
       
  1330       stack_before=[anyobject],
       
  1331       stack_after=[anyobject, anyobject],
       
  1332       proto=0,
       
  1333       doc="Push the top stack item onto the stack again, duplicating it."),
       
  1334 
       
  1335     I(name='MARK',
       
  1336       code='(',
       
  1337       arg=None,
       
  1338       stack_before=[],
       
  1339       stack_after=[markobject],
       
  1340       proto=0,
       
  1341       doc="""Push markobject onto the stack.
       
  1342 
       
  1343       markobject is a unique object, used by other opcodes to identify a
       
  1344       region of the stack containing a variable number of objects for them
       
  1345       to work on.  See markobject.doc for more detail.
       
  1346       """),
       
  1347 
       
  1348     I(name='POP_MARK',
       
  1349       code='1',
       
  1350       arg=None,
       
  1351       stack_before=[markobject, stackslice],
       
  1352       stack_after=[],
       
  1353       proto=0,
       
  1354       doc="""Pop all the stack objects at and above the topmost markobject.
       
  1355 
       
  1356       When an opcode using a variable number of stack objects is done,
       
  1357       POP_MARK is used to remove those objects, and to remove the markobject
       
  1358       that delimited their starting position on the stack.
       
  1359       """),
       
  1360 
       
  1361     # Memo manipulation.  There are really only two operations (get and put),
       
  1362     # each in all-text, "short binary", and "long binary" flavors.
       
  1363 
       
  1364     I(name='GET',
       
  1365       code='g',
       
  1366       arg=decimalnl_short,
       
  1367       stack_before=[],
       
  1368       stack_after=[anyobject],
       
  1369       proto=0,
       
  1370       doc="""Read an object from the memo and push it on the stack.
       
  1371 
       
  1372       The index of the memo object to push is given by the newline-teriminated
       
  1373       decimal string following.  BINGET and LONG_BINGET are space-optimized
       
  1374       versions.
       
  1375       """),
       
  1376 
       
  1377     I(name='BINGET',
       
  1378       code='h',
       
  1379       arg=uint1,
       
  1380       stack_before=[],
       
  1381       stack_after=[anyobject],
       
  1382       proto=1,
       
  1383       doc="""Read an object from the memo and push it on the stack.
       
  1384 
       
  1385       The index of the memo object to push is given by the 1-byte unsigned
       
  1386       integer following.
       
  1387       """),
       
  1388 
       
  1389     I(name='LONG_BINGET',
       
  1390       code='j',
       
  1391       arg=int4,
       
  1392       stack_before=[],
       
  1393       stack_after=[anyobject],
       
  1394       proto=1,
       
  1395       doc="""Read an object from the memo and push it on the stack.
       
  1396 
       
  1397       The index of the memo object to push is given by the 4-byte signed
       
  1398       little-endian integer following.
       
  1399       """),
       
  1400 
       
  1401     I(name='PUT',
       
  1402       code='p',
       
  1403       arg=decimalnl_short,
       
  1404       stack_before=[],
       
  1405       stack_after=[],
       
  1406       proto=0,
       
  1407       doc="""Store the stack top into the memo.  The stack is not popped.
       
  1408 
       
  1409       The index of the memo location to write into is given by the newline-
       
  1410       terminated decimal string following.  BINPUT and LONG_BINPUT are
       
  1411       space-optimized versions.
       
  1412       """),
       
  1413 
       
  1414     I(name='BINPUT',
       
  1415       code='q',
       
  1416       arg=uint1,
       
  1417       stack_before=[],
       
  1418       stack_after=[],
       
  1419       proto=1,
       
  1420       doc="""Store the stack top into the memo.  The stack is not popped.
       
  1421 
       
  1422       The index of the memo location to write into is given by the 1-byte
       
  1423       unsigned integer following.
       
  1424       """),
       
  1425 
       
  1426     I(name='LONG_BINPUT',
       
  1427       code='r',
       
  1428       arg=int4,
       
  1429       stack_before=[],
       
  1430       stack_after=[],
       
  1431       proto=1,
       
  1432       doc="""Store the stack top into the memo.  The stack is not popped.
       
  1433 
       
  1434       The index of the memo location to write into is given by the 4-byte
       
  1435       signed little-endian integer following.
       
  1436       """),
       
  1437 
       
  1438     # Access the extension registry (predefined objects).  Akin to the GET
       
  1439     # family.
       
  1440 
       
  1441     I(name='EXT1',
       
  1442       code='\x82',
       
  1443       arg=uint1,
       
  1444       stack_before=[],
       
  1445       stack_after=[anyobject],
       
  1446       proto=2,
       
  1447       doc="""Extension code.
       
  1448 
       
  1449       This code and the similar EXT2 and EXT4 allow using a registry
       
  1450       of popular objects that are pickled by name, typically classes.
       
  1451       It is envisioned that through a global negotiation and
       
  1452       registration process, third parties can set up a mapping between
       
  1453       ints and object names.
       
  1454 
       
  1455       In order to guarantee pickle interchangeability, the extension
       
  1456       code registry ought to be global, although a range of codes may
       
  1457       be reserved for private use.
       
  1458 
       
  1459       EXT1 has a 1-byte integer argument.  This is used to index into the
       
  1460       extension registry, and the object at that index is pushed on the stack.
       
  1461       """),
       
  1462 
       
  1463     I(name='EXT2',
       
  1464       code='\x83',
       
  1465       arg=uint2,
       
  1466       stack_before=[],
       
  1467       stack_after=[anyobject],
       
  1468       proto=2,
       
  1469       doc="""Extension code.
       
  1470 
       
  1471       See EXT1.  EXT2 has a two-byte integer argument.
       
  1472       """),
       
  1473 
       
  1474     I(name='EXT4',
       
  1475       code='\x84',
       
  1476       arg=int4,
       
  1477       stack_before=[],
       
  1478       stack_after=[anyobject],
       
  1479       proto=2,
       
  1480       doc="""Extension code.
       
  1481 
       
  1482       See EXT1.  EXT4 has a four-byte integer argument.
       
  1483       """),
       
  1484 
       
  1485     # Push a class object, or module function, on the stack, via its module
       
  1486     # and name.
       
  1487 
       
  1488     I(name='GLOBAL',
       
  1489       code='c',
       
  1490       arg=stringnl_noescape_pair,
       
  1491       stack_before=[],
       
  1492       stack_after=[anyobject],
       
  1493       proto=0,
       
  1494       doc="""Push a global object (module.attr) on the stack.
       
  1495 
       
  1496       Two newline-terminated strings follow the GLOBAL opcode.  The first is
       
  1497       taken as a module name, and the second as a class name.  The class
       
  1498       object module.class is pushed on the stack.  More accurately, the
       
  1499       object returned by self.find_class(module, class) is pushed on the
       
  1500       stack, so unpickling subclasses can override this form of lookup.
       
  1501       """),
       
  1502 
       
  1503     # Ways to build objects of classes pickle doesn't know about directly
       
  1504     # (user-defined classes).  I despair of documenting this accurately
       
  1505     # and comprehensibly -- you really have to read the pickle code to
       
  1506     # find all the special cases.
       
  1507 
       
  1508     I(name='REDUCE',
       
  1509       code='R',
       
  1510       arg=None,
       
  1511       stack_before=[anyobject, anyobject],
       
  1512       stack_after=[anyobject],
       
  1513       proto=0,
       
  1514       doc="""Push an object built from a callable and an argument tuple.
       
  1515 
       
  1516       The opcode is named to remind of the __reduce__() method.
       
  1517 
       
  1518       Stack before: ... callable pytuple
       
  1519       Stack after:  ... callable(*pytuple)
       
  1520 
       
  1521       The callable and the argument tuple are the first two items returned
       
  1522       by a __reduce__ method.  Applying the callable to the argtuple is
       
  1523       supposed to reproduce the original object, or at least get it started.
       
  1524       If the __reduce__ method returns a 3-tuple, the last component is an
       
  1525       argument to be passed to the object's __setstate__, and then the REDUCE
       
  1526       opcode is followed by code to create setstate's argument, and then a
       
  1527       BUILD opcode to apply  __setstate__ to that argument.
       
  1528 
       
  1529       If type(callable) is not ClassType, REDUCE complains unless the
       
  1530       callable has been registered with the copy_reg module's
       
  1531       safe_constructors dict, or the callable has a magic
       
  1532       '__safe_for_unpickling__' attribute with a true value.  I'm not sure
       
  1533       why it does this, but I've sure seen this complaint often enough when
       
  1534       I didn't want to <wink>.
       
  1535       """),
       
  1536 
       
  1537     I(name='BUILD',
       
  1538       code='b',
       
  1539       arg=None,
       
  1540       stack_before=[anyobject, anyobject],
       
  1541       stack_after=[anyobject],
       
  1542       proto=0,
       
  1543       doc="""Finish building an object, via __setstate__ or dict update.
       
  1544 
       
  1545       Stack before: ... anyobject argument
       
  1546       Stack after:  ... anyobject
       
  1547 
       
  1548       where anyobject may have been mutated, as follows:
       
  1549 
       
  1550       If the object has a __setstate__ method,
       
  1551 
       
  1552           anyobject.__setstate__(argument)
       
  1553 
       
  1554       is called.
       
  1555 
       
  1556       Else the argument must be a dict, the object must have a __dict__, and
       
  1557       the object is updated via
       
  1558 
       
  1559           anyobject.__dict__.update(argument)
       
  1560 
       
  1561       This may raise RuntimeError in restricted execution mode (which
       
  1562       disallows access to __dict__ directly); in that case, the object
       
  1563       is updated instead via
       
  1564 
       
  1565           for k, v in argument.items():
       
  1566               anyobject[k] = v
       
  1567       """),
       
  1568 
       
  1569     I(name='INST',
       
  1570       code='i',
       
  1571       arg=stringnl_noescape_pair,
       
  1572       stack_before=[markobject, stackslice],
       
  1573       stack_after=[anyobject],
       
  1574       proto=0,
       
  1575       doc="""Build a class instance.
       
  1576 
       
  1577       This is the protocol 0 version of protocol 1's OBJ opcode.
       
  1578       INST is followed by two newline-terminated strings, giving a
       
  1579       module and class name, just as for the GLOBAL opcode (and see
       
  1580       GLOBAL for more details about that).  self.find_class(module, name)
       
  1581       is used to get a class object.
       
  1582 
       
  1583       In addition, all the objects on the stack following the topmost
       
  1584       markobject are gathered into a tuple and popped (along with the
       
  1585       topmost markobject), just as for the TUPLE opcode.
       
  1586 
       
  1587       Now it gets complicated.  If all of these are true:
       
  1588 
       
  1589         + The argtuple is empty (markobject was at the top of the stack
       
  1590           at the start).
       
  1591 
       
  1592         + It's an old-style class object (the type of the class object is
       
  1593           ClassType).
       
  1594 
       
  1595         + The class object does not have a __getinitargs__ attribute.
       
  1596 
       
  1597       then we want to create an old-style class instance without invoking
       
  1598       its __init__() method (pickle has waffled on this over the years; not
       
  1599       calling __init__() is current wisdom).  In this case, an instance of
       
  1600       an old-style dummy class is created, and then we try to rebind its
       
  1601       __class__ attribute to the desired class object.  If this succeeds,
       
  1602       the new instance object is pushed on the stack, and we're done.  In
       
  1603       restricted execution mode it can fail (assignment to __class__ is
       
  1604       disallowed), and I'm not really sure what happens then -- it looks
       
  1605       like the code ends up calling the class object's __init__ anyway,
       
  1606       via falling into the next case.
       
  1607 
       
  1608       Else (the argtuple is not empty, it's not an old-style class object,
       
  1609       or the class object does have a __getinitargs__ attribute), the code
       
  1610       first insists that the class object have a __safe_for_unpickling__
       
  1611       attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
       
  1612       it doesn't matter whether this attribute has a true or false value, it
       
  1613       only matters whether it exists (XXX this is a bug; cPickle
       
  1614       requires the attribute to be true).  If __safe_for_unpickling__
       
  1615       doesn't exist, UnpicklingError is raised.
       
  1616 
       
  1617       Else (the class object does have a __safe_for_unpickling__ attr),
       
  1618       the class object obtained from INST's arguments is applied to the
       
  1619       argtuple obtained from the stack, and the resulting instance object
       
  1620       is pushed on the stack.
       
  1621 
       
  1622       NOTE:  checks for __safe_for_unpickling__ went away in Python 2.3.
       
  1623       """),
       
  1624 
       
  1625     I(name='OBJ',
       
  1626       code='o',
       
  1627       arg=None,
       
  1628       stack_before=[markobject, anyobject, stackslice],
       
  1629       stack_after=[anyobject],
       
  1630       proto=1,
       
  1631       doc="""Build a class instance.
       
  1632 
       
  1633       This is the protocol 1 version of protocol 0's INST opcode, and is
       
  1634       very much like it.  The major difference is that the class object
       
  1635       is taken off the stack, allowing it to be retrieved from the memo
       
  1636       repeatedly if several instances of the same class are created.  This
       
  1637       can be much more efficient (in both time and space) than repeatedly
       
  1638       embedding the module and class names in INST opcodes.
       
  1639 
       
  1640       Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
       
  1641       the class object is taken off the stack, immediately above the
       
  1642       topmost markobject:
       
  1643 
       
  1644       Stack before: ... markobject classobject stackslice
       
  1645       Stack after:  ... new_instance_object
       
  1646 
       
  1647       As for INST, the remainder of the stack above the markobject is
       
  1648       gathered into an argument tuple, and then the logic seems identical,
       
  1649       except that no __safe_for_unpickling__ check is done (XXX this is
       
  1650       a bug; cPickle does test __safe_for_unpickling__).  See INST for
       
  1651       the gory details.
       
  1652 
       
  1653       NOTE:  In Python 2.3, INST and OBJ are identical except for how they
       
  1654       get the class object.  That was always the intent; the implementations
       
  1655       had diverged for accidental reasons.
       
  1656       """),
       
  1657 
       
  1658     I(name='NEWOBJ',
       
  1659       code='\x81',
       
  1660       arg=None,
       
  1661       stack_before=[anyobject, anyobject],
       
  1662       stack_after=[anyobject],
       
  1663       proto=2,
       
  1664       doc="""Build an object instance.
       
  1665 
       
  1666       The stack before should be thought of as containing a class
       
  1667       object followed by an argument tuple (the tuple being the stack
       
  1668       top).  Call these cls and args.  They are popped off the stack,
       
  1669       and the value returned by cls.__new__(cls, *args) is pushed back
       
  1670       onto the stack.
       
  1671       """),
       
  1672 
       
  1673     # Machine control.
       
  1674 
       
  1675     I(name='PROTO',
       
  1676       code='\x80',
       
  1677       arg=uint1,
       
  1678       stack_before=[],
       
  1679       stack_after=[],
       
  1680       proto=2,
       
  1681       doc="""Protocol version indicator.
       
  1682 
       
  1683       For protocol 2 and above, a pickle must start with this opcode.
       
  1684       The argument is the protocol version, an int in range(2, 256).
       
  1685       """),
       
  1686 
       
  1687     I(name='STOP',
       
  1688       code='.',
       
  1689       arg=None,
       
  1690       stack_before=[anyobject],
       
  1691       stack_after=[],
       
  1692       proto=0,
       
  1693       doc="""Stop the unpickling machine.
       
  1694 
       
  1695       Every pickle ends with this opcode.  The object at the top of the stack
       
  1696       is popped, and that's the result of unpickling.  The stack should be
       
  1697       empty then.
       
  1698       """),
       
  1699 
       
  1700     # Ways to deal with persistent IDs.
       
  1701 
       
  1702     I(name='PERSID',
       
  1703       code='P',
       
  1704       arg=stringnl_noescape,
       
  1705       stack_before=[],
       
  1706       stack_after=[anyobject],
       
  1707       proto=0,
       
  1708       doc="""Push an object identified by a persistent ID.
       
  1709 
       
  1710       The pickle module doesn't define what a persistent ID means.  PERSID's
       
  1711       argument is a newline-terminated str-style (no embedded escapes, no
       
  1712       bracketing quote characters) string, which *is* "the persistent ID".
       
  1713       The unpickler passes this string to self.persistent_load().  Whatever
       
  1714       object that returns is pushed on the stack.  There is no implementation
       
  1715       of persistent_load() in Python's unpickler:  it must be supplied by an
       
  1716       unpickler subclass.
       
  1717       """),
       
  1718 
       
  1719     I(name='BINPERSID',
       
  1720       code='Q',
       
  1721       arg=None,
       
  1722       stack_before=[anyobject],
       
  1723       stack_after=[anyobject],
       
  1724       proto=1,
       
  1725       doc="""Push an object identified by a persistent ID.
       
  1726 
       
  1727       Like PERSID, except the persistent ID is popped off the stack (instead
       
  1728       of being a string embedded in the opcode bytestream).  The persistent
       
  1729       ID is passed to self.persistent_load(), and whatever object that
       
  1730       returns is pushed on the stack.  See PERSID for more detail.
       
  1731       """),
       
  1732 ]
       
  1733 del I
       
  1734 
       
  1735 # Verify uniqueness of .name and .code members.
       
  1736 name2i = {}
       
  1737 code2i = {}
       
  1738 
       
  1739 for i, d in enumerate(opcodes):
       
  1740     if d.name in name2i:
       
  1741         raise ValueError("repeated name %r at indices %d and %d" %
       
  1742                          (d.name, name2i[d.name], i))
       
  1743     if d.code in code2i:
       
  1744         raise ValueError("repeated code %r at indices %d and %d" %
       
  1745                          (d.code, code2i[d.code], i))
       
  1746 
       
  1747     name2i[d.name] = i
       
  1748     code2i[d.code] = i
       
  1749 
       
  1750 del name2i, code2i, i, d
       
  1751 
       
  1752 ##############################################################################
       
  1753 # Build a code2op dict, mapping opcode characters to OpcodeInfo records.
       
  1754 # Also ensure we've got the same stuff as pickle.py, although the
       
  1755 # introspection here is dicey.
       
  1756 
       
  1757 code2op = {}
       
  1758 for d in opcodes:
       
  1759     code2op[d.code] = d
       
  1760 del d
       
  1761 
       
  1762 def assure_pickle_consistency(verbose=False):
       
  1763     import pickle, re
       
  1764 
       
  1765     copy = code2op.copy()
       
  1766     for name in pickle.__all__:
       
  1767         if not re.match("[A-Z][A-Z0-9_]+$", name):
       
  1768             if verbose:
       
  1769                 print "skipping %r: it doesn't look like an opcode name" % name
       
  1770             continue
       
  1771         picklecode = getattr(pickle, name)
       
  1772         if not isinstance(picklecode, str) or len(picklecode) != 1:
       
  1773             if verbose:
       
  1774                 print ("skipping %r: value %r doesn't look like a pickle "
       
  1775                        "code" % (name, picklecode))
       
  1776             continue
       
  1777         if picklecode in copy:
       
  1778             if verbose:
       
  1779                 print "checking name %r w/ code %r for consistency" % (
       
  1780                       name, picklecode)
       
  1781             d = copy[picklecode]
       
  1782             if d.name != name:
       
  1783                 raise ValueError("for pickle code %r, pickle.py uses name %r "
       
  1784                                  "but we're using name %r" % (picklecode,
       
  1785                                                               name,
       
  1786                                                               d.name))
       
  1787             # Forget this one.  Any left over in copy at the end are a problem
       
  1788             # of a different kind.
       
  1789             del copy[picklecode]
       
  1790         else:
       
  1791             raise ValueError("pickle.py appears to have a pickle opcode with "
       
  1792                              "name %r and code %r, but we don't" %
       
  1793                              (name, picklecode))
       
  1794     if copy:
       
  1795         msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
       
  1796         for code, d in copy.items():
       
  1797             msg.append("    name %r with code %r" % (d.name, code))
       
  1798         raise ValueError("\n".join(msg))
       
  1799 
       
  1800 assure_pickle_consistency()
       
  1801 del assure_pickle_consistency
       
  1802 
       
  1803 ##############################################################################
       
  1804 # A pickle opcode generator.
       
  1805 
       
  1806 def genops(pickle):
       
  1807     """Generate all the opcodes in a pickle.
       
  1808 
       
  1809     'pickle' is a file-like object, or string, containing the pickle.
       
  1810 
       
  1811     Each opcode in the pickle is generated, from the current pickle position,
       
  1812     stopping after a STOP opcode is delivered.  A triple is generated for
       
  1813     each opcode:
       
  1814 
       
  1815         opcode, arg, pos
       
  1816 
       
  1817     opcode is an OpcodeInfo record, describing the current opcode.
       
  1818 
       
  1819     If the opcode has an argument embedded in the pickle, arg is its decoded
       
  1820     value, as a Python object.  If the opcode doesn't have an argument, arg
       
  1821     is None.
       
  1822 
       
  1823     If the pickle has a tell() method, pos was the value of pickle.tell()
       
  1824     before reading the current opcode.  If the pickle is a string object,
       
  1825     it's wrapped in a StringIO object, and the latter's tell() result is
       
  1826     used.  Else (the pickle doesn't have a tell(), and it's not obvious how
       
  1827     to query its current position) pos is None.
       
  1828     """
       
  1829 
       
  1830     import cStringIO as StringIO
       
  1831 
       
  1832     if isinstance(pickle, str):
       
  1833         pickle = StringIO.StringIO(pickle)
       
  1834 
       
  1835     if hasattr(pickle, "tell"):
       
  1836         getpos = pickle.tell
       
  1837     else:
       
  1838         getpos = lambda: None
       
  1839 
       
  1840     while True:
       
  1841         pos = getpos()
       
  1842         code = pickle.read(1)
       
  1843         opcode = code2op.get(code)
       
  1844         if opcode is None:
       
  1845             if code == "":
       
  1846                 raise ValueError("pickle exhausted before seeing STOP")
       
  1847             else:
       
  1848                 raise ValueError("at position %s, opcode %r unknown" % (
       
  1849                                  pos is None and "<unknown>" or pos,
       
  1850                                  code))
       
  1851         if opcode.arg is None:
       
  1852             arg = None
       
  1853         else:
       
  1854             arg = opcode.arg.reader(pickle)
       
  1855         yield opcode, arg, pos
       
  1856         if code == '.':
       
  1857             assert opcode.name == 'STOP'
       
  1858             break
       
  1859 
       
  1860 ##############################################################################
       
  1861 # A symbolic pickle disassembler.
       
  1862 
       
  1863 def dis(pickle, out=None, memo=None, indentlevel=4):
       
  1864     """Produce a symbolic disassembly of a pickle.
       
  1865 
       
  1866     'pickle' is a file-like object, or string, containing a (at least one)
       
  1867     pickle.  The pickle is disassembled from the current position, through
       
  1868     the first STOP opcode encountered.
       
  1869 
       
  1870     Optional arg 'out' is a file-like object to which the disassembly is
       
  1871     printed.  It defaults to sys.stdout.
       
  1872 
       
  1873     Optional arg 'memo' is a Python dict, used as the pickle's memo.  It
       
  1874     may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
       
  1875     Passing the same memo object to another dis() call then allows disassembly
       
  1876     to proceed across multiple pickles that were all created by the same
       
  1877     pickler with the same memo.  Ordinarily you don't need to worry about this.
       
  1878 
       
  1879     Optional arg indentlevel is the number of blanks by which to indent
       
  1880     a new MARK level.  It defaults to 4.
       
  1881 
       
  1882     In addition to printing the disassembly, some sanity checks are made:
       
  1883 
       
  1884     + All embedded opcode arguments "make sense".
       
  1885 
       
  1886     + Explicit and implicit pop operations have enough items on the stack.
       
  1887 
       
  1888     + When an opcode implicitly refers to a markobject, a markobject is
       
  1889       actually on the stack.
       
  1890 
       
  1891     + A memo entry isn't referenced before it's defined.
       
  1892 
       
  1893     + The markobject isn't stored in the memo.
       
  1894 
       
  1895     + A memo entry isn't redefined.
       
  1896     """
       
  1897 
       
  1898     # Most of the hair here is for sanity checks, but most of it is needed
       
  1899     # anyway to detect when a protocol 0 POP takes a MARK off the stack
       
  1900     # (which in turn is needed to indent MARK blocks correctly).
       
  1901 
       
  1902     stack = []          # crude emulation of unpickler stack
       
  1903     if memo is None:
       
  1904         memo = {}       # crude emulation of unpicker memo
       
  1905     maxproto = -1       # max protocol number seen
       
  1906     markstack = []      # bytecode positions of MARK opcodes
       
  1907     indentchunk = ' ' * indentlevel
       
  1908     errormsg = None
       
  1909     for opcode, arg, pos in genops(pickle):
       
  1910         if pos is not None:
       
  1911             print >> out, "%5d:" % pos,
       
  1912 
       
  1913         line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
       
  1914                               indentchunk * len(markstack),
       
  1915                               opcode.name)
       
  1916 
       
  1917         maxproto = max(maxproto, opcode.proto)
       
  1918         before = opcode.stack_before    # don't mutate
       
  1919         after = opcode.stack_after      # don't mutate
       
  1920         numtopop = len(before)
       
  1921 
       
  1922         # See whether a MARK should be popped.
       
  1923         markmsg = None
       
  1924         if markobject in before or (opcode.name == "POP" and
       
  1925                                     stack and
       
  1926                                     stack[-1] is markobject):
       
  1927             assert markobject not in after
       
  1928             if __debug__:
       
  1929                 if markobject in before:
       
  1930                     assert before[-1] is stackslice
       
  1931             if markstack:
       
  1932                 markpos = markstack.pop()
       
  1933                 if markpos is None:
       
  1934                     markmsg = "(MARK at unknown opcode offset)"
       
  1935                 else:
       
  1936                     markmsg = "(MARK at %d)" % markpos
       
  1937                 # Pop everything at and after the topmost markobject.
       
  1938                 while stack[-1] is not markobject:
       
  1939                     stack.pop()
       
  1940                 stack.pop()
       
  1941                 # Stop later code from popping too much.
       
  1942                 try:
       
  1943                     numtopop = before.index(markobject)
       
  1944                 except ValueError:
       
  1945                     assert opcode.name == "POP"
       
  1946                     numtopop = 0
       
  1947             else:
       
  1948                 errormsg = markmsg = "no MARK exists on stack"
       
  1949 
       
  1950         # Check for correct memo usage.
       
  1951         if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
       
  1952             assert arg is not None
       
  1953             if arg in memo:
       
  1954                 errormsg = "memo key %r already defined" % arg
       
  1955             elif not stack:
       
  1956                 errormsg = "stack is empty -- can't store into memo"
       
  1957             elif stack[-1] is markobject:
       
  1958                 errormsg = "can't store markobject in the memo"
       
  1959             else:
       
  1960                 memo[arg] = stack[-1]
       
  1961 
       
  1962         elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
       
  1963             if arg in memo:
       
  1964                 assert len(after) == 1
       
  1965                 after = [memo[arg]]     # for better stack emulation
       
  1966             else:
       
  1967                 errormsg = "memo key %r has never been stored into" % arg
       
  1968 
       
  1969         if arg is not None or markmsg:
       
  1970             # make a mild effort to align arguments
       
  1971             line += ' ' * (10 - len(opcode.name))
       
  1972             if arg is not None:
       
  1973                 line += ' ' + repr(arg)
       
  1974             if markmsg:
       
  1975                 line += ' ' + markmsg
       
  1976         print >> out, line
       
  1977 
       
  1978         if errormsg:
       
  1979             # Note that we delayed complaining until the offending opcode
       
  1980             # was printed.
       
  1981             raise ValueError(errormsg)
       
  1982 
       
  1983         # Emulate the stack effects.
       
  1984         if len(stack) < numtopop:
       
  1985             raise ValueError("tries to pop %d items from stack with "
       
  1986                              "only %d items" % (numtopop, len(stack)))
       
  1987         if numtopop:
       
  1988             del stack[-numtopop:]
       
  1989         if markobject in after:
       
  1990             assert markobject not in before
       
  1991             markstack.append(pos)
       
  1992 
       
  1993         stack.extend(after)
       
  1994 
       
  1995     print >> out, "highest protocol among opcodes =", maxproto
       
  1996     if stack:
       
  1997         raise ValueError("stack not empty after STOP: %r" % stack)
       
  1998 
       
  1999 # For use in the doctest, simply as an example of a class to pickle.
       
  2000 class _Example:
       
  2001     def __init__(self, value):
       
  2002         self.value = value
       
  2003 
       
  2004 _dis_test = r"""
       
  2005 >>> import pickle
       
  2006 >>> x = [1, 2, (3, 4), {'abc': u"def"}]
       
  2007 >>> pkl = pickle.dumps(x, 0)
       
  2008 >>> dis(pkl)
       
  2009     0: (    MARK
       
  2010     1: l        LIST       (MARK at 0)
       
  2011     2: p    PUT        0
       
  2012     5: I    INT        1
       
  2013     8: a    APPEND
       
  2014     9: I    INT        2
       
  2015    12: a    APPEND
       
  2016    13: (    MARK
       
  2017    14: I        INT        3
       
  2018    17: I        INT        4
       
  2019    20: t        TUPLE      (MARK at 13)
       
  2020    21: p    PUT        1
       
  2021    24: a    APPEND
       
  2022    25: (    MARK
       
  2023    26: d        DICT       (MARK at 25)
       
  2024    27: p    PUT        2
       
  2025    30: S    STRING     'abc'
       
  2026    37: p    PUT        3
       
  2027    40: V    UNICODE    u'def'
       
  2028    45: p    PUT        4
       
  2029    48: s    SETITEM
       
  2030    49: a    APPEND
       
  2031    50: .    STOP
       
  2032 highest protocol among opcodes = 0
       
  2033 
       
  2034 Try again with a "binary" pickle.
       
  2035 
       
  2036 >>> pkl = pickle.dumps(x, 1)
       
  2037 >>> dis(pkl)
       
  2038     0: ]    EMPTY_LIST
       
  2039     1: q    BINPUT     0
       
  2040     3: (    MARK
       
  2041     4: K        BININT1    1
       
  2042     6: K        BININT1    2
       
  2043     8: (        MARK
       
  2044     9: K            BININT1    3
       
  2045    11: K            BININT1    4
       
  2046    13: t            TUPLE      (MARK at 8)
       
  2047    14: q        BINPUT     1
       
  2048    16: }        EMPTY_DICT
       
  2049    17: q        BINPUT     2
       
  2050    19: U        SHORT_BINSTRING 'abc'
       
  2051    24: q        BINPUT     3
       
  2052    26: X        BINUNICODE u'def'
       
  2053    34: q        BINPUT     4
       
  2054    36: s        SETITEM
       
  2055    37: e        APPENDS    (MARK at 3)
       
  2056    38: .    STOP
       
  2057 highest protocol among opcodes = 1
       
  2058 
       
  2059 Exercise the INST/OBJ/BUILD family.
       
  2060 
       
  2061 >>> import random
       
  2062 >>> dis(pickle.dumps(random.random, 0))
       
  2063     0: c    GLOBAL     'random random'
       
  2064    15: p    PUT        0
       
  2065    18: .    STOP
       
  2066 highest protocol among opcodes = 0
       
  2067 
       
  2068 >>> from pickletools import _Example
       
  2069 >>> x = [_Example(42)] * 2
       
  2070 >>> dis(pickle.dumps(x, 0))
       
  2071     0: (    MARK
       
  2072     1: l        LIST       (MARK at 0)
       
  2073     2: p    PUT        0
       
  2074     5: (    MARK
       
  2075     6: i        INST       'pickletools _Example' (MARK at 5)
       
  2076    28: p    PUT        1
       
  2077    31: (    MARK
       
  2078    32: d        DICT       (MARK at 31)
       
  2079    33: p    PUT        2
       
  2080    36: S    STRING     'value'
       
  2081    45: p    PUT        3
       
  2082    48: I    INT        42
       
  2083    52: s    SETITEM
       
  2084    53: b    BUILD
       
  2085    54: a    APPEND
       
  2086    55: g    GET        1
       
  2087    58: a    APPEND
       
  2088    59: .    STOP
       
  2089 highest protocol among opcodes = 0
       
  2090 
       
  2091 >>> dis(pickle.dumps(x, 1))
       
  2092     0: ]    EMPTY_LIST
       
  2093     1: q    BINPUT     0
       
  2094     3: (    MARK
       
  2095     4: (        MARK
       
  2096     5: c            GLOBAL     'pickletools _Example'
       
  2097    27: q            BINPUT     1
       
  2098    29: o            OBJ        (MARK at 4)
       
  2099    30: q        BINPUT     2
       
  2100    32: }        EMPTY_DICT
       
  2101    33: q        BINPUT     3
       
  2102    35: U        SHORT_BINSTRING 'value'
       
  2103    42: q        BINPUT     4
       
  2104    44: K        BININT1    42
       
  2105    46: s        SETITEM
       
  2106    47: b        BUILD
       
  2107    48: h        BINGET     2
       
  2108    50: e        APPENDS    (MARK at 3)
       
  2109    51: .    STOP
       
  2110 highest protocol among opcodes = 1
       
  2111 
       
  2112 Try "the canonical" recursive-object test.
       
  2113 
       
  2114 >>> L = []
       
  2115 >>> T = L,
       
  2116 >>> L.append(T)
       
  2117 >>> L[0] is T
       
  2118 True
       
  2119 >>> T[0] is L
       
  2120 True
       
  2121 >>> L[0][0] is L
       
  2122 True
       
  2123 >>> T[0][0] is T
       
  2124 True
       
  2125 >>> dis(pickle.dumps(L, 0))
       
  2126     0: (    MARK
       
  2127     1: l        LIST       (MARK at 0)
       
  2128     2: p    PUT        0
       
  2129     5: (    MARK
       
  2130     6: g        GET        0
       
  2131     9: t        TUPLE      (MARK at 5)
       
  2132    10: p    PUT        1
       
  2133    13: a    APPEND
       
  2134    14: .    STOP
       
  2135 highest protocol among opcodes = 0
       
  2136 
       
  2137 >>> dis(pickle.dumps(L, 1))
       
  2138     0: ]    EMPTY_LIST
       
  2139     1: q    BINPUT     0
       
  2140     3: (    MARK
       
  2141     4: h        BINGET     0
       
  2142     6: t        TUPLE      (MARK at 3)
       
  2143     7: q    BINPUT     1
       
  2144     9: a    APPEND
       
  2145    10: .    STOP
       
  2146 highest protocol among opcodes = 1
       
  2147 
       
  2148 Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
       
  2149 has to emulate the stack in order to realize that the POP opcode at 16 gets
       
  2150 rid of the MARK at 0.
       
  2151 
       
  2152 >>> dis(pickle.dumps(T, 0))
       
  2153     0: (    MARK
       
  2154     1: (        MARK
       
  2155     2: l            LIST       (MARK at 1)
       
  2156     3: p        PUT        0
       
  2157     6: (        MARK
       
  2158     7: g            GET        0
       
  2159    10: t            TUPLE      (MARK at 6)
       
  2160    11: p        PUT        1
       
  2161    14: a        APPEND
       
  2162    15: 0        POP
       
  2163    16: 0        POP        (MARK at 0)
       
  2164    17: g    GET        1
       
  2165    20: .    STOP
       
  2166 highest protocol among opcodes = 0
       
  2167 
       
  2168 >>> dis(pickle.dumps(T, 1))
       
  2169     0: (    MARK
       
  2170     1: ]        EMPTY_LIST
       
  2171     2: q        BINPUT     0
       
  2172     4: (        MARK
       
  2173     5: h            BINGET     0
       
  2174     7: t            TUPLE      (MARK at 4)
       
  2175     8: q        BINPUT     1
       
  2176    10: a        APPEND
       
  2177    11: 1        POP_MARK   (MARK at 0)
       
  2178    12: h    BINGET     1
       
  2179    14: .    STOP
       
  2180 highest protocol among opcodes = 1
       
  2181 
       
  2182 Try protocol 2.
       
  2183 
       
  2184 >>> dis(pickle.dumps(L, 2))
       
  2185     0: \x80 PROTO      2
       
  2186     2: ]    EMPTY_LIST
       
  2187     3: q    BINPUT     0
       
  2188     5: h    BINGET     0
       
  2189     7: \x85 TUPLE1
       
  2190     8: q    BINPUT     1
       
  2191    10: a    APPEND
       
  2192    11: .    STOP
       
  2193 highest protocol among opcodes = 2
       
  2194 
       
  2195 >>> dis(pickle.dumps(T, 2))
       
  2196     0: \x80 PROTO      2
       
  2197     2: ]    EMPTY_LIST
       
  2198     3: q    BINPUT     0
       
  2199     5: h    BINGET     0
       
  2200     7: \x85 TUPLE1
       
  2201     8: q    BINPUT     1
       
  2202    10: a    APPEND
       
  2203    11: 0    POP
       
  2204    12: h    BINGET     1
       
  2205    14: .    STOP
       
  2206 highest protocol among opcodes = 2
       
  2207 """
       
  2208 
       
  2209 _memo_test = r"""
       
  2210 >>> import pickle
       
  2211 >>> from StringIO import StringIO
       
  2212 >>> f = StringIO()
       
  2213 >>> p = pickle.Pickler(f, 2)
       
  2214 >>> x = [1, 2, 3]
       
  2215 >>> p.dump(x)
       
  2216 >>> p.dump(x)
       
  2217 >>> f.seek(0)
       
  2218 >>> memo = {}
       
  2219 >>> dis(f, memo=memo)
       
  2220     0: \x80 PROTO      2
       
  2221     2: ]    EMPTY_LIST
       
  2222     3: q    BINPUT     0
       
  2223     5: (    MARK
       
  2224     6: K        BININT1    1
       
  2225     8: K        BININT1    2
       
  2226    10: K        BININT1    3
       
  2227    12: e        APPENDS    (MARK at 5)
       
  2228    13: .    STOP
       
  2229 highest protocol among opcodes = 2
       
  2230 >>> dis(f, memo=memo)
       
  2231    14: \x80 PROTO      2
       
  2232    16: h    BINGET     0
       
  2233    18: .    STOP
       
  2234 highest protocol among opcodes = 2
       
  2235 """
       
  2236 
       
  2237 __test__ = {'disassembler_test': _dis_test,
       
  2238             'disassembler_memo_test': _memo_test,
       
  2239            }
       
  2240 
       
  2241 def _test():
       
  2242     import doctest
       
  2243     return doctest.testmod()
       
  2244 
       
  2245 if __name__ == "__main__":
       
  2246     _test()