|
1 NOTES ON OPTIMIZING DICTIONARIES |
|
2 ================================ |
|
3 |
|
4 |
|
5 Principal Use Cases for Dictionaries |
|
6 ------------------------------------ |
|
7 |
|
8 Passing keyword arguments |
|
9 Typically, one read and one write for 1 to 3 elements. |
|
10 Occurs frequently in normal python code. |
|
11 |
|
12 Class method lookup |
|
13 Dictionaries vary in size with 8 to 16 elements being common. |
|
14 Usually written once with many lookups. |
|
15 When base classes are used, there are many failed lookups |
|
16 followed by a lookup in a base class. |
|
17 |
|
18 Instance attribute lookup and Global variables |
|
19 Dictionaries vary in size. 4 to 10 elements are common. |
|
20 Both reads and writes are common. |
|
21 |
|
22 Builtins |
|
23 Frequent reads. Almost never written. |
|
24 Size 126 interned strings (as of Py2.3b1). |
|
25 A few keys are accessed much more frequently than others. |
|
26 |
|
27 Uniquification |
|
28 Dictionaries of any size. Bulk of work is in creation. |
|
29 Repeated writes to a smaller set of keys. |
|
30 Single read of each key. |
|
31 Some use cases have two consecutive accesses to the same key. |
|
32 |
|
33 * Removing duplicates from a sequence. |
|
34 dict.fromkeys(seqn).keys() |
|
35 |
|
36 * Counting elements in a sequence. |
|
37 for e in seqn: |
|
38 d[e] = d.get(e,0) + 1 |
|
39 |
|
40 * Accumulating references in a dictionary of lists: |
|
41 |
|
42 for pagenumber, page in enumerate(pages): |
|
43 for word in page: |
|
44 d.setdefault(word, []).append(pagenumber) |
|
45 |
|
46 Note, the second example is a use case characterized by a get and set |
|
47 to the same key. There are similar use cases with a __contains__ |
|
48 followed by a get, set, or del to the same key. Part of the |
|
49 justification for d.setdefault is combining the two lookups into one. |
|
50 |
|
51 Membership Testing |
|
52 Dictionaries of any size. Created once and then rarely changes. |
|
53 Single write to each key. |
|
54 Many calls to __contains__() or has_key(). |
|
55 Similar access patterns occur with replacement dictionaries |
|
56 such as with the % formatting operator. |
|
57 |
|
58 Dynamic Mappings |
|
59 Characterized by deletions interspersed with adds and replacements. |
|
60 Performance benefits greatly from the re-use of dummy entries. |
|
61 |
|
62 |
|
63 Data Layout (assuming a 32-bit box with 64 bytes per cache line) |
|
64 ---------------------------------------------------------------- |
|
65 |
|
66 Smalldicts (8 entries) are attached to the dictobject structure |
|
67 and the whole group nearly fills two consecutive cache lines. |
|
68 |
|
69 Larger dicts use the first half of the dictobject structure (one cache |
|
70 line) and a separate, continuous block of entries (at 12 bytes each |
|
71 for a total of 5.333 entries per cache line). |
|
72 |
|
73 |
|
74 Tunable Dictionary Parameters |
|
75 ----------------------------- |
|
76 |
|
77 * PyDict_MINSIZE. Currently set to 8. |
|
78 Must be a power of two. New dicts have to zero-out every cell. |
|
79 Each additional 8 consumes 1.5 cache lines. Increasing improves |
|
80 the sparseness of small dictionaries but costs time to read in |
|
81 the additional cache lines if they are not already in cache. |
|
82 That case is common when keyword arguments are passed. |
|
83 |
|
84 * Maximum dictionary load in PyDict_SetItem. Currently set to 2/3. |
|
85 Increasing this ratio makes dictionaries more dense resulting |
|
86 in more collisions. Decreasing it improves sparseness at the |
|
87 expense of spreading entries over more cache lines and at the |
|
88 cost of total memory consumed. |
|
89 |
|
90 The load test occurs in highly time sensitive code. Efforts |
|
91 to make the test more complex (for example, varying the load |
|
92 for different sizes) have degraded performance. |
|
93 |
|
94 * Growth rate upon hitting maximum load. Currently set to *2. |
|
95 Raising this to *4 results in half the number of resizes, |
|
96 less effort to resize, better sparseness for some (but not |
|
97 all dict sizes), and potentially doubles memory consumption |
|
98 depending on the size of the dictionary. Setting to *4 |
|
99 eliminates every other resize step. |
|
100 |
|
101 * Maximum sparseness (minimum dictionary load). What percentage |
|
102 of entries can be unused before the dictionary shrinks to |
|
103 free up memory and speed up iteration? (The current CPython |
|
104 code does not represent this parameter directly.) |
|
105 |
|
106 * Shrinkage rate upon exceeding maximum sparseness. The current |
|
107 CPython code never even checks sparseness when deleting a |
|
108 key. When a new key is added, it resizes based on the number |
|
109 of active keys, so that the addition may trigger shrinkage |
|
110 rather than growth. |
|
111 |
|
112 Tune-ups should be measured across a broad range of applications and |
|
113 use cases. A change to any parameter will help in some situations and |
|
114 hurt in others. The key is to find settings that help the most common |
|
115 cases and do the least damage to the less common cases. Results will |
|
116 vary dramatically depending on the exact number of keys, whether the |
|
117 keys are all strings, whether reads or writes dominate, the exact |
|
118 hash values of the keys (some sets of values have fewer collisions than |
|
119 others). Any one test or benchmark is likely to prove misleading. |
|
120 |
|
121 While making a dictionary more sparse reduces collisions, it impairs |
|
122 iteration and key listing. Those methods loop over every potential |
|
123 entry. Doubling the size of dictionary results in twice as many |
|
124 non-overlapping memory accesses for keys(), items(), values(), |
|
125 __iter__(), iterkeys(), iteritems(), itervalues(), and update(). |
|
126 Also, every dictionary iterates at least twice, once for the memset() |
|
127 when it is created and once by dealloc(). |
|
128 |
|
129 Dictionary operations involving only a single key can be O(1) unless |
|
130 resizing is possible. By checking for a resize only when the |
|
131 dictionary can grow (and may *require* resizing), other operations |
|
132 remain O(1), and the odds of resize thrashing or memory fragmentation |
|
133 are reduced. In particular, an algorithm that empties a dictionary |
|
134 by repeatedly invoking .pop will see no resizing, which might |
|
135 not be necessary at all because the dictionary is eventually |
|
136 discarded entirely. |
|
137 |
|
138 |
|
139 Results of Cache Locality Experiments |
|
140 ------------------------------------- |
|
141 |
|
142 When an entry is retrieved from memory, 4.333 adjacent entries are also |
|
143 retrieved into a cache line. Since accessing items in cache is *much* |
|
144 cheaper than a cache miss, an enticing idea is to probe the adjacent |
|
145 entries as a first step in collision resolution. Unfortunately, the |
|
146 introduction of any regularity into collision searches results in more |
|
147 collisions than the current random chaining approach. |
|
148 |
|
149 Exploiting cache locality at the expense of additional collisions fails |
|
150 to payoff when the entries are already loaded in cache (the expense |
|
151 is paid with no compensating benefit). This occurs in small dictionaries |
|
152 where the whole dictionary fits into a pair of cache lines. It also |
|
153 occurs frequently in large dictionaries which have a common access pattern |
|
154 where some keys are accessed much more frequently than others. The |
|
155 more popular entries *and* their collision chains tend to remain in cache. |
|
156 |
|
157 To exploit cache locality, change the collision resolution section |
|
158 in lookdict() and lookdict_string(). Set i^=1 at the top of the |
|
159 loop and move the i = (i << 2) + i + perturb + 1 to an unrolled |
|
160 version of the loop. |
|
161 |
|
162 This optimization strategy can be leveraged in several ways: |
|
163 |
|
164 * If the dictionary is kept sparse (through the tunable parameters), |
|
165 then the occurrence of additional collisions is lessened. |
|
166 |
|
167 * If lookdict() and lookdict_string() are specialized for small dicts |
|
168 and for largedicts, then the versions for large_dicts can be given |
|
169 an alternate search strategy without increasing collisions in small dicts |
|
170 which already have the maximum benefit of cache locality. |
|
171 |
|
172 * If the use case for a dictionary is known to have a random key |
|
173 access pattern (as opposed to a more common pattern with a Zipf's law |
|
174 distribution), then there will be more benefit for large dictionaries |
|
175 because any given key is no more likely than another to already be |
|
176 in cache. |
|
177 |
|
178 * In use cases with paired accesses to the same key, the second access |
|
179 is always in cache and gets no benefit from efforts to further improve |
|
180 cache locality. |
|
181 |
|
182 Optimizing the Search of Small Dictionaries |
|
183 ------------------------------------------- |
|
184 |
|
185 If lookdict() and lookdict_string() are specialized for smaller dictionaries, |
|
186 then a custom search approach can be implemented that exploits the small |
|
187 search space and cache locality. |
|
188 |
|
189 * The simplest example is a linear search of contiguous entries. This is |
|
190 simple to implement, guaranteed to terminate rapidly, never searches |
|
191 the same entry twice, and precludes the need to check for dummy entries. |
|
192 |
|
193 * A more advanced example is a self-organizing search so that the most |
|
194 frequently accessed entries get probed first. The organization |
|
195 adapts if the access pattern changes over time. Treaps are ideally |
|
196 suited for self-organization with the most common entries at the |
|
197 top of the heap and a rapid binary search pattern. Most probes and |
|
198 results are all located at the top of the tree allowing them all to |
|
199 be located in one or two cache lines. |
|
200 |
|
201 * Also, small dictionaries may be made more dense, perhaps filling all |
|
202 eight cells to take the maximum advantage of two cache lines. |
|
203 |
|
204 |
|
205 Strategy Pattern |
|
206 ---------------- |
|
207 |
|
208 Consider allowing the user to set the tunable parameters or to select a |
|
209 particular search method. Since some dictionary use cases have known |
|
210 sizes and access patterns, the user may be able to provide useful hints. |
|
211 |
|
212 1) For example, if membership testing or lookups dominate runtime and memory |
|
213 is not at a premium, the user may benefit from setting the maximum load |
|
214 ratio at 5% or 10% instead of the usual 66.7%. This will sharply |
|
215 curtail the number of collisions but will increase iteration time. |
|
216 The builtin namespace is a prime example of a dictionary that can |
|
217 benefit from being highly sparse. |
|
218 |
|
219 2) Dictionary creation time can be shortened in cases where the ultimate |
|
220 size of the dictionary is known in advance. The dictionary can be |
|
221 pre-sized so that no resize operations are required during creation. |
|
222 Not only does this save resizes, but the key insertion will go |
|
223 more quickly because the first half of the keys will be inserted into |
|
224 a more sparse environment than before. The preconditions for this |
|
225 strategy arise whenever a dictionary is created from a key or item |
|
226 sequence and the number of *unique* keys is known. |
|
227 |
|
228 3) If the key space is large and the access pattern is known to be random, |
|
229 then search strategies exploiting cache locality can be fruitful. |
|
230 The preconditions for this strategy arise in simulations and |
|
231 numerical analysis. |
|
232 |
|
233 4) If the keys are fixed and the access pattern strongly favors some of |
|
234 the keys, then the entries can be stored contiguously and accessed |
|
235 with a linear search or treap. This exploits knowledge of the data, |
|
236 cache locality, and a simplified search routine. It also eliminates |
|
237 the need to test for dummy entries on each probe. The preconditions |
|
238 for this strategy arise in symbol tables and in the builtin dictionary. |
|
239 |
|
240 |
|
241 Readonly Dictionaries |
|
242 --------------------- |
|
243 Some dictionary use cases pass through a build stage and then move to a |
|
244 more heavily exercised lookup stage with no further changes to the |
|
245 dictionary. |
|
246 |
|
247 An idea that emerged on python-dev is to be able to convert a dictionary |
|
248 to a read-only state. This can help prevent programming errors and also |
|
249 provide knowledge that can be exploited for lookup optimization. |
|
250 |
|
251 The dictionary can be immediately rebuilt (eliminating dummy entries), |
|
252 resized (to an appropriate level of sparseness), and the keys can be |
|
253 jostled (to minimize collisions). The lookdict() routine can then |
|
254 eliminate the test for dummy entries (saving about 1/4 of the time |
|
255 spent in the collision resolution loop). |
|
256 |
|
257 An additional possibility is to insert links into the empty spaces |
|
258 so that dictionary iteration can proceed in len(d) steps instead of |
|
259 (mp->mask + 1) steps. Alternatively, a separate tuple of keys can be |
|
260 kept just for iteration. |
|
261 |
|
262 |
|
263 Caching Lookups |
|
264 --------------- |
|
265 The idea is to exploit key access patterns by anticipating future lookups |
|
266 based on previous lookups. |
|
267 |
|
268 The simplest incarnation is to save the most recently accessed entry. |
|
269 This gives optimal performance for use cases where every get is followed |
|
270 by a set or del to the same key. |