|
1 Tiny Code Generator - Fabrice Bellard. |
|
2 |
|
3 1) Introduction |
|
4 |
|
5 TCG (Tiny Code Generator) began as a generic backend for a C |
|
6 compiler. It was simplified to be used in QEMU. It also has its roots |
|
7 in the QOP code generator written by Paul Brook. |
|
8 |
|
9 2) Definitions |
|
10 |
|
11 The TCG "target" is the architecture for which we generate the |
|
12 code. It is of course not the same as the "target" of QEMU which is |
|
13 the emulated architecture. As TCG started as a generic C backend used |
|
14 for cross compiling, it is assumed that the TCG target is different |
|
15 from the host, although it is never the case for QEMU. |
|
16 |
|
17 A TCG "function" corresponds to a QEMU Translated Block (TB). |
|
18 |
|
19 A TCG "temporary" is a variable only live in a basic |
|
20 block. Temporaries are allocated explicitly in each function. |
|
21 |
|
22 A TCG "local temporary" is a variable only live in a function. Local |
|
23 temporaries are allocated explicitly in each function. |
|
24 |
|
25 A TCG "global" is a variable which is live in all the functions |
|
26 (equivalent of a C global variable). They are defined before the |
|
27 functions defined. A TCG global can be a memory location (e.g. a QEMU |
|
28 CPU register), a fixed host register (e.g. the QEMU CPU state pointer) |
|
29 or a memory location which is stored in a register outside QEMU TBs |
|
30 (not implemented yet). |
|
31 |
|
32 A TCG "basic block" corresponds to a list of instructions terminated |
|
33 by a branch instruction. |
|
34 |
|
35 3) Intermediate representation |
|
36 |
|
37 3.1) Introduction |
|
38 |
|
39 TCG instructions operate on variables which are temporaries, local |
|
40 temporaries or globals. TCG instructions and variables are strongly |
|
41 typed. Two types are supported: 32 bit integers and 64 bit |
|
42 integers. Pointers are defined as an alias to 32 bit or 64 bit |
|
43 integers depending on the TCG target word size. |
|
44 |
|
45 Each instruction has a fixed number of output variable operands, input |
|
46 variable operands and always constant operands. |
|
47 |
|
48 The notable exception is the call instruction which has a variable |
|
49 number of outputs and inputs. |
|
50 |
|
51 In the textual form, output operands usually come first, followed by |
|
52 input operands, followed by constant operands. The output type is |
|
53 included in the instruction name. Constants are prefixed with a '$'. |
|
54 |
|
55 add_i32 t0, t1, t2 (t0 <- t1 + t2) |
|
56 |
|
57 3.2) Assumptions |
|
58 |
|
59 * Basic blocks |
|
60 |
|
61 - Basic blocks end after branches (e.g. brcond_i32 instruction), |
|
62 goto_tb and exit_tb instructions. |
|
63 - Basic blocks start after the end of a previous basic block, or at a |
|
64 set_label instruction. |
|
65 |
|
66 After the end of a basic block, the content of temporaries is |
|
67 destroyed, but local temporaries and globals are preserved. |
|
68 |
|
69 * Floating point types are not supported yet |
|
70 |
|
71 * Pointers: depending on the TCG target, pointer size is 32 bit or 64 |
|
72 bit. The type TCG_TYPE_PTR is an alias to TCG_TYPE_I32 or |
|
73 TCG_TYPE_I64. |
|
74 |
|
75 * Helpers: |
|
76 |
|
77 Using the tcg_gen_helper_x_y it is possible to call any function |
|
78 taking i32, i64 or pointer types. Before calling an helper, all |
|
79 globals are stored at their canonical location and it is assumed that |
|
80 the function can modify them. In the future, function modifiers will |
|
81 be allowed to tell that the helper does not read or write some globals. |
|
82 |
|
83 On some TCG targets (e.g. x86), several calling conventions are |
|
84 supported. |
|
85 |
|
86 * Branches: |
|
87 |
|
88 Use the instruction 'br' to jump to a label. Use 'jmp' to jump to an |
|
89 explicit address. Conditional branches can only jump to labels. |
|
90 |
|
91 3.3) Code Optimizations |
|
92 |
|
93 When generating instructions, you can count on at least the following |
|
94 optimizations: |
|
95 |
|
96 - Single instructions are simplified, e.g. |
|
97 |
|
98 and_i32 t0, t0, $0xffffffff |
|
99 |
|
100 is suppressed. |
|
101 |
|
102 - A liveness analysis is done at the basic block level. The |
|
103 information is used to suppress moves from a dead variable to |
|
104 another one. It is also used to remove instructions which compute |
|
105 dead results. The later is especially useful for condition code |
|
106 optimization in QEMU. |
|
107 |
|
108 In the following example: |
|
109 |
|
110 add_i32 t0, t1, t2 |
|
111 add_i32 t0, t0, $1 |
|
112 mov_i32 t0, $1 |
|
113 |
|
114 only the last instruction is kept. |
|
115 |
|
116 3.4) Instruction Reference |
|
117 |
|
118 ********* Function call |
|
119 |
|
120 * call <ret> <params> ptr |
|
121 |
|
122 call function 'ptr' (pointer type) |
|
123 |
|
124 <ret> optional 32 bit or 64 bit return value |
|
125 <params> optional 32 bit or 64 bit parameters |
|
126 |
|
127 ********* Jumps/Labels |
|
128 |
|
129 * jmp t0 |
|
130 |
|
131 Absolute jump to address t0 (pointer type). |
|
132 |
|
133 * set_label $label |
|
134 |
|
135 Define label 'label' at the current program point. |
|
136 |
|
137 * br $label |
|
138 |
|
139 Jump to label. |
|
140 |
|
141 * brcond_i32/i64 cond, t0, t1, label |
|
142 |
|
143 Conditional jump if t0 cond t1 is true. cond can be: |
|
144 TCG_COND_EQ |
|
145 TCG_COND_NE |
|
146 TCG_COND_LT /* signed */ |
|
147 TCG_COND_GE /* signed */ |
|
148 TCG_COND_LE /* signed */ |
|
149 TCG_COND_GT /* signed */ |
|
150 TCG_COND_LTU /* unsigned */ |
|
151 TCG_COND_GEU /* unsigned */ |
|
152 TCG_COND_LEU /* unsigned */ |
|
153 TCG_COND_GTU /* unsigned */ |
|
154 |
|
155 ********* Arithmetic |
|
156 |
|
157 * add_i32/i64 t0, t1, t2 |
|
158 |
|
159 t0=t1+t2 |
|
160 |
|
161 * sub_i32/i64 t0, t1, t2 |
|
162 |
|
163 t0=t1-t2 |
|
164 |
|
165 * neg_i32/i64 t0, t1 |
|
166 |
|
167 t0=-t1 (two's complement) |
|
168 |
|
169 * mul_i32/i64 t0, t1, t2 |
|
170 |
|
171 t0=t1*t2 |
|
172 |
|
173 * div_i32/i64 t0, t1, t2 |
|
174 |
|
175 t0=t1/t2 (signed). Undefined behavior if division by zero or overflow. |
|
176 |
|
177 * divu_i32/i64 t0, t1, t2 |
|
178 |
|
179 t0=t1/t2 (unsigned). Undefined behavior if division by zero. |
|
180 |
|
181 * rem_i32/i64 t0, t1, t2 |
|
182 |
|
183 t0=t1%t2 (signed). Undefined behavior if division by zero or overflow. |
|
184 |
|
185 * remu_i32/i64 t0, t1, t2 |
|
186 |
|
187 t0=t1%t2 (unsigned). Undefined behavior if division by zero. |
|
188 |
|
189 ********* Logical |
|
190 |
|
191 * and_i32/i64 t0, t1, t2 |
|
192 |
|
193 t0=t1&t2 |
|
194 |
|
195 * or_i32/i64 t0, t1, t2 |
|
196 |
|
197 t0=t1|t2 |
|
198 |
|
199 * xor_i32/i64 t0, t1, t2 |
|
200 |
|
201 t0=t1^t2 |
|
202 |
|
203 * not_i32/i64 t0, t1 |
|
204 |
|
205 t0=~t1 |
|
206 |
|
207 * andc_i32/i64 t0, t1, t2 |
|
208 |
|
209 t0=t1&~t2 |
|
210 |
|
211 * eqv_i32/i64 t0, t1, t2 |
|
212 |
|
213 t0=~(t1^t2) |
|
214 |
|
215 * nand_i32/i64 t0, t1, t2 |
|
216 |
|
217 t0=~(t1&t2) |
|
218 |
|
219 * nor_i32/i64 t0, t1, t2 |
|
220 |
|
221 t0=~(t1|t2) |
|
222 |
|
223 * orc_i32/i64 t0, t1, t2 |
|
224 |
|
225 t0=t1|~t2 |
|
226 |
|
227 ********* Shifts/Rotates |
|
228 |
|
229 * shl_i32/i64 t0, t1, t2 |
|
230 |
|
231 t0=t1 << t2. Undefined behavior if t2 < 0 or t2 >= 32 (resp 64) |
|
232 |
|
233 * shr_i32/i64 t0, t1, t2 |
|
234 |
|
235 t0=t1 >> t2 (unsigned). Undefined behavior if t2 < 0 or t2 >= 32 (resp 64) |
|
236 |
|
237 * sar_i32/i64 t0, t1, t2 |
|
238 |
|
239 t0=t1 >> t2 (signed). Undefined behavior if t2 < 0 or t2 >= 32 (resp 64) |
|
240 |
|
241 * rotl_i32/i64 t0, t1, t2 |
|
242 |
|
243 Rotation of t2 bits to the left. Undefined behavior if t2 < 0 or t2 >= 32 (resp 64) |
|
244 |
|
245 * rotr_i32/i64 t0, t1, t2 |
|
246 |
|
247 Rotation of t2 bits to the right. Undefined behavior if t2 < 0 or t2 >= 32 (resp 64) |
|
248 |
|
249 ********* Misc |
|
250 |
|
251 * mov_i32/i64 t0, t1 |
|
252 |
|
253 t0 = t1 |
|
254 |
|
255 Move t1 to t0 (both operands must have the same type). |
|
256 |
|
257 * ext8s_i32/i64 t0, t1 |
|
258 ext8u_i32/i64 t0, t1 |
|
259 ext16s_i32/i64 t0, t1 |
|
260 ext16u_i32/i64 t0, t1 |
|
261 ext32s_i64 t0, t1 |
|
262 ext32u_i64 t0, t1 |
|
263 |
|
264 8, 16 or 32 bit sign/zero extension (both operands must have the same type) |
|
265 |
|
266 * bswap16_i32 t0, t1 |
|
267 |
|
268 16 bit byte swap on a 32 bit value. The two high order bytes must be set |
|
269 to zero. |
|
270 |
|
271 * bswap_i32 t0, t1 |
|
272 |
|
273 32 bit byte swap |
|
274 |
|
275 * bswap_i64 t0, t1 |
|
276 |
|
277 64 bit byte swap |
|
278 |
|
279 * discard_i32/i64 t0 |
|
280 |
|
281 Indicate that the value of t0 won't be used later. It is useful to |
|
282 force dead code elimination. |
|
283 |
|
284 ********* Type conversions |
|
285 |
|
286 * ext_i32_i64 t0, t1 |
|
287 Convert t1 (32 bit) to t0 (64 bit) and does sign extension |
|
288 |
|
289 * extu_i32_i64 t0, t1 |
|
290 Convert t1 (32 bit) to t0 (64 bit) and does zero extension |
|
291 |
|
292 * trunc_i64_i32 t0, t1 |
|
293 Truncate t1 (64 bit) to t0 (32 bit) |
|
294 |
|
295 * concat_i32_i64 t0, t1, t2 |
|
296 Construct t0 (64-bit) taking the low half from t1 (32 bit) and the high half |
|
297 from t2 (32 bit). |
|
298 |
|
299 * concat32_i64 t0, t1, t2 |
|
300 Construct t0 (64-bit) taking the low half from t1 (64 bit) and the high half |
|
301 from t2 (64 bit). |
|
302 |
|
303 ********* Load/Store |
|
304 |
|
305 * ld_i32/i64 t0, t1, offset |
|
306 ld8s_i32/i64 t0, t1, offset |
|
307 ld8u_i32/i64 t0, t1, offset |
|
308 ld16s_i32/i64 t0, t1, offset |
|
309 ld16u_i32/i64 t0, t1, offset |
|
310 ld32s_i64 t0, t1, offset |
|
311 ld32u_i64 t0, t1, offset |
|
312 |
|
313 t0 = read(t1 + offset) |
|
314 Load 8, 16, 32 or 64 bits with or without sign extension from host memory. |
|
315 offset must be a constant. |
|
316 |
|
317 * st_i32/i64 t0, t1, offset |
|
318 st8_i32/i64 t0, t1, offset |
|
319 st16_i32/i64 t0, t1, offset |
|
320 st32_i64 t0, t1, offset |
|
321 |
|
322 write(t0, t1 + offset) |
|
323 Write 8, 16, 32 or 64 bits to host memory. |
|
324 |
|
325 ********* QEMU specific operations |
|
326 |
|
327 * tb_exit t0 |
|
328 |
|
329 Exit the current TB and return the value t0 (word type). |
|
330 |
|
331 * goto_tb index |
|
332 |
|
333 Exit the current TB and jump to the TB index 'index' (constant) if the |
|
334 current TB was linked to this TB. Otherwise execute the next |
|
335 instructions. |
|
336 |
|
337 * qemu_ld_i32/i64 t0, t1, flags |
|
338 qemu_ld8u_i32/i64 t0, t1, flags |
|
339 qemu_ld8s_i32/i64 t0, t1, flags |
|
340 qemu_ld16u_i32/i64 t0, t1, flags |
|
341 qemu_ld16s_i32/i64 t0, t1, flags |
|
342 qemu_ld32u_i64 t0, t1, flags |
|
343 qemu_ld32s_i64 t0, t1, flags |
|
344 |
|
345 Load data at the QEMU CPU address t1 into t0. t1 has the QEMU CPU |
|
346 address type. 'flags' contains the QEMU memory index (selects user or |
|
347 kernel access) for example. |
|
348 |
|
349 * qemu_st_i32/i64 t0, t1, flags |
|
350 qemu_st8_i32/i64 t0, t1, flags |
|
351 qemu_st16_i32/i64 t0, t1, flags |
|
352 qemu_st32_i64 t0, t1, flags |
|
353 |
|
354 Store the data t0 at the QEMU CPU Address t1. t1 has the QEMU CPU |
|
355 address type. 'flags' contains the QEMU memory index (selects user or |
|
356 kernel access) for example. |
|
357 |
|
358 Note 1: Some shortcuts are defined when the last operand is known to be |
|
359 a constant (e.g. addi for add, movi for mov). |
|
360 |
|
361 Note 2: When using TCG, the opcodes must never be generated directly |
|
362 as some of them may not be available as "real" opcodes. Always use the |
|
363 function tcg_gen_xxx(args). |
|
364 |
|
365 4) Backend |
|
366 |
|
367 tcg-target.h contains the target specific definitions. tcg-target.c |
|
368 contains the target specific code. |
|
369 |
|
370 4.1) Assumptions |
|
371 |
|
372 The target word size (TCG_TARGET_REG_BITS) is expected to be 32 bit or |
|
373 64 bit. It is expected that the pointer has the same size as the word. |
|
374 |
|
375 On a 32 bit target, all 64 bit operations are converted to 32 bits. A |
|
376 few specific operations must be implemented to allow it (see add2_i32, |
|
377 sub2_i32, brcond2_i32). |
|
378 |
|
379 Floating point operations are not supported in this version. A |
|
380 previous incarnation of the code generator had full support of them, |
|
381 but it is better to concentrate on integer operations first. |
|
382 |
|
383 On a 64 bit target, no assumption is made in TCG about the storage of |
|
384 the 32 bit values in 64 bit registers. |
|
385 |
|
386 4.2) Constraints |
|
387 |
|
388 GCC like constraints are used to define the constraints of every |
|
389 instruction. Memory constraints are not supported in this |
|
390 version. Aliases are specified in the input operands as for GCC. |
|
391 |
|
392 The same register may be used for both an input and an output, even when |
|
393 they are not explicitly aliased. If an op expands to multiple target |
|
394 instructions then care must be taken to avoid clobbering input values. |
|
395 GCC style "early clobber" outputs are not currently supported. |
|
396 |
|
397 A target can define specific register or constant constraints. If an |
|
398 operation uses a constant input constraint which does not allow all |
|
399 constants, it must also accept registers in order to have a fallback. |
|
400 |
|
401 The movi_i32 and movi_i64 operations must accept any constants. |
|
402 |
|
403 The mov_i32 and mov_i64 operations must accept any registers of the |
|
404 same type. |
|
405 |
|
406 The ld/st instructions must accept signed 32 bit constant offsets. It |
|
407 can be implemented by reserving a specific register to compute the |
|
408 address if the offset is too big. |
|
409 |
|
410 The ld/st instructions must accept any destination (ld) or source (st) |
|
411 register. |
|
412 |
|
413 4.3) Function call assumptions |
|
414 |
|
415 - The only supported types for parameters and return value are: 32 and |
|
416 64 bit integers and pointer. |
|
417 - The stack grows downwards. |
|
418 - The first N parameters are passed in registers. |
|
419 - The next parameters are passed on the stack by storing them as words. |
|
420 - Some registers are clobbered during the call. |
|
421 - The function can return 0 or 1 value in registers. On a 32 bit |
|
422 target, functions must be able to return 2 values in registers for |
|
423 64 bit return type. |
|
424 |
|
425 5) Recommended coding rules for best performance |
|
426 |
|
427 - Use globals to represent the parts of the QEMU CPU state which are |
|
428 often modified, e.g. the integer registers and the condition |
|
429 codes. TCG will be able to use host registers to store them. |
|
430 |
|
431 - Avoid globals stored in fixed registers. They must be used only to |
|
432 store the pointer to the CPU state and possibly to store a pointer |
|
433 to a register window. |
|
434 |
|
435 - Use temporaries. Use local temporaries only when really needed, |
|
436 e.g. when you need to use a value after a jump. Local temporaries |
|
437 introduce a performance hit in the current TCG implementation: their |
|
438 content is saved to memory at end of each basic block. |
|
439 |
|
440 - Free temporaries and local temporaries when they are no longer used |
|
441 (tcg_temp_free). Since tcg_const_x() also creates a temporary, you |
|
442 should free it after it is used. Freeing temporaries does not yield |
|
443 a better generated code, but it reduces the memory usage of TCG and |
|
444 the speed of the translation. |
|
445 |
|
446 - Don't hesitate to use helpers for complicated or seldom used target |
|
447 intructions. There is little performance advantage in using TCG to |
|
448 implement target instructions taking more than about twenty TCG |
|
449 instructions. |
|
450 |
|
451 - Use the 'discard' instruction if you know that TCG won't be able to |
|
452 prove that a given global is "dead" at a given program point. The |
|
453 x86 target uses it to improve the condition codes optimisation. |