2
9.3 Instruction fetch, decoding and retirement ................................................................ 62
9.4 Instruction latency and throughput ............................................................................ 63
9.5 Break dependency chains......................................................................................... 64
9.6 Jumps and calls........................................................................................................ 65
10 Optimizing for size......................................................................................................... 72
10.1 Choosing shorter instructions.................................................................................. 72
10.2 Using shorter constants and addresses .................................................................. 73
10.3 Reusing constants .................................................................................................. 75
10.4 Constants in 64-bit mode ........................................................................................ 75
10.5 Addresses and pointers in 64-bit mode................................................................... 75
10.6 Making instructions longer for the sake of alignment............................................... 77
11 Optimizing memory access............................................................................................ 80
11.1 How caching works................................................................................................. 80
11.2 Trace cache............................................................................................................ 81
11.3 Alignment of data.................................................................................................... 82
11.4 Alignment of code ................................................................................................... 84
11.5 Organizing data for improved caching..................................................................... 86
11.6 Organizing code for improved caching.................................................................... 86
11.7 Cache control instructions....................................................................................... 87
12 Loops ............................................................................................................................ 87
12.1 Minimize loop overhead .......................................................................................... 87
12.2 Induction variables .................................................................................................. 90
12.3 Move loop-invariant code........................................................................................ 91
12.4 Find the bottlenecks................................................................................................ 91
12.5 Instruction fetch, decoding and retirement in a loop ................................................ 92
12.6 Distribute µops evenly between execution units...................................................... 92
12.7 An example of analysis for bottlenecks on PM........................................................ 93
12.8 Same example on Core2 ........................................................................................ 96
12.9 Loop unrolling ......................................................................................................... 98
12.10 Optimize caching ................................................................................................ 100
12.11 Parallelization ..................................................................................................... 101
12.12 Analyzing dependences...................................................................................... 102
12.13 Loops on processors without out-of-order execution ........................................... 105
12.14 Macro loops ........................................................................................................ 107
13 Vector programming.................................................................................................... 109
13.1 Conditional moves in SIMD registers .................................................................... 110
13.2 Using vector instructions with other types of data than they are intended for ........ 113
13.3 Shuffling data........................................................................................................ 115
13.4 Generating constants............................................................................................ 118
13.5 Accessing unaligned data ..................................................................................... 121
13.6 Using AVX instruction set and YMM registers....................................................... 125
13.7 Vector operations in general purpose registers ..................................................... 130
14 Multithreading.............................................................................................................. 131
14.1 Hyperthreading ..................................................................................................... 132
15 CPU dispatching.......................................................................................................... 132
15.1 Checking for operating system support for XMM and YMM registers .................... 134
16 Problematic Instructions .............................................................................................. 135
16.1 LEA instruction (all processors)............................................................................. 135
16.2 INC and DEC (all Intel processors) ....................................................................... 136
16.3 XCHG (all processors) .......................................................................................... 136
16.4 Shifts and rotates (P4) .......................................................................................... 136
16.5 Rotates through carry (all processors) .................................................................. 137
16.6 Bit test (all processors) ......................................................................................... 137
16.7 LAHF and SAHF (all processors).......................................................................... 137
16.8 Integer multiplication (all processors).................................................................... 137
16.9 Division (all processors)........................................................................................ 137
16.10 String instructions (all processors) ...................................................................... 142
16.11 WAIT instruction (all processors) ........................................................................ 143