3.
The microarchitecture of Intel and AMD
CPU's
An optimization guide for assembly programmers and
compiler makers
By Agner Fog. Copenhagen University College of Engineering.
Copyright © 1996 - 2009. Last updated 2009-05-05.
Contents
1 Introduction ....................................................................................................................... 3
1.1 About this manual ....................................................................................................... 3
1.2 Microprocessor versions covered by this manual........................................................ 4
2 Out-of-order execution (All processors except P1, PMMX)................................................ 6
2.1 Instructions are split into µops..................................................................................... 6
2.2 Register renaming ...................................................................................................... 7
3 Branch prediction (all processors) ..................................................................................... 9
3.1 Prediction methods for conditional jumps.................................................................... 9
3.2 Branch prediction in P1............................................................................................. 14
3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 18
3.4 Branch prediction in P4 and P4E .............................................................................. 19
3.5 Branch prediction in PM and Core2 .......................................................................... 22
3.6 Branch prediction in AMD ......................................................................................... 24
3.7 Indirect jumps on older processors ........................................................................... 27
3.8 Returns (all processors except P1) ........................................................................... 27
3.9 Static prediction ........................................................................................................ 27
3.10 Close jumps............................................................................................................ 28
4 Pentium 1 and Pentium MMX pipeline............................................................................. 30
4.1 Pairing integer instructions........................................................................................ 30
4.2 Address generation interlock..................................................................................... 34
4.3 Splitting complex instructions into simpler ones ........................................................ 34
4.4 Prefixes..................................................................................................................... 35
4.5 Scheduling floating point code .................................................................................. 36
5 Pentium Pro, II and III pipeline......................................................................................... 39
5.1 The pipeline in PPro, P2 and P3 ............................................................................... 39
5.2 Instruction fetch ........................................................................................................ 39
5.3 Instruction decoding.................................................................................................. 40
5.4 Register renaming .................................................................................................... 44
5.5 ROB read.................................................................................................................. 44
5.6 Out of order execution .............................................................................................. 48
5.7 Retirement ................................................................................................................ 49
5.8 Partial register stalls.................................................................................................. 50
5.9 Store forwarding stalls .............................................................................................. 53
5.10 Bottlenecks in PPro, P2, P3 .................................................................................... 54
6 Pentium M pipeline.......................................................................................................... 56
6.1 The pipeline in PM .................................................................................................... 56
6.2 The pipeline in Core Solo and Duo ........................................................................... 57
6.3 Instruction fetch ........................................................................................................ 57
6.4 Instruction decoding.................................................................................................. 57
6.5 Loop buffer ............................................................................................................... 59
6.6 Micro-op fusion ......................................................................................................... 59
6.7 Stack engine............................................................................................................. 61
6.8 Register renaming .................................................................................................... 63
6.9 Register read stalls ................................................................................................... 63