microarchitecture.pdf资源-CSDN文库

需积分: 12 95 浏览量 2019-05-23 14:11:48 上传评论收藏 1.02MB PDF 举报

Contents 1 Introduction.......................................................................................................................3 1.1 About this manual.......................................................................................................3 1.2 Microprocessor versions covered by this manual........................................................4 2 Out-of-order execution (All processors except P1, PMMX)................................................5 2.1 Instructions are split into uops.....................................................................................5 2.2 Register renaming......................................................................................................6 3 Branch prediction (all processors).....................................................................................7 3.1 Prediction methods for conditional jumps....................................................................7 3.2 Branch prediction in P1.............................................................................................13 3.3 Branch prediction in PMMX, PPro, P2, and P3.........................................................17 3.4 Branch prediction in P4 and P4E..............................................................................18 3.5 Branch prediction in PM and Core2..........................................................................21 3.6 Branch prediction in AMD64.....................................................................................22 3.7 Indirect jumps (all processors except PM and Core2)...............................................25 3.8 Returns (all processors except P1)...........................................................................25 3.9 Static prediction........................................................................................................26 3.10 Close jumps............................................................................................................27 4 Pentium 1 and Pentium MMX pipeline.............................................................................29 4.1 Pairing integer instructions........................................................................................29 4.2 Address generation interlock.....................................................................................33 4.3 Splitting complex instructions into simpler ones........................................................33 4.4 Prefixes.....................................................................................................................34 4.5 Scheduling floating point code..................................................................................35 5 Pentium Pro, II and III pipeline.........................................................................................38 5.1 The pipeline in PPro, P2 and P3...............................................................................38 5.2 Instruction fetch........................................................................................................38 5.3 Instruction decoding..................................................................................................39 5.4 Register renaming....................................................................................................43 5.5 ROB read..................................................................................................................43 5.6 Out of order execution..............................................................................................47 5.7 Retirement................................................................................................................48 5.8 Partial register stalls..................................................................................................49 5.9 Partial memory stalls.................................................................................................52 5.10 Bottlenecks in PPro, P2, P3....................................................................................53 6 Pentium M pipeline..........................................................................................................55 6.1 The pipeline in PM....................................................................................................55 6.2 The pipeline in Core Solo and Duo...........................................................................56 6.3 Instruction fetch........................................................................................................56 6.4 Instruction decoding..................................................................................................56 6.5 Loop buffer...............................................................................................................58 6.6 Micro-op fusion.........................................................................................................58 6.7 Stack engine.............................................................................................................60 6.8 Register renaming....................................................................................................62 6.9 Register read stalls...................................................................................................62 2 6.10 Execution units.......................................................................................................64 6.11 Execution units that are connected to both port 0 and 1..........................................64 6.12 Retirement..............................................................................................................66 6.13 Partial register access.............................................................................................66 6.14 Partial memory stalls...............................................................................................68 6.15 Bottlenecks in PM...................................................................................................68 7 Core 2 pipeline................................................................................................................71 7.1 Pipeline.....................................................................................................................71 7.2 Instruction fetch and predecoding.............................................................................71 7.3 Instruction decoding..................................................................................................73 7.4 Micro-op fusion.........................................................................................................74 7.5 Macro-op fusion........................................................................................................74 7.6 Stack engine.............................................................................................................76 7.7 Register renaming....................................................................................................76 7.8 Register read stalls...................................................................................................76 7.9 Execution units.........................................................................................................78 7.10 Retirement..............................................................................................................80 7.11 Partial register access.............................................................................................80 7.12 Partial memory stalls...............................................................................................81 7.13 Cache and memory access.....................................................................................81 7.14 Breaking dependence chains..................................................................................82 7.15 Bottlenecks in Core2...............................................................................................83 8 Pentium 4 (NetBurst) pipeline..........................................................................................85 8.1 Data cache...............................................................................................................85 8.2 Trace cache..............................................................................................................85 8.3 Instruction decoding..................................................................................................90 8.4 Execution units.........................................................................................................91 8.5 Do the floating point and MMX units run at half speed?............................................93 8.6 Transfer of data between execution units..................................................................96 8.7 Retirement................................................................................................................98 8.8 Partial registers and partial flags...............................................................................99 8.9 Partial memory access............................................................................................100 8.10 Memory intermediates in dependence chains.......................................................100 8.11 Breaking dependence chains................................................................................102 8.12 Choosing the optimal instructions.........................................................................102 8.13 Bottlenecks in P4 and P4E....................................................................................105 9 AMD64 pipeline.............................................................................................................108 9.1 The pipeline in AMD64............................................................................................108 9.2 Instruction fetch......................................................................................................110 9.3 Predecoding and instruction length decoding..........................................................110 9.4 Single, double and vector path instructions.............................................................111 9.5 Integer execution pipes...........................................................................................112 9.6 Floating point execution pipes.................................................................................112 9.7 Mixing instructions with different latency.................................................................114 9.8 64 bit versus 128 bit instructions.............................................................................115 9.9 Data delay between differently typed instructions...................................................116 9.10 Partial register access...........................................................................................117 9.11 Partial flag access.................................................................................................117 9.12 Partial memory stalls.............................................................................................118 9.13 Loops....................................................................................................................118 9.14 Cache...................................................................................................................119 9.15 Bottlenecks in AMD64...........................................................................................120 10 Comparison of microarchitectures...............................................................................122 10.1 The AMD kernel....................................................................................................122 10.2 The Pentium 4 kernel............................................................................................123 10.3 The Pentium M kernel...........................................................................................124 10.4 Intel Core 2 microarchitecture...............................................................................125 10.5 Conclusion............................................................................................................126 3 10.6 Future trends........................................................................................................128 11 Literature.....................................................................................................................129

资源推荐

资源详情

资源评论

The microarchitecture of Intel and AMD

CPU's

An optimization guide for assembly programmers and

compiler makers

By Agner Fog. Copenhagen University College of Engineering.

Contents

1 Introduction ....................................................................................................................... 3

1.1 About this manual ....................................................................................................... 3

1.2 Microprocessor versions covered by this manual........................................................ 4

2 Out-of-order execution (All processors except P1, PMMX)................................................ 5

2.1 Instructions are split into uops..................................................................................... 5

2.2 Register renaming ...................................................................................................... 6

3 Branch prediction (all processors) ..................................................................................... 7

3.1 Prediction methods for conditional jumps.................................................................... 7

3.2 Branch prediction in P1............................................................................................. 13

3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 17

3.4 Branch prediction in P4 and P4E .............................................................................. 18

3.5 Branch prediction in PM and Core2 .......................................................................... 21

3.6 Branch prediction in AMD64 ..................................................................................... 22

3.7 Indirect jumps (all processors except PM and Core2) ............................................... 25

3.8 Returns (all processors except P1) ........................................................................... 25

3.9 Static prediction ........................................................................................................ 26

3.10 Close jumps............................................................................................................ 27

4 Pentium 1 and Pentium MMX pipeline............................................................................. 29

4.1 Pairing integer instructions........................................................................................ 29

4.2 Address generation interlock..................................................................................... 33

4.3 Splitting complex instructions into simpler ones ........................................................ 33

4.4 Prefixes..................................................................................................................... 34

4.5 Scheduling floating point code .................................................................................. 35

5 Pentium Pro, II and III pipeline......................................................................................... 38

5.1 The pipeline in PPro, P2 and P3 ............................................................................... 38

5.2 Instruction fetch ........................................................................................................ 38

5.3 Instruction decoding.................................................................................................. 39

5.4 Register renaming .................................................................................................... 43

5.5 ROB read.................................................................................................................. 43

5.6 Out of order execution .............................................................................................. 47

5.7 Retirement ................................................................................................................ 48

5.8 Partial register stalls.................................................................................................. 49

5.9 Partial memory stalls................................................................................................. 52

5.10 Bottlenecks in PPro, P2, P3 .................................................................................... 53

6 Pentium M pipeline.......................................................................................................... 55

6.1 The pipeline in PM .................................................................................................... 55

6.2 The pipeline in Core Solo and Duo ........................................................................... 56

6.3 Instruction fetch ........................................................................................................ 56

6.4 Instruction decoding.................................................................................................. 56

6.5 Loop buffer ............................................................................................................... 58

6.6 Micro-op fusion ......................................................................................................... 58

6.7 Stack engine............................................................................................................. 60

6.8 Register renaming .................................................................................................... 62

6.9 Register read stalls ................................................................................................... 62

6.10 Execution units ....................................................................................................... 64

6.11 Execution units that are connected to both port 0 and 1.......................................... 64

6.12 Retirement .............................................................................................................. 66

6.13 Partial register access............................................................................................. 66

6.14 Partial memory stalls............................................................................................... 68

6.15 Bottlenecks in PM ................................................................................................... 68

7 Core 2 pipeline ................................................................................................................ 71

7.1 Pipeline..................................................................................................................... 71

7.2 Instruction fetch and predecoding ............................................................................. 71

7.3 Instruction decoding.................................................................................................. 73

7.4 Micro-op fusion ......................................................................................................... 74

7.5 Macro-op fusion........................................................................................................ 74

7.6 Stack engine............................................................................................................. 76

7.7 Register renaming .................................................................................................... 76

7.8 Register read stalls ................................................................................................... 76

7.9 Execution units ......................................................................................................... 78

7.10 Retirement .............................................................................................................. 80

7.11 Partial register access............................................................................................. 80

7.12 Partial memory stalls............................................................................................... 81

7.13 Cache and memory access..................................................................................... 81

7.14 Breaking dependence chains.................................................................................. 82

7.15 Bottlenecks in Core2............................................................................................... 83

8 Pentium 4 (NetBurst) pipeline.......................................................................................... 85

8.1 Data cache ............................................................................................................... 85

8.2 Trace cache.............................................................................................................. 85

8.3 Instruction decoding.................................................................................................. 90

8.4 Execution units ......................................................................................................... 91

8.5 Do the floating point and MMX units run at half speed? ............................................ 93

8.6 Transfer of data between execution units.................................................................. 96

8.7 Retirement ................................................................................................................ 98

8.8 Partial registers and partial flags............................................................................... 99

8.9 Partial memory access............................................................................................ 100

8.10 Memory intermediates in dependence chains ....................................................... 100

8.11 Breaking dependence chains................................................................................ 102

8.12 Choosing the optimal instructions ......................................................................... 102

8.13 Bottlenecks in P4 and P4E.................................................................................... 105

9 AMD64 pipeline............................................................................................................. 108

9.1 The pipeline in AMD64............................................................................................ 108

9.2 Instruction fetch ...................................................................................................... 110

9.3 Predecoding and instruction length decoding.......................................................... 110

9.4 Single, double and vector path instructions............................................................. 111

9.5 Integer execution pipes........................................................................................... 112

9.6 Floating point execution pipes................................................................................. 112

9.7 Mixing instructions with different latency ................................................................. 114

9.8 64 bit versus 128 bit instructions............................................................................. 115

9.9 Data delay between differently typed instructions ................................................... 116

9.10 Partial register access........................................................................................... 117

9.11 Partial flag access................................................................................................. 117

9.12 Partial memory stalls............................................................................................. 118

9.13 Loops.................................................................................................................... 118

9.14 Cache ................................................................................................................... 119

9.15 Bottlenecks in AMD64........................................................................................... 120

10 Comparison of microarchitectures ............................................................................... 122

10.1 The AMD kernel.................................................................................................... 122

10.2 The Pentium 4 kernel............................................................................................ 123

10.3 The Pentium M kernel........................................................................................... 124

10.4 Intel Core 2 microarchitecture ............................................................................... 125

10.5 Conclusion............................................................................................................ 126

10.6 Future trends ........................................................................................................ 128

11 Literature..................................................................................................................... 129

1 Introduction

1.1 About this manual

This is the third in a series of five manuals:

1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac

platforms.

2. Optimizing subroutines in assembly language: An optimization guide for x86

platforms.

3. The microarchitecture of Intel and AMD CPU's: An optimization guide for assembly

programmers and compiler makers.

4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation

breakdowns for Intel and AMD CPU's.

5. Calling conventions for different C++ compilers and operating systems.

The latest versions of these manuals are always available from www.agner.org/optimize

The present manual describes the details of the microarchitectures of x86 microprocessors

from Intel and AMD. The Itanium processor is not covered. The purpose of this manual is to

enable assembly programmers and compiler makers to optimize software for a specific

microprocessor. The main focus is on details that are relevant to calculations of how much

time a piece of code takes to execute, such as the latencies of different execution units and

the throughputs of various parts of the pipelines. Branch prediction algorithms are also

covered in detail.

This manual will also be interesting to students of microarchitecture. But it must be noted

that the technical descriptions are mostly based on my own research, which is limited to

what is measurable. The descriptions of the "mechanics" of the pipelines are therefore

limited to what can be measured by counting clock cycles or micro-operations (uops) and

what can be deduced from these measurements. Mechanistic explanations in this manual

should be regarded as a model which is useful for predicting microprocessor behavior. I

have no way of knowing with certainty whether it is in accordance with the actual physical

structure of the microprocessors. The main purpose of providing this information is to

enable programmers and compiler makers to optimize their code.

On the other hand, my method of deducing information from measurements rather than

relying on information published by microprocessor vendors provides a lot of new informa-

tion that cannot be found anywhere else. Technical details published by microprocessor

vendors is often superficial, incomplete, selective and sometimes misleading.

My findings are sometimes in disagreement with data published by microprocessor vendors.

Reasons for this discrepancy might be that such data are theoretical while my data are

obtained experimentally under a particular set of testing conditions. I do not claim that all

information in this manual is exact. Some timings etc. can be difficult or impossible to

measure exactly, and I do not have access to the inside information on technical

implementations that microprocessor vendors base their technical manuals on.

I have done tests in various processor modes: unprotected and protected, 16-bit, 32-bit and

64-bit. Most timing results are independent of the processor mode. Important differences

are noted where appropriate. Far jumps, far calls and interrupts have mostly been tested in

16-bit mode. Call gates etc. have not been tested. The detailed timing results are listed in

manual 4: "Instruction tables".

Most of the information in this manual is based on my own research. Many people have

sent me useful information and corrections, which I am very thankful for. I keep updating the

manual whenever I have new important information. This manual is therefore more detailed,

comprehensive and exact than other sources of information; and it contains many details

not found anywhere else.

This manual is not for beginners. It is assumed that the reader has a good understanding of

assembly programming and microprocessor architecture. If not, then please read some

books on the subject and get some programming experience before you begin doing

complicated optimizations. See the literature list in manual 2: "Optimizing subroutines in

assembly language" or follow the links from www.agner.org/optimize

The reader may skip chapters describing old microprocessor designs unless you are using

these processors in embedded systems or you are interested in historical developments in

microarchitecture.

Please don't send your programming questions to me, I am not gonna do your homework

for you! There are various discussion forums on the Internet where you can get answers to

your programming questions if you cannot find the answers in the relevant books and

manuals.

1.2 Microprocessor versions covered by this manual

The following families of x86 microprocessors are discussed in this manual:

Microprocessor name Abbreviation

Intel Pentium (without name suffix) P1

Intel Pentium MMX PMMX

Intel Pentium Pro PPro

Intel Pentium II P2

Intel Pentium III P3

Intel Pentium 4 (NetBurst) P4

Intel Pentium 4 with EM64T, Pentium D, etc. P4E

Intel Pentium M, Core Solo, Core Duo PM

Intel Core 2 Core2

AMD Athlon 64, Opteron, etc. AMD64

Table 1.1. Microprocessor families

The abbreviations here are intended to distinguish between different kernel microarchitec-

tures, regardless of trade names. The commercial names of microprocessors often blur the

distinctions between different kernel technologies. The name Celeron applies to P2, P3, P4

or PM with less cache than the standard versions. The name Xeon applies to P2, P3, P4 or

Core2 with more cache than the standard versions. The names Pentium D and Pentium

Extreme Edition refer to P4E with multiple cores. The name Centrino applies to Pentium M,

Core Solo and Core Duo processors. Core Solo is rather similar to Pentium M. Core Duo is

similar too, but with two cores.

The name Sempron applies to a low-end version of Athlon 64 with less cache. Turion 64 is

a mobile version. Opteron is a server version with more cache. Some versions of P4E, PM,

Core2 and AMD64 processors have multiple cores.

The P1 and PMMX processors represent the fifth generation in the Intel x86 series of

microprocessors, and their processor kernels are very similar. PPro, P2 and P3 all have the

sixth generation kernel. These three processors are almost identical except for the fact that

new instructions are added to each new model. P4 is the first processor in the seventh

generation which, for obscure reasons, is not called seventh generation in Intel documents.

Quite unexpectedly, the generation number returned by the CPUID instruction in the P4 is

not 7 but 15. The confusion is complete when the subsequent Intel CPU's: Pentium M,

Core, and Core2 report generation number 6.

I have not tested earlier processors from AMD and cannot give a detailed description of

them. The 32-bit Athlon is similar to the Athlon 64 but less efficient in some respects.

Important differences are briefly mentioned.

The reader should be aware that different generations of microprocessors behave very

differently. Also, the Intel and AMD microarchitectures are very different. What is optimal for

one generation or one brand may not be optimal for the others.

2 Out-of-order execution (All processors except P1,

PMMX)

The sixth generation of microprocessors, beginning with the PPro, provided an important

improvement in microarchitecture design, namely out-of-order execution. The idea is that if

the execution of a particular instruction is delayed because the input data for the instruction

are not available yet, then the microprocessor will try to find later instructions that it can do

first, if the input data for the latter instructions are ready. Obviously, the microprocessor has

to check if the latter instructions need the output from the former instruction. If each

instruction depends on the result of the preceding instruction, then we have no opportunities

for out-of-order execution. This is called a dependence chain. Manual 2: "Optimizing

subroutines in assembly language" gives examples of how to avoid long dependence

chains.

The logic for determining input dependences and the mechanisms for doing instructions as

soon as the necessary inputs are ready, gives us the further advantage that the

microprocessor can do several things at the same time. If we need to do an addition and a

multiplication, and neither instruction depends on the output of the other, then we can do

both at the same time, because they are using two different execution units. But we cannot

do two multiplications at the same time if we have only one multiplication unit.

Typically, everything in these microprocessors is highly pipelined in order to improve the

throughput. If, for example, a floating point addition takes 4 clock cycles, and the execution

unit is fully pipelined, then we can start one addition at time T, which will be finished at time

T+4, and start another addition at time T+1, which will be finished at time T+5. The

advantage of this technology is therefore highest if the code can be organized so that there

are as few dependences as possible between successive instructions.

2.1 Instructions are split into uops

The microprocessors with out-of-order execution are translating all instructions into micro-

operations - abbreviated µops or uops. A simple instruction such as ADD EAX,EBX

generates only one uop, while an instruction like ADD EAX,[MEM1] may generate two: one

for reading from memory into a temporary (unnamed) register, and one for adding the

contents of the temporary register to EAX. The instruction ADD [MEM1],EAX may

generate three uops: one for reading from memory, one for adding, and one for writing the

result back to memory. The advantage of this is that the uops can be executed out of order.

Example:

剩余128页未读，继续阅读

评论收藏

内容反馈

drjiachen

粉丝: 172
资源: 2138

microarchitecture.pdf

LC3-Microarchitecture.pdf

GNUtoolchain_Optimization_MicroArchitecture.pdf

ARM11 Microarchitecture White Paper.pdf

Demystifying GPU Microarchitecture through Microbenchmarking.pdf

The microarchitecture of Intel and AMD CPUs

The microarchitecture of intel AMD and VIA CPUs

计算机组织结构-期中复习.pdf

广积科技推出支持Intel酷睿2双核的PISA CPU板卡——IB930.pdf

全面认识配件之CPU.pdf

计算机学科国际会议分级 2012.pdf

重庆大学计算机学院-计算机系统结构期末复习.pdf

pxa-user-manual.pdf

网易2017春招笔试真题编程题集合.pdf

微机系统与接口技术：第二章- 微处理器的结构.pdf

CPU.rar_VHDL-CPU_cpu内部_cpu设计 vhdl_vhdl cpu

Intel-XScale.zip_xscale

16版自考02318计算机组成原理重点总结提纲有(已排版).pdf

程序的优化手册

计算机组成原理(workbook)—Alan Clements

on-chip networks：片上网络（On-chip Networks）

Automatic intruction-set extensions 论文翻译

optimization_manuals（一个老外写的有关各种编程优化方法的主题）

丹麦人总结的代码优化资料

串口助手工具合集.zip

OLED显示温度和时间-STM32F103C8T6（完整程序工程+原理图+相关资料）.zip

Vivado license 永久

最新资源