optimization_manuals（一个老外写的有关各种编程优化方法的主题）资源-CSDN文库

共7个文件

pdf：5个

zip：2个

optimization

4星 · 超过85%的资源需积分: 10 165 浏览量 2009-06-17 10:31:59 上传评论 1 收藏 3.18MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

optimization_manuals.zip （7个子文件）

cppexamples.zip 59KB

microarchitecture.pdf 1.14MB

instruction_tables.pdf 978KB

asmexamples.zip 14KB

optimizing_cpp.pdf 769KB

optimizing_assembly.pdf 824KB

calling_conventions.pdf 356KB

The microarchitecture of Intel and AMD

CPU's

An optimization guide for assembly programmers and

compiler makers

By Agner Fog. Copenhagen University College of Engineering.

Contents

1 Introduction ....................................................................................................................... 3

1.1 About this manual ....................................................................................................... 3

1.2 Microprocessor versions covered by this manual........................................................ 4

2 Out-of-order execution (All processors except P1, PMMX)................................................ 6

2.1 Instructions are split into µops..................................................................................... 6

2.2 Register renaming ...................................................................................................... 7

3 Branch prediction (all processors) ..................................................................................... 9

3.1 Prediction methods for conditional jumps.................................................................... 9

3.2 Branch prediction in P1............................................................................................. 14

3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 18

3.4 Branch prediction in P4 and P4E .............................................................................. 19

3.5 Branch prediction in PM and Core2 .......................................................................... 22

3.6 Branch prediction in AMD ......................................................................................... 24

3.7 Indirect jumps on older processors ........................................................................... 27

3.8 Returns (all processors except P1) ........................................................................... 27

3.9 Static prediction ........................................................................................................ 27

3.10 Close jumps............................................................................................................ 28

4 Pentium 1 and Pentium MMX pipeline............................................................................. 30

4.1 Pairing integer instructions........................................................................................ 30

4.2 Address generation interlock..................................................................................... 34

4.3 Splitting complex instructions into simpler ones ........................................................ 34

4.4 Prefixes..................................................................................................................... 35

4.5 Scheduling floating point code .................................................................................. 36

5 Pentium Pro, II and III pipeline......................................................................................... 39

5.1 The pipeline in PPro, P2 and P3 ............................................................................... 39

5.2 Instruction fetch ........................................................................................................ 39

5.3 Instruction decoding.................................................................................................. 40

5.4 Register renaming .................................................................................................... 44

5.5 ROB read.................................................................................................................. 44

5.6 Out of order execution .............................................................................................. 48

5.7 Retirement ................................................................................................................ 49

5.8 Partial register stalls.................................................................................................. 50

5.9 Store forwarding stalls .............................................................................................. 53

5.10 Bottlenecks in PPro, P2, P3 .................................................................................... 54

6 Pentium M pipeline.......................................................................................................... 56

6.1 The pipeline in PM .................................................................................................... 56

6.2 The pipeline in Core Solo and Duo ........................................................................... 57

6.3 Instruction fetch ........................................................................................................ 57

6.4 Instruction decoding.................................................................................................. 57

6.5 Loop buffer ............................................................................................................... 59

6.6 Micro-op fusion ......................................................................................................... 59

6.7 Stack engine............................................................................................................. 61

6.8 Register renaming .................................................................................................... 63

6.9 Register read stalls ................................................................................................... 63

6.10 Execution units ....................................................................................................... 65

6.11 Execution units that are connected to both port 0 and 1.......................................... 65

6.12 Retirement .............................................................................................................. 67

6.13 Partial register access............................................................................................. 67

6.14 Store forwarding stalls ............................................................................................ 69

6.15 Bottlenecks in PM ................................................................................................... 69

7 Core 2 pipeline ................................................................................................................ 72

7.1 Pipeline..................................................................................................................... 72

7.2 Instruction fetch and predecoding ............................................................................. 72

7.3 Instruction decoding.................................................................................................. 74

7.4 Micro-op fusion ......................................................................................................... 75

7.5 Macro-op fusion........................................................................................................ 76

7.6 Stack engine............................................................................................................. 77

7.7 Register renaming .................................................................................................... 77

7.8 Register read stalls ................................................................................................... 78

7.9 Execution units ......................................................................................................... 79

7.10 Retirement .............................................................................................................. 82

7.11 Partial register access............................................................................................. 82

7.12 Store forwarding stalls ............................................................................................ 83

7.13 Cache and memory access..................................................................................... 84

7.14 Breaking dependence chains.................................................................................. 85

7.15 Bottlenecks in Core2............................................................................................... 85

8 Pentium 4 (NetBurst) pipeline.......................................................................................... 88

8.1 Data cache ............................................................................................................... 88

8.2 Trace cache.............................................................................................................. 88

8.3 Instruction decoding.................................................................................................. 93

8.4 Execution units ......................................................................................................... 94

8.5 Do the floating point and MMX units run at half speed? ............................................ 96

8.6 Transfer of data between execution units.................................................................. 99

8.7 Retirement .............................................................................................................. 101

8.8 Partial registers and partial flags............................................................................. 102

8.9 Store forwarding stalls ............................................................................................ 103

8.10 Memory intermediates in dependence chains ....................................................... 103

8.11 Breaking dependence chains................................................................................ 105

8.12 Choosing the optimal instructions ......................................................................... 105

8.13 Bottlenecks in P4 and P4E.................................................................................... 108

9 AMD pipeline................................................................................................................. 111

9.1 The pipeline in AMD processors ............................................................................. 111

9.2 Instruction fetch ...................................................................................................... 113

9.3 Predecoding and instruction length decoding.......................................................... 113

9.4 Single, double and vector path instructions............................................................. 114

9.5 Stack engine........................................................................................................... 115

9.6 Integer execution pipes........................................................................................... 115

9.7 Floating point execution pipes................................................................................. 115

9.8 Mixing instructions with different latency ................................................................. 117

9.9 64 bit versus 128 bit instructions............................................................................. 118

9.10 Data delay between differently typed instructions ................................................. 119

9.11 Partial register access........................................................................................... 119

9.12 Partial flag access................................................................................................. 120

9.13 Store forwarding stalls .......................................................................................... 120

9.14 Loops.................................................................................................................... 121

9.15 Cache ................................................................................................................... 121

9.16 Bottlenecks in AMD............................................................................................... 123

10 Comparison of microarchitectures ............................................................................... 125

10.1 The AMD kernel.................................................................................................... 125

10.2 The Pentium 4 kernel............................................................................................ 126

10.3 The Pentium M kernel........................................................................................... 128

10.4 Intel Core 2 microarchitecture ............................................................................... 128

10.5 Conclusion............................................................................................................ 129

10.6 Future trends ........................................................................................................ 131

11 Literature..................................................................................................................... 133

1 Introduction

1.1 About this manual

This is the third in a series of five manuals:

1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac

platforms.

2. Optimizing subroutines in assembly language: An optimization guide for x86

platforms.

3. The microarchitecture of Intel and AMD CPU's: An optimization guide for assembly

programmers and compiler makers.

4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation

breakdowns for Intel and AMD CPU's.

5. Calling conventions for different C++ compilers and operating systems.

The latest versions of these manuals are always available from www.agner.org/optimize

The present manual describes the details of the microarchitectures of x86 microprocessors

from Intel and AMD. The Itanium processor is not covered. The purpose of this manual is to

enable assembly programmers and compiler makers to optimize software for a specific

microprocessor. The main focus is on details that are relevant to calculations of how much

time a piece of code takes to execute, such as the latencies of different execution units and

the throughputs of various parts of the pipelines. Branch prediction algorithms are also

covered in detail.

This manual will also be interesting to students of microarchitecture. But it must be noted

that the technical descriptions are mostly based on my own research, which is limited to

what is measurable. The descriptions of the "mechanics" of the pipelines are therefore

limited to what can be measured by counting clock cycles or micro-operations (µops) and

what can be deduced from these measurements. Mechanistic explanations in this manual

should be regarded as a model which is useful for predicting microprocessor behavior. I

have no way of knowing with certainty whether it is in accordance with the actual physical

structure of the microprocessors. The main purpose of providing this information is to

enable programmers and compiler makers to optimize their code.

On the other hand, my method of deducing information from measurements rather than

relying on information published by microprocessor vendors provides a lot of new informa-

tion that cannot be found anywhere else. Technical details published by microprocessor

vendors is often superficial, incomplete, selective and sometimes misleading.

My findings are sometimes in disagreement with data published by microprocessor vendors.

Reasons for this discrepancy might be that such data are theoretical while my data are

obtained experimentally under a particular set of testing conditions. I do not claim that all

information in this manual is exact. Some timings etc. can be difficult or impossible to

measure exactly, and I do not have access to the inside information on technical

implementations that microprocessor vendors base their technical manuals on.

I have done tests in various processor modes: unprotected and protected, 16-bit, 32-bit and

64-bit. Most timing results are independent of the processor mode. Important differences

are noted where appropriate. Far jumps, far calls and interrupts have mostly been tested in

16-bit mode. Call gates etc. have not been tested. The detailed timing results are listed in

manual 4: "Instruction tables".

Most of the information in this manual is based on my own research. Many people have

sent me useful information and corrections, which I am very thankful for. I keep updating the

manual whenever I have new important information. This manual is therefore more detailed,

comprehensive and exact than other sources of information; and it contains many details

not found anywhere else.

This manual is not for beginners. It is assumed that the reader has a good understanding of

assembly programming and microprocessor architecture. If not, then please read some

books on the subject and get some programming experience before you begin doing

complicated optimizations. See the literature list in manual 2: "Optimizing subroutines in

assembly language" or follow the links from www.agner.org/optimize

The reader may skip chapters describing old microprocessor designs unless you are using

these processors in embedded systems or you are interested in historical developments in

microarchitecture.

Please don't send your programming questions to me, I am not gonna do your homework

for you! There are various discussion forums on the Internet where you can get answers to

your programming questions if you cannot find the answers in the relevant books and

manuals.

1.2 Microprocessor versions covered by this manual

The following families of x86 microprocessors are discussed in this manual:

Microprocessor name Abbreviation

Intel Pentium (without name suffix) P1

Intel Pentium MMX PMMX

Intel Pentium Pro PPro

Intel Pentium II P2

Intel Pentium III P3

Intel Pentium 4 (NetBurst) P4

Intel Pentium 4 with EM64T, Pentium D, etc. P4E

Intel Pentium M, Core Solo, Core Duo PM

Intel Core 2 Core2

AMD Athlon AMD K7

AMD Athlon 64, Opteron, etc., 64-bit AMD K8

AMD Family 10h, Phenom, third generation Opteron AMD K10

Table 1.1. Microprocessor families

The abbreviations here are intended to distinguish between different kernel microarchitec-

tures, regardless of trade names. The commercial names of microprocessors often blur the

distinctions between different kernel technologies. The name Celeron applies to P2, P3, P4

or PM with less cache than the standard versions. The name Xeon applies to P2, P3, P4 or

Core2 with more cache than the standard versions. The names Pentium D and Pentium

Extreme Edition refer to P4E with multiple cores. The name Centrino applies to Pentium M,

Core Solo and Core Duo processors. Core Solo is rather similar to Pentium M. Core Duo is

similar too, but with two cores.

The name Sempron applies to a low-end version of Athlon 64 with less cache. Turion 64 is

a mobile version. Opteron is a server version with more cache. Some versions of P4E, PM,

Core2 and AMD processors have multiple cores.

The P1 and PMMX processors represent the fifth generation in the Intel x86 series of

microprocessors, and their processor kernels are very similar. PPro, P2 and P3 all have the

sixth generation kernel. These three processors are almost identical except for the fact that

new instructions are added to each new model. P4 is the first processor in the seventh

generation which, for obscure reasons, is not called seventh generation in Intel documents.

Quite unexpectedly, the generation number returned by the CPUID instruction in the P4 is

not 7 but 15. The confusion is complete when the subsequent Intel CPU's: Pentium M,

Core, and Core2 report generation number 6.

The reader should be aware that different generations of microprocessors behave very

differently. Also, the Intel and AMD microarchitectures are very different. What is optimal for

one generation or one brand may not be optimal for the others.

评论收藏

内容反馈

hwakicestone

2013-06-17

比较实用，比较基础的内容，值得学习

mybandari

粉丝: 0
资源: 9

optimization_manuals（一个老外写的有关各种编程优化方法的主题）

Optimization

calling_conventions.pdf

Optimization-Based Control

[OPTIMIZING]OPTIMIZING CPP

optimizing_cpp_optimizing_C++_

Convex-optimization.rar_convex optimization_优化卡尔曼_凸优化_凸优化 卡尔曼

Creative_Kx3551_Optimization_2.22.zip

modern_optimization_algorithm_optimization_

DisplayPort Link training optimization_surekqh_displayport_Linkt

Control_and_Optimization_Methods_for_Electric_Smart_Grids

Code.zip_convex optimization_matlab 凸优化_凸优化

optimization_slides_01

Mali_Optimization_Guide_3.0

Joint optimization_储能调度_调度_储能优化_thermalpowerplant_电网储能.zip

Joint optimization_储能调度_调度_储能优化_thermalpowerplant_电网储能_源码.zip

PSO.rar_PSO_optimization_pso optimization_粒子群优化

凸优化程序包_包含各种凸优化算法_可供方便调用_convex optimization_matlab

Convex Optimization_Boyd_英文版_凸优化_王会宁译_中文版

sappress_bw_performance_optimization_guide_080

optimization_toolbox_OptimizationToolbox_

最优化导论.zip_optimization_zip_最优化_最优化导论_最优导论

sappress_bw_performance_optimization_guide_080.pdf

【MATLAB官网发布】Optimization_Models_and_Applications.zip

粒子群优化_PSO_Optimization_课件

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

仿真电路以及操作方法

【纯干货啊】华为IPD流程管理(完整版).pptx

可编程语言标准IEC61131-3中文版.pdf

OFDM完整仿真过程与教程.zip

最新资源

Convex-optimization.rar_convex optimization_优化卡尔曼_凸优化_凸优化卡尔曼

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar