没有合适的资源?快使用搜索试试~ 我知道了~
AMD CPU 性能调优知道文档
需积分: 5 9 浏览量
2022-12-01
15:29:26
上传
评论
收藏 1.15MB PDF 举报
可以知道AMD CPU对内存的亲和访问
资源推荐
资源详情
资源评论








1
Best Practice Guide - AMD EPYC
Xu Guo, EPCC, UK
Ole Widar Saastad (Editor), University of Oslo, Norway
Version 2.0 by 18-02-2019

Best Practice Guide - AMD EPYC
2
Table of Contents
1. Introduction .............................................................................................................................. 3
2. System Architecture / Configuration ............................................................................................. 4
2.1. Processor Architecture ..................................................................................................... 4
2.2. Memory Architecture ....................................................................................................... 6
2.2.1. Memory Bandwidth Benchmarking .......................................................................... 7
3. Programming Environment / Basic Porting ..................................................................................... 9
3.1. Available Compilers ........................................................................................................ 9
3.1.1. Compiler Flags .................................................................................................... 9
3.1.2. Compiler Performance ......................................................................................... 11
3.2. Available (Optimized) Numerical Libraries ........................................................................ 13
3.2.1. Performance of libraries ....................................................................................... 13
3.2.2. Examples of numerical library usage ...................................................................... 15
3.3. Available MPI Implementations ....................................................................................... 17
3.4. OpenMP ...................................................................................................................... 18
3.4.1. Compiler Flags ................................................................................................... 18
3.5. Basic Porting Examples .................................................................................................. 18
3.5.1. OpenSBLI ......................................................................................................... 18
3.5.2. CASTEP ........................................................................................................... 19
3.5.3. GROMACS ....................................................................................................... 19
4. Performance Analysis ............................................................................................................... 20
4.1. Available Performance Analysis Tools .............................................................................. 20
4.1.1. perf (Linux utility) .............................................................................................. 20
4.1.2. AMD µProf ....................................................................................................... 21
4.1.3. Performance reports ............................................................................................ 21
4.2. General Hints for Interpreting Results from all tools ............................................................ 22
5. Tuning ................................................................................................................................... 24
5.1. Advanced / Aggressive Compiler Flags ............................................................................. 24
5.1.1. GNU compiler ................................................................................................... 24
5.1.2. Intel compiler .................................................................................................... 24
5.1.3. PGI (Portland) compiler ....................................................................................... 24
5.1.4. Compilers and flags ............................................................................................ 24
5.2. Single Core Optimization ............................................................................................... 25
5.2.1. Replace libm library ............................................................................................ 25
5.3. Advanced OpenMP Usage .............................................................................................. 26
5.3.1. Tuning / Environment Variables ............................................................................ 26
5.3.2. Thread Affinity .................................................................................................. 27
5.4. Memory Optimization .................................................................................................... 27
5.4.1. Memory Affinity (OpenMP/MPI/Hybrid) ................................................................ 27
5.4.2. Memory Allocation (malloc) Tuning ...................................................................... 29
5.4.3. Using Huge Pages .............................................................................................. 31
5.4.4. Monitoring NUMA pages ..................................................................................... 31
5.5. Possible Kernel Parameter Tuning .................................................................................... 32
5.5.1. NUMA control ................................................................................................... 32
5.5.2. Scheduling control .............................................................................................. 33
6. Debugging .............................................................................................................................. 35
6.1. Available Debuggers ...................................................................................................... 35
6.2. Compiler Flags ............................................................................................................. 35
Further documentation ................................................................................................................. 36

Best Practice Guide - AMD EPYC
3
1. Introduction
Figure 1. The AMD EPYC Processor chip
The EPYC processors are the latest generation of processors from AMD Inc. While they not yet show large adap-
tation on the top-500 list their performance might change this in the future.
The processors are based on x86-64 architecture and provide vector units for a range of different data types, the
most relevant being 64-bits floating point. Vector units are 256 bits wide and can operate on four double precision
(64-bits) numbers at a time. The processors feature a high number of memory controllers, 8 in the EPYC 7601
model (see [6] for details) that was used for evaluation in the writing of this guide. They also provide 128 PCIe
version 3.0 lanes.
This guide provides information about how to use the AMD EPYC processors in an HPC environment and it
describes some experiences with the use of some common tools for this processor, in addition to a general overview
of the architecture and memory system. Being a NUMA type architecture information about the nature of the
NUMA is relevant. In addition some tuning and optimization techniques as well as debugging are covered also.
In this mini guide we cover the following tools: Compilers, performance libraries, threading libraries (OpenMP),
Message passing libraries (MPI), memory access and allocation libraries, debuggers, performance profilers, etc.
Some benchmarks, in which we compare compilers and libraries have been performed and some recommendations
and hints about how to use the Intel tools with this processor are presented. While the AMD EPYC is a x86-64
architecture it's not fully compatible with Intel processors when it comes to the new features found on the latest
generations of the Intel processors. Issues might be present when using highly optimized versions of the Intel
libraries.
In contrast to the Intel tools the GNU tools and tools from other independent vendors have full support for EPYC.
A set of compilers and development tools have been tested with satisfactory results.

Best Practice Guide - AMD EPYC
4
2. System Architecture / Configuration
2.1. Processor Architecture
The x86 EPYC processor, designed by AMD, is a System-on-Chip (SoC) composed of up to 32 Zen cores per
SoC. Simultaneous Multithreading (SMT) is supported on the Zen core, which allows each core to run two threads
giving at maximum 64 threads per CPU in total. Each EPYC processor provides 8 memory channels and 128
PCIe 3.0 lanes. EPYC supports both 1-socket and 2-sockets models. In the multi-processor configuration, half
of the PCIe lanes from each processor are used for the communications between the two CPUs through AMD’s
socket-to-socket interconnect, Infinity Fabric [3][4][5][7].
There are several resources available for information about the Zen architecture and the EPYC processor. The
wikichip web site generally is a good source of information [6]. The figures and much of the information below
is taken from the web pages at wikichip and their article about Zen. Detailed information about cache sizes,
pipelining, TLB etc is found there. The table below just lists the cache sizes as they might be of use for many
programmers.
Table 1. Cache sizes and related information
Cache level/type Size &Information
L0 µOP 2,048 µOPs, 8-way set associative,32-sets, 8-µOP line size
L1 instruction 64 KiB 4-way set associative,256-sets, 64 B
line size, shared by the two threads, per core
L1 data 32 KiB 8-way set associative, 64-sets, 64 B line size, write-
back policy, 4-5 cycles latency for Int, 7-8 cycles latency for FP
L2 512 KiB 8-way set associative, 1,024-sets, 64 B line size,
write-back policy, Inclusive of L1, 17 cycles latency
L3 Victim cache, 8 MiB/CCX, shared across all cores, 16-way
set associative, 8,192-sets, 64 B line size, 40 cycles latency
TLB instructions 8 entry L0 TLB, all page sizes, 64 entry L1 TLB,
all page sizes, 512 entry L2 TLB, no 1G pages
TLB data 64 entry L1 TLB, all page sizes, 1,532-entry L2 TLB, no 1G pages

Best Practice Guide - AMD EPYC
5
Figure 2. Zen Block diagram
The Zen core contains a battery of different units, it is not a simple task to figure out how two threads are scheduled
on this array of execution units. The core is divided into two parts, one front end (in-order) and one execute part
(out-of-order). The front end decodes the x86-64 instructions to micro operations which are sent to the execution
part by a scheduler. There is one unit for integer and one for floating point arithmetic, there are hence two separate
pipelines one for integer and one for floating point operations.
The floating point part deals with all vector operations. The vector units are of special interest as they perform
vectorized floating point operations. There are four 128 bits vector units, two units for multiplications including
fused multiply-add and two units for additions. Combined they can perform 256 bits wide AVX2 instructions.
The chip is optimized for 128 bits operations. The simple integer vector operations (e.g. shift, add) can all be
done in one cycle, half the latency of AMD's previous architecture. Basic floating point math has a latency of
three cycles including multiplication (one additional cycle for double precision). Fused multiply-add (FMA) has
a latency of five cycles.
AMD claim that theoretical floating point performance can be calculated as: Double Precision theoretical Floating
Point performance = #real_cores*8DP flop/clk * core frequency. For a 2 socket system = 2*32cores*8DP flops/
clk * 2.2GHz = 1126.4 Gflops. This includes counting FMA as two flops.
剩余37页未读,继续阅读
资源评论

CheriseShi
- 粉丝: 0
- 资源: 7

上传资源 快速赚钱
我的内容管理 收起
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助

会员权益专享
安全验证
文档复制为VIP权益,开通VIP直接复制
