IntroductiontoIntelAVX资源-CSDN文库

SIMD

需积分: 10 34 浏览量 2018-07-23 16:28:23 上传评论收藏 1.44MB PDF 举报

资源推荐

资源详情

资源评论

Introduction to Intel® Advanced Vector Extensions

By Chris Lomont

Intel® Advanced Vector Extensions (Intel® AVX) is a set of instructions for doing Single

Instruction Multiple Data (SIMD) operations on Intel® architecture CPUs. These instructions

extend previous SIMD offerings (MMX™ instructions and Intel® Streaming SIMD Extensions

(Intel® SSE)) by adding the following new features:

 The 128-bit SIMD registers have been expanded to 256 bits. Intel® AVX is designed to

support 512 or 1024 bits in the future.

 Three-operand, nondestructive operations have been added. Previous two-operand

instructions performed operations such as A = A + B, which overwrites a source operand;

the new operands can perform operations like A = B + C, leaving the original source

operands unchanged.

 A few instructions take four-register operands, allowing smaller and faster code by

removing unnecessary instructions.

 Memory alignment requirements for operands are relaxed.

 A new extension coding scheme (VEX) has been designed to make future additions easier

as well as making coding of instructions smaller and faster to execute.

Closely related to these advances are the new Fused–Multiply–Add (FMA) instructions, which

allow faster and more accurate specialized operations such as single instruction A = A * B + C. The

FMA instructions should be available in the second-generation Intel® Core™ CPU. Other features

include new instructions for dealing with Advanced Encryption Standard (AES) encryption and

decryption, a packed carry-less multiplication operation (PCLMULQDQ) useful for certain

encryption primitives, and some reserved slots for future instructions, such as a hardware random

number generator.

Instruction Set Overview

The new instructions are encoded using what Intel calls a VEX prefix, which is a two- or three-byte

prefix designed to clean up the complexity of current and future x86/x64 instruction encoding.

The two new VEX prefixes are formed from two obsolete 32-bit instructions—Load Pointer Using

DS (LDS—0xC4, 3-byte form) and Load Pointer Using ES (LES—0xC5, two-byte form)—which load

the DS and ES segment registers in 32-bit mode. In 64-bit mode, opcodes LDS and LES generate an

invalid-opcode exception, but under Intel® AVX, these opcodes are repurposed for encoding new

instruction prefixes. As a result, the VEX instructions can only be used when running in 64-bit

mode. The prefixes allow encoding more registers than previous x86 instructions and are required

for accessing the new 256-bit SIMD registers or using the three- and four-operand syntax. As a

user, you do not need to worry about this (unless you’re writing assemblers or disassemblers).

Intel® Advanced Vector Extensions

23 May 2011

Data is memory aligned when the data to be operated upon as an n-byte chunk is stored on an n-

byte memory boundary. For example, when loading 256-bit data into YMM registers, if the data

source is 256-bit aligned, the data is called aligned.

For Intel® SSE operations, memory alignment was required unless explicitly stated. For example,

under Intel SSE, there were specific instructions for memory-aligned and memory-unaligned

operations, such as the MOVAPD (move-aligned packed double) and MOVUPD (move-unaligned

packed double) instructions. Instructions not split in two like this required aligned accesses.

Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows

unaligned access; however, this access may come at a performance slowdown, so the old rule of

designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access

and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the

SSE instructions that explicitly required memory-aligned data: These instructions still require

aligned data. Other specific instructions requiring aligned access are listed in Table 2.4 of the

Intel® Advanced Vector Extensions Programming Reference (see “For More Information” for a link).

Another performance concern besides unaligned data issues is that mixing legacy XMM-only

instructions and newer Intel AVX instructions causes delays, so minimize transitions between

VEX-encoded instructions and legacy Intel SSE code. Said another way, do not mix VEX-prefixed

instructions and non–VEX-prefixed instructions for optimal throughput. If you must do so,

minimize transitions between the two by grouping instructions of the same VEX/non-VEX class.

Alternatively, there is no transition penalty if the upper YMM bits are set to zero via VZEROUPPER or

VZEROALL, which compilers should automatically insert. This insertion requires an extra

instruction, so profiling is recommended.

Intel® AVX Instruction Classes

As mentioned, Intel® AVX adds support for many new instructions and extends current Intel SSE

instructions to the new 256-bit registers, with most old Intel SSE instructions having a V-prefixed

Intel AVX version for accessing new register sizes and three-operand forms. Depending on how

instructions are counted, there are up to a few hundred new Intel AVX instructions.

For example, the old two-operand Intel SSE instruction ADDPS xmm1, xmm2/m128 can now be

expressed in three-operand syntax as VADDPS xmm1, xmm2, xmm3/m128 or the 256-bit register

using the form VADDPS ymm1, ymm2, ymm3/m256. A few instructions allow four operands, such as

VBLENDVPS ymm1, ymm2, ymm3/m256, ymm4, which conditionally copies single-precision floating-

point values from ymm2 or ymm3/m256 to ymm1 based on masks in ymm4. This is an improvement on

the previous form, where xmm0 was implicitly needed, requiring compilers to free up xmm0. Now,

with all registers explicit, there is more freedom for register allocation. Here, m128 is a 128-bit

memory location, xmm1 is the 128-bit register, and so on.

Some new instructions are VEX only (not Intel SSE extensions), including many ways to move data

into and out of the YMM registers. Examples are the useful VBROADCASTS[S/D], which loads a

Intel® Advanced Vector Extensions

23 May 2011

single value into all elements of an XMM or YMM register, and ways to shuffle data around in a

Intel® AVX adds arithmetic instructions for variants of add, subtract, multiply, divide, square root,

compare, min, max, and round on single- and double-precision packed and scalar floating-point

data. Many new conditional predicates are also useful for 128-bit Intel SSE, giving 32 comparison

types. Intel® AVX also includes instructions promoted from previous SIMD covering logical, blend,

convert, test, pack, unpack, shuffle, load, and store.

The toolset adds new instructions, as well, including non-strided fetching (broadcast of single or

multiple data into a 256-bit destination, masked-move primitives for conditional load and store),

insert and extract multiple-SIMD data to and from 256-bit SIMD registers, permute primitives to

manipulate data within a register, branch handling, and packed testing instructions.

Future Additions

The Intel® AVX manual also lists some proposed future instructions, covered here for

completeness. This is not a guarantee that these instructions will materialize as written.

Two instructions (VCVTPH2PS and VCVTPS2PH) are reserved for supporting 16-bit floating-point

conversions to and from single– and double–floating-point types. The 16-bit format is called half-

precision and has a 10-bit mantissa (with an implied leading 1 for non-denormalized numbers,

resulting in 11-bit precision), 5-bit exponent (biased by 15), and 1-bit sign.

The proposed RDRAND instruction uses a cryptographically secure hardware digital random bit

generator to generate random numbers for 16- 32- , and 64-bit registers. On success, the carry flag

is set to 1 (CF=1). If not enough entropy is available, the carry flag is cleared (CF=0).

Finally, there are four instructions (RDFDBASE, RDGSBASE, WRFSBASE, and WRGSBASE) to read and

write FS and GS registers at all privilege levels in 64-bit mode.

Another future addition is the FMA instructions, which perform operations similar to

A = + A * B + C, where either of the plus signs (+) on the right can be changed to a minus sign (−)

and the three operands on the right can be in any order. There are also forms for interleaved

addition and subtraction. Packed FMA instructions can perform eight single-precision FMA

operations or four double-precision FMA operations with 256-bit vectors.

FMA operations such as A = A * B + C are better than performing one step at a time, because

intermediate results are treated as infinite precision, with rounding done on store, and thus are

more accurate for computation. This single rounding is what gives the “fused” prefix. They are also

faster than performing the computation in steps.

Each instruction comes in three forms for the ordering of the operands A, B, and C, with the

ordering corresponding to a three-digit extension: form 132 does A = AC + B, form 213 does

A = BA + C, and form 231 does A = BC + A. The ordering number is just the order of the operands

on the right side of the expression.

剩余20页未读，继续阅读

评论收藏

内容反馈

Dgelom

粉丝: 0
资源: 3

Introduction to Intel AVX

最新资源

Introduction to Intel AVX

Intel AVX编程参考文档

intel SSE2/3/4，AVX指令集

AVX指令集测试

avx格式监控播放转换工具

Introduction to Computing Systems

Introduction to Classical Mechanics

Introduction to Deep Learning.pdf

A Mathematical Introduction to Robotic Manipulation

Introduction to Space-Time wireless Communications

An introduction to copulas

Introduction to MIMO Communications

Introduction to Algorithms Third Edition

An Introduction to the Bootstrap

Introduction to Modern Cryptography

Introduction to Wireless and Mobile Systems

An Introduction to Fluid Dynamics (Batchelor)

Hack Audio：An Introduction to Computer Programming and DSP in Matlab.rar

An Introduction to Manifolds, 2nd edition by Loring W. Tu.

Introduction to Statistical Machine Learning

An Introduction to the Analysis of Algorithms

INTRODUCTION TO LINEAR ALGEBRA 清晰版 第五版 习题解答

第十五届蓝桥杯大赛软件赛省赛C++B组题目

C/C++中文参考手册离线最新版

代码随想录-八股文 pdf

编译器（gcc、g++）

第十五届蓝桥杯大赛软件赛省赛-C++A组题目

Qt5.9 C++开发指南.pdf 及示例源码

Qt （高仿Visio）流程图组件开发，源码分享

最新资源

INTRODUCTION TO LINEAR ALGEBRA 清晰版第五版习题解答