没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
- 1 -
NEON support in the RealView compiler
William Munns
18 June 2007
Introduction
This paper provides a simple introduction to the NEON
TM
Vector-SIMD architecture. It
continues by looking at the compiler support for SIMD, both through automatic recognition
and through the use of intrinsic functions.
NEON is a hybrid 64/128 bit SIMD architecture extension to the ARM v7-A profile,
targeted at multimedia applications. Positioning NEON within the processor allows it to
share the CPU resources for integer operation, loop control, and caching, significantly
reducing the area and power cost compared with a CPU plus hardware accelerator
combination. SIMD (Single Instruction Multiple Data) is where one instruction acts on
multiple data items, usually carrying out the same operation for all data.
The use of NEON instead of a CPU plus hardware accelerator combination allows savings
to be made in software development time as it creates a much simpler programming model
without forcing the programmer to search for ad-hoc concurrency and scheduling points.
On the ARM Cortex™-A8 the NEON unit is positioned in the pipeline so that loads can
come directly from the L2 cache. This means that a much larger dataset can be held in the
cache than would be allowed when executing ARM or Thumb
®
-2 code.
The NEON instruction set was designed to be an easy target for a compiler, including low
cost promotion/demotion and structure loads capable of accessing data from their natural
locations rather than forcing alignment to the vector size.
The RealView Development Tools
®
Suite version 3.1 supports NEON both in the standard
release using intrinsic functions and assembler, as well as through the vectorizing compiler
add-on which can recognise code sequences and automatically generate SIMD code. The
vectorizing compiler greatly reduces porting time, as well as reducing the requirement for
deep architectural knowledge.
© 2007 ARM Limited. All Rights Reserved.
ARM and RealView logo are registered
trademarks of ARM Ltd. All other trademarks
are the property of their respective owners and
are acknowledged
- 2 -
Overview of NEON Vector SIMD
SIMD is the name of the process for operating on multiple data items in parallel using the
same instruction. In the NEON extension, the data is organized into very long registers (64
or 128 bits wide). These registers can hold "vectors" of items which are 8, 16, 32 or 64 bits.
The traditional advice when optimizing or porting algorithms written in C/C++ is to use the
natural type of the machine for data handling (in the case of ARM 32 bits). The unwanted
bits can then be discarded by casting and/or shifting before storing to memory. The ability
of NEON to specify the data width in the instruction and hence use the whole register
width for useful information means keeping the natural type for the algorithm is both
possible and preferable. Keeping with the algorithms natural type reduces the cost of
porting an algorithm from one architecture to another and allows more data items to be
simultaneously operated on.
NEON appears to the programmer to have two banks of registers, 64 bit D registers and
128 bit Q registers. In reality the D and Q registers alias each other, so the 64 bit registers
D0 and D1 map against the same physical bits as the register Q0.
When an operation is performed on the registers the instruction specifies the layout of the
data contained in the source and, in certain cases, destination registers.
4 x 32 bit Data
8 x 16 bit Data
16 x 8 bit Data
128 bit Q register
- 3 -
Example: Add together the 16 bit integers stored in the 64 bit vector D2 and 64 bit vector
D1 storing the resultant items in the 64 bit register D0
VADD.I16 D0, D1, D2 This instruction will cause four 16 bit adds
Promotion/demotion of types
Promotion/demotion of types is a very common operation in C. Casting to larger types can
be used to avoid overflow or increase precision. Shifting into smaller types allows
compatibility at interfaces or reduced memory usage. In contrast with some other SIMD
architectures, NEON provides compound operations which combine type promotion with
arithmetic operations. This allows NEON code to make better use of the register file and
use fewer instructions.
Example: Multiply together the 16 bit integers stored in the 64 bit vectors D2 and D3
storing the resultant items in the 128 bit register Q0
VMUL.I32.S16 Q0, D2, D3 This instruction will cause four widening multiplies
Example: Shift right by #5 the four 32 bit integers stored in 128 bit vector Q1, truncate to
16 bits and store the resultant 16 bit integers in 64 bit register D0
VSHR.I16.I32 D0, Q1,#5 This instruction will cause four narrowing shifts
+
++
+
D1
D2
D0
* **
*
D2
D3
Q0
D
#
>> >> >> >>
Q1
D0
#5
- 4 -
Structure load and store operations
Often items are not held in memory as simple arrays, but rather arrays of structures for
logically grouped data items.
For example it is common to find a screen represented as an array of structures of pixels
rather than split into three arrays of red, green and blue items. Storing all components of
pixel data together allows faster operation for common operations such as colour
conversion or display, however it can cause difficulties for some SIMD implementations.
The NEON unit includes special structure load instructions which can load whole
structures and spilt them accordingly across multiple registers.
Example: Load 12 16 bit values from the address stored in R0, and split them over 64 bit
registers D0, D1 and D2. Update R0 to point at next structure.
VLD3.16 {D0,D1,D2}, [R0]!
Structure load and store better matches how engineers write code, so code usually does not
need to be rewritten to take advantage of it.
struct rgb_pixel
{
short r; /* Red */
short g; /* Green */
short b; /* Blue */
}s[X_SIZE*Y_SIZE]; /* screen */
- 5 -
Writing NEON code using the standard RealView compiler
The standard tools shipped with RealView Development Suite 3.1 have support for NEON
directly in the assembler and embedded assembler. The compiler also provides NEON
support using pseudo functions called intrinsics. Intrinsic functions compile into one or
more NEON instructions which are inserted at the call site. There is at least one intrinsic
for each NEON instruction, with multiple intrinsic functions where needed for signed and
unsigned types.
Using intrinsics, rather than programming in assembly language directly, allows the
compiler to schedule registers, as well as giving the programmer easy access to C variables
and arrays.
Using vector registers directly from assembler could lead to programming errors such as a
64 bit vector containing data of 8 bits wide is operated upon by a 16 bit adder. These kind
of faults can be very difficult to track down as only particular corner cases will trigger an
erroneous condition. In the previous addition example, the output will only differ if one of
the data items overflows into another. Using intrinsics is type-safe and will not allow
accidental mixing of signed/unsigned or differing width data.
Accessing vector types from C
The header file arm_neon.h is required to use the intrinsics and defines C style types for
vector operations. The C types are written in the form :
uint8x16_t Unsigned integers, 8 bits, vector of 16 items - 128 bit “Q” register
int16x4_t Signed integers, 16 bits, vector of four items - 64 bit "D" register
As there is a basic incompatibility between scalar (ARM) and vector (NEON) types it is
impossible to assign a scalar to a vector, even if they have the same bit length. Scalar
values and pointers can only be used with NEON instructions that use scalars directly.
Example: Extract an unsigned 32 bit integer from lane 0 of a NEON vector
result = vget_lane_u32(vec64a, 0)
Vector types are not operable using standard C operators except for assignment, so the
appropriate VADD should be used rather than the operator “+”.
Where there are vector types which differ only in number of elements (uint32x2_t,
uint32x4_t) there are specific instructions to ‘assign’ the top or bottom vector elements of a
128 bit value to a 64 bit value and vice-versa. This operation does not use any code space
as long as the registers can be scheduled as aliases.
Example
: Use the bottom 64 bits of a 128 bit register
vec64 = vget_low_u32(vec128);
剩余20页未读,继续阅读
资源评论
- 异次元空间19942018-02-04好用,我借助这个成功写了neon汇编优化画图
肖老板
- 粉丝: 160
- 资源: 30
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功