FastSortingAlgorithmsusingAVX-512onIntelKnightsLanding-24Apr2017(1704.08579)-计算机科学资源-CSDN文库

75 浏览量 2021-04-22 16:02:55 上传评论收藏 662KB PDF 举报

资源推荐

资源详情

资源评论

Fast Sorting Algorithms using

AVX-512 on Intel Knights Landing

Berenger Bramas

Max Planck Computing and Data Facility (MPCDF)

Email: Berenger.Bramas@mpcdf.mpg.de

This paper describes fast sorting techniques using the recent AVX-512 instruction

set. Our implementations beneﬁt from the latest possibilities oﬀered by AVX-512

to vectorize a two-parts hybrid algorithm: we sort the small arrays using a branch-

free Bitonic variant, and we provide a vectorized partitioning kernel which is the

main component of the well-known Quicksort. Our algorithm sorts in-place and

is straightforward to implement thanks to the new instructions. Meanwhile, we

also show how an algorithm can be adapted and implemented with AVX-512. We

report a performance study on the Intel KNL where our approach is faster than

the GNU C++ sort algorithm for any size in both integer and double ﬂoating-point

arithmetics by a factor of 4 in average.

Keywords: quicksort; sort; vectorization; AVX-512; KNL

1. INTRODUCTION

Sorting is a fundamental problem in computer science,

and eﬃcient implementations are critical for various

applications such as database servers [1] or image

rendering engines [2]. Moreover, sorting libraries are

massively used in software development to reduce the

complexity of speciﬁc algorithms. As an example,

once an array is sorted, it is possible to use

binary search and ﬁnd any item in logarithmic time.

Therefore, having eﬃcient sorting implementations on

new architectures can improve performance of a wide

range of applications.

From one CPU generation to the next, improvements

and changes are made at various levels. Some

modiﬁcations are hidden from the programmer and

might improve existing codes without any update.

This includes the low-level modules (speculation, out-

of-order execution, etc.) but also the CPU’s clock

frequency. On the other hand, some new features and

improvements require to be explicitly used. In this

category, the two dominant improvements are the usage

of vectorization units and multi-core parallelization.

Thus, to achieve high-performance on modern CPUs,

it is indispensable to vectorize a code, that is to take

advantage of the single instruction on multiple data

(SIMD) capacity. In fact, while the diﬀerence between a

scalar code and its vectorized equivalent was ”only” of a

factor of 2 in the year 2000 (SSE technology and double

precision), the diﬀerence is now up to a factor 4 on

most CPUs (AVX) and up to 8 (AVX-512) on the KNL.

Some algorithms or computational intensive kernels are

straightforward to vectorize and could even be auto-

vectorize by the compilers. However, data-processing

algorithms are generally not in this category because

the SIMD operations are not always well designed for

this purpose.

The Intel Knights Landing (KNL) processor [3] did

not follow the same path of improvement as the previous

CPUs. In fact, the clock frequency has been reduced

but the size of SIMD-vector and the number of cores in

the chip have been increased impressively (in addition

to having a new memory hierarchy). The KNL supports

the AVX-512 instruction set [4]: it supports Intel AVX-

512 foundational instructions (AVX-512F), Intel AVX-

512 conﬂict detection instructions (AVX-512CD), Intel

AVX-512 exponential and reciprocal instructions (AVX-

512ER), and Intel AVX-512 prefetch instructions (AVX-

512PF). AVX-512 allows to work on SIMD-vectors of

double the size compared to previous AVX(2) set, and

it comes with various new operations. Moreover, the

next-generation of Intel CPUs will support AVX-512

too. Consequently, the development of new algorithms

targeting Intel KNL should be beneﬁcial to the future

Intel Xeon SkyLake CPU as well.

In the current paper, we look at diﬀerent strategies

to develop an eﬃcient sorting algorithm on Intel KNL

using AVX-512. The contributions of this study are the

following:

• studying sorting algorithms on the Intel KNL

• proposing a new partitioning algorithm using AVX-

512

• deﬁning a new Bitonic sort variant to sort tiny

arrays using AVX-512

• implementing a new Quicksort variant using AVX-

512.

All in all, we show how we can obtain a fast and

arXiv:1704.08579v1 [cs.MS] 24 Apr 2017

2 B. Bramas

vectorized sorting algorithm

The rest of the paper is organized as follows. Sec-

tion 2 provides the background related to vectorization

and sorting. Then, we describe our proposal to sort tiny

arrays in Section 3, and our partitioning and Quicksort

variant in Section 4. Finally, in Section 5, we provide

a performance study of our method against the GNU

C++ standard sort (STL) implementation.

2. BACKGROUND

In this section, we provide the pre-requisite to our sort

implementation. We ﬁrst introduce the Quicksort and

the Bitonic sort. Then we describe the objectives of the

vectorization and we ﬁnish with an overview of some

related work.

2.1. Sorting Algorithms

2.1.1. Quicksort Overview

We brieﬂy describe the well-known Quicksort (QS)

algorithm to make the document self-content, but

readers comfortable with it can skip this section. QS

has been originally proposed in [5], and uses a divide-

and-conquer strategy by recursively partitioning the

input array until it ends with partitions of one value.

The partitioning puts values lower than a pivot at

the beginning of the array and the values greater at

the end with a linear complexity. QS has a worst-

case complexity of O(n

) but an average complexity

of O(n log n) in practice. The complexity is tied to the

choice of the pivot when partitioning, it must be close

to the median to ensure a low complexity. However,

its simplicity in terms of implementation and its speed

in practice has turned it into a very popular sorting

algorithm. Figure 1 shows the sort of an array using

QS.

3 1 2 0 5

1 0 2

5 3

3 50 1

≤ 2

≤ 1

≤ 5

∅

l=0

r=1 r=5

l=0

l=4

r=5

FIGURE 1: Quicksort example to sort [3, 1, 2, 0, 5] to

[0, 1, 2, 3, 5]. The pivot is equal to the value in the

middle: the ﬁrst pivot is 2, then at second recursion

level it is 1 and 5.

We provide in Algorithm 2 a possible implementation

of a scalar QS (here the term scalar refers to as single

The functions described in the current study are available

at https://gitlab.mpcdf.mpg.de/bbramas/avx-512-sort. This

repository includes a clean header-only library (branch master)

and a test ﬁle that generates the performance study of the current

manuscript (branch paper). The code is under MIT license.

value at the opposite of a SIMD vector). In this

implementation, the choice of the pivot is done naively

by selecting the value in the middle before partitioning.

This type of selection can result in very unbalanced

partitions, hence why more advanced heuristics have

been proposed in the past. As an example, the pivot

can be selected by taking a median from several values.

Algorithm 1: Quicksort

Input: array: an array to sort. length: the size of array.

Output: array: the array sorted.

1 function QS(array, length)

2 QS core(array, 0, length-1)

3 function QS core(array, left, right)

4 if left < right then

5 // Naive method, select value in the middle

6 pivot idx = ((right-left)/2) + left

7 swap(array[pivot idx], array[right])

8 partition bound = partition(array, left, right, array[right])

9 swap(array[partition bound], array[right])

10 QS core(array, left, partition bound-1)

11 QS core(array, partition bound+1, right)

12 end

13 function partition(array, left, right, pivot value)

14 for idx read ← left to right do

15 if array[idx read] ¡ pivot value then

16 swap(array[idx read], array[left])

17 left += 1

18 end

19 end

20 return left;

2.1.2. GNU std::sort Implementation (STL)

The worst case complexity of QS makes it no longer

suitable to be used as the standard C++ sort. In

fact, a complexity of O(n log n) in average was required

until year 2003 [6], but it is now a worst case limit [7]

that a pure QS implementation cannot guarantee.

Consequently, the current implementation is a 3-part

hybrid sorting algorithm [8] i.e. it relies on 3 diﬀerent

algorithms. First, the algorithm uses an introsort [9] to

a maximum depth of 2 × log

n (introsort is a hybrid of

quicksort and heap sort) to obtain small partitions that

are then sorted using an insertion sort.

2.1.3. Bitonic Sorting Network

In computer science, a sorting network is an abstract

description of how to sort values from a ﬁxed-size array;

how the values are compared and exchanged. A sorting

network can be represented graphically having each

input value as an horizontal lines and each compare and

exchange unit as a vertical connection between those

lines. It exists various sorting networks in the literature,

but we concentrate our description on the Bitonic sort

from [10]. This network is easy to implement and has an

algorithm complexity of O(n log(n)

). Moreover, it has

shown good performance on parallel computers [11] and

GPUs [12]. Figure 2a shows a Bitonic sorting network

to process 16 values. A sorting network can be seen as a

time-line where input values are transfered from left to

right and exchanged if needed at each vertical bar. We

show such execution in Figure 2b where we print the

Fast Sorting Algorithms using AVX-512 on Intel Knights Landing 3

intermediate steps while sorting an array of 8 values.

(a) Bitonic sorting network for input of size 16.

All vertical bars/switches exchange values in the

same direction.

(b) Example of 8 values sorted by a

Bitonic sorting network.

FIGURE 2: Bitonic sorting network examples. In red

boxes, the exchanges are done from extremities to the

center. Whereas in orange boxes, the exchanges are

done with a linear progression.

Two main strategies are possible when implementing

a sorting network. Either it can be implemented by

hard-coding the connections between the lines (which

can be seen as a direct mapping of the picture).

Or with a ﬂexible algorithm that performs the same

operations by using a formula/rule to decide when to

exchange the values. In Algorithm 2, we show a ﬂexible

implementation of a Bitonic sort which mimics the

networks presented in Figure 2a but without hard-coded

exchanges.

2.2. Vectorization

The term vectorization refers to a CPUs’ feature to

apply a single operation/instruction to a vector of

values instead of only a single one [13]. It is common to

refer to this concept as Flynn’s taxonomy term SIMD,

for single instruction on multiple data. By adding SIMD

instructions/registers to CPUs, it has been possible

to increase the peak performance of the single cores

despite the stagnation of the clock frequency since the

mid 2000s. In the rest of the paper, we use the term

SIMD-vector to call the data type managed by the

CPU, but it has no relation with an expandable vector

data structure such as std::vector. The size of a SIMD-

vector is variable and depends on both the instruction

set and the type of SIMD-vector element and is given

by the size of the registers in the chip. Simd-vector

extensions to the X86 instruction set, for example,

are SSE [14], AVX [15], and AVX-512 [4], supporting

SIMD-vectors of size 128, 256 and 512 bits, respectively.

An AVX-512 SIMD-vector is able to store 8 double

Algorithm 2: Bitonic sort

Input: array: an array to sort. length: the size of array (must be a

power of 2).

Output: array: the array sorted.

1 function bitonic sort(array, length)

2 for s=2 ; s <= n ; s= s ∗ 2 do

3 for i=0; i < n; i= i + s do

4 bitonic core(sub array(array,i), s);

5 end

6 end

7 function bitonic core(array, n)

8 step = n/2

9 leftPtr = arr

10 rightPtr = sub array(arr,n-1)

11 for k=0 ; k < step ; k= k + 1 do

12 test and exchange(leftPtr, rightPtr)

13 leftPtr = leftPtr + 1

14 rightPtr = rightPtr - 1

15 end

16 for step = n/2/2 ; step > 0 ; step = step/2 do

17 for i=0; i < n; i = i + step ∗ 2 do

18 leftPtr = sub array(arr,i)

19 rightPtr = sub array(arr,i+step)

20 for k=0 ; k < step ; k= k + 1 do

21 test and exchange(leftPtr, rightPtr)

22 leftPtr = leftPtr + 1

23 rightPtr = rightPtr + 1

24 end

25 end

26 end

precision ﬂoating-point numbers or 16 integer values,

for example. Figure 3 illustrates the diﬀerence between

a scalar summation and a SIMD-vector summation.

Throughout this document, we use intrinsic function

extension instead of the assembly language to write

vectorized code on top of the AVX-512 instruction set.

Intrinsics are small functions that should be replaced

by the compiler with a single assembly instruction.

Using intrinsics allows us to use high-level programming

languages (C++ in our case) while being able to tell the

compilers to use particular instructions.

float a; __m128 a;

__m256 a;

float b; __m128 b; __m256 b;

a+b a+b a+b

FIGURE 3: Summation example of single precision

ﬂoating-point values using : () scalar standard C++

code, () SSE SIMD-vector of 4 values , () AVX

SIMD-vector of 8 values.

2.2.1. AVX-512

The AVX-512 is a recent instruction set that has been

introduced with the Intel KNL CPU in the year 2016.

It oﬀers operations that do not have an equivalent in

previous extensions of the X86 instruction sets. As an

example, there are several new instructions that use a

mask (integer) to select values or conditionally apply an

operation on some values from a SIMD-vector. In the

following, we describe some of the instructions that we

use in our implementations.

4 B. Bramas

Load/set/store. As in previous instruction sets, AVX-

512 has instructions to load a contiguous block of values

from main memory and to transform it into a SIMD-

vector (load), ﬁll a SIMD-vector with a given value (set)

and move back a SIMD-vector into memory (store).

Store some. A new operation allows to save only

some values from a SIMD-vector into memory

(vpcmpd/vcmppd). The values are saved contiguously.

This is a major improvement because without this

instruction, several operations are needed to obtain the

same result. For example, to save some values from a

SIMD-vector v at address p in memory, one possibility

is to load the current values from p into a SIMD-vector

, permute the values in v to move the values to store

at the beginning, merge v and v

, and ﬁnally save the

resulting vector. The pseudo-code in Figure 4 describes

the results obtained with this store-some instruction.

1 void mm512 cmp epi32 mask (

2 in t ∗ pt r ,

3 mas k16 msk ,

4 m5 1 2i v a l u e s ) {

5 o f f s e t = 0 ;

6 fo r ( id x from 0 t o 15) {

7 i f ( msk AND s h i f t ( 1 , i d x ) ) {

8 p tr [ o f f s e t ] = v a l u e s [ i dx ] ;

9 o f f s e t += 1 ;

10 }

11 }

12 }

Store some (integer)

1 void mm512 cmp pd mask (

2 double ∗ p t r ,

3 mask 8 msk ,

4 m512d v a l u e s ) {

5 o f f s e t = 0 ;

6 fo r ( id x from 0 t o 7) {

7 i f ( msk AND s h i f t ( 1 , i d x ) ) {

8 p tr [ o f f s e t ] = v a l u e s [ i dx ] ;

9 o f f s e t += 1 ;

10 }

11 }

12 }

Store some (double)

FIGURE 4: AVX-512 store-some behavior for an

integer and a double ﬂoating-point vectors.

Vector permutation. Permuting the values inside

a vector was possible since AVX/AVX2 using

permutevar8x32 (instruction vperm(d,ps)). This in-

struction allows to re-order the values inside a SIMD-

vector using a second integer array which contains the

permutation indexes. AVX-512 also has this instruction

on its own SIMD-vector, and we synthetize its behavior

in Figure 5.

Min/Max. The min/max operations (vpmaxsd/vp-

minsd/vmaxpd/vminpd ) return a SIMD-vector where

1 m5 1 2i mm5 12 p ermu tex var e pi3 2 (

2 m5 1 2i permIdxs ,

3 m5 1 2i v a l u e s ) {

4 m5 1 2i r e s ;

5 fo r ( id x from 0 t o 15)

6 r e s [ i d x ] = v a l u e s [ permIdxs [ i d x ] ] ;

7 return r e s ;

8 }

Permute (integer)

1 m512d mm512 permutexvar pd (

2 m5 1 2i permIdxs ,

3 m512d v a l u e s ) {

4 m512d r e s ;

5 fo r ( id x from 0 t o 7)

6 r e s [ i d x ] = v a l u e s [ permIdxs [ i d x ] ] ;

7 return r e s ;

8 }

Permute (double)

FIGURE 5: AVX-512 permute behavior for an integer

and a double ﬂoating-point vectors.

each value correspond to the minimum/maximum value

of the two input vectors at the same position (they do

not return a single scalar as the global minimum/max-

imum among all the values). Such instructions exist in

SSE/SSE2/AVX too.

Comparison. In AVX-512, the value returned by a

test/comparison (vpcmpd/vcmppd) is a mask (integer)

and not a SIMD-vector of integers as it was in

SSE/AVX. Therefore, it is easy to modify and work

directly on the mask with arithmetic and binary

operations for scalar integers. The behavior of the

comparison is shown in Figure 6, where the mask is

ﬁlled with bits from the comparison results. AVX-512

provides several instructions that use this type of mask

like the conditional selection for instance.

Conditional selection. Among the mask-based instruc-

tions, the mask move (vmovdqa32/vmovapd) allows to

select values between two vectors using a mask. The be-

havior is show in Figure 7, where a value is taken from

the second vector where the mask is false and from the

ﬁrst vector otherwise. Achieving the same result was

possible in previous instruction set only using several

operations and not one dedicated instruction.

2.3. Vectorized Sorting Algorithms

The literature on sorting and vectorized sorting

implementations is immense. Therefore, we only

cite here some of the studies that we consider most

connected to our work.

The sorting technique from [16] tries to remove

branches and improves the prediction of a scalar sort,

and they show a speedup by a factor of 2 against the

STL (the implementation of the STL at that time was

剩余15页未读，继续阅读

评论收藏

内容反馈

weixin_38682279

粉丝: 9
资源: 889

Fast Sorting Algorithms using AVX-512 on Intel Knights Landing -...

最新资源

Fast Sorting Algorithms using AVX-512 on Intel Knights Landing -...

Algorithm-sorting-algorithms-performance-comparison.zip

sortingalgorithms-源码.rar

[麻省理工学院-算法导论].Introduction.to. Algorithms,.Second.Edition

Guide to Data Structures_A Concise Introduction Using Java-Springer(2017).pdf

Skiena-The_Algorithm_Design_Manual.pdf

[麻省理工学院-算法导论](英文版).chm

Data.Structures.and.Algorithms.USING.C

A Fast Elitist Non-Dominated Sorting Genetic

Problem.Solving.in.Data.Structures.and.Algorithms.Using.Cplusplus.epub

PyPI 官网下载 | django-sorting-bootstrap-2.4.0.tar.gz

C++标准库（第二版）英文版.pdf

算法导论--Introduction.to.Algorithms

Learning.JavaScript.Data.Structures.and.Algorithms.1783554878

Swift.Data.Structure.and.Algorithms

Disk-Based.Algorithms.for.Big.Data

Sorting-Algorithmsls-源码.rar

Fast Parallel GPU-Sorting Using a Hybrid Sort-计算机科学

源码+书Hands-on Data Structures and Algorithms with JavaScript

Java-various-sorting-algorithms.rar_algorithms

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

NPPJSONViewer.zip

GD32替换STM32注意事项.pdf

XCP协议的规范文档

VS2015安装证书，JavaScript_ProjectSystem.msi，JavaScript_LanguageService.msi

CANoe通过CAPL脚本实现自动测试

蓝牙BLE协议中文版.pdf

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）