提升C语义用于数据流优化_LiftingCSemanticsforDataflowOptimization.pdf资源-CSDN文库

版权申诉

150 浏览量 2022-01-04 23:29:43 上传评论收藏 1.32MB PDF 举报

资源推荐

资源详情

资源评论

Lifting C Semantics for Dataow Optimization

Alexandru Calotoiu

acalotoiu@inf.ethz.ch

ETH Zurich, Switzerland

Tal Ben-Nun

talbn@inf.ethz.ch

ETH Zurich, Switzerland

Grzegorz Kwasniewski

gkwasnie@inf.ethz.ch

ETH Zurich, Switzerland

Johannes de Fine Licht

definelj@inf.ethz.ch

ETH Zurich, Switzerland

Timo Schneider

timo.schneider@inf.ethz.ch

ETH Zurich, Switzerland

Philipp Schaad

philipp.schaad@inf.ethz.ch

ETH Zurich, Switzerland

Torsten Hoeer

torsten.hoefler@inf.ethz.ch

ETH Zurich, Switzerland

Abstract

C is the lingua franca of programming and almost any device

can be programmed using C. However, programming mod-

ern heterogeneous architectures such as multi-core CPUs

and GPUs requires explicitly expressing parallelism as well

as device-specic properties such as memory hierarchies.

The resulting code is often hard to understand, debug, and

modify for dierent architectures. We propose to lift C pro-

grams to a parametric dataow representation that lends

itself to static data-centric analysis and enables automatic

high-performance code generation. We separate writing code

from optimizing for dierent hardware: simple, portable C

source code is used to generate ecient specialized versions

with a click of a button. Our approach can identify parallelism

when no other compiler can, and outperforms a bespoke par-

allelized version of a scientic proxy application by up to

21%.

Keywords:

parallelism, dataow analysis, automatic paral-

lelization

1 Introduction

Many performance critical applications are written in C, as

its machine model is usually closest to hardware and allows

for bare-metal tuning to achieve highest performance. Ac-

cording to the TIOBE index [

] in 2020, C was the most

popular language in Internet searches. High-performance

computing centers state that 25% of their users primarily use

C [

]. Since Kernighan’s and Ritchie’s original inception of

the C language, systems have changed dramatically. Most

architectures need specialized instructions, compiler direc-

tives, or libraries to be used eciently. This usually leads

to C programs where more lines of code are implementing

optimizations tailored to the architecture than solving the

actual problem.

Targeted optimization is tightly coupled to hardware ar-

chitectures. A code written for GPUs using CUDA, a code

written to exploit shared memory using OpenMP, and a code

written for large supercomputers using the message passing

interface (MPI) can be nominally written in C, but will vary

widely even if they solve the same problem. The only aspect

they are likely to have in common is the sequential algo-

rithm each variant is based on. We argue that specializing

the programs to an architecture treats the symptoms, but

cannot eliminate the root cause: precisely because C was not

designed for performance portability, optimizing C programs

is both challenging and time consuming.

A powerful alternative to specialization is using tools

provided by modern compilers such as polyhedral analy-

sis [

] to optimize and parallelize sequential C code, with

results rivaling and even surpassing hand-tuned versions of

the code. However, these are limited to static control parts

(SCOPs) within functions [

]. SCOPs impose constraints

on what type of source code can be analyzed: indirect array

accesses such as

𝑥 [𝑐𝑜𝑙𝑢𝑚𝑛_𝑖𝑛𝑑𝑒𝑥 [𝑗]]

are typically not per-

mitted. The limitation is apparent in the following example

(sparse matrix vector multiplication), as no optimization is

possible due to the data-dependent indirect array accesses.

for (i = 0; i < N; i ++)

for (j = row _p tr [ i ]; j < row_ pt r [ i + 1]; j ++)

y[ i ] += A [ c ol_ id x [ j ]] * x[j ];

In search of a more general solution, we observe that data

movement is the most expensive part of most program exe-

cutions when considering both energy and time [

]. Data-

centric programming and leveraging dataow graphs is al-

ready widely performed in compiler analysis [

], and

recently emerging in graph analytics [

], high performance

computing [

], and machine learning [

]. Data-centric

models are both productive and portable, as parallelism is in-

herently expressed as data-independent sections, regardless

of the target hardware.

Our goal is to generate optimized, parallel code for dif-

ferent platforms by minimizing data movement. To achieve

it, we extract the data movement semantics from most C pro-

grams into a parametric dataow representation, where data

movement can be better analyzed and transformed. While

one cannot statically analyze the dataow of all C programs,

as can be shown by the Halting problem or Rice’s theorem,

we observe that high performance C codes, a subset of C

arXiv:2112.11879v2 [cs.PL] 30 Dec 2021

Calotoiu, et al.

C program

int





int





int





C-to-DaCe

translation

(§2)

Dataflow

coarsening

(§3)

Data-centric program

view (SDFG)



   

  





Dataflow

optimization

(§4)

Optimized C program

Code

generation

[8]

int





int





int





Figure 1. Optimizing C programs by lifting dataow.

programs without undened behavior, recursion or function

pointers, can be lifted.

We keep track of memory accesses using symbolic analy-

sis of access patterns and leverage the dataow across the

entirety of a program. To showcase the opportunities pro-

vided by the data-centric approach we show how we can

automatically expose data parallelism by identifying and

optimizing updates to shared memory locations. We evalu-

ate the eectiveness of our parallelization by providing an

automatic method of deriving work/depth models for code

we have parallelized. There is no need to annotate the code

to recover parametrically-parallel sections, as we derive the

required information directly from dataow. The overall pro-

cess is fully automatic and is summarized in Figure 1. Figure 2

shows a more detailed view of how the sparse matrix vector

multiplication code is translated, transformed and optimized

and will be discussed in detail in Sections 2, 3, and 4.

As we shall show, from the raw C codes, we are able to

not only generate codes that perform equivalently or better

than specialized tools such as polyhedral compilers; but also

operate on LULESH [

], a scientic computing application,

nding parallelization opportunities that no state-of-the-art

tool detects, and even outperform the tuned parallel version

provided by the application authors.

Contributions.

•

We statically lift the semantics of dataow from C into

a data-centric intermediate representation.

•

We use symbolic analysis of data access patterns across

entire programs to expose optimizations and paral-

lelism in unmodied C programs.

•

We statically detect the update of a memory location

as a distinct data access pattern to expose additional

parallelism opportunities.

•

We introduce an automatic, static work-depth analysis

to objectively measure the degree to which we have

exposed parallelism in sequential C code.

•

On the LULESH [

] high-performance scientic ap-

plication, we automatically generate a parallel version

that outperforms all other compilers and autoparal-

lelizing tools and even surpasses the developers’ own

OpenMP parallelization by up to 21%.

The code is available under hps://github.com/spcl/c2dace.

2 From C to Data-Centric Programming

The data-centric programming paradigm revolves around

memory, its movement, and its manipulation through com-

putations. Rather than prioritizing control-ow constructs

(e.g., sequential statements, loops), the core component of

data-centric models is dataow. Execution order is thus rst

a byproduct of data dependencies, and secondly a result of

explicit control-ow. There are three governing principles to

the paradigm: separation of data containers from computa-

tion, explicit data movement expressed as a rst-class com-

ponent, and providing control dependencies for cases where

dataow is not implied (e.g., data-dependent branches).

This is a crucial dierence to control-centric C programs,

where dataow is implicit. In order to perform this para-

digm shift we must execute a workow to lift dataow from

C programs. Throughout this workow, we must maintain

semantic equivalence in every step of the translation. We sep-

arate the workow: rst, we perform AST transformations

to simplify the translation to the dataow representation.

Then, we parse the C code into a ne-grained dataow rep-

resentation. Then we repeatedly coarsen that dataow, after

which we can perform optimizing transformation passes. Fi-

nally, we can generate optimized C source code for dierent

architectures.

In this work, we focus on the Stateful Dataow Multigraph

(SDFG) IR [

] as the data-centric representation. An SDFG

is a directed graph, representing a state machine, where

each node (

state

) is in itself a parametric directed acyclic

multigraph. In the outer graph, edges contain state transi-

tion conditions and assignments. Each state is in turn an

acyclic dataow multigraph, with edges representing data

movement and nodes representing data containers, compu-

tations, and parametric parallelism scopes. The components

are summarized in Figure 2 and full operational semantics

can be found in Ben-Nun et al. [8].

Using the DaCe framework, SDFGs were shown to accel-

erate a wide range of application classes in dense/sparse

linear algebra and graph algorithms [

], deep learning Trans-

former architectures [26], numerical weather prediction on

FPGAs [

], and extreme-scale quantum transport simula-

tions on the world’s largest supercomputer [51].

We provide a high level overview mapping major C syntax

elements [

] to equivalent SDFG elements in Table 1, and

introduce both relevant SDFG components in more detail as

well as discuss the more challenging aspects of C to SDFG

translation below.

Liing C Semantics for Dataflow Optimization

C Language SDFG Equivalent

Declarations and Types (§ 2.1)

Primitive data type Scalar data container

Array Array data container

Pointer

Access node to existing data con-

tainer, or new data container if point-

ing to newly allocated memory.

Expressions and Assignments (§ 2.2)

Operators (Unary, Binary,...)

Tasklet with incoming and outgoing

memlets for read/written operands

Array expression Memlet

Statements (§ 2.3)

Compound (blocks) State

Branching (if,...)

Branch conditions on state transi-

tion edges

Iteration (

for, while, ...)

State for compound statement, with

states and transitions for loop logic

Function ow (

break,

continue, return)

State transitions

goto State transition

Functions (§ 2.4)

Function calls (with source)

Nested SDFG for content, memlets

reduce shape of inputs and outputs

External/Library calls Tasklet with library state

Recursion Unsupported

Function pointers No equivalent, unsupported

Parallelism (§ 2.6)

— Parametric map scope

Table 1.

Mapping of major C syntax [

] elements to SDFG

representation.

2.1 Declarations and Types

We need to capture all instances where data is dened, read,

and written. The rst step is to capture all instances where

data is dened, whether statically or at runtime. The equiva-

lent to declarations in C is the creation of data containers in

SDFGs.

Data containers

are accessed using access nodes in SD-

FGs, and represent arrays, both one- and multi-dimensional.

Scalars are thus specialized data containers, with just one

instance of a primitive data type.

Some examples of data containers are shown below:

C B

B[0:50, 0:50]C[0:10]

…

double **A, B[50][50];

float *C = (float *)malloc(sizeof(float) * 10);

C code Corresponding SDFG

Here,

will be registered as a one-dimensional single preci-

sion oating point array of 10 elements in the SDFG, and

as a two-dimensional double precision oating point array

of 50 elements times 50 elements. We ensure no aliasing is

possible in our representation by not creating a separate

data container for pointers such as

. Containers will be

created only if

is assigned to newly allocated memory. If

is assigned to an existing data container,

will simply be

replaced with an access to that container.

SDFGs rely on symbolic math to perform useful analyses

and transformations. A

symbol

is dened as a scalar that

will not be modied within any state. Symbols can only be

set between states, in an inter-state edge. We can thus use

symbolic expressions in memory osets and integer sets, and

dierentiate them from runtime-computed scalars.

To analyze dynamically allocated memory such as

malloc

and variable-length arrays, we automatically create symbols

out of integer scalar values, as we detail in Section 3.1.

2.2 Expressions and Assignments

Assignments are some of the most common constructs en-

countered in C. An assignment contains both data (read and

written) and computation (as part of expressions), and we

discuss their SDFG equivalents below.

for (int i = 0; i < N ; i++)

for (int j = row_ptr[i]; j < row_ptr[i+1]; j++)

y[i] += A[col_idx[j]] * x[j];

for (int i = 0; i < N ; i++)

for (int j = row_ptr[i]; j < row_ptr[i+1]; j++)

{

int idx=col_idx[j];

y[i] += A[idx] * x[j];

}

j=row_ptr[i]

y = A*x

A[idx]

x[j]

y[i] (CR : Sum)

j>=row_ptr[i+1]

i=0

i<N

i++

j<row_ptr[i+1]

idx=col_idx[j]

j++

[i=0:N] (omp parallel for)

[j=row_ptr[i]:row_ptr[i+1]

(omp parallel for)

Data: Array containers

A[idx]

y[i] (CR: Sum)

Memlet: Data movement unit, with parallel write

conflict resolution (CR) options

States: Control dependencies

Map: Parametric parallelism scope



y = A * x

Tasklet: Fine-grained computation

A x

i=0

i<N

Interstate edges: symbolic conditions and

assignments dependencies

SDFG components

idx=col_idx[j]

Update

detection

Symbolic

access

Index

extraction

AST transformation (§2)

i>=N

y = A*x

A[idx]

x[j]

y[i] (CR : Sum)

C-to-DaCe

translation

(§2)

Dataflow

coarsening

(§3)

Dataflow

optimization

(§4)

Figure 2. From C to Data-Centric Programming.

剩余11页未读，继续阅读

评论收藏

内容反馈

版权申诉

易小侠

粉丝: 6508
资源: 9万+

提升C语义用于数据流优化_Lifting C Semantics for Dataflow Optimization.pdf

最新资源

提升C语义用于数据流优化_Lifting C Semantics for Dataflow Optimization.pdf

image_compress.rar_lifting_lifting wavelet_wavelet lifting_小波分解_

popwav.zip_lifting_lifting scheme

liftpack_double_win95.rar_lifting wavelet_liftpack_liftpack soft

LiftingLineTheory.zip_Lift line_lifting line_theory

tishengxiaobo.zip_lifting_去噪_小波去噪_提升_提升小波去噪

tishengfa-pragram.rar_lifting_提升小波

image_denoise.rar_image denoise_lifting_图像去噪_小波图像去噪_提升小波去噪

vlm.zip_VLM_lifting_propeller_propeller design_theory

linear_2_wavelets.rar_lifting wavelet

wavelet_matlab_code.rar_Daubechies 小波_lifting transform _wavelet

WT.rar_9-7小波变换_9/7 MATLAB_lifting scheme_小波图像去噪_提升

wavelift.rar_d5/3 小波_lifting scheme_wavelift_小波D9/7_小波提升

openprop.tar.gz_lifting line_升力_升力线_尾涡_涡

test53.rar_LOG 边缘_lifting wavelet_matlab 视频_小波变换补偿_视频编码

VerilogHDL.rar_Wavelet verilog_lifting vhdl_小波 VHDL_小波 Verilog_小

lifting_signal_compression2.zip_小波 去噪_小波压缩_提升_提升小波_提升小波 压缩

li5-39.rar_haar lifting_haar提升小波_image layer_提升小波_提升方案小波

lwt.rar_LWT-OFDM_lwt_wavelet_wavelet lifting_小波

SourceCode.rar_lifting line_openprop_propeller_升力线_螺旋桨

Cobalt Strike下载

计算机系统-笔记-HUN2021级

北京邮电大学计算机考研复试笔试资料

cs1.6老版本供下载

合成孔径雷达的经典成像算法cs(matlab)仿真代码（吐血整理，内容全，注释全）

港大CS（MSC）面试整理

合成孔径雷达RD CS OmegaK算法点目标仿真.rar

计算机科学导论原书第二版答案.zip

Cobalt-Strike-4.5

cobaltstrike4.3.zip

最新资源

lifting_signal_compression2.zip_小波去噪_小波压缩_提升_提升小波_提升小波压缩