Intel FPGA SDK for OpenCL Best Practices Guide

所需积分/C币:14 2018-10-01 10:44:11 5.68MB PDF
收藏 收藏 1
举报

OpenCL在Intel FPGA SDK的使用文档,包括安装教程以及实例
Contents intel 3.7 Avoiding Pointer Aliasing.….……, 86 3.8 Avoid Expensive Functions,…,,….86 3.9 AVoiding Work-Item ID-Dependent Backward branching . 88 4 Profiling Your Kernel to identify performance Bottlenecks.……… a89 4. 1 Intel FPGA Dynamic Profiler for openCL Best practices........................90 4.2 Intel FPGA Dynamic Profiler for OpenCL GUI 4.2.1 Source Code tab 90 4.2.2 Kernel execution tab 92 4.2.3 Autorun Captures Tab ting the profiling informatio 94 4.3.1 Stall, Occupancy, Bandwidth. ..... .95 4.3.2 Activity..,.... 97 4.3. 3 Cache hit 98 4.3. 4 Profiler Analyses of Example OpenCL Design Scenarios 98 43. 5 Autorun profiler data 4. 4 Intel FPGa Dynamic profiler for openCL Limitations 1面 102 5 Strategies for Improving Single Work- Item Kernel performance.…………104 5. 1 Addressing Single Work-Item Kernel Dependencies Based on Optimization report Feedback,,,,,.104 5.1.1 Removing Loop- Carried Dependency.……… ..105 5.1.2 Relaxing Loop-Carried Dependency. ∴108 5.1.3 Simplifying Loop-Carried dependency . 5.1.4 Transferring Loop-Carried Dependency to Local Memory. 113 5.1.5 Removing Loop-Carried Dependency by Inferring Shift Registers.......115 5.2 Removing Loop-Carried Dependencies Caused by accesses to Memory arrays ...... 116 5.3 Good Design Practices for Single Work-Item Kernel. ∴119 6 Strategies for Improving NDRange Kernel Data Processing Efficiency. 6. 1 Specifying a Maximum Work-Group Size or a Required Work-Group Size. ........ 122 6.2 Kernel vectorization 124 6.2.1 Static Memory Coalescing.…,,…,,, 能1画 125 6.3 Multiple Compute Units.…,……,,…,……,,…,,……,…,……127 6.3. 1 Compute Unit Replication versus Kernel sIMd Vectorization..,,......... 128 6. 4 Combination of Compute Unit Replication and Kernel SIMD Vectorization ∴130 6.5 Resource-Driven Optimization . 6.6 Reviewing Kernel properties and Loop Unroll status in the htMl Report ..,......... 132 7 Strategies for Improving Memory Access Efficiency.………………………………134 7. 1 General Guidelines on Optimizing memory Accesses. 134 7.2 Optimize Global Memory Accesses 135 7.2.1 Contiguous Memory Accesses.…,,,,,… ∴136 7.2.2 Manual Partitioning of Global Memory 137 7.3 Performing Kernel Computations Using Constant, Local or Private Memory......... 138 7.3. 1 Constant Cache Memory ∴139 7.3.2 Preloading Data to Local Memory.…,,…,,,…,…,…,,…,,,139 7.3.3 Storing Variables and arrays in Private Memory. ..........................................141 7. 4 Improving Kernel Performance by Banking the Local Memory................ 141 7.4.1 Optimizing the geometric configuration of local memory banks based on Array Index. 144 7. 5 Optimizing accesses to Local Memory by Controlling the memory Replication Factor... 146 Intel@ FPGA SDK for OpenCL Best Practices Guide intel Contents 7.6 Minimizing the Memory Dependencies for Loop pipelining. . 8 Strategies for optimizing fPga area Usage. m....an::::::::::::::::::::::::::::::::::::::: 149 8. 1 Compilation Considerations. 149 8.2 Board variant Selection Considerations 149 8.3 Memory Access Considerations 8.4 Arithmetic Operation Considerations 151 8.5 Data Type selection Considerations.…………………… 152 A Additional information 153 A 1 Document Revision History 153 Intel FPGA SDK for opencL Best Practices Guide 4 UG-ocL003|2017.12.08 intel 1 Introduction The Inte/@ FPGA SDK for OpenCL Best Practices Guide provides guidance on leveraging the functiona lities of the Intel FPGa Software Development Kit (sdk)for OpenCL(1)to optimize your openCL(2) applications for Intel FPGA products This document assumes that you are familiar with Open CL concepts and application programming interfaces(APIs, as described in the Open CL Specification version 1.0 by the khronos group. It also assumes that you have experience in creating OpenCL applications To achieve the highest performance of your open Cl application for FPGAs, familiarize yourself with details of the underlying hardware. In addition understand the compiler optimizations that convert and map your opencL application to FPGAs For more information on the open Cl Specification version 1.0, refer to the openCL Reference pages on the Khronos group website. For detailed information on the OpenCL APIs and programming language refer to the OpenCL Specification version 1.0 Related links OpenCL Reference Pages OpenCL specification version 1.0 1 FPGA Overview Field-programmable gate arrays(FPGas)are integrated circuits that you can configure repeatedly to perform an infinite number of functions An FPGA consists of several small computational units. Custom datapaths can be built directly into the fabric by programming the compute units and connecting them as shown in the following figure Data flow is programmed directly into the architecture (1)The Intel FPGA SDK for OpenCL is based on a published Khronos Specification, and has passed the khronos conformance testing process. Current conformance status can be found at www.khronos.org/conformance (2)OpenCL and the Open Cl logo are trademarks of Apple Inc. and used by permission of the Khronos group Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios Quartus countries. Intel warrants performance of its FPGa and semiconductor products to current specifications in Iso accordance with Intel's standard warranty but reserves the right to make changes to any products and services 9001: 2008 at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any Registered information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or service *Other names and brands may be claimed as the property of others intel 1 Introduction UG-ocL003|2017.12.08 Figure 1. FPGA Architecture DSP Block Memory block ADDR A ADDR B DATAN A DAIAN B DATAOUT A DATAOUT B WE A WE B CLK A CK B 一国国国国国国 国■国■■■■■■国■■■国二■ =〓二二〓 ;; Programmable Logic =====〓…= Routing Switch Modules With FPGAS, low-level operations like bit masking shifting, and addition are a configurable. Also, you can assemble these operations in any order. To implement computation pipelines FPGAs integrate combinations of lookup tables(LUTs) registers on-chip memories and arithmetic hardware(for example digital signa processor(DSp)blocks) through a network of reconfigurable connections. As a result FPGAs achieve a high level of programmability LUTs are responsible for implementing various logic functions. For example, reprogramming a lut can change an operation from a bit-Wise ANd logic function to a bit-Wise XOR logic function The key benefit of using FPGAs for algorithm acceleration is that they support wide heterogeneous and unique pipeline implementations. this characteristic is in contrast to many different types of processing units such as symmetric multiprocessors, D SPs and graphics processing units(GPUs). In these types of devices, parallelism is achieved by replicating the same generic computation hardware multiple times. In FPGAS, however, you can achieve parallelism by duplicating only the logic that your algorithm exercises Intel FPGA SDK for opencL Best Practices Guide 1 Introduction intel UG-ocL003|2017.1208 a processor implements an instruction set that limits the amount of work it can perform each clock cycle. For example, most processors do not have a dedicated instruction that can execute the following c code: ((A+B)^C)&D)>>2 Without a dedicated instruction for this c code example, a CPu, dsP or GPu must execute multiple instructions to perform the operation In contrast, you may think of an FPGa as a hardware platform that can implement any instruction set that your software algorithm requires. You can configure an FPGa to perform a sequence of operations that implements the code example above in a single clock cycle. An FPGA implementation connects specialized addition hardware with a lut that performs the bit-wise XOR and and operations. the device then leverages its programmable connections to perform a right shift by two bits without consuming any hardware resources. The result of this operation then becomes a part of subsequent operations to form complex pipelines 1.2 Pipelines In a pipelined architecture input data passes through a sequence of stages. Each stage performs an operation that contributes to the final result, such as memory operation or calculation The designs of microprocessors, digital signal processors(DSPs), hardware accelerators, and other high performance implementations of digital hardware often contain pipeline architectures For example the diagram below represents the following example code fragment as a multistage pipeline for(i=0;i<1024;i++) y[i]=(a「i1+b[i1+c「i+a「i]+e「il+f「i1+g「il+hi1)>>3; Figure 2. Example Multistage Pipeline Diagram Shift Add Add Add by 3 c[l dir fil Pipeline Stage- With a pipelined architecture, each arithmetic operation passes into the pipeline one at a time. Therefore, as shown in the diagram above a saturated pipeline consists of eight stages that calculate the arithmetic operations simultaneously and in parallel. In addition because of the large number of loop iterations, the pipeline stages continue to perform these arithmetic instructions concurrently for each subsequent loop iteration Intel@ FPGA SDK for openCL Best Practices Guide intel 1 Introduction UG-ocL003|2017.12.08 Intel FPGA SDK for OpenCL Pipeline Approach A new pipeline is constructed based on your design. as a result, it can accommodate the highly configurable nature of FPGAS. Consider the following OpencL code fragment sE You can configure an FPga to instantiate a complex pipeline structure that executes the entire code simultaneously. In this case the sdK implements the code as two independent pipelined entities that feed into a pipelined adder, as shown in the figure below Figure 3. Example of the SDK's Pipeline Approach D E A Subtraction Shift right by 5 Shift Left by 3 Addition Addition The intel fPGA SDK for OpenCl offline compiler provides a custom pipeline structure that speeds up computation by allowing operations within a large number of work items to occur concurrently. The offline compiler can create a custom pipeline that calculates the values for variables C, F and g every clock cycle, as shown below. After a ramp-up phase, the pipeline sustains a throughput of one work-item per cycle Intel FPGA SDK for opencL Best Practices Guide 8 1 Introduction intel UG-ocL003|2017.1208 Figure 4. An FPGA Pipeline with Three Operations Per Clock Cycle C1A2>585+81A5484>5+814>】+8.|>5)+8 <3(04E)<302E)<302E)<30.E)<30E)<3 G C+E 〔2+F2 〔3+F + 0 Time in Clock Cycle a traditional processor has a limited set of shared registers. Eventually, a processor must write the stored data out to memory to allow more data to occupy the registers The offline compiler keeps data" live by generating enough registers to store the data for all the active work-items within the pipeline. The following code example and figure illustrate a live variable C in the OpenCL pipeline size t index get global idiC) C=A「 index I+B[ index I 卫[ index]=CD[ index]; igure 5. An FPGA Pipeline with a Live Variable C Index =0 Index= Index=2 Load Load Aide A[index] B[index] Load C Index=0 Index] Index] Index=0 Store E[index] E[index] E[index] 〔 ock cycle Clock Cycle 2 1.3 Single Work-Item Kernel versus NDRange Kernel Intel recommends that you structure your opencl kernel as a single work-item if possible. However, if your kernel program does not have loop and memory dependencies, you may structure your application as an NDRange kernel because the kernel can execute multiple work-items in parallel efficiently The Intel FPGA SDK for OpenCL host can execute a kernel as a single work-item Thich is equivalent to launching a kernel with an NDRange size of (1, 1, 1) The OpenCL Specification version 1.0 describes this mode of operation as task parallel programming. a task refers to a kernel executed with one work-group that contains ne work -item Intel@ FPGA SDK for openCL Best Practices Guide intel 1 Introduction UG-ocL003|2017.12.08 Generally, the host launches multiple work-items in parallel. However, this data parallel programming model is not suitable for situations where fine-grained data must be shared among parallel work-items In these cases, you can maximize throughput by expressing your kernel as a single work-item. Unlike NDRange kernels, single work- item kernels follow a natural sequential model similar to C programming. Particularly, you do not have to partition the data ac cross work-item To ensure high-throughput single work-item-based kernel execution on the FPGa, the Intel FPGA SDK for OpenCL Offline Compiler must process multiple pipeline stages in parallel at any given time. this parallelism is realized by pipelining the iterations of Consider the following simple example code that shows accumulating with a single- k item l kernel vcid accum swg (global intra lobal int nt k size) i for (int k =0 k< k size; ++k)i 4 for (int ) 6 for (int k =0ik< k size ++k)i 10 [k]=sunk] During each loop iteration, data values from the global memory a is accumulated to sum[kl. In this example the inner loop on line 4 has an initiation interval value of 1 with a latency of 11. the outer loop also has an initiation interval value greater than or equal to 1 with a latency of 8 Note The launch frequency of a new loop iteration is called the initiation interval (ii). Ii refers to the number of hardware clock cycles for which the pipeline must wait before it can process the next loop iteration An optimally unrolled loop has an Ii value of 1 because one loop iteration is processed every clock cycle Figure 6. Loop Analysis Report Loops analysis M Show fully unrolled loops Pipelined Details KerneL accum swg (accum swg cl3) Single work-item execution accum swE B1 (accum swg cL:5) accum swg B2 (accum swg cLb) II Is an approximation. accum swg.B4 (accum swg CL12) Il is an approximation. Intel FPGA SDK for opencL Best Practices Guide 10

...展开详情
试读 127P Intel FPGA SDK for OpenCL Best Practices Guide
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    抢沙发
    一个资源只可评论一次,评论内容不能少于5个字
    关注 私信 TA的资源
    上传资源赚积分,得勋章
    最新推荐
    Intel FPGA SDK for OpenCL Best Practices Guide 14积分/C币 立即下载
    1/127
    Intel FPGA SDK for OpenCL Best Practices Guide第1页
    Intel FPGA SDK for OpenCL Best Practices Guide第2页
    Intel FPGA SDK for OpenCL Best Practices Guide第3页
    Intel FPGA SDK for OpenCL Best Practices Guide第4页
    Intel FPGA SDK for OpenCL Best Practices Guide第5页
    Intel FPGA SDK for OpenCL Best Practices Guide第6页
    Intel FPGA SDK for OpenCL Best Practices Guide第7页
    Intel FPGA SDK for OpenCL Best Practices Guide第8页
    Intel FPGA SDK for OpenCL Best Practices Guide第9页
    Intel FPGA SDK for OpenCL Best Practices Guide第10页
    Intel FPGA SDK for OpenCL Best Practices Guide第11页
    Intel FPGA SDK for OpenCL Best Practices Guide第12页
    Intel FPGA SDK for OpenCL Best Practices Guide第13页
    Intel FPGA SDK for OpenCL Best Practices Guide第14页
    Intel FPGA SDK for OpenCL Best Practices Guide第15页
    Intel FPGA SDK for OpenCL Best Practices Guide第16页
    Intel FPGA SDK for OpenCL Best Practices Guide第17页
    Intel FPGA SDK for OpenCL Best Practices Guide第18页
    Intel FPGA SDK for OpenCL Best Practices Guide第19页
    Intel FPGA SDK for OpenCL Best Practices Guide第20页

    试读已结束,剩余107页未读...

    14积分/C币 立即下载 >