IntelFPGASDKforOpenCLBestPracticesGuide

需积分: 13 75 浏览量 2018-10-01 10:44:11 上传评论 1 收藏 5.68MB PDF 举报

《Intel FPGA SDK for OpenCL最佳实践指南》是一份针对使用Intel FPGA进行OpenCL开发的详细文档。该文档不仅提供了安装教程，还包含了一系列实例和最佳实践，帮助开发者在Intel FPGA平台上更有效地编写和优化OpenCL程序。文档内容涵盖了FPGA基础知识、流水线、单个工作项与NDRange内核的区别、多线程主机应用程序等多个方面。在FPGA概述部分，介绍了FPGA的基本概念、架构以及与CPU和GPU的区别。流水线一节介绍了流水线操作的基本原理和应用。单个工作项内核与NDRange内核的区别则是区分了两种不同类型的内核执行模型。而多线程主机应用程序部分，则介绍了如何在主机端创建多线程程序以提升与FPGA的交互性能。文档特别强调了对内核报告文件（.html文件）的审查和分析，这是进行性能优化的一个重要步骤。报告文件包含了高阶设计报告布局、报告摘要、循环信息、区域信息、内存复制和停滞情况等多个方面。高阶设计报告布局部分，帮助开发者了解整个设计的结构。报告摘要部分则给出了内核性能和资源使用的快速概览。循环信息部分不仅分析了循环的使用，还提供了通过改变内存访问模式、使用循环合并减少嵌套循环消耗的面积等案例。区域信息分析则包括了源代码和系统级别的区域分析。针对内存复制和停滞情况的验证，文档提供了系统查看器和内核内存查看器的功能介绍，帮助开发者识别和解决可能的性能瓶颈。文档还介绍了如何根据HTML报告中的信息来优化OpenCL设计示例，并提供了相应的HTML报告解读。 OpenCL内核设计的最佳实践部分，包括了通过Intel FPGA SDK for OpenCL的通道或OpenCL管道来传输数据，讨论了通道和管道的特点、执行顺序以及如何优化它们的缓冲区推断。循环展开是性能优化的一个重要手段，文档详细说明了循环展开的技巧和最佳实践。文档还讨论了浮点运算的优化，比较了浮点和定点表示法，并提供了一系列内存优化技巧，包括分配对齐内存、结构体对齐和避免指针别名等。文档还特别强调了内核设计概念，包括内核的使用、全局内存互连、局部内存、嵌套循环、单一工作项内核中的循环、通道等。在优化OpenCL设计的过程中，这些内核设计概念是至关重要的，因为它们决定了计算任务在FPGA上的分布和执行效率。通过这份指南，开发者可以学习如何在使用Intel FPGA进行OpenCL开发的过程中，更高效地利用硬件资源，提升内核性能，并优化内存访问模式和数据传输效率。这份指南不仅适合初学者入门，也适合有一定经验的开发者深入理解并提升其FPGA开发水平。

资源推荐

资源详情

资源评论

Intel

FPGA SDK for OpenCL

™

Best Practices Guide

Updated for Intel

Quartus

Prime Design Suite: 17.1

Send Feedback

UG-OCL003 | 2017.12.08

Latest document on the web: PDF | HTML

Contents

1 Introduction.................................................................................................................... 5

1.1 FPGA Overview.......................................................................................................5

1.2 Pipelines................................................................................................................7

1.3 Single Work-Item Kernel versus NDRange Kernel........................................................ 9

1.4 Multi-Threaded Host Application..............................................................................15

2 Reviewing Your Kernel's report.html File....................................................................... 17

2.1 High Level Design Report Layout.............................................................................17

2.2 Reviewing the Report Summary.............................................................................. 19

2.3 Reviewing Loop Information................................................................................... 20

2.3.1 Loop Analysis Report of an OpenCL Design Example...................................... 22

2.3.2 Changing the Memory Access Pattern Example..............................................23

2.3.3 Reducing the Area Consumed by Nested Loops Using loop_coalesce............ 27

2.4 Reviewing Area Information................................................................................... 28

2.4.1 Area Analysis by Source.............................................................................29

2.4.2 Area Analysis of System.............................................................................30

2.5 Verifying Information on Memory Replication and Stalls..............................................31

2.5.1 Features of the System Viewer....................................................................31

2.5.2 Features of the Kernel Memory Viewer ........................................................ 33

2.6 Optimizing an OpenCL Design Example Based on Information in the HTML Report......... 36

2.7 HTML Report: Area Report Messages....................................................................... 42

2.7.1 Area Report Message for Board Interface..................................................... 42

2.7.2 Area Report Message for Function Overhead................................................. 42

2.7.3 Area Report Message for State....................................................................43

2.7.4 Area Report Message for Feedback.............................................................. 43

2.7.5 Area Report Message for Constant Memory...................................................43

2.7.6 Area Report Messages for Private Variable Storage........................................ 43

2.8 HTML Report: Kernel Design Concepts..................................................................... 44

2.8.1 Kernels....................................................................................................45

2.8.2 Global Memory Interconnect.......................................................................46

2.8.3 Local Memory...........................................................................................47

2.8.4 Nested Loops........................................................................................... 54

2.8.5 Loops in a Single Work-Item Kernel............................................................. 61

2.8.6 Channels................................................................................................. 68

2.8.7 Load-Store Units.......................................................................................68

3 OpenCL Kernel Design Best Practices.............................................................................73

3.1 Transferring Data Via Intel FPGA SDK for OpenCL Channels or OpenCL Pipes.................73

3.1.1 Characteristics of Channels and Pipes.......................................................... 74

3.1.2 Execution Order for Channels and Pipes....................................................... 76

3.1.3 Optimizing Buffer Inference for Channels or Pipes......................................... 77

3.1.4 Best Practices for Channels and Pipes.......................................................... 78

3.2 Unrolling Loops.....................................................................................................78

3.3 Optimizing Floating-Point Operations....................................................................... 80

3.3.1 Floating-Point versus Fixed-Point Representations......................................... 82

3.4 Allocating Aligned Memory..................................................................................... 83

3.5 Aligning a Struct with or without Padding................................................................. 83

3.6 Maintaining Similar Structures for Vector Type Elements.............................................86

Contents

Intel

FPGA SDK for OpenCL

™

Best Practices Guide

3.7 Avoiding Pointer Aliasing........................................................................................86

3.8 Avoid Expensive Functions..................................................................................... 86

3.9 Avoiding Work-Item ID-Dependent Backward Branching............................................. 88

4 Profiling Your Kernel to Identify Performance Bottlenecks............................................ 89

4.1 Intel FPGA Dynamic Profiler for OpenCL Best Practices............................................... 90

4.2 Intel FPGA Dynamic Profiler for OpenCL GUI.............................................................90

4.2.1 Source Code Tab.......................................................................................90

4.2.2 Kernel Execution Tab................................................................................. 92

4.2.3 Autorun Captures Tab................................................................................ 94

4.3 Interpreting the Profiling Information...................................................................... 94

4.3.1 Stall, Occupancy, Bandwidth.......................................................................95

4.3.2 Activity....................................................................................................97

4.3.3 Cache Hit.................................................................................................98

4.3.4 Profiler Analyses of Example OpenCL Design Scenarios ................................. 98

4.3.5 Autorun Profiler Data............................................................................... 102

4.4 Intel FPGA Dynamic Profiler for OpenCL Limitations................................................. 102

5 Strategies for Improving Single Work-Item Kernel Performance................................. 104

5.1 Addressing Single Work-Item Kernel Dependencies Based on Optimization Report

Feedback........................................................................................................104

5.1.1 Removing Loop-Carried Dependency..........................................................105

5.1.2 Relaxing Loop-Carried Dependency............................................................108

5.1.3 Simplifying Loop-Carried Dependency........................................................ 110

5.1.4 Transferring Loop-Carried Dependency to Local Memory............................... 113

5.1.5 Removing Loop-Carried Dependency by Inferring Shift Registers................... 115

5.2 Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays............. 116

5.3 Good Design Practices for Single Work-Item Kernel..................................................119

6 Strategies for Improving NDRange Kernel Data Processing Efficiency......................... 122

6.1 Specifying a Maximum Work-Group Size or a Required Work-Group Size..................... 122

6.2 Kernel Vectorization.............................................................................................124

6.2.1 Static Memory Coalescing.........................................................................125

6.3 Multiple Compute Units........................................................................................ 127

6.3.1 Compute Unit Replication versus Kernel SIMD Vectorization.......................... 128

6.4 Combination of Compute Unit Replication and Kernel SIMD Vectorization.................... 130

6.5 Resource-Driven Optimization............................................................................... 131

6.6 Reviewing Kernel Properties and Loop Unroll Status in the HTML Report..................... 132

7 Strategies for Improving Memory Access Efficiency.....................................................134

7.1 General Guidelines on Optimizing Memory Accesses.................................................134

7.2 Optimize Global Memory Accesses......................................................................... 135

7.2.1 Contiguous Memory Accesses................................................................... 136

7.2.2 Manual Partitioning of Global Memory........................................................ 137

7.3 Performing Kernel Computations Using Constant, Local or Private Memory.................. 138

7.3.1 Constant Cache Memory.......................................................................... 139

7.3.2 Preloading Data to Local Memory...............................................................139

7.3.3 Storing Variables and Arrays in Private Memory...........................................141

7.4 Improving Kernel Performance by Banking the Local Memory.................................... 141

7.4.1 Optimizing the Geometric Configuration of Local Memory Banks Based on

Array Index........................................................................................... 144

7.5 Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor..... 146

Contents

Intel

FPGA SDK for OpenCL

™

Best Practices Guide

1 Introduction

The Intel

FPGA SDK for OpenCL

™

Best Practices Guide provides guidance on

leveraging the functionalities of the Intel FPGA Software Development Kit (SDK) for

OpenCL

(1)

to optimize your OpenCL

(2)

applications for Intel FPGA products.

This document assumes that you are familiar with OpenCL concepts and application

programming interfaces (APIs), as described in the OpenCL Specification version 1.0

by the Khronos Group. It also assumes that you have experience in creating OpenCL

applications.

To achieve the highest performance of your OpenCL application for FPGAs, familiarize

yourself with details of the underlying hardware. In addition, understand the compiler

optimizations that convert and map your OpenCL application to FPGAs.

For more information on the OpenCL Specification version 1.0, refer to the OpenCL

Reference Pages on the Khronos Group website. For detailed information on the

OpenCL APIs and programming language, refer to the OpenCL Specification version

1.0.

Intel FPGA SDK for OpenCL Best Practices Guide

最新资源

Intel FPGA SDK for OpenCL Best Practices Guide

opencl best practices guide

intel sdk for opencl

intel fpga opencl 编程指南

OpenCL_Best_Practices_Guide.pdf

windows intel_sdk_for_opencl

OpenCL Programming by Example

使用FPGA优化视频水印操作的OpenCL应用

opencl编程指南

altera_opencl.rar_Altera OpenCL_FPGA OpenCL_OpenCL FPGA_fpga 加速

intel opencl sdk安装手册

opencl for sdk 集显intel

intel_sdk_for_opencl_applications_2020.3.494.tar.gz

Intel_SDK_for_OpenCL_2016_r3_release_notes

OpenCL编程指南

OpenCL学习文档

opencl-sdk-1.2.2.zip

Intel Code Builder for OpenCL API for Microsoft Visual Studio

ATI_Stream_SDK_OpenCL_Programming_Guide

intel opencl sdk 用户手册

aocl_getting_started.pdf

英特尔FPGA手册目录

rn_aocl-683177-749417 3(1).pdf

采用OpenCL 标准实现FPGA 设计.zip

ARM Mali OpenCL SDK

Altera宣布业界首款支持FPGA的OpenCL工具.pdf

最新资源