3.7 Avoiding Pointer Aliasing........................................................................................86
3.8 Avoid Expensive Functions..................................................................................... 86
3.9 Avoiding Work-Item ID-Dependent Backward Branching............................................. 88
4 Profiling Your Kernel to Identify Performance Bottlenecks............................................ 89
4.1 Intel FPGA Dynamic Profiler for OpenCL Best Practices............................................... 90
4.2 Intel FPGA Dynamic Profiler for OpenCL GUI.............................................................90
4.2.1 Source Code Tab.......................................................................................90
4.2.2 Kernel Execution Tab................................................................................. 92
4.2.3 Autorun Captures Tab................................................................................ 94
4.3 Interpreting the Profiling Information...................................................................... 94
4.3.1 Stall, Occupancy, Bandwidth.......................................................................95
4.3.2 Activity....................................................................................................97
4.3.3 Cache Hit.................................................................................................98
4.3.4 Profiler Analyses of Example OpenCL Design Scenarios ................................. 98
4.3.5 Autorun Profiler Data............................................................................... 102
4.4 Intel FPGA Dynamic Profiler for OpenCL Limitations................................................. 102
5 Strategies for Improving Single Work-Item Kernel Performance................................. 104
5.1 Addressing Single Work-Item Kernel Dependencies Based on Optimization Report
Feedback........................................................................................................104
5.1.1 Removing Loop-Carried Dependency..........................................................105
5.1.2 Relaxing Loop-Carried Dependency............................................................108
5.1.3 Simplifying Loop-Carried Dependency........................................................ 110
5.1.4 Transferring Loop-Carried Dependency to Local Memory............................... 113
5.1.5 Removing Loop-Carried Dependency by Inferring Shift Registers................... 115
5.2 Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays............. 116
5.3 Good Design Practices for Single Work-Item Kernel..................................................119
6 Strategies for Improving NDRange Kernel Data Processing Efficiency......................... 122
6.1 Specifying a Maximum Work-Group Size or a Required Work-Group Size..................... 122
6.2 Kernel Vectorization.............................................................................................124
6.2.1 Static Memory Coalescing.........................................................................125
6.3 Multiple Compute Units........................................................................................ 127
6.3.1 Compute Unit Replication versus Kernel SIMD Vectorization.......................... 128
6.4 Combination of Compute Unit Replication and Kernel SIMD Vectorization.................... 130
6.5 Resource-Driven Optimization............................................................................... 131
6.6 Reviewing Kernel Properties and Loop Unroll Status in the HTML Report..................... 132
7 Strategies for Improving Memory Access Efficiency.....................................................134
7.1 General Guidelines on Optimizing Memory Accesses.................................................134
7.2 Optimize Global Memory Accesses......................................................................... 135
7.2.1 Contiguous Memory Accesses................................................................... 136
7.2.2 Manual Partitioning of Global Memory........................................................ 137
7.3 Performing Kernel Computations Using Constant, Local or Private Memory.................. 138
7.3.1 Constant Cache Memory.......................................................................... 139
7.3.2 Preloading Data to Local Memory...............................................................139
7.3.3 Storing Variables and Arrays in Private Memory...........................................141
7.4 Improving Kernel Performance by Banking the Local Memory.................................... 141
7.4.1 Optimizing the Geometric Configuration of Local Memory Banks Based on
Array Index........................................................................................... 144
7.5 Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor..... 146
Contents
Intel
®
FPGA SDK for OpenCL
™
Best Practices Guide
3