【免费】针对移动GPU上的单图像卷积神经网络推理优化卷积算法(2019-09-08-12-04

需积分: 0 25 浏览量 2022-08-04 12:28:32 上传评论收藏 647KB PDF 举报

在移动设备上，尤其是智能手机和平板电脑，GPU（图形处理器）在执行计算密集型任务，如卷积神经网络（CNN）推理时起着至关重要的作用。然而，传统的GPU卷积算法设计初衷是用于批量神经网络训练，而非针对单个图像的推理。本文将探讨这一问题，并提出一种名为HNTMP（HNTMP不映射线程到像素）的优化卷积算法，以提升移动GPU上的单图像CNN推理性能。 1. 卷积神经网络与移动GPU的挑战 1.1 输入图像数量的差异与高功耗、高性能的专用GPU相比，移动GPU在处理单图像推理时面临挑战。批量神经网络训练通常涉及多张图像同时处理，这允许充分利用GPU的线程级并行性。而单图像推理仅处理一张图像，导致线程级并行性的显著降低，需要更有效的算法来填补这一空白。 1.2 移动GPU与专用GPU的性能差距移动设备的GPU受限于功耗、尺寸和成本，其计算能力通常远低于桌面或服务器级GPU。因此，需要设计适应这些限制的高效算法。 1.3 工程考虑的差异移动平台关注能效比，因为它们需要长时间运行且电池寿命有限。因此，优化的目标不仅是速度，还包括能耗。此外，考虑到内存限制，算法应尽可能减少内存访问和带宽需求。 2. HNTMP卷积算法的提出为了解决上述挑战，作者提出了HNTMP卷积算法，该算法重新设计了线程分配策略，避免了将线程直接映射到像素，从而提高了效率。HNTMP算法的核心思想是优化线程布局，以更好地利用GPU资源，减少不必要的计算和内存访问。 3. 性能比较 HNTMP算法相比于最流行的im2col卷积算法，实现了14.6倍的加速，这意味着它能显著提高推理速度。此外，相较于已知最快的直接卷积算法，HNTMP也实现了2.1倍的提升，这表明其在移动GPU上的表现更加优越。 4. 关键技术与实现 HNTMP算法可能涉及到以下关键技术： - 更智能的线程组织：通过重新排列线程块和线程，以适应单图像的处理需求。 - 内存访问优化：减少数据加载和存储的次数，以降低带宽压力。 - 计算效率提升：通过减少冗余计算，提高每个计算单元的工作效率。 - 动态调度：根据GPU负载实时调整工作流，确保资源得到最大化利用。 5. 应用与前景 HNTMP算法的优化对移动设备上的边缘计算平台具有重要意义，尤其是在实时应用如自动驾驶、图像识别和增强现实等领域。随着移动设备硬件的进步和对低延迟、高效率计算的需求增加，这类优化算法的研究将继续深入。 6. 结论本文通过分析单图像CNN推理的特性，识别出移动GPU的局限性，并提出HNTMP卷积算法，显著提升了计算效率。这一工作为未来移动设备上的CNN推理优化提供了新的思路，有望推动移动AI应用的发展。

资源详情

资源评论

资源推荐

HNMTP CONV: OPTIMIZE CONVOLUTION ALGORITHM FOR

SINGLE-IMAGE CONVOLUTION NEURAL NETWORK INFERENCE

ON MOBILE GPUS

A PREPRINT

Zhuoran Ji

Department of Computer Science

The University of Hong Kong

Hong Kong, China

jizr@hku.hk

September 9, 2019

ABSTRACT

Convolution neural networks are widely used for mobile applications. However, GPU convolution

algorithms are designed for mini-batch neural network training, the single-image convolution neural

network inference algorithm on mobile GPUs is not well-studied. After discussing the usage

difference and examining the existing convolution algorithms, we proposed the HNTMP convolution

algorithm. The HNTMP convolution algorithm achieves

14.6×

speedup than the most popular im2col

convolution algorithm, and

2.1×

speedup than the fastest existing convolution algorithm (direct

convolution) as far as we know.

Keywords Convolution Neural Network · Mobile GPU · Edge Computing Platforms

1 Introduction

This preprint is a technical report rather than a paper and the introduction is kept short. In this report, we addressed the

challenges and opportunities of single-image convolution neural network inference on mobile GPUs. We discussed the

popular existing GPU convolution algorithms and proposed the HNTMP (

NTMP does

hreads to

ixels)

convolution algorithm.

2 Challenges and Opportunities of Single-Image Inference on Mobile GPUs

Even though both of them deal with convolution neural networks, single-image convolution neural network inference

on mobile GPUs is quite different from mini-batch convolution neural network training on high-end dedicated GPUs, if

they are not entirely different stories. The differences mainly come from three aspects: the disparity of input images

numbers, the gap between mobile GPUs and dedicated GPUs, and the different engineering considerations.

2.1 Single-Image Reduces Threads-Level Parallelism

The most critical difference or challenge of single-image inference is that only one image is fed into the convolution

neural network. For a single image, the insufﬁciency of data parallelism prevents us from using as many threads as

mini-batch training. There are two ways for GPUs to hide the latency: thread-level parallelism (TLP) and instruction-

level parallelism (ILP). The insufﬁciency of threads reduce the thread-level parallelism signiﬁcantly and may lead to

low GPU utilization. For single-image inference, we need to take better advantage of instruction-level parallelism.

The instruction-level parallelism is to issue long latency independent instruction in pipeline. The compilers are

responsible for scheduling the instructions and rearrange to achieve high instruction-level parallelism. It also fusions

arXiv:1909.02765v1 [cs.DC] 6 Sep 2019

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余9页未读，立即下载

评论收藏

内容反馈

CyberNinja

粉丝: 29
资源: 297

针对移动GPU上的单图像卷积神经网络推理优化卷积算法(2019-09-08-12-04_read)1

评论0

最新资源

针对移动GPU上的单图像卷积神经网络推理优化卷积算法(2019-09-08-12-04_read)1

评论0

基于matlab-使用深度卷积神经网络DnCNN-图像去噪算法项目代码

一种基于GPU的高性能稀疏卷积神经网络优化_方程(2019-10-13_11_06_53)1

基于卷积神经网络的GFW加速调度算法(2019-09-16_19-19_read)1

基于GPU的反卷积算法并行优化_隽鹏辉(2019-09-21_20_19)1

基于HLS的Tiny_yolo卷积神经网络加速研究_张丽丽(2019-09-15-22-18_read)1

遗传算法改进粒子群算法优化卷积神经网络，莱维飞行改进遗传粒子群算法优化卷积神经网络，lv-ga-pso-cnn网络攻击识别

基于卷积神经网络的图像识别算法PPT课件.ppt

基于卷积神经网络的候选区域优化算法.pdf

基于MATLAB实现传统图像去噪算法和基于深度卷积神经网络的DnCNN图像去噪算法.zip

基于卷积神经网络的图像分割算法 (2).pdf

计算机视觉核心算法在移动GPU上的高性能实现_时洋(2019-09-05-18-24_read)1

基于粒子群算法优化卷积神经网络(PSO-CNN)的回归预测预测，多变量输入模型（Matlab完整源码)

卷积神经网络（CNNS）权值优化算法

基于DenseNet进化的卷积神经网络图像分类算法

论文研究-基于卷积神经网络的手写数字识别算法研究 .pdf

CNN卷积计算在移动GPU上的加速研究_王湘新(2019--09-08-12-04)1

反卷积_神经网络_神经_反卷积_网络_卷积神经_

论文研究-利用卷积神经网络改进迭代深度学习算法的图像识别方法研究.pdf

一种基于卷积神经网络的图像复原算法Match-Map.pdf

基于卷积神经网络迭代优化的图像分类算法.pdf

基于matlab-使用深度学习卷积神经网络DnCNN-图像去噪算法设计与实现源码.zip

【提供操作视频】基于多尺度CNN卷积神经网络的MRF图像分割算法matlab仿真

Matlab实现基于BO-CNN-BiLSTM贝叶斯优化卷积神经网络-双向长短期记忆网络时间序列预测（完整程序和数据）

基于MATLAB实现传统图像去噪算法和基于深度卷积神经网络的DnCNN图像去噪算法（大作业设计）

draw_convnet-master.zip_draw convnet_python卷积神经_卷积神经_卷积神经网络_神经网络

基于卷积神经网络的驾驶行为分析算法

最新资源