HNMTP CONV: OPTIMIZE CONVOLUTION ALGORITHM FOR
SINGLE-IMAGE CONVOLUTION NEURAL NETWORK INFERENCE
ON MOBILE GPUS
A PREPRINT
Zhuoran Ji
Department of Computer Science
The University of Hong Kong
Hong Kong, China
jizr@hku.hk
September 9, 2019
ABSTRACT
Convolution neural networks are widely used for mobile applications. However, GPU convolution
algorithms are designed for mini-batch neural network training, the single-image convolution neural
network inference algorithm on mobile GPUs is not well-studied. After discussing the usage
difference and examining the existing convolution algorithms, we proposed the HNTMP convolution
algorithm. The HNTMP convolution algorithm achieves
14.6×
speedup than the most popular im2col
convolution algorithm, and
2.1×
speedup than the fastest existing convolution algorithm (direct
convolution) as far as we know.
Keywords Convolution Neural Network · Mobile GPU · Edge Computing Platforms
1 Introduction
This preprint is a technical report rather than a paper and the introduction is kept short. In this report, we addressed the
challenges and opportunities of single-image convolution neural network inference on mobile GPUs. We discussed the
popular existing GPU convolution algorithms and proposed the HNTMP (
H
NTMP does
N
ot
M
ap
T
hreads to
P
ixels)
convolution algorithm.
2 Challenges and Opportunities of Single-Image Inference on Mobile GPUs
Even though both of them deal with convolution neural networks, single-image convolution neural network inference
on mobile GPUs is quite different from mini-batch convolution neural network training on high-end dedicated GPUs, if
they are not entirely different stories. The differences mainly come from three aspects: the disparity of input images
numbers, the gap between mobile GPUs and dedicated GPUs, and the different engineering considerations.
2.1 Single-Image Reduces Threads-Level Parallelism
The most critical difference or challenge of single-image inference is that only one image is fed into the convolution
neural network. For a single image, the insufficiency of data parallelism prevents us from using as many threads as
mini-batch training. There are two ways for GPUs to hide the latency: thread-level parallelism (TLP) and instruction-
level parallelism (ILP). The insufficiency of threads reduce the thread-level parallelism significantly and may lead to
low GPU utilization. For single-image inference, we need to take better advantage of instruction-level parallelism.
The instruction-level parallelism is to issue long latency independent instruction in pipeline. The compilers are
responsible for scheduling the instructions and rearrange to achieve high instruction-level parallelism. It also fusions
arXiv:1909.02765v1 [cs.DC] 6 Sep 2019
评论0