没有合适的资源?快使用搜索试试~ 我知道了~
用于可穿戴设备上资源受限推理的深度学习层的稀疏化和分离1
需积分: 0 0 下载量 101 浏览量
2022-08-04
15:20:26
上传
评论
收藏 4.35MB PDF 举报
温馨提示
试读
14页
用于可穿戴设备上资源受限推理的深度学习层的稀疏化和分离1
资源详情
资源评论
资源推荐
Sparsification and Separation of Deep Learning Layers
for Constrained Resource Inference on Wearables
Sourav Bhattacharya
§
and Nicholas D. Lane
§,†
§
Nokia Bell Labs,
†
University College London
ABSTRACT
Deep learning has revolutionized the way sensor data are
analyzed and interpreted. The accuracy gains these ap-
proaches o↵er make them attractive for the next genera-
tion of mobile, wearable and embedded sensory applica-
tions. However, state-of-the-art deep learning algorithms
typically require a significant amount of device and pro-
cessor resources, even just for the inference stages that are
used to discriminate high-level classes from low-level data.
The limited availability of memory, computation, and en-
ergy on mobile and embedded platforms thus pose a signif-
icant challenge to the adoption of these powerful learning
techniques. In this paper, we propose SparseSep, a new ap-
proach that leverages the sparsification of fully connected
layers and separation of convolutional kernels to reduce the
resource requirements of popular deep learning algorithms.
As a result, SparseSep allows large-scale DNNs and CNNs to
run efficiently on mobile and embedded hardware with only
minimal impact on inference accuracy. We experiment using
SparseSep across a variety of common processors such as the
Qualcomm Snapdragon 400, ARM Cortex M0 and M3, and
Nvidia Tegra K1, and show that it allows inference for vari-
ous deep models to execute more efficiently; for example, on
average requiring 11.3 times less memory and running 13.3
times faster on these representative platforms.
CCS Concepts
•Computing methodologies ! Machine learning; Neu-
ral networks; •Computer systems organization !
Embedded software;
Keywords
Wearable computing; deep learning; sparse coding; weight
factorization
1. INTRODUCTION
Recognizing co ntextual signals and the everyday activity
of users from raw sensor data is a core enabler for mobile
and wearable applications. By monitoring user actions (via
sp eech, ambient audio, motion) and context using a variety
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SenSys ’16, November 14-16, 2016, Stanford, CA, USA
c
2016 ACM. ISBN 978-1-4503-4263-6/16/11. . . $15.00
DOI: http://dx.doi.org/10.1145/2994551.2994564
of sensing modalities, mobile developers are able to provide
b oth enhanced, and brand new, application features. While
sensor-related applications and systems are still maturing,
and are highly diverse, a notable characteristic is their re-
liance on making a wide-variety of sensor inferences.
Accurately extracting context and activity information from
noisy mobile sensor data remains an unsolved problem. Be-
cause the real world is highly complex, unpredictable and
constantly changing, it often causes confusion to the ma-
chine learning and signal processing algorithms used by mo-
bile devices. One of the most promising directions today
in overcoming such challenges is deep learning [1, 2]. De-
velopments in this particular field of machine learning have
caused the approaches and algorithms used in even mature
sensing tasks to be completely changed (e.g., speech [3] and
face [4] recognition). The study of deep learning usage for
mobile applications is in its early stages (e.g., [5, 6, 7, 8]),
but with promising initial results.
While deep learning o↵ers important benefits to robust mod-
eling, its integration into mobiles and wearables is compli-
cated by the sizable system resource requirements these algo-
rithms introduce. Barriers exist in the form of memory, com-
putation and energy; these collectively prevent most deep
mo dels from executing directly on mobile hardware. Conse-
quently, existing examples of deep learning for smartphones
(e.g., speech recognition) remain largely cloud-assisted. A
number of negative side-e↵ects of this: first, inference ex-
ecution becomes coupled to fluctuating and unpredictable
network quality (e.g., latency, throughput); but more im-
p ortantly it exposes users to privacy dangers [9] as sensitive
data (e.g., audio) is processed o↵-device by a third party.
Allowing broader device-centric deep learning classification
and prediction will need the development of brand-new tech-
niques for optimized resource sensitive execution. Up to this
p oint, the machine learning community has made excellent
progress in training-time optimizations and is only now be-
ginning to consider how these ideas transfer to inference-
time. Currently, most knowledge of deep learning algo-
rithm behavior on constrained devices is largely limited to
one-o↵ task-specific experiences (e.g. , [10, 11]). These sys-
tems are limited however is providing examples and evidence
that local execution is feasible, although they do provide
some insights for ways forward. What is required however
is a deeper study of these issues with an aim towards the
development of techniques like o↵-line model optimization
and runtime execution environments to match the resources
(e.g., memory, computation and energy) present on edge de-
vices like wearables and mobile phones.
In this work, we make an significant progress into the de-
velopment of such algorithms and software by developing
a sparse coding- and convolution kernel separation-based
approach to optimizing deep learning model layers. This
framework – SparseSep – includes: (1) a compiler, Layer
Compression Compiler (LCC), in which unchanged deep mod-
els are inserted and then optimized; (2) a runtime frame-
work, Sparse Inference Runtime (SIR), that is able to exploit
the transformation of the model and realize radical reduc-
tions in computation, energy and memory usage; and (3) a
separator, Convolution Separation Runtime (CSR), that sig-
nificantly reduces convolution operations. SparseSep tech-
niques can allow a developer to adopt existing o↵-the-shelf
deep models and scale their processor behavior such as, ac-
ceptable accuracy reduction and device limits, e.g., memory
and necessary execution time.
The core concept of this work is the hypothesis that compu-
tational and space complexity of the deep learning models
can be significantly improved through the sparse representa-
tion of key layers and separation of convolution layers. Deep
mo dels often have millions of parameters spread throughout
a number of hidden layers that capture the robust represen-
tations of the data. By using theory from sparse dictionary
learning we investigate how the originally complex synap-
tic weight matrix can be captured in much smaller matri-
ces that require less computational and memory resources.
Critically, such theory a↵ords the ability of these sparsified
layers to be faithful to the originals with theoretical bounds
on important aspects such as, reconstruction error. This is
the first time this approach has been used.
Our experiments include both DNNs and CNNs, the most
p opular forms of deep learning today. Tests span both au-
dio classification tasks (ambient scene analysis and speaker
identification) that are common in the mobile sensing sys-
tems; along with image tasks (object recognition) seen in
mobile vision devices like Google Glass. We find that across
a range of experiments and devices SparseSep can allow deep
mo dels to execute using (on an average) only 26% of the orig-
inal energy while only sacrificing approximately up to 5% of
the accuracy of these models. Specific examples include the
Snap dragon 400 processor running a deep learning model for
sp eaker identification with a 4.1 times improvement in exe-
cution time, and a 17.6 times reduction in memory. Further-
more, we benchmark this deep learning version of speaker
identification and find, as expected, that the deep model is
much more robust than models conventionally used (such
as random forests). Most important of all, we examine de-
vice restrictions found on other common processors like the
Cortex M3 equipped with 32 KB of RAM. Not surprisingly
we find these processors can not support any form of deep
learning model tested (due to restrictions to computation
and/or memory) – until we apply the SparseSep process.
The key scientific contributions of this research are:
• We propose, for the first time, a sparse coding-based
approach to the optimization of deep learning inference
execution. We propose the use of convolution kernel sep-
aration technique to minimize overall computations of
CNNs on resource constrained platforms.
• To our knowledge, this work is the first to demonstrate
very deep learning models (many layer DNNs and CNNs)
executing on severely constrained wearable hardware with
acceptable levels of performance (energy efficiency, com-
putation times).
• We design and implement a prototype that realizes our
Input
Fully connected Layers
Pooling Layer
Convolution Layer
Output
Layer
Convolution Layer
Figure 1: ACNNmixesconvolutionalandfeed-forwardlayers
approach to sparse dictionary learning and kernel sepa-
ration into deep learning model representation and infer-
ence execution. We implement necessary runtime com-
p onents for 4 embedded and mobile processor platforms.
• We experiment with four di↵erent CNN and DNN mod-
els under large audio and image datasets. We demon-
strate gains of the order of 11.3⇥ improvements in mem-
ory and 13.3⇥ in execution time under multiple exper-
iment configurations, while only su↵ering accuracy loss
of ⇡ 5%.
2. BACKGROUND
Popular deep learning architectures, such as Restricted Bo-
ltzmann Machines and Deep Belief Networks, share a com-
mon architecture. Often, they are collectively referred to as
Deep Neural Networks. Typically, a DNN contains a num-
b er of fully-connected layers, where each layer is composed
of a collection of nodes. Sensor measurements (e.g., audio,
images) are fed to the first layer (the input layer). The fi-
nal layer, also known as the output layer, corresponds to
inference classes with nodes capturing individual inference
categories (e.g., music or cat). Layers in between the input
and the output layer are referred to as hidden layers. The
degree of influence of units between layers vary on a pair-
wise basis determined by a weight value. Together with the
synaptic connections and inherent non-linearity, the hidden
layers transform raw data applied to the input layer into the
prediction classes captured in the output layer.
DNN-based inferencing follows a feed-forward algorithm that
op erates on sensor data segments in isolation. The algorithm
starts at the input layer and moves layer wise sequentially,
while updating the activation states of all nodes one by one.
The process finishes at the output layer when all nodes have
b een updated. Finally, the inferred class is identified as the
class corresponding to the output layer node with the great-
est state value.
CNNs are another popular class of dee p models that share
architectural similarities to DN Ns. As presented in Fig-
ure 1, a CNN mode l contains one or more convolutional
layers, pooling or sub-sampling layers, and fully connected
layers (equivalent to those used in DNNs). The objective of
these layers is to ext ract simple representations from the in-
put data, and then converting the representation into more
complex representations at much coarser resolutions within
the subsequent layers. For instance, first convolutional fil-
ters (with small kernel width) are applied to the input data
to capture local data properties. Next, max or min pooling
is applied to make the representations invariant to transla-
tions. Pooling operations can also be seen as a form of di-
mensionality re duction. Lastly, fully connected layers (i.e.,
a DNN) help a CNN to make predictions.
A CNN follows a sequential approach, as in DNNs, to gener-
ate isolated prediction at a time. Often in CNN-based pre-
dictions, sensor data is first vectorized into two dimensions.
Next, data is passed through a series of convolution, pooling
and non-linear layers. The purpose of the convolution and
p ooling layers can be viewed as that of feature extractor be-
fore the fully connected layers are engaged. Inference then
pro ceeds exactly as previously described for DNNs until ul-
timately a classification is reached.
Contrary to the shallow learning-based models, deep learn-
ing models are usually big an often contains more than mil-
lion parameters. H igh parameter space improves the capac-
ity of these models and they often outperform prior shallow
mo dels in terms of model generalization perform ances. How-
ever, the accuracy gains come at the expense of high energy
and memory costs. Although, high end wearables contain-
ing GPU, e.g., NVIDIA Tegra K1, can efficiently run deep
mo dels [12], the high resource demands make deep learning
mo dels unattractive for low end wearables. In this paper we
explore sparse factorizations and convolutional kernel sep-
arations to optimize the resource demands of deep models,
while maintaining the functional properties of the models.
3. DESIGN AND OPERATION
Beginning with this section, and spanning the following two,
we detail the design and algorithms of SparseSep.
3.1 Design Goals
SparseSep is shaped on the following objectives.
• No Re-training. The training of a large deep model is
the most time consuming and computationally demand-
ing task. For example, a large model such as GoogleNet
is trained using thousands of CPU cores [13], which is
b eyond the current capabilities of a single wearable de-
vice. In this work, we mainly focus on the inference
cycle of a deep model and p erform no training on the
resource-constrained devices. The training process also
requires a very large training dataset, often inaccessible
to the developers [14]. Thus new techniques are needed
to compress popular cloud-scale deep learning models to
run on wearable and IoT grade hardware gracefully.
• No Cloud O✏oading. As noted in §1, o✏oading
the execution of portions of deep models can result in
leaking sensitive sensor data. By keeping inference com-
pletely local, user and applications have greater privacy
protection as the data or any intermediate results never
leave the device.
• Target Low-resource Platforms. Even high-end
mobile processors (such as the Tegra K1 [15]) still require
careful resource use, when executing deep learning mod-
els. But in this class of processors, the gap in resources
is closing. However, for low-energy highly portable wear-
able processors that lack GPUs or have only a few MBs
of RAM (e.g., ARM Cortex M3 [16]), local execution of
deep models remains impractical. For this reason, Spars-
eSep turns to new ideas like the use of sparsification of
weights and kernel separation, in search of the leaps in
resource efficiency required to make these low-end pro-
cessors viable.
• Minimize Model Changes. Deep models must un-
dergo some degree of change to enable their operation
on wearable hardware. However, a core tenet of Spars-
eSep is to minimize the extent of such modifications
and remain functionally faithful to the initial model ar-
chitecture. For this reason, we frame the problem as
one of deep model compression (originally formulated by
the m achine learning community), where model layer ar-
rangements remain unchanged and only per-layer con-
nections are changed through the insertion of additional
summarizing layers. Thus, the degree of changes made
by SparseSep is a key metric that is minimized during
mo del processing.
• Adopt Principled Approaches. Ad-ho c methods
to al ter a deep model – such as ‘specializing’ a model to
recognize a smaller set of activities/contexts, or chang-
ing layer/unit parameters to generate a desired resource
consumption profile – are dangerous as they violate the
domain experience of the modeling experts. Methods like
sparse coding [17] and model compression [18] are sup-
p orted by theoretical analysis [19]. Assessing if a model
can be altered solely by changes in the accuracy metric
can be dangerous and can potentially hurt, for example,
its ability to generalize.
3.2 Overview
We now briefly outline the core approach of SparseSep to
optimize the architecture of large deep learning models so
that they meet the constraints of target wearable devices.
In §4 we provide the necessary theory and algorithms of this
pro cess, but we begin here with the key ideas.
The inference pipeline of a deep learning model is domi-
nated by a series of matrix computations, especially multi-
plications, and convolutions. Attempts have been made to
optimize the total number of computations by low-rank fac-
torizing of the weight matrix or decomposing convolutional
kernels into separable filters in an ad-hoc manner. Both
weight factorization and kernel separation, however, require
mo dification in the architecture of the model by inserting
a new layer and updating weight components (see §4.1 and
§4.4). Although, counter-intuitive, the insertion of a new
layer only achieves computat ional efficiency under certain
conditions, which depends on, e.g., the size of the newly
inserted layer, the size of the original weight matrix, and
the size of convolutional kernels. In §4.1, §4.2 and §4.4 we
derive and show the c onditions unde r which computational
and memory efficiencies can be achieved.
In this paper, we postulate that the computational and space
efficiency of the deep learning models can b e further im-
proved by adding sparsity constraints to the factorization
pro cess. Accordingly, we propose a sparse dictionary learn-
ing approach to enforcing a sparse factorization of the weight
matrix (see §4.3). In §5.2 we show that under specific spar-
sity conditions the resource scalability of the proposed ap-
proach is significantly better than existing approaches.
The weight factorization approach significantly reduces the
memory footprint of both DNN and CNN models by opti-
mizing the parameter space of the fully connected layers.
The factorization also helps to reduce the overall number of
op erations needed and improves the inference time. How-
ever, the inference time improvement due to factorization
is much more pronounced for DNNs than CNNs. This is
primarily due to the fact that a major portion of the CNN-
based inference time (often over 95%) is spent on performing
convolution operations [12, 20], where the layer factorization
technique has no influence. To overcome this limitation, we
also propose a runtime convolution kernel separation tech-
nique that optimizes the convolution operations to reduce
剩余13页未读,继续阅读
马李灵珊
- 粉丝: 34
- 资源: 297
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0