深度学习如何又好又快?Google最新《高效深度学习:更小、更快、更好》综述论文资源-CSDN文库

需积分: 9 91 浏览量 2021-06-19 11:49:16 上传评论收藏 5.53MB PDF 举报

深度学习自2012年ImageNet竞赛中AlexNet的成功应用以来，已经成为训练新机器学习模型的主导方法。十年间，随着模型深度的增加，参数数量和模型复杂度也在不断上升。尤其是随着卷积神经网络（CNN）的广泛应用，图像识别、自然语言处理、语音识别等领域的技术进步显著。然而，这一进步伴随着模型体积增大、延迟时间增长和训练所需资源增多的问题。效率问题已经成为深度学习领域需要重点关注的问题。模型效率涉及的方面很广，可以从模型架构、基础设施和硬件支持这三个核心领域进行探讨。模型架构方面，研究者通过创新设计如ResNet的残差连接，Inception模块的多尺度特征提取等来增加模型的深度和宽度，但同时也在探索如何减少模型参数，提高参数利用率，例如通过引入稀疏性、权重共享等技术。此外，模型剪枝、量化和知识蒸馏等技术可以进一步压缩模型，减少模型大小，加快模型运行速度。在基础设施方面，涉及深度学习框架和分布式计算平台的优化，如TensorFlow和PyTorch等深度学习框架提供了高效的自动微分和模型并行机制，支持在大规模GPU和TPU集群上运行深度学习模型。此外，通过网络拓扑优化和存储系统改进，也在降低数据传输时间和提高存储效率方面做出了贡献。硬件支持方面，随着专用深度学习硬件如GPU、TPU和FPGA等的发展，深度学习模型的训练和部署速度得到了显著提升。特别地，这些硬件设计考虑到了深度学习模型计算和内存访问模式的特点，如在GPU中通过增加大规模并行处理单元来加快矩阵乘法等操作的速度，TPU则通过设计特殊的数据流和专用指令集来优化模型推理。高效深度学习技术不仅需要关注模型训练的效率，还要关注模型部署的效率。对于实际应用场景而言，如何在有限的计算资源下实现深度学习模型的快速部署和有效运行，是工业界和学术界都需要面对的问题。高效深度学习涉及技术的多方面融合，从优化算法到硬件实现，每一步都是为了在保证模型性能的前提下，实现模型的更小、更快、更好。Google的这篇综述论文《高效深度学习: 更小、更快、更好》通过详细的调查和实验指南，结合代码示例，为研究者和实践者提供了优化模型训练和部署的实用方法和思想，使得读者能够在理解领域基础知识的同时，应用通用效率技术获得显著的性能提升，并为其进一步的研究和实验提供灵感，以取得额外的增益。这篇综述被期望可以为高效深度学习领域提供一个综合性的调查，覆盖从模型技术到硬件支持的全貌。

资源推荐

资源详情

资源评论

Eicient Deep Learning: A Survey on Making Deep Learning

Models Smaller, Faster, and Beer

GAURAV MENGHANI, Google Research, USA

Deep Learning has revolutionized the elds of computer vision, natural language understanding, speech recog-

nition, information retrieval and more. However, with the progressive improvements in deep learning models,

their number of parameters, latency, resources required to train, etc. have all have increased signicantly.

Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just

its quality. We present and motivate the problem of eciency in deep learning, followed by a thorough survey

of the ve core areas of model eciency (spanning modeling techniques, infrastructure, and hardware) and the

seminal work there. We also present an experiment-based guide along with code, for practitioners to optimize

their model training and deployment. We believe this is the rst comprehensive survey in the ecient deep

learning space that covers the landscape of model eciency from modeling techniques to hardware support.

Our hope is that this survey would provide the reader with the mental model and the necessary understanding

of the eld to apply generic eciency techniques to immediately get signicant improvements, and also equip

them with ideas for further research and experimentation to achieve additional gains.

ACM Reference Format:

Gaurav Menghani. 2021. Ecient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster,

and Better. 1, 1 (June 2021), 43 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

Deep Learning with neural networks has been the dominant methodology of training new machine

learning models for the past decade. Its rise to prominence is often attributed to the ImageNet

competition [

] in 2012. That year, a University of Toronto team submitted a deep convolutional

network (AlexNet [

], named after the lead developer Alex Krizhevsky), performed 41% better

than the next best submission. As a result of this trailblazing work, there was a race to create

deeper networks with an ever increasing number of parameters and complexity. Several model

architectures such as VGGNet [

141

], Inception [

146

], ResNet [

] etc. successively beat previous

records at ImageNet competitions in the subsequent years, while also increasing in their footprint

(model size, latency, etc.)

This eect has also been noted in natural language understanding (NLU), where the Transformer

[

154

] architecture based on primarily Attention layers, spurred the development of general purpose

language encoders like BERT [

], GPT-3 [

], etc. BERT specically beat 11 NLU benchmarks

when it was published. GPT-3 has also been used in several places in the industry via its API. The

common aspect amongst these domains is the rapid growth in the model footprint (Refer to Figure

1), and the cost associated with training and deploying them.

Since deep learning research has been focused on improving the state of the art, progressive

improvements on benchmarks like image classication, text classication, etc. have been correlated

with an increase in the network complexity, number of parameters, the amount of training resources

Author’s address: Gaurav Menghani, gmenghani@google.com, Google Research, Mountain View, California, USA, 95054.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

XXXX-XXXX/2021/6-ART $15.00

https://doi.org/10.1145/nnnnnnn.nnnnnnn

, Vol. 1, No. 1, Article . Publication date: June 2021.

arXiv:2106.08962v1 [cs.LG] 16 Jun 2021

Eicient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Beer 3

• Privacy & Data Sensitivity

: Being able to use as little data as possible for training is critical

when the user-data might be sensitive. Hence, eciently training models with a fraction of

the data means lesser data-collection required.

• New Applications

: Certain new applications oer new constraints (around model quality

or footprint) that existing o-the-shelf models might not be able to address.

• Explosion of Models

: While a singular model might work well, training and/or deploying

multiple models on the same infrastructure (colocation) for dierent applications might end

up exhausting the available resources.

1.1 Eicient Deep Learning

The common theme around the above challenges is eciency. We can break it down further as

follows:

• Inference Eciency

: This primarily deals with questions that someone deploying a model

for inference (computing the model outputs for a given input), would ask. Is the model small?

Is it fast, etc.? More concretely, how many parameters does the model have, what is the disk

size, RAM consumption during inference, inference latency, etc.

• Training Eciency

: This involves questions someone training a model would ask, such as

How long does the model take to train? How many devices? Can the model t in memory?,

etc. It might also include questions like, how much data would the model need to achieve the

desired performance on the given task?

If we were to be given two models, performing equally well on a given task, we might want to

choose a model which does better in either one, or ideally both of the above aspects. If one were to

be deploying a model on devices where inference is constrained (such as mobile and embedded

devices), or expensive (cloud servers), it might be worth paying attention to inference eciency.

Similarly, if one is training a large model from scratch on either with limited or costly training

resources, developing models that are designed for training eciency would help.

Fig. 2. Pareto Optimality: Green dots represent pareto-optimal models (together forming the pareto-frontier),

where none of the other models (red dots) get beer accuracy with the same inference latency, or the other

way around.

Regardless of what one might be optimizing for, we want to achieve pareto-optimality. This

implies that any model that we choose is the best for the tradeos that we care about. As an example

in Figure 2, the green dots represent pareto-optimal models, where none of the other models (red

dots) get better accuracy with the same inference latency, or the other way around. Together, the

pareto-optimal models (green dots) form our pareto-frontier. The models in the pareto-frontier

are by denition more ecient than the other models, since they perform the best for their given

, Vol. 1, No. 1, Article . Publication date: June 2021.

4 Gaurav Menghani

tradeo. Hence, when we seek eciency, we should be thinking about discovering and improving

on the pareto-frontier.

To achieve this goal, we propose turning towards a collection of algorithms, techniques, tools,

and infrastructure that work together to allow users to train and deploy pareto-optimal models

with respect to model quality and its footprint.

2 A MENTAL MODEL

In this section we present the mental model to think about the collection of algorithms, techniques,

and tools related to ecient deep learning. We propose to structure them in ve major areas, with

the rst four focused on modeling, and the nal one around infrastructure and tools.

Fig. 3. A mental model for thinking about algorithms, techniques, and tools related to eiciency in Deep

Learning.

(1) Compression Techniques

: These are general techniques and algorithms that look at op-

timizing the model’s architecture, typically by compressing its layers. A classical example

is quantization [

], which tries to compress the weight matrices of a layer, by reducing its

precision (eg., from 32-bit oating point values to 8-bit unsigned integers), with minimal loss

in quality.

(2) Learning Techniques

: These are algorithms which focus on training the model dierently

(to make fewer prediction errors, require less data, converge faster, etc.). The improved

quality can then be exchanged for a smaller footprint / a more ecient model by trimming

the number of parameters if needed. An example of a learning technique is distillation [

which allows improving the accuracy of a smaller model by learning to mimic a larger model.

(3) Automation

: These are tools for improving the core metrics of the given model using

automation. An example is hyper-parameter optimization (HPO) [

] where optimizing the

hyper-parameters helps increase the accuracy, which could then be then exchanged for a

model with lesser parameters. Similarly, architecture search [

167

] falls in this category too,

where the architecture itself is tuned and the search helps nd a model that optimizes both

the loss / accuracy, and some other metric such as model latency, model size, etc.

(4) Ecient Architectures

: These are fundamental blocks that were designed from scratch

(convolutional layers, attention, etc.), that are a signicant leap over the baseline methods used

before them (fully connected layers, and RNNs respectively). As an example, convolutional

layers introduced parameter sharing for use in image classication, which avoids having

to learn separate weights for each input pixel, and also makes them robust to overtting.

, Vol. 1, No. 1, Article . Publication date: June 2021.

Eicient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Beer 5

Similarly, attention layers [

] solved the problem of Information Bottleneck in Seq2Seq

models. These architectures can be used directly for eciency gains.

(5) Infrastructure

: Finally, we also need a foundation of infrastructure and tools that help us

build and leverage ecient models. This includes the model training framework, such as

Tensorow [

], PyTorch [

119

], etc. (along with the tools required specically for deploying

ecient models such as Tensorow Lite (TFLite), PyTorch Mobile, etc.). We depend on the

infrastructure and tooling to leverage gains from ecient models. For example, to get both

size and latency improvements with quantized models, we need the inference platform to

support common neural network layers in quantized mode.

We will survey each of these areas in depth in the following section.

3 LANDSCAPE OF EFFICIENT DEEP LEARNING

3.1 Compression Techniques

Compression techniques as mentioned earlier, are usually generic techniques for achieving a more

ecient representation of one or more layers in a neural network, with a possible quality trade o.

The eciency goal could be to optimize the model for one or more of the footprint metrics, such

as model size, inference latency, training time required for convergence, etc. in exchange for as

little quality loss as possible. In some cases if the model is over-parameterized, these techniques

can improve model generalization.

3.1.1 Pruning. Given a neural network

𝑓 (𝑋,𝑊 )

, where

𝑋

is the input and

𝑊

is the set of parameters

(or weights), pruning is a technique for coming up with a minimal subset

𝑊

′

such that the rest of

the parameters of

𝑊

are pruned (or set to 0), while ensuring that the quality of the model remains

above the desired threshold. After pruning, we can say the network has been made sparse, where

the sparsity can be quantied as the ratio of the number of parameters that were pruned to the

number of parameters in the original network (

𝑠 = (

−

|𝑊

′

|𝑊 |

)

). The higher the sparsity, the lesser

the number of non-zero parameters in the pruned networks.

Fig. 4. A simplified illustration of pruning weights (connections) and neurons (nodes) in a neural network

comprising of fully connected layers.

Some of the classical works in this area are Optimal Brain Damage (OBD) by LeCun et al. [

], and

Optimal Brain Surgeon paper (OBD) by Hassibi et al. [

]. These methods usually take a network that

has been pre-trained to a reasonable quality and then iteratively prune the parameters which have

the lowest ‘saliency’ score, such that the impact on the validation loss is minimized. Once pruning

, Vol. 1, No. 1, Article . Publication date: June 2021.

剩余42页未读，继续阅读

评论收藏

内容反馈

syp_net

粉丝: 158
资源: 1184

深度学习如何又好又快? Google最新《高效深度学习: 更小、更快、更好》综述论文

最新资源

深度学习如何又好又快? Google最新《高效深度学习: 更小、更快、更好》综述论文

深度学习的最优化：理论和算法综述论文【包含257篇文献】.zip

深度学习论文

深度学习相关论文集合

深度学习三巨头在Nature上共同发表的名为《深度学习》的综述文章

MIT最新《贝叶斯深度学习》综述论文

基于深度学习的轮廓检测算法：综述.pdf

最新《深度学习人体姿态估计》综述论文

高效学习深度学习

《深度学习理论进展》综述论文

深度学习教程(最新版)1

深度学习 KAN 最新发布论文

什么是深度学习？推荐的课程体系都有哪些？

《文本分类大综述：从浅层到深度学习》

人工智能论文：基于深度学习的目标检测技术综述.pdf

关于计算机视觉和深度学习的一本完整论述论文

深度学习——一个新兴的范式-研究论文

深度学习方法研究综述.pdf

深度学习算法论文

深度主动学习综述论文

深度学习模型鲁棒性研究综述1

论文研究-深度学习相关研究综述.pdf

深度学习重要论文

基于深度学习的调制识别综述.pdf

最新资源