ONCE-FOR-ALLTRAINONENETWORKANDSPECIALIZEITFOREFFICIENTDEPLOYMENT

需积分: 15 83 浏览量 2021-06-02 18:41:07 上传评论收藏 3.2MB PDF 举报

本文介绍的是一种名为“一次训练，全网部署”（ONCE-FOR-ALL, OFA）的新型神经网络训练和部署策略，旨在解决在多种设备和资源限制条件下高效推理的难题，特别是在边缘设备上。传统的做法要么是手工设计网络，要么是使用神经架构搜索（Neural Architecture Search, NAS）来找到一个专门的神经网络，并且每次都是从头开始训练，这在计算上是禁止的，会导致巨大的二氧化碳排放，因此是不可扩展的。在这项工作中，研究者提出了训练一个OFA网络，通过解耦训练和搜索来支持多样化的架构设置，以降低成本。通过从OFA网络中选择特定的子网络，我们可以快速获得专门的子网络而无需额外训练。为了高效地训练OFA网络，研究者还提出了一种新颖的渐进式收缩算法，这是一种广义的剪枝方法，它在多个维度上减少模型大小（深度、宽度、卷积核大小和分辨率），能够获得数量惊人的子网络（>10^19），它们能够适应不同的硬件平台和延迟限制，同时保持与独立训练相同水平的准确度。在多种边缘设备上，OFA的一致性优于最先进的NAS方法（与MobileNetV3相比，ImageNet top-1准确度最多提高4.0%，或者在保持相同准确度的情况下，比MobileNetV3快1.5倍，比EfficientNet快2.6倍，以测量的延迟为基准）。此外，OFA在移动设置下达到了新的标准，即80.0%的ImageNet top-1准确度（<600百万次运算），同时显著减少了很多数量级的GPU小时和二氧化碳排放。 OFA网络的关键优势在于其能够适应多样化的硬件平台和延迟约束，而无需对每个特定环境进行新的训练。OFA方法通过预先训练一个包含多种可能性的全量网络，并通过选择子网来实现针对特定硬件环境的优化，从而解决了传统NAS方法的局限性。这种做法大大降低了针对特定硬件优化所需求的计算资源，使得在资源有限的设备上部署高性能神经网络变得更加可行。此外，OFA网络提出的渐进式收缩算法也是突破性的，因为它不仅在深度、宽度、卷积核大小上进行优化，还涉及到了模型分辨率的调整，使得模型大小能够在多个维度上进行收缩。这种多维度的模型剪枝技术，赋予了OFA网络极高的灵活性，能够在保持推理准确性的同时，进一步优化网络模型以适应各种不同的硬件限制和应用需求。 OFA网络的成功在多个基准测试中得到了验证，它在移动设置下的ImageNet top-1准确度达到了新的业界最高水平。而且，通过在多个不同的设备和延迟约束条件下测试，OFA网络都展现出了超越竞争对手的表现。OFA网络因此在第三和第四届低功耗计算机视觉挑战赛（LPCVC）中获奖，证明了其在实际应用中的优越性和实用性。整体来看，OFA网络代表了一种在多设备环境下进行高效神经网络部署的新范式，它通过一次性训练得到的全量网络，提供了快速适应不同硬件和延迟需求的可能性。OFA网络的提出，不仅为计算视觉领域带来了新的优化思路，也为深度学习在移动和边缘设备上的应用开辟了新的前景。其开源代码和预先训练好的模型已经发布在GitHub上，供研究者和开发人员下载和使用，这将进一步促进高效部署深度学习模型的研究和应用。

资源推荐

资源详情

资源评论

Published as a conference paper at ICLR 2020

ONCE-FOR-ALL: TRAIN ONE NETWORK AND SPE-

CIALIZE IT FOR EFFICIENT DEPLOYMENT

Han Cai

, Chuang Gan

, Tianzhe Wang

, Zhekai Zhang

, Song Han

Massachusetts Institute of Technology,

MIT-IBM Watson AI Lab

{hancai, chuangg, songhan}@mit.edu

ABSTRACT

We address the challenging problem of efﬁcient inference across many devices

and resource constraints, especially on edge devices. Conventional approaches

either manually design or use neural architecture search (NAS) to ﬁnd a specialized

neural network and train it from scratch for each case, which is computationally

prohibitive (causing

emission as much as 5 cars’ lifetime Strubell et al. (2019))

thus unscalable. In this work, we propose to train a once-for-all (OFA) network that

supports diverse architectural settings by decoupling training and search, to reduce

the cost. We can quickly get a specialized sub-network by selecting from the OFA

network without additional training. To efﬁciently train OFA networks, we also

propose a novel progressive shrinking algorithm, a generalized pruning method

that reduces the model size across many more dimensions than pruning (depth,

width, kernel size, and resolution). It can obtain a surprisingly large number of sub-

networks (

> 10

) that can ﬁt different hardware platforms and latency constraints

while maintaining the same level of accuracy as training independently. On diverse

edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods

(up to 4.0% ImageNet top1 accuracy improvement over MobileNetV3, or same

accuracy but 1.5

faster than MobileNetV3, 2.6

faster than EfﬁcientNet w.r.t

measured latency) while reducing many orders of magnitude GPU hours and

emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy

under the mobile setting (

600M MACs). OFA is the winning solution for the

3rd Low Power Computer Vision Challenge (LPCVC), DSP classiﬁcation track

and the 4th LPCVC, both classiﬁcation track and detection track. Code and 50

pre-trained models (for many devices & many latency constraints) are released at

https://github.com/mit-han-lab/once-for-all.

1 INTRODUCTION

Deep Neural Networks (DNNs) deliver state-of-the-art accuracy in many machine learning applica-

tions. However, the explosive growth in model size and computation cost gives rise to new challenges

on how to efﬁciently deploy these deep learning models on diverse hardware platforms, since they

have to meet different hardware efﬁciency constraints (e.g., latency, energy). For instance, one mobile

application on App Store has to support a diverse range of hardware devices, from a high-end Sam-

sung Note10 with a dedicated neural network accelerator to a 5-year-old Samsung S6 with a much

slower processor. With different hardware resources (e.g., on-chip memory size, #arithmetic units),

the optimal neural network architecture varies signiﬁcantly. Even running on the same hardware,

under different battery conditions or workloads, the best model architecture also differs a lot.

Given different hardware platforms and efﬁciency constraints (deﬁned as deployment scenarios),

researchers either design compact models specialized for mobile (Howard et al., 2017; Sandler et al.,

2018; Zhang et al., 2018) or accelerate the existing models by compression (Han et al., 2016; He

et al., 2018) for efﬁcient deployment. However, designing specialized DNNs for every scenario

is engineer-expensive and computationally expensive, either with human-based methods or NAS.

Since such methods need to repeat the network design process and retrain the designed network

from scratch for each case. Their total cost grows linearly as the number of deployment scenarios

increases, which will result in excessive energy consumption and

emission (Strubell et al., 2019).

It makes them unable to handle the vast amount of hardware devices (23.14 billion IoT devices till

Published as a conference paper at ICLR 2020

Number of Deployment Scenarios

0 20 40 60 80

16x~1300x

reduction

direct deploy (no retrain)

train a once-for-all network

specialized sub-nets

Samsung Note10 Latency (ms)

cpu

Diﬀerent Hardware / Constraint

Design Cost

Previous: O(N) design cost

Ours: O(1) design cost

Mobile AI

Tiny AI

(AIoT)

Cloud AI

Top-1 ImageNet Acc (%)

OFA MobileNetV3

70.0

76.1

Train Once,!

Get Many

75.2

73.3

70.4

67.4

Train Four Times,

Get Four

MCU

Number of Deployment Scenarios

0 20 40 60 80

16x~1300x

reduction

direct deploy (no retrain)

train a once-for-all network

specialized sub-nets

cpu

Diﬀerent Hardware / Constraint

Design Cost

Previous: O(N) design cost

Ours: O(1) design cost

Mobile AI

Tiny AI

(AIoT)

Cloud AI

MCU

Figure 1: Left: a single once-for-all network is trained to support versatile architectural conﬁgurations

including depth, width, kernel size, and resolution. Given a deployment scenario, a specialized sub-

network is directly selected from the once-for-all network without training. Middle: this approach

reduces the cost of specialized deep learning deployment from O(N) to O(1). Right: once-for-all

network followed by model selection can derive many accuracy-latency trade-offs by training only

once, compared to conventional methods that require repeated training.

2018

) and highly dynamic deployment environments (different battery conditions, different latency

requirements, etc.).

This paper introduces a new solution to tackle this challenge – designing a once-for-all network that

can be directly deployed under diverse architectural conﬁgurations, amortizing the training cost. The

inference is performed by selecting only part of the once-for-all network. It ﬂexibly supports different

depths, widths, kernel sizes, and resolutions without retraining. A simple example of Once-for-All

(OFA) is illustrated in Figure 1 (left). Speciﬁcally, we decouple the model training stage and the

neural architecture search stage. In the model training stage, we focus on improving the accuracy

of all sub-networks that are derived by selecting different parts of the once-for-all network. In the

model specialization stage, we sample a subset of sub-networks to train an accuracy predictor and

latency predictors. Given the target hardware and constraint, a predictor-guided architecture search

(Liu et al., 2018) is conducted to get a specialized sub-network, and the cost is negligible. As such,

we reduce the total cost of specialized neural network design from O(N) to O(1) (Figure 1 middle).

However, training the once-for-all network is a non-trivial task, since it requires joint optimization

of the weights to maintain the accuracy of a large number of sub-networks (more than 10

in our

experiments). It is computationally prohibitive to enumerate all sub-networks to get the exact gradient

in each update step, while randomly sampling a few sub-networks in each step will lead to signiﬁcant

accuracy drops. The challenge is that different sub-networks are interfering with each other, making

the training process of the whole once-for-all network inefﬁcient. To address this challenge, we

propose a progressive shrinking algorithm for training the once-for-all network. Instead of directly

optimizing the once-for-all network from scratch, we propose to ﬁrst train the largest neural network

with maximum depth, width, and kernel size, then progressively ﬁne-tune the once-for-all network to

support smaller sub-networks that share weights with the larger ones. As such, it provides better

initialization by selecting the most important weights of larger sub-networks, and the opportunity to

distill smaller sub-networks, which greatly improves the training efﬁciency. From this perspective,

progressive shrinking can be viewed as a generalized network pruning method that shrinks multiple

dimensions (depth, width, kernel size, and resolution) of the full network rather than only the width

dimension. Besides, it targets on maintaining the accuracy of all sub-networks rather than a single

pruned network.

We extensively evaluated the effectiveness of OFA on ImageNet with many hardware platforms

(CPU, GPU, mCPU, mGPU, FPGA accelerator) and efﬁciency constraints. Under all deployment

scenarios, OFA consistently improves the ImageNet accuracy by a signiﬁcant margin compared to

SOTA hardware-aware NAS methods while saving the GPU hours, dollars, and

emission by

orders of magnitude. On the ImageNet mobile setting (less than 600M MACs), OFA achieves a new

SOTA 80.0% top1 accuracy with 595M MACs (Figure 2). To the best of our knowledge, this is the

ﬁrst time that the SOTA ImageNet top1 accuracy reaches 80% under the mobile setting.

https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/

Published as a conference paper at ICLR 2020

0 1 2 3 4 5 6 7 8 9

MACs (Billion)

ImageNet Top-1 accuracy (%)

2M 4M 8M

Handcrafted

16M

AutoML

32M 64M

→

The higher the better

The lower the better

Once-for-All (ours)

EﬃcientNet

ProxylessNAS

MBNetV3

AmoebaNet

MBNetV2

PNASNet

ShuﬄeNet

DARTS

IGCV3-D

MobileNetV1 (MBNetV1)

NASNet-A

InceptionV2

DenseNet-121

DenseNet-169

ResNet-50

ResNetXt-50

InceptionV3

DenseNet-264

DPN-92

ResNet-101

Xception

ResNetXt-101

14x reduction

595M MACs#

80.0% Top-1

Model Size

Figure 2: Comparison between OFA and state-of-the-art CNN models on ImageNet. OFA provides

80.0% ImageNet top1 accuracy under the mobile setting (< 600M MACs).

2 RELATED WORK

Efﬁcient Deep Learning.

Many efﬁcient neural network architectures are proposed to improve the

hardware efﬁciency, such as SqueezeNet (Iandola et al., 2016), MobileNets (Howard et al., 2017;

Sandler et al., 2018), ShufﬂeNets (Ma et al., 2018; Zhang et al., 2018), etc. Orthogonal to architecting

efﬁcient neural networks, model compression (Han et al., 2016) is another very effective technique

for efﬁcient deep learning, including network pruning that removes redundant units (Han et al., 2015)

or redundant channels (He et al., 2018; Liu et al., 2017), and quantization that reduces the bit width

for the weights and activations (Han et al., 2016; Courbariaux et al., 2015; Zhu et al., 2017).

Neural Architecture Search.

Neural architecture search (NAS) focuses on automating the architec-

ture design process (Zoph & Le, 2017; Zoph et al., 2018; Real et al., 2019; Cai et al., 2018a; Liu et al.,

2019). Early NAS methods (Zoph et al., 2018; Real et al., 2019; Cai et al., 2018b) search for high-

accuracy architectures without taking hardware efﬁciency into consideration. Therefore, the produced

architectures (e.g., NASNet, AmoebaNet) are not efﬁcient for inference. Recent hardware-aware

NAS methods (Cai et al., 2019; Tan et al., 2019; Wu et al., 2019) directly incorporate the hardware

feedback into architecture search. Hardware-DNN co-design techniques (Jiang et al., 2019b;a; Hao

et al., 2019) jointly optimize neural network architectures and hardware architectures. As a result,

they can improve inference efﬁciency. However, given new inference hardware platforms, these

methods need to repeat the architecture search process and retrain the model, leading to prohibitive

GPU hours, dollars, and

emission. They are not scalable to a large number of deployment

scenarios. The individually trained models do not share any weight, leading to large total model size

and high downloading bandwidth.

Dynamic Neural Networks.

To improve the efﬁciency of a given neural network, some work

explored skipping part of the model based on the input image. For example, Wu et al. (2018); Liu &

Deng (2018); Wang et al. (2018) learn a controller or gating modules to adaptively drop layers; Huang

et al. (2018) introduce early-exit branches in the computation graph; Lin et al. (2017) adaptively

prune channels based on the input feature map; Kuen et al. (2018) introduce stochastic downsampling

point to reduce the feature map size adaptively. Recently, Slimmable Nets (Yu et al., 2019; Yu &

Huang, 2019b) propose to train a model to support multiple width multipliers (e.g., 4 different global

width multipliers), building upon existing human-designed neural networks (e.g., MobileNetV2 0.35,

0.5, 0.75, 1.0). Such methods can adaptively ﬁt different efﬁciency constraints at runtime, however,

still inherit a pre-designed neural network (e.g., MobileNet-v2), which limits the degree of ﬂexibility

(e.g., only width multiplier can adapt) and the ability in handling new deployment scenarios where

the pre-designed neural network is not optimal. In this work, in contrast, we enable a much more

diverse architecture space (depth, width, kernel size, and resolution) and a signiﬁcantly larger number

of architectural settings (

v.s. 4 (Yu et al., 2019)). Thanks to the diversity and the large design

剩余14页未读，继续阅读

评论收藏

内容反馈

潜夙

粉丝: 0
资源: 40

ONCE-FOR-ALL TRAIN ONE NETWORK AND SPECIALIZE IT FOR EFFICIENT D...

最新资源

ONCE-FOR-ALL TRAIN ONE NETWORK AND SPECIALIZE IT FOR EFFICIENT D...

once-for-all:[ICLR 2020]一劳永逸

network-train

iOS 10 Programming for Beginners

Essential Windows Communication Foundation For .NET Framework 3.5

Packt.Swift.Developing.iOS.Applications.2016

光线追踪的程序实现raytracing

Construction Craftsman(Masonry).pdf

OpenShift in Action.pdf

Game Engine Architecture

Numerical Recipes 3rd Edition: The Art of Scientific Computing

SCWCD-model1.pdf

考研英语自我介绍及各个专业所对应的英语翻译.docx

Android代码-folding-cell-android

Android代码-navigation-toolbar-android

ChatGPT-PromptBlog-Educating [ideal customer persona] on topic

Android代码-paper-onboarding-android

centos7 bazel 0.11版本

20秋吉大《新视野英语（二）》在线作业一-0004答卷.docx

大工《商务英语写作》21春在线作业2参考答案.docx.docx

2016届新课标高三英语一轮复习习题 Module 5 A Lesson in a Lab(单元能力过关).docx

该提示要求发布一篇博客文章，为特定的目标受众提供有价值的相关信息，目的是鼓励他们采取与正在推广的网站或产品相关的行动

这个提示是在寻找一篇博客文章，直接表达特定人群的需求和担忧，激励他们采取与广告中的网站或产品相关的具体行动

2007年湖北专升本大学英语试题样本及考纲

20131227翻译讲义[整理].pdf

Android代码-GarlandView在多个内容列表之间无缝切换。

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

机器学习期末复习题及答案

最新资源