Published as a conference paper at ICLR 2020
Number of Deployment Scenarios
direct deploy (no retrain)
train a once-for-all network
Samsung Note10 Latency (ms)
Different Hardware / Constraint
Previous: O(N) design cost
Train Four Times,
Get Four
Number of Deployment Scenarios
direct deploy (no retrain)
train a once-for-all network
Different Hardware / Constraint
Previous: O(N) design cost
Figure 1: Left: a single once-for-all network is trained to support versatile architectural configurations
including depth, width, kernel size, and resolution. Given a deployment scenario, a specialized sub-
network is directly selected from the once-for-all network without training. Middle: this approach
reduces the cost of specialized deep learning deployment from O(N) to O(1). Right: once-for-all
network followed by model selection can derive many accuracy-latency trade-offs by training only
once, compared to conventional methods that require repeated training.
2018
1
) and highly dynamic deployment environments (different battery conditions, different latency
requirements, etc.).
This paper introduces a new solution to tackle this challenge – designing a once-for-all network that
can be directly deployed under diverse architectural configurations, amortizing the training cost. The
inference is performed by selecting only part of the once-for-all network. It flexibly supports different
depths, widths, kernel sizes, and resolutions without retraining. A simple example of Once-for-All
(OFA) is illustrated in Figure 1 (left). Specifically, we decouple the model training stage and the
neural architecture search stage. In the model training stage, we focus on improving the accuracy
of all sub-networks that are derived by selecting different parts of the once-for-all network. In the
model specialization stage, we sample a subset of sub-networks to train an accuracy predictor and
latency predictors. Given the target hardware and constraint, a predictor-guided architecture search
(Liu et al., 2018) is conducted to get a specialized sub-network, and the cost is negligible. As such,
we reduce the total cost of specialized neural network design from O(N) to O(1) (Figure 1 middle).
However, training the once-for-all network is a non-trivial task, since it requires joint optimization
of the weights to maintain the accuracy of a large number of sub-networks (more than 10
19
in our
experiments). It is computationally prohibitive to enumerate all sub-networks to get the exact gradient
in each update step, while randomly sampling a few sub-networks in each step will lead to significant
accuracy drops. The challenge is that different sub-networks are interfering with each other, making
the training process of the whole once-for-all network inefficient. To address this challenge, we
propose a progressive shrinking algorithm for training the once-for-all network. Instead of directly
optimizing the once-for-all network from scratch, we propose to first train the largest neural network
with maximum depth, width, and kernel size, then progressively fine-tune the once-for-all network to
support smaller sub-networks that share weights with the larger ones. As such, it provides better
initialization by selecting the most important weights of larger sub-networks, and the opportunity to
distill smaller sub-networks, which greatly improves the training efficiency. From this perspective,
progressive shrinking can be viewed as a generalized network pruning method that shrinks multiple
dimensions (depth, width, kernel size, and resolution) of the full network rather than only the width
dimension. Besides, it targets on maintaining the accuracy of all sub-networks rather than a single
pruned network.
We extensively evaluated the effectiveness of OFA on ImageNet with many hardware platforms
(CPU, GPU, mCPU, mGPU, FPGA accelerator) and efficiency constraints. Under all deployment
scenarios, OFA consistently improves the ImageNet accuracy by a significant margin compared to
SOTA hardware-aware NAS methods while saving the GPU hours, dollars, and
CO
2
emission by
orders of magnitude. On the ImageNet mobile setting (less than 600M MACs), OFA achieves a new
SOTA 80.0% top1 accuracy with 595M MACs (Figure 2). To the best of our knowledge, this is the
first time that the SOTA ImageNet top1 accuracy reaches 80% under the mobile setting.
1
https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
2