1
DORY: Automatic End-to-End
Deployment of Real-World DNNs
on Low-Cost IoT MCUs
Alessio Burrello, Angelo Garofalo, Nazareno Bruschi,
Giuseppe Tagliavini, Member, IEEE, Davide Rossi, Francesco Conti, Member, IEEE
Abstract—The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical
enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory
and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency – requiring explicit DMA-based
memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive
topology-dependent tiling and double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY ) – an automatic
tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint
Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it
generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY
augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target
GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device,
DORY achieves up to 2.5× better MAC/cycle than the GreenWaves proprietary software solution and 18.1× better than the
state-of-the-art result on an STM32-H743 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a
1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps – 15.4× better than an STM32-H743. We release all our
developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.
Index Terms—Deep Neural Networks, IoT, edge computing, DNN acceleration
F
This is a post peer-review accepted manuscript; published version available online at ieeexplore.ieee.org/document/9381618 (doi: 10.1109/TC.2021.3066883).
©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse
of any copyrighted component of this work in other works.
1 INTRODUCTION
T
HE Internet of Things (IoT) envisions billions of
wireless-connected end-nodes [1], which can sense, pro-
cess and transmit data for a wide range of applications such
as surveillance [2], health monitoring [3], agriculture [4],
robotics [5], and others. However, major challenges are
linked to this new computation paradigm, including reli-
ability, security, capacity, together with the production of
high-bandwidth data. In this scenario, edge-based Deep
Learning (DL) is an attractive approach thanks to its ca-
pability to extract high-level features from raw sensor data,
reducing off-node transmissions, and improving security by
doing most processing in-place.
Modern Deep Neural Network (DNN) inference tasks
run on cloud servers, personal computers, or smartphones.
Even in the most constrained scenario of mobile devices,
their execution can count on GB of memory and signif-
icant processing power available, under a power enve-
lope of a few watts. Conversely, deploying DNNs on a
microcontroller-based IoT end-node has to deliver similar
performance while dealing with i) strict constraints in terms
of memory (a few MB off-chip, and typically 1 MB on-
chip at most), ii) limited computational capabilities, and iii)
battery constraints and a peak power envelope of 100-200
mW. The deployment of DL-based algorithms on the IoT
• A. Burrello, A. Garofalo, N. Bruschi, D. Rossi and F. Conti are with
the Department of Electrical, Electronic and Information Engineering,
University of Bologna, 40136 Bologna, Italy.
G. Tagliavini is with the Department of Computer Science and
Engineering, University of Bologna, 40136 Bologna, Italy.
• This work was supported in part by the EU Horizon 2020 Research and
Innovation projects OPRECOMP (Open trans-PREcision COMPuting,
g.a. no. 732631) and WiPLASH (Wireless Plasticity for Heterogeneous
Massive Computer Architectures, g.a. no. 863337) and by the ECSEL
Horizon 2020 project AI4DI (Artificial Intelligence for Digital Industry,
g.a. no. 826060).
This work has been submitted to the IEEE for possible publication. Copyright
may be transferred without notice, after which this version may no longer be
accessible.
demands aggressive hardware, software, and algorithmic
co-optimization to exploit the scarce resources on these
systems to the maximum degree [6]. In particular, the scarce
availability of memory constitutes a real Deep Learning
Memory Wall [7]: a fundamental limitation to the maximum
performance of an embedded DNN compute system.
Recently introduced algorithmic improvements such as
quantized DNN inference [8] aim at matching a DNN’s full-
precision accuracy while using exclusively 8-bit (or smaller)
integer data to reduce memory occupation and execution
complexity. On the hardware side, accelerators [9], [10], [11]
and instruction set architecture (ISA) extensions [12] that
exploit quantization have been introduced to speed up the
computation, lessen the impact of memory constraints and
minimize energy consumption. In essence, 8-bit networks
are now supported by most of the frameworks, such as
TensorFlow and PyTorch. Recently proposed architectural
paradigms aim at maximizing DNN performance and effi-
ciency on IoT end-nodes while safeguarding the flexibility
of typical Microcontroller Unit (MCUs), so that common
control-oriented MCU tasks can be mixed with DNNs and
non-DL-based data processing tasks. These architectures
often couple a conventional MCU with an accelerator [13],
[14]. Parallel Ultra-Low-Power computing (PULP), for exam-
ple, is an architectural paradigm based on flexible software-
oriented acceleration for DNNs and other data processing
tasks in multi-core end-nodes. The core idea of PULP is to
couple an I/O-dedicated core with a multi-core cluster of
processors optimized for data-parallel processing, sharing a
high-bandwidth multi-banked L1 memory [15].
Accelerated IoT end-nodes employ multi-level hierar-
chies of on- and off-chip memories. In some cases, they do
away entirely with energy-expensive coherent data caches,
exploiting manually managed scratchpad memories instead
to maximize area and energy efficiency. For example, PULP
architectures complement a small (< 128 kB) L1 with a
bigger-but-slower (∼1 GB/s) on-chip L2 memory, and by
an off-chip L3 low-power IoT DRAM [16] that provides
arXiv:2008.07127v3 [cs.DC] 19 Mar 2021