在低成本loTmcu上实现深度神经网络端到端自动部署-源码-深度神经网络、物联网、边缘计算、DNN加速

共2个文件

zip：1个

pdf：1个

版权申诉

边缘计算

dnn

176 浏览量 2024-03-29 21:06:35 上传评论收藏 7.1MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

在低成本loTmcu上实现实际dnn的端到端自动部署.zip （2个子文件）

在低成本loTmcu上实现实际dnn的端到端自动部署

原文.pdf 6.35MB

在低成本loTmcu上实现实际dnn的端到端自动部署.zip 1002KB

DORY: Automatic End-to-End

Deployment of Real-World DNNs

on Low-Cost IoT MCUs

Alessio Burrello, Angelo Garofalo, Nazareno Bruschi,

Giuseppe Tagliavini, Member, IEEE, Davide Rossi, Francesco Conti, Member, IEEE

Abstract—The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical

enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory

and often replace caches with scratchpads, to reduce area overheads and increase energy efﬁciency – requiring explicit DMA-based

memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive

topology-dependent tiling and double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY ) – an automatic

tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint

Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it

generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY

augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target

GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device,

DORY achieves up to 2.5× better MAC/cycle than the GreenWaves proprietary software solution and 18.1× better than the

state-of-the-art result on an STM32-H743 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a

1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps – 15.4× better than an STM32-H743. We release all our

developments – the DORY framework, the optimized backend kernels, and the related heuristics – as open-source software.

Index Terms—Deep Neural Networks, IoT, edge computing, DNN acceleration

This is a post peer-review accepted manuscript; published version available online at ieeexplore.ieee.org/document/9381618 (doi: 10.1109/TC.2021.3066883).

reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse

of any copyrighted component of this work in other works.

1 INTRODUCTION

HE Internet of Things (IoT) envisions billions of

wireless-connected end-nodes [1], which can sense, pro-

cess and transmit data for a wide range of applications such

as surveillance [2], health monitoring [3], agriculture [4],

robotics [5], and others. However, major challenges are

linked to this new computation paradigm, including reli-

ability, security, capacity, together with the production of

high-bandwidth data. In this scenario, edge-based Deep

Learning (DL) is an attractive approach thanks to its ca-

pability to extract high-level features from raw sensor data,

reducing off-node transmissions, and improving security by

doing most processing in-place.

Modern Deep Neural Network (DNN) inference tasks

run on cloud servers, personal computers, or smartphones.

Even in the most constrained scenario of mobile devices,

their execution can count on GB of memory and signif-

icant processing power available, under a power enve-

lope of a few watts. Conversely, deploying DNNs on a

microcontroller-based IoT end-node has to deliver similar

performance while dealing with i) strict constraints in terms

of memory (a few MB off-chip, and typically 1 MB on-

chip at most), ii) limited computational capabilities, and iii)

battery constraints and a peak power envelope of 100-200

mW. The deployment of DL-based algorithms on the IoT

• A. Burrello, A. Garofalo, N. Bruschi, D. Rossi and F. Conti are with

the Department of Electrical, Electronic and Information Engineering,

University of Bologna, 40136 Bologna, Italy.

G. Tagliavini is with the Department of Computer Science and

Engineering, University of Bologna, 40136 Bologna, Italy.

E-mail: [email protected], angelo.gar[email protected],

[email protected], [email protected], da-

[email protected], [email protected]

• This work was supported in part by the EU Horizon 2020 Research and

Innovation projects OPRECOMP (Open trans-PREcision COMPuting,

g.a. no. 732631) and WiPLASH (Wireless Plasticity for Heterogeneous

Massive Computer Architectures, g.a. no. 863337) and by the ECSEL

Horizon 2020 project AI4DI (Artiﬁcial Intelligence for Digital Industry,

g.a. no. 826060).

This work has been submitted to the IEEE for possible publication. Copyright

may be transferred without notice, after which this version may no longer be

accessible.

demands aggressive hardware, software, and algorithmic

co-optimization to exploit the scarce resources on these

systems to the maximum degree [6]. In particular, the scarce

availability of memory constitutes a real Deep Learning

Memory Wall [7]: a fundamental limitation to the maximum

performance of an embedded DNN compute system.

Recently introduced algorithmic improvements such as

quantized DNN inference [8] aim at matching a DNN’s full-

precision accuracy while using exclusively 8-bit (or smaller)

integer data to reduce memory occupation and execution

complexity. On the hardware side, accelerators [9], [10], [11]

and instruction set architecture (ISA) extensions [12] that

exploit quantization have been introduced to speed up the

computation, lessen the impact of memory constraints and

minimize energy consumption. In essence, 8-bit networks

are now supported by most of the frameworks, such as

TensorFlow and PyTorch. Recently proposed architectural

paradigms aim at maximizing DNN performance and efﬁ-

ciency on IoT end-nodes while safeguarding the ﬂexibility

of typical Microcontroller Unit (MCUs), so that common

control-oriented MCU tasks can be mixed with DNNs and

non-DL-based data processing tasks. These architectures

often couple a conventional MCU with an accelerator [13],

[14]. Parallel Ultra-Low-Power computing (PULP), for exam-

ple, is an architectural paradigm based on ﬂexible software-

oriented acceleration for DNNs and other data processing

tasks in multi-core end-nodes. The core idea of PULP is to

couple an I/O-dedicated core with a multi-core cluster of

processors optimized for data-parallel processing, sharing a

high-bandwidth multi-banked L1 memory [15].

Accelerated IoT end-nodes employ multi-level hierar-

chies of on- and off-chip memories. In some cases, they do

away entirely with energy-expensive coherent data caches,

exploiting manually managed scratchpad memories instead

to maximize area and energy efﬁciency. For example, PULP

architectures complement a small (< 128 kB) L1 with a

bigger-but-slower (∼1 GB/s) on-chip L2 memory, and by

an off-chip L3 low-power IoT DRAM [16] that provides

arXiv:2008.07127v3 [cs.DC] 19 Mar 2021

high capacity, but at a slower speed (∼100 MB/s) and with

relatively high energy penalty (> 50 pJ/B). These composite

memory hierarchies are becoming necessary even in low-

power systems to cope with the memory footprint of DNN

inference, without paying the cost of huge on-chip caches.

However, to “unlock” such a system’s theoretical perfor-

mance often requires carefully managed data movement by

means of cache locking or explicit DMA transfers. To reduce

the related development overhead, software caches [17] and

data tiling strategies [18] have been proposed: however,

most DL-based applications can improve upon general-

purpose solutions by exploiting the regular structure of

DNNs, with ad-hoc memory management ﬂows to mini-

mize inference time [5], [19], exploit data reuse, and opti-

mize scheduling [20]. Conversely, automatic solutions for

end-to-end deployment of real-world DNNs on MCUs so-

far rely either on slow and inefﬁcient interpretation (e.g., TF-

Lite Micro [21]), or on proprietary code generation frame-

works (e.g., ST XCUBE-AI [22], GWT

AutoTiler

In this paper, we introduce a novel lightweight frame-

work called DORY, Development Oriented to memoRY,

which aims at the deployment of end-to-end DNNs on

memory-starved end-nodes, and particularly tuned to the

class of end-nodes based on the PULP paradigm. As our

main case study, we target GWT GAP-8 [23] – one of

the most advanced low-power edge nodes available in the

market, embodying the PULP architectural paradigm with

DSP-enhanced RISC-V cores.

We introduce several novel contributions:

1. A tool for multi-level memory tiling aiming at the de-

ployment of realistically sized DNNs on memory-starved

MCUs. Relying on Constraint Programming (CP) opti-

mization, our tool matches on- and off-chip memory hi-

erarchy constraints with DNN geometrical requirements,

such as the relationships between input, weight, and

output tensor dimensions.

2. A set of heuristics to maximize the performance of the

CP solution on PULP platforms using the dedicated

backend library PULP-NN [14], to maximize throughput

and energy efﬁciency in the RISC-V based GAP-8 target.

3. A code generator using tiling solutions to produce ANSI

C code for the target platform, with all data L3-L2-

L1 orchestration implemented as fully pipelined, triple-

buffered DMA transfers and integrated calls to the com-

putational backend (PULP-NN).

4. Tailored optimizations for common architectural features

of modern DNNs: i) for residual connections, a bidi-

rectional stack avoiding memory fragmentation; ii) for

depthwise layers, a new optimized backend not present

in PULP-NN.

We evaluate the performance and energy efﬁciency of the

deployed networks produced by DORY on GWT GAP-8,

considering both single layers and end-to-end networks.

DORY achieves up to 18.1× better MAC/cycle than the

state-of-the-art result on a conventional cache-based micro-

controller, the STM32-H743 MCU, in single layer execution.

Using DORY, end-to-end deployment of 8-bit quantized

networks such as 0.5-MobileNet-v1-192, achieve up to 8.00

MACs/cycle, with a 13.2× improvement compared to the

same networks running on the STM32-H743 using the state-

of-the-art ST X-CUBE-AI. Furthermore, on a layer by layer

basis, DORY can achieve up to 2.5× better throughput

than the proprietary GWT AutoTiler, on the same GAP-8

platform, and up to 27% better performance on full network

1. GreenWaves Technologies.

2. https://greenwaves-technologies.com/manuals/

BUILD/AUTOTILER/html/index.html

execution. Our results show that image recognition on an ex-

treme edge-node can run in as little as 11.9 mJ/classiﬁcation

@ 4.3 fps.

To put our results into context, we compare the efﬁcacy

of DORY on GAP-8 with that obtainable in a state-of-the-

art single-core ARM M7 core, the off-the-shelf ST32-H743

MCU with 16 kB of L1 data cache (D$) and 128 kB of L1

scratchpad memory. For a set of 44 DNN layers of different

size, we compare i) single-core execution on GAP-8 with

DORY-based memory management, ii) M7 execution with

active D$, iii) M7 execution with DORY-managed DMA

transfers on the scratchpad memory. Our results show that

on the M7, DORY automatic memory management is up to

9% faster than the 16 kB hardware cache, and never slower.

We also show that single-core execution on GAP-8 is, on

average, 2.5× faster than on the M7 in cycles/cycles thanks

to the full exploitation of the DSP-enhanced RISC-V cores.

To foster research on real-world deeply embedded DNN

applications, we release the DORY framework, the opti-

mized backend kernels, and the PULP heuristics discussed

in this paper as open-source

2 RELATED WORK

DNN algorithm minimization

From the algorithmic viewpoint, the ﬁrst task in DL de-

ployment is making sure that the DNNs are “minimally

redundant”, in the sense that they do not perform any

additional operation unless it leads to a better quality-of-

results. In this direction, a current research trend is to adapt

DNN architectures to deployment in constrained platforms

by shrinking the DNN topologies themselves, either di-

rectly [30], [31] or using neural architecture search [32], [33].

Orthogonally, system designers can adopt techniques for

post-training quantization [34] and quantization-aware ﬁne-

tuning [35] to reduce the cost of single operations in terms

of energy and of single parameters in terms of memory –

trying to minimize the price in terms of quality-of-results.

Optimized software & ISA for DNN computation

Given a size-optimized and precision-tuned DNN, we need

to address the deployment challenge, i.e., achieve maxi-

mal utilization of the computing units, while minimizing

the performance and energy penalties associated with data

transfers across the memory hierarchy. Application-speciﬁc

hardware architectures are very useful in accelerating partic-

ular layers and, in some cases, entire networks [9], [10], [11]

– but their lack of ﬂexibility can be a liability in a ﬁeld such

as DL, where every year researchers introduce tens of new

topologies and different ways to combine the DNN basic

blocks. To provide higher ﬂexibility, in many cases, DNN

primitives are implemented in highly optimized software

instead of full-hardware blocks. Over the years, several soft-

ware libraries of DNN kernels have been proposed [14], [34],

[36], [37] to maximize the efﬁciency of DNN execution with

DSP-oriented single-instruction multiple-data (SIMD) ISA

capabilities [38]. These libraries leverage either the Height-

Width-Channel (HWC) or Channel-Height-Width (CHW)

data layout to minimize operations and memory footprint.

CHW optimizes data reuse in the spatial dimensions. There-

fore, it is faster on convolutions with larger ﬁlters and lower

channel connectivity; HWC naturally favors channel-wise

data reuse, often requiring the construction of a ﬂattened

data structure (’im2col’ buffer) to exploit spatial data reuse

partially [36]. Further, there is an increasing trend towards

3. https://github.com/pulp-platform/dory

TABLE 1

Data ﬂow scheduling and tiling in literature for different computing scales, super computing, ASIC accelerators, and tiny MCUs.

Work Networks Optimizations Output Open-Source Precision

Supercomputers

DMIAYN [24] Transformers

1) Operator Fusing,

2) Data Layout Exploration

Transformer Primitives Yes fp32

DNN Accelerators

dMazeRunner [25]

CNN,

Nested Loops

1) Loop Ordering,

2) Loop Tiling,

3) Memory Movements

Temporal/Spatial Schedule,

Loop Tiling

Yes Flexible

MAESTRO [26] CNN

1) Mapping & Data Reuse,

2) PEs Design

PEs array,

Temporal/Spatial Schedule

Yes Flexible

Interstellar [27]

CNN,

LSTM,

MLP

1) Loop Ordering,

2) Loop Tiling,

3) PEs+Mem. Design

PEs + Mem. Array,

7-Loops Ordering and Tiling

Yes 16 bits

Timeloop [28] CNN

1) Loop Ordering

2) Loop Tiling

Model Scheduling,

Latency/Energy Estimation

No Flexible

Mobile & MCUs

LCE [29] BNN

1) Loop Tiling,

2) Vectorization,

3) Parallelization

C++ Runtime Interpreter,

C++ Descriptor

Yes 1bit

TFLite Micro [21]

CNN,

MLP

1) Hand-conﬁgurable Mem.,

2) Optimized Backends

C++ Runtime Interpreter,

C++ Descriptor

Yes int8-fp32

Cube-AI [22]

CNN,

MLP

1) Mem. Access Opt. C Optimized Executable No int8-fp32

GWT AutoTiler

CNN,

MLP

1) Loop Tiling,

2) Mem. Access Opt.

C Optimized Executable Partially int8-int16

DORY

CNN,

MLP

1) Loop Tiling,

2) Mem. Access Opt.

3) Mem. Fragmentation

C Optimized Executable Yes int8

more targeted ISA specialization (e.g., ARM Helium

xPULPNN [12]) to support and accelerate the pervasive

convolutional layers with low-bitwidth linear algebra in-

structions.

Memory hierarchy management

One of the most critical challenges in DNN deployment is

memory hierarchy management: modern DNNs generate

high amounts of weight and activation trafﬁc between dif-

ferent levels of the memory hierarchy, which may constitute

a signiﬁcant bottleneck. In Table 1, we report different

methods for data ﬂow scheduling and generation that cover

three broad classes of devices, namely high-performance

computing systems [24], DNN accelerators [25], [26], [27],

[28], and embedded systems [29], [39]. For what concerns

high-performance computing systems, [24] propose new

transformer primitives to exploit data reuse and limit data

movement by fusing pointwise operators. On the other

hand, [25], [26], [27], [28] discuss DNN optimization on

AI-specialized accelerators based on systolic arrays of pro-

cessing elements (PEs), with a focus on loop tiling and/or

reordering to i) efﬁciently move the data to fastest memory

regions and ii) correctly schedule layers in space and time

to maximize PE utilization. The output of these tools can be

either an accelerator model to run a given DNN [26], [27] or

the spatial scheduling to maximize PE array utilization on a

target accelerator [25], [28].

MCU data ﬂow scheduling tools show similarities to

frameworks such as DMazeRunner, as both target the opti-

mization of a dataﬂow schedule given an externally known

architecture. However, the MCU scenario also imposes some

additional unique challenges, such as the fact that DNN

execution has to be adapted to a general-purpose architec-

ture and the small amount of memory that MCU platforms

include. Further, the kernel instructions are heavily inﬂu-

enced by the limited size of the register ﬁle, which causes

additional load-store operations and thus demand for an

4. https://www.arm.com/why-arm/technologies/helium

optimal loop sizing to avoid register spilling overhead. Aca-

demic researchers and industries have signiﬁcantly investi-

gated this aspect by including in their edge-node solutions

either specialized caches (e.g., NXP

) or explicitly managed

scratchpad memories (e.g., GWT [23]).

DNN-oriented microcontrollers and related tools

Recently, the ﬁrst generation of low-power neural-network

oriented MCUs has been introduced, coupling optimized

software and ISA extensions for DNN computing with

“traditional” control and I/O-bound activities. To enable

optimal execution of both kinds of tasks, these MCUs ex-

ploit parallel and heterogeneous processing; for example,

ST Microelectronics

and NXP have recently introduced

new-generation dual-core microcontrollers with an ARM

M0 processor dedicated to I/O and an ARM M4 processor

with single-cycle multiply-and-accumulate and SIMD capa-

bilities. These platforms show an increased complexity in

terms of memory hierarchy compared to conventional ﬂat-

memory MCUs, with an L1 memory optimized for speed

and an L2 optimized for capacity. At the same time, there is

a trend towards explicit management of memory hierarchy,

with hand-tunable data caches featuring locking for hand-

crafted data management. To manage this complexity, these

MCUs include dedicated infrastructure for data marshaling,

such as general-purpose DMA controllers to speed-up mem-

ory transfers and reduce the memory access bottleneck.

New platforms magnify these industry-wide architec-

tural trends, introducing multi-core and AI-speciﬁc acceler-

ators and removing data caches, replacing them with small

on-chip scratchpad memories. For instance, the Kendrite

K210

is a RISC-V dual-core 64 bits system-on-chip with

a neural network processor (KPU) on which the cores can

ofﬂoad the computation. It also includes dedicated memory

5. https://www.nxp.com/products/

processors-and-microcontrollers/

arm-microcontrollers/general-purpose-mcus/

lpc4300-cortex-m4-m0

6. https://www.st.com/en/microcontrollers-microprocessors/

stm32h7-series.html

7. https://canaan.io/product/kendryteai

banks for the NN accelerator and a DMA unit to explicitly

manage the transfers. The SONY Spresense board

features

a 6-cores M4 accelerator with a maximum clock speed of 156

MHz, 1.5 MB of SRAM and 8 MB of Flash. The GreenWaves

Technologies GAP-8 [23] system-on-chip, which we target

as a case study in this work, was introduced in 2018 as

a commercial embodiment of the Parallel Ultra-Low-Power

paradigm [15]: it features one I/O core and an 8-core SIMD-

optimized DSP cluster accelerator using an extension of

the RISC-V ISA. Programming these DNN-oriented MCUs

is typically more complicated with respect to conventional

MCUs. Maximizing the exploitation of computational re-

sources is challenging, and scratchpads require manually

managed data orchestration and tiling.

New tools such as TFLite Micro [21] and the Larq

Computing Engine (LCE) [29] offer a model-agnostic de-

ployment framework and overcome these problems. Both

are non-vendor-locked tools supporting ARM Cortex-M and

RISC-V cores. Their library memory footprints require only

16 kB on a Cortex-M3; however, by default they rely on

graph interpretation at runtime, limiting achievable perfor-

mance. To offset this limitation, TFLite Micro allows plug-

ging in optimized kernels and declaring vectors in different

memory regions. However, it does not include any tiling

mechanism to execute layers that do not ﬁt on-chip memory.

To the best of our knowledge, the two most power-

ful DNN deployment tools available in the state-of-the-art

have been proposed by the industry as proprietary, vendor-

locked solutions for their own MCUs. X-CUBE-AI [22] from

STMicroelectronics is an automatic NN library generator

optimized on computation and memory. It converts a pre-

trained DNN model from DNN tools such as Tensorﬂow

into a precompiled library for the ARM Cortex-M cores

embedded in STM32 series MCUs. X-CUBE-AI relies on

relatively large on-chip L1 caches (up to 16 kB) to de-

liver performance on STM32 MCUs, and it does not tackle

software-based memory management. On the other hand,

GWT designed a tool called AutoTiler, to target the GAP-

8 RISC-V based multi-core ultra-low-power microcontroller.

One of its primary functions is to take a pre-trained DNN

and generate code for memory tiling and efﬁcient transfers

of weight and activation data between all memory levels

(on- and off-chip). The GWT AutoTiler directly tackles

the data-movement and tile sizing challenge to optimize

memory access, reaching state-of-the-art performance on the

execution of many networks. The tool is proprietary, but its

backend basic kernels are available as open-source as part

of the GAP-8 SDK

DORY is the ﬁrst open-source framework to directly

tackle the MCU memory hierarchy management challenge,

with a comprehensive exploration of data tiling, optimized

loop ordering for different layers (i.e., pointwise and depth-

wise), and a solution for the data fragmentation problem

that is critical to deploy residual layers at the edge. In

Section 5, we perform several quantitative comparisons with

the best results obtained with STM X-CUBE-AI, GWT Au-

toTiler, and our own DORY. DORY consistently outperforms

all the competitors on all the proposed benchmarks.

3 B ACKGROUND

3.1 Quantized Neural Networks

Post-training quantization [34] or quantization-aware train-

ing [35] produce as output a Quantized Neural Network

(QNN). In the context of this work, we consider QNNs

8. https://developer.sony.com/develop/spresense/

9. https://github.com/GreenWaves-Technologies/gap sdk

produced with linear uniform per-layer quantization, where all

tensors t (e.g., weights w, inputs x, or outputs y) deﬁned

in a range[α

, β

) can be mapped to N-bit integer tensors

through a bijective mapping:

t = α

+ ε

t , (1)

where ε

= (β

− α

)/(2

− 1). We call ε

the quantum

because it is the smallest amount that we can represent in

the quantized tensor.

Each QNN layer is composed of a sequence of three op-

erators: Linear, Batch-Normalization (optionally) and Quan-

tization/Activation. Without loss of generality, we consider

that α

= α

= 0 for all the inputs of Linear and the outputs

of Quantization/Activation operators

, but not for weights.

Using Eq. 1, all operators are mapped in the integer domain:

LIN : ϕ =

m,n

⇐⇒ bϕ =

m,n

(2)

: ϕ

= κ · ϕ + λ ⇐⇒ bϕ

= bκ · bϕ +

λ . (3)

The dot product operation in Eq. 2 results in a shrinking

of the quantum used to represent bϕ, which will be ε

. Hence, we need to represent the integer output of the

Linear operator ( bϕ) with higher precision (e.g., 32 bits) with

respect to its inputs, before re-quantizing it at the end of

the accumulation. A similar consideration applies to Batch-

Normalization and its output bϕ

The ﬁnal Quantization/Activation operator i) provides a

non-linear activation essential for the QNN to work at all,

and ii) collapses the accumulator into a smaller bitwidth:

QNT/ACT :

y = m · bϕ

 d ; m =



· 2



. (4)

d is an integer chosen during the quantization process in

such a way that ε

/ε

can be represented with sufﬁcient

accuracy inside m. A method similar to Eq. 4 is also used

when multiple branches of the network, each with its own ε,

reconverge in a single tensor (typically using summation). In

that case, the branches are “matched” to the same quantum

using a variant of Eq. 4.

Thanks to the mapping of Eq. 1, it is possible to ex-

ecute the entire network using only integer data. In this

work, we target networks using 8-bit quantization for both

w (signed), and

x /

y (unsigned); bϕ, bϕ

, and the bκ,

λ, m,

d parameters use 32-bit integers (signed). We relied on the

open-source NEMO library [40] to generate QNN topologies

in the format described in this Section. Note that using

different quantization techniques such as non-linear 8 bits

quantization or clustering [41] for network compression

and execution would be possible with DORY replacing the

software backend employed.

3.2 Parallel Ultra-Low-Power computing paradigm

Research and industry are dedicating increasing attention

to edge-nodes with specialized co-processors (accelerators)

and hierarchical memories, designed to exploit the perva-

sive data-regularity in emerging data-analytics tasks (e.g.,

deep-learning). Parallel Ultra-Low Power computing is an

architectural paradigm leveraging near-threshold comput-

ing to achieve high energy efﬁciency, coupled with par-

allelism to improve the performance degradation at low-

voltage [42]. The PULP paradigm builds upon the trends

10. If the original activation is a ReLU, then the QNN automatically

satisﬁes this condition; otherwise, it can be transformed to satisfy it.

11. In inference, the statistical and learned parameters of BN can be

combined: κ = γ/σ and λ = β − µγ/σ.

PMU

DC/DC

RTC

HYPER

UART

SPI

I2S

I2C

GPIO

JTAG

I/O DMA

L2 Memory

512 kB

4 GB/s @ 250MHz

Instr Cache

I/O RISC-V

I/O L1

ROM

DBG CLK

CL DMA

HWCE

HW Sync

DBG

Shared Multi-Bank L1 Memory - 64 kB

16 GB/s @ 250 MHz

Logarithmic Interconnect

RISC-V

Shared Instruction Cache

L3 Memory

8 MB RAM

64 MB Flash

250 MB/s DDR

I/O DOMAIN CLUSTER DOMAIN

O-Chip

Fig. 1. GWT GAP-8 MCU block diagram.

TABLE 2

Symbols used throughout this work.

Input x dims (height/width/chan) h

/ w

/ C

Output y dims (height/width/chan) h

/ w

/ C

Weight w dims (out c/height/width/in c) C

/ K

/ C

Buffer for tensor q at i-th level of mem. hier. Li

Tiled dimension d

of a tensor q d

explained in Section 2: ISA optimizations for DSP and DNN

computing; heterogeneous parallel acceleration, with archi-

tecturally different compute units dedicated to unrelated

tasks; and explicitly managed memory management. PULP

systems are centered around a state-of-the-art single-core

microcontroller (I/O domain) with a standard set of periph-

erals. The I/O core ofﬂoads parallel tasks to a software-

programmable parallel accelerator composed of N addi-

tional cores, standing in its own voltage and frequency

domain (cluster domain).

GWT GAP-8 [23] (depicted in Figure 1) is a commercial

PULP system with 9 extended RISC-V cores (one I/O + an

eight-core cluster), which we chose as the reference platform

in this work since it represents one of the most advanced

embodiments of the DNN-dedicated MCU trends.

The GAP-8 ’cluster’ is composed by eight 4-stage in-

order single-issue pipeline RI5CY [38] cores, implementing

the RISC-V RV32IMCXpulpV2 Instruction Set Architecture

(ISA). XpulpV2 is a domain-speciﬁc extension meant for

efﬁcient digital signal processing, with hardware loops,

post-modiﬁed access LD/ST, and SIMD instructions down

to 8-bit vector operands.

The cores of the cluster share a ﬁrst level of memory, a 64

kB multi-banked L1 memory Tightly-Coupled Data Memory

(TCDM), accessible from the cluster’s cores through a high-

bandwidth, single-cycle-latency logarithmic interconnect,

featuring a 2× banking factor and a word-level interleav-

ing scheme to reduce the probability of contention [43].

In order to manage data transfers between the L1 TCDM

memory and a second-level 512 kB of memory (managed

as a scratchpad as well) available in the SoC domain, the

cluster DMA [44] can manage data transfers between L1 and

L2 with a bandwidth up to 2 GB/s and a latency of 80 ns

at the maximum frequency. On the other hand, to interface

the L2 memory with the external world, and in particular

with the Cypress Semiconductor’s HyperRAM/HyperFash

module [16] available on the GAPuino board, GAP-8 can

use an autonomous I/O subsystem called I/O DMA [45].

Through the HyperBus interface, the external L3 Hyper-

RAM and/or HyperFlash memory can be connected to the

system, enabling a further 64 MB of storage for read-only

data on Flash and 8-16 MB for volatile data on DRAM, with

a bandwidth up to 200 MB/s.

3.3 QNN Execution Model on GAP-8

Computational backends are by construction tied to a spe-

ciﬁc target platform as they need to fully exploit the architec-

ture’s strength. As optimized QNN backend for our GAP-

8 case study, we relied on the open-source PULP-NN [14]

library. PULP-NN is based on the HWC data layout. An

efﬁcient QNN layer is implemented in the backend library

as a combination of three phases, summarily shown in

Figure 2. First, the Im2Col step copies the pixels needed

to produce a single output pixel (i.e., the receptive ﬁeld) from

their 3-D input non-sequential in memory arrangement into

a 1-D vector using load/store operations. Note that this

step is not performed for 1×1 convolutions, since all the

necessary input pixels (1 × 1 × C

) are already sequential in

memory, given the HWC data layout. Then, the linear part

of the kernel, the Matrix Multiplication (MatMul), convolves

the current 1-D vector with the weight parameters of the

layer, exploiting the RI5CY SIMD instructions to implement

the integer part of Eq. 2. To improve performance, the

innermost loop of the MatMul accumulates the partial re-

sults of the convolution over registers, eliminating the store

instructions inside the loop and reusing the 1-D input vector

elements along with 4 different sets of ﬁlters. This enables

the computation of 2 adjacent output pixels in parallel, thus

maximizing reuse and reducing the cost of loads. In this

way, the innermost loop consists of just 6 load (ld) instruc-

tions and 8 SIMD MAC instructions (sdotp), for a total of

32 MACs per loop iteration. In this work, we extended the

PULP-NN [14] library to support also Batch-Normalization

and Quantization/Activation as deﬁned in Eqs. 5 and 6,

respectively, which together compose the Norm/Qnt phase.

The PULP-NN library assumes that all the activations and

weights are stored in the L1 memory. Readers may refer to

[14] for detailed information about this library.

3.4 QNN Tensor Tiling

In the context of QNN deployment, a tiling strategy con-

sists of a regular software-managed fragmentation of the

data tensors mentioned in Section 3.1 to i) ﬁt within the

available memory, and ii) transparently move data between

levels, using double buffering and DMA of the next tile

in parallel with computation on the current tile. In this

work, we target a hardware architecture with three levels

of memory hierarchy: a virtually unlimited-size off-chip L3;

an on-chip L2 memory balancing size (e.g., 256 kB to a

few MB) and bandwidth; and an on-chip L1 with virtually

unlimited bandwidth to the compute units, but of limited

size (typically < 128 kB).

If we consider a convolutional layer in a DNN, in

general, inputs, outputs, and weights should all be tiled

to satisfy memory constraints at all levels Li (see Table 2

for the notation adopted throughout the paper). The main

challenge of tiling is to maximize the size of all tiles while

i) ﬁtting within the size constraints imposed by the size of

layer Li, and ii) guaranteeing that all relationships between

the tensors are respected both on the tiles in Li and on the

full tensors in L(i + 1).

4 DORY: DEPLOYMENT ORIENTED TO MEMORY

DORY targets a compute node with three levels (L3, L2,

and L1) in the memory hierarchy, as described in Section 3.

It supports L3-L2 and L2-L1 tiling of both weights and

activations. Storage of weights in L3 (> 512 kB) is essential

for the deployment of most non-trivial networks such as

[30], [31]. On the other hand, activations’ tiling is typically

necessary only for networks working on high-resolution im-

ages with big spatial dimensions, which are rare in the edge

computing domain. The operation of DORY is organized in

three steps, performed ofﬂine before network deployment.

First, the ONNX decoder receives as input a QNN graph

using the Open Neural Network Exchange (ONNX format).

评论收藏

内容反馈

版权申诉

阿齐Archie

粉丝: 1w+
资源: 2303

在低成本loT mcu上实现深度神经网络端到端自动部署-源码-深度神经网络、物联网、边缘计算、DNN加速

最新资源

在低成本loT mcu上实现深度神经网络端到端自动部署-源码-深度神经网络、物联网、边缘计算、DNN加速

机器学习C++源码解析-深度神经网络DNN算法-源码+数据

Matlab实现DNN深度神经网络多变量时间序列预测（完整程序和数据）

基于深度神经网络DNN未来预测，深度神经网络DNN时序多步预测，深度全连接神经网络单列数据递归预测，要求MATLAB2018及以

MATLAB实现DNN神经网络多输入多输出预测（完整源码和数据）

Tutorial-on-DNN-4-of-9-DNN-Accelerator-Architectures

OpenCV3.3深度神经网络(DNN)模块-应用课程配套源代码.7z

OpenCV3.3深度神经网络(DNN)模块应用源码.zip

高级人工智能：深度神经网络DNN.pptx

基于神经网络DNN实现信号均衡matlab完整源码.zip

OpenCV3.3深度神经网络(DNN)模块-应用课程配套PDF.7z

基于深度神经网络(DNN)的多输入多输出预测，深度神经网络(DNN)的数据回归预测，深度全连接神经网络，要求MATLAB2018

mkl-dnn-master.zip

【预测模型】基于DNN深度神经网络实现minist数据集预测matlab源码.md

OpenCV3.3深度神经网络(DNN)模块-应用视频教程.txt

基于深度神经网络DNN区间预测，深度神经网络DNN的核密度估计下置信区间预测 DNN-KDE区间预测 多变量区间预测，单变量

matlabdemo.zip_DNN_DNN matlab_DNN神经网络_dnn matlab实现_神经网络

speech-derev-dnn-master.zip_DNN_DNN speech _DNN语音_深度神经网络_混响

基于神经网络DNN实现的光通信信道估计matlab完整源码(毕业设计).zip

深度学习+NN-DNN-CNN-RNN-基础理解

第十五届蓝桥杯大赛软件赛省赛-PythonB组题目

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

（免费）Chrome浏览器插件axure-chrome-extension

axure谷歌浏览器插件

VMware workstation 17 pro个人免费版

免费插件-AI插件-illustrator插件集合-尺寸标注-智能填充-颜色自动处理-自动批处理-Windows安装包.zip

火狐Firefox浏览器的插件Video DownloadHelper 8.0 的合作应用VdhCoAppSetup2.0.19

Chatgpt 4omni 发布 GPT 4o / chatgpt-4 桌面版 chchatgpt 4 下载 / darkgpt

最新版YS9082HC主控开卡工具 YS9082HC-MPToolV8.00.00.18.826-HCS1A25E2023062

安卓期末大作业（AndroidStudio开发），垃圾分类助手app，分为前台后台，代码有注释，均能正常运行

最新资源

基于深度神经网络DNN区间预测，深度神经网络DNN的核密度估计下置信区间预测 DNN-KDE区间预测多变量区间预测，单变量