【免费】BWN-E预生产1资源-CSDN文库

需积分: 0 22 浏览量 2022-08-08 19:06:22 上传评论收藏 923KB DOCX 举报

资源详情

资源评论

资源推荐

An Energy-efficient Speech Classification Convolution Neural

Network Accelerator Based on FPGA and Quantization

Abstract

Deep convolution neural networks (CNN) have been shown to own unique advantages in acoustic tasks

over recurrent neural networks (RNN). However, activation data in convolution neural networks is often

indicated in floating format, which is both time-consuming and power-consuming when be computed.

Quantization method can turn activation into fix-point, replacing floating computing into faster and more

energy-saving fix-point computing. Based on this method, this article provides a design space searching

method to quantize a binary weight neural network with least accuracy loss. We also design a specific

accelerator on FPGA platform, which is high-throughput and energy-efficient compared with CPU or

RNN-based accelerators.

Keywords: energy-efficient; reconfigurable computing; FPGA; quantization; sound classification

1. Introduction

Sound classification is a typical information

analyzing task which is widely used in military

and speech controlling. Since Deng, Yu et al

introduced RNN and LSTM (Long-Short Stage

Memory) acoustic model into speech recognition

and sound classification, LSTM has reached

series of excellent performance in this area [1][2].

However, deep neural networks based on RNN

are hard to be trained and parallelized due to

complex structure and recurrent computation.

When applied in actual missions, RNN models

usually demand high performance GPU and CPU

to satisfy computing power need. Such hardware

platforms have up to hundreds of watt power

consumption, which cannot meet requirements

for energy-sensitive circumstance. On the

contrast, CNN has been found to get excellent

performance on acoustic model [3][4][5]. Audio

files can be transferred into feature maps or

feature matrixes by wave-filtering algorithms

(such as Mel Frequency Cepstral Coefficients

algorithm) [6], acoustic CNN models then run on

these maps just like input images for computer

vision models. With tiny 3x3 or 5x5 convolution

kernels, CNNs can be trained and forward faster

than RNN for convoluting operation is easier to

be parallelized and accelerated than recurrent

computation. This advantage makes it possible to

accelerate an acoustic CNN model on specific

power-efficiency hardware platforms.

Gradient descent algorithm [7], which is

sensitive to numerical fluctuation [8], is widely

applied to train deep neural networks (DNN). To

pursue best DNN performance, it is necessary to

store data in full-precision format during training

process. However, floating data format needs

longer word-length to store and more circuit parts

to compute, leading to more energy consumption

and bigger circuit designing area [9]. Also, the

computing complexity of floating-point data

makes it difficult to reduce computing cycle

counts. Floating computation, although can keep

precision well, has become the bottleneck of

power-efficiency high performance computing.

Fortunately, some works [10][11] have

proven that floating-point data is unnecessary to

CNNs’ forwarding tasks, low precision

computing can achieve similar performance as

well. These works provide quantization methods,

turning weight and activation data into fix-point

data, integer data or even binary data with little

accuracy loss. Based on kinds of quantized CNN

models, there comes BNN (Binary Neural

Network) accelerators [12][13][14] and GPUs

supporting 8 bits integer data etc. These designs

reduce power consumption greatly and have up to

hundreds of speedup ratios compared with CPU

platforms. It turns out that hardware with

corresponding quantized CNN models can

achieve excellent computing performance as well

as high energy efficiency.

Compared with CPU and GPU, ASIC and

FPGA are more suitable to accelerate a specific

task for their reconfigurable feature. These

reconfigurable platforms can be customized by

setting pipeline and expanding parallelism degree,

lowering power consumption and raising

computing performance. Although ASIC owns

huge advantages over FPGA on power and speed,

expensive designing and manufacturing cost

limits it’s general application. On the contrast,

FPGA keeps a good balance between

performance, power, flexibility and expense due

to programmable feature and mature industry

design. Now, FPGA has been widely used in

cloud computing and intelligence computing by

Microsoft, Amazon and Alibaba [15][16],

becoming an important part of high- performance

computing.

The sound classification model which

focuses on specific speech instructions or

acoustics signal, is a basic component of

intelligent scenario analysis in both cloud and

edge end. Such applying circumstance needs a

low-power but high-performance computing

platform especially. Typical deep convolution

neural networks can do coarse sound

classification work well. However, there still

exists some space to accelerate CNN model and

reduce computing platform’s energy

consumption by quantization and customized

hardware design. To implement this power-

efficient sound classification computation

platform, we choose a typical CNN-based speech

classification model where weight value is +1 or

-1 and activation data is in full precision floating-

data format [17]. We design an accelerator based

on Xilinx XCKU-115 FPGA platform and run

this BWN (Binary Weight Network) model.

Compared with state-of-art CPU platform, our

accelerator achieves 18-300x throughput speed

up ratio and high energy efficiency. The main

contributions of this work are as follows:

1. We turn float-type feature data into fix-

point data each level by design space

exploration method. Model’s loss on

accuracy will be covered by the

performance of fix-point computing on

FPGA platforms.

2. We design a multi-PE BWN accelerator

on FPGA, which has shared weight storage,

balanced pipeline structure and low-delay

pipeline between CNN’s levels. Also, the

performance, power consumption and

energy efficiency of this accelerator are

discussed.

3. The target speech classification model is

tested under single thread, multi-thread and

multi-node environments to get sufficient

performance baseline. Compared with these

test results, our design has absolute

advantage on performance per watt and

throughput.

2. Neural Network Forwarding Quantization

When training a deep neural network,

researchers usually choose full-precision data

format to ensure best model accuracy. However,

in inference task, these parameters will not be

changed and therefore we can prune them in an

offline method. [17] raises an algorithm to

compress floating-point DNN parameters into

binary data, which is consisted of +1 and -1.

Compared with common DNN with floating-

point weight and activation, this compression

method not only sharply reduces parameter

storage, but also replaces multiplication and

division with add and minus. Less storage space

and multiplication mean less memory-consuming

energy and less computing cycles, leading to

faster lower power consumption and faster

working speed.

[18] brings out a method to turn activation

into binary format. Unlike parameter in neural

networks, activation data fluctuates numerically

with different input (such like input image or

input audio feature map). Although [18] still

keeps a good model accuracy on very deep CNNs

like VGGNet, great numerical precision loss

would be brought out binary activation data,

which may cause vital influence on some small-

size CNNs [19][20]. In this situation, turning

floating data into fix-point format can keep a

good balance between computing performance

and model accuracy: fix-point data computation

needs less computing cycles compared with

floating data, and, fix-point can adapt to data’s

numerical distribution by flexible allocation of

integer bitwise and decimal bitwise.

In Fig 1, when integer part is allocated with

more bitwise, it can indicate bigger data; and

when we give decimal part longer word-length, it

can improve numerical precision correspondingly.

However, increasing the length of fix-point data

format will add cycle counts to computation or

excess hardware’s limitation, so taking speed and

accuracy into account, it is important to find the

relatively best data format.

10.06

10.0625

True value

Fix-point value

1010.00001

Fix-point value in binary format

{

Integer part

{

Decimal part

20.06

20.060546875

True value

Fix-point value with extra

decimal bitwise

10100.0000011111

Fix-point value in binary format

Accuracy loss: 0.025

Accuracy loss: 0.000546875

Fig.1 How Bitwise Influences Numerical Precision

3. Speech Classification Model

3.1 Model Architecture and Weight

Binarization

This CNN-based speech recognition model

is trained on Tensorflow speech command dataset.

It can recognize the six sort of short speech

segment “up”, “down”, “yes”, “right”, “left” and

“unknown words”. This model first uses MFCC

algorithm turning an audio file into a float-type

tensor, whose dimension is 20x49x1. Then this

tensor will be sent into a convolution neural

network, which is consisted of two convolution

layers, three full-connected layers and binary

weight parameters. The detail information of

model architecture is shown in Fig 2. All

convolution kernel size is 3 and convolution

stride is 1. There is no padding and expansion

operation in this network, which is convenient for

us to accelerate. To be noticed that activation is

still in float format at this stage. Via softmax

function, this model outputs the possibility of six

type of labels.

After model parameters being fixed in the

training process, we can transfer float weight

value into (-1, 1). Fig 3 shows how we processing

weight data. We assume the distribution of

primitive parameter is normal distribution, the

numerical distributing range is then modified by

tanh function and a series of scale methods.

Finally, all parameters are discretized to -1 or +1.

This stage’s BWN model (activation data is still

剩余12页未读，继续阅读

评论收藏

内容反馈

郭逗

粉丝: 30
资源: 318

BWN-E 预生产1

评论0

最新资源

BWN-E 预生产1

评论0

BWN doc E 预生产1

BWN-E 预生产 - 副本1

BWN手册-中文1

BWN doc 中文1

BA 3401 (EX) FLENDER couplings N-BIPEX 结构形式 BWN, BWT, BNT[手册].pdf

11-开发阶段工作量统计1

指纹仪驱动WIN10安装包及调试程序_ZKFinger SDK 5.0 URU4000B

关于二阶线性递归序列倒数和的对称性 (2002年)

基础电子中的实际的同步突发式SRAM

VM-Pro通用化视觉系统框架V1.6

串口侦听 串口监听 不占用串口 不占用串口的监听

net framework4.0和4.5开发包（用于visual studio 2022 安装net旧版本）

【C#源码】TCP+串口通信的调试工具 （源码+教学视频）

C#含有ModbusRtu通讯库，通讯示例 硬件设备测试例程

C#40000字全套精华教程！！！从入门到精通，一篇就够了！！！

基于C#与Sql Server的智慧星学生选课管理系统.rar

HslCommunication.dll 7.0.1 免费版本 全部源代码和测试工程

C# .Net使用第三方库PacketDotNet，开发的抓包软件示例

封装owin的dll包

C# Winform通用开发框架

c#深度学习-PaddleOCRSharp数字识别demo

C# WinForm虚拟串口通信

全套C#教程(可当手册)

MaterialDesignIn XamlToolkit 压缩包（已编译，可直接打开，0 积分免费下载）

基于C#的软件加密、授权与注册

C# MES_开源源代码

C# 使用MQTTnet实现MQTT通信

C# WinForm使用AForge拍照及录像

2019年最新整理出的20 套 c# 项目(包含开发实例及源代码)

最新资源

串口侦听串口监听不占用串口不占用串口的监听

【C#源码】TCP+串口通信的调试工具（源码+教学视频）

C#含有ModbusRtu通讯库，通讯示例硬件设备测试例程

HslCommunication.dll 7.0.1 免费版本全部源代码和测试工程