BinarizedNeuralNetworks-TrainingDeepNeuralNetworkswithWeightsandActivationsConstrainedto+1or-1-2016(1602.02830v3)-计算机科学资源-CSDN文库

124 浏览量 2021-04-22 18:01:23 上传评论收藏 420KB PDF 举报

资源推荐

资源详情

资源评论

Binarized Neural Networks: Training Neural Networks with Weights and

Activations Constrained to +1 or −1

Matthieu Courbariaux*

MATTHIEU.COURBARIAUX@GMAIL.COM

Itay Hubara*

ITAYHUBARA@GMAIL.COM

Daniel Soudry

DANIEL.SOUDRY@GMAIL.COM

Ran El-Yaniv

RANI@CS.TECHNION.AC.IL

Yoshua Bengio

1,4

YOSHUA.UMONTREAL@GMAIL.COM

Universit

e de Montr

eal

Technion - Israel Institute of Technology

Columbia University

CIFAR Senior Fellow

*Indicates equal contribution. Ordering determined by coin ﬂip.

Abstract

We introduce a method to train Binarized Neu-

ral Networks (BNNs) - neural networks with bi-

nary weights and activations at run-time. At

training-time the binary weights and activations

are used for computing the parameters gradi-

ents. During the forward pass, BNNs drastically

reduce memory size and accesses, and replace

most arithmetic operations with bit-wise opera-

tions, which is expected to substantially improve

power-efﬁciency. To validate the effectiveness of

BNNs we conduct two sets of experiments on the

Torch7 and Theano frameworks. On both, BNNs

achieved nearly state-of-the-art results over the

MNIST, CIFAR-10 and SVHN datasets. Last but

not least, we wrote a binary matrix multiplication

GPU kernel with which it is possible to run our

MNIST BNN 7 times faster than with an unopti-

mized GPU kernel, without suffering any loss in

classiﬁcation accuracy. The code for training and

running our BNNs is available on-line.

Introduction

Deep Neural Networks (DNNs) have substantially pushed

Artiﬁcial Intelligence (AI) limits in a wide range of tasks,

including but not limited to object recognition from im-

ages (Krizhevsky et al., 2012; Szegedy et al., 2014), speech

recognition (Hinton et al., 2012; Sainath et al., 2013), sta-

tistical machine translation (Devlin et al., 2014; Sutskever

et al., 2014; Bahdanau et al., 2015), Atari and Go games

(Mnih et al., 2015; Silver et al., 2016), and even abstract

art (Mordvintsev et al., 2015).

Today, DNNs are almost exclusively trained on one or

many very fast and power-hungry Graphic Processing

Units (GPUs) (Coates et al., 2013). As a result, it is of-

ten a challenge to run DNNs on target low-power devices,

and substantial research efforts are invested in speeding

up DNNs at run-time on both general-purpose (Vanhoucke

et al., 2011; Gong et al., 2014; Romero et al., 2014; Han

et al., 2015) and specialized computer hardware (Farabet

et al., 2011a;b; Pham et al., 2012; Chen et al., 2014a;b;

Esser et al., 2015).

This paper makes the following contributions:

• We introduce a method to train Binarized-Neural-

Networks (BNNs), neural networks with binary

weights and activations, at run-time, and when com-

puting the parameters gradients at train-time (see Sec-

tion 1).

• We conduct two sets of experiments, each imple-

mented on a different framework, namely Torch7

(Collobert et al., 2011) and Theano (Bergstra et al.,

2010; Bastien et al., 2012), which show that it is pos-

sible to train BNNs on MNIST, CIFAR-10 and SVHN

and achieve nearly state-of-the-art results (see Section

2).

• We show that during the forward pass (both at run-

time and train-time), BNNs drastically reduce mem-

ory consumption (size and number of accesses), and

arXiv:1602.02830v3 [cs.LG] 17 Mar 2016

Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1

replace most arithmetic operations with bit-wise oper-

ations, which potentially lead to a substantial increase

in power-efﬁciency (see Section 3). Moreover, a bi-

narized CNN can lead to binary convolution kernel

repetitions; We argue that dedicated hardware could

reduce the time complexity by 60% .

• Last but not least, we programed a binary matrix mul-

tiplication GPU kernel with which it is possible to run

our MNIST BNN 7 times faster than with an unopti-

mized GPU kernel, without suffering any loss in clas-

siﬁcation accuracy (see Section 4).

• The code for training and running our BNNs is avail-

able on-line (In both Theano framework

and Torch

framework

1. Binarized Neural Networks

In this section, we detail our binarization function, show

how we use it to compute the parameters gradients, and

how we backpropagate through it.

1.1. Deterministic vs Stochastic Binarization

When training a BNN, we constrain both the weights and

the activations to either +1 or −1. Those two values are

very advantageous from a hardware perspective, as we ex-

plain in Section 4. In order to transform the real-valued

variables into those two values, we use two different bi-

narization functions, as in (Courbariaux et al., 2015). Our

ﬁrst binarization function is deterministic:

= Sign(x) =



+1 if x ≥ 0,

−1 otherwise,

(1)

where x

is the binarized variable (weight or activation)

and x the real-valued variable. It is very straightforward to

implement and works quite well in practice. Our second

binarization function is stochastic:



+1 with probability p = σ(x),

−1 with probability 1 − p,

(2)

where σ is the “hard sigmoid” function:

σ(x) = clip(

x + 1

, 0, 1) = max(0, min(1,

x + 1

)). (3)

The stochastic binarization is more appealing than the sign

function, but harder to implement as it requires the hard-

ware to generate random bits when quantizing. As a re-

sult, we mostly use the deterministic binarization function

(i.e, the sign function), with the exception of activations at

train-time in some of our experiments.

https://github.com/MatthieuCourbariaux/

BinaryNet

https://github.com/itayhubara/BinaryNet

1.2. Gradient Computation and Accumulation

Although our BNN training method uses binary weights

and activation to compute the parameters gradients, the

real-valued gradients of the weights are accumulated in

real-valued variables, as per Algorithm 1. Real-valued

weights are likely required for Stochasic Gradient Descent

(SGD) to work at all. SGD explores the space of param-

eters in small and noisy steps, and that noise is averaged

out by the stochastic gradient contributions accumulated in

each weight. Therefore, it is important to keep sufﬁcient

resolution for these accumulators, which at ﬁrst glance sug-

gests that high precision is absolutely required.

Moreover, adding noise to weights and activations when

computing the parameters gradients provide a form of reg-

ularization that can help to generalize better, as previ-

ously shown with variational weight noise (Graves, 2011),

Dropout (Srivastava, 2013; Srivastava et al., 2014) and

DropConnect (Wan et al., 2013). Our method of training

BNNs can be seen as a variant of Dropout, in which instead

of randomly setting half of the activations to zero when

computing the parameters gradients, we binarize both the

activations and the weights.

1.3. Propagating Gradients Through Discretization

The derivative of the sign function is zero almost every-

where, making it apparently incompatible with backpropa-

gation, since the exact gradient of the cost with respect to

the quantities before the discretization (pre-activations or

weights) would be zero. Note that this remains true even

if stochastic quantization is used. Bengio (2013) studied

the question of estimating or propagating gradients through

stochastic discrete neurons. They found in their experi-

ments that the fastest training was obtained when using the

“straight-through estimator,” previously introduced in Hin-

ton (2012)’s lectures.

We follow a similar approach but use the version of

the straight-through estimator that takes into account the

saturation effect, and does use deterministic rather than

stochastic sampling of the bit. Consider the sign function

quantization

q = Sign(r),

and assume that an estimator g

of the gradient

∂C

∂q

has

been obtained (with the straight-through estimator when

needed). Then, our straight-through estimator of

∂C

∂r

is sim-

ply

= g

|r|≤1

. (4)

Note that this preserves the gradient’s information and can-

cels the gradient when r is too large. Not cancelling the

gradient when r is too large signiﬁcantly worsens the per-

formance. The use of this straight-through estimator is il-

lustrated in Algorithm 1. The derivative 1

|r|≤1

can also be

剩余10页未读，继续阅读

评论收藏

内容反馈

weixin_38550605

粉丝: 5
资源: 951

Binarized Neural Networks - Training Deep Neural Networks with W...

最新资源

Binarized Neural Networks - Training Deep Neural Networks with W...

深度学习神经网络(英文版PDF教程）

neural networks and deep learning

Neural Networks

neural-networks:几种神经网络模型的实现

深度学习 论文

neural-networks-and-deep-learning-zh_cn

Neural.Networks.with.R.epub

neural-networks-and-deep-learning-master

Deep Learning: Practical Neural Networks with Java 完整高清英文azw3版

Artificial Neural Networks - Industrial and Control Engineering Applications

刘知远-Introduction to Graph Neural Networks.pdf

An Overview of Multi-Task Learning in Deep Neural Networks.pdf

neural-networks-and-deep-learning

neural-networks-and-deep-learning-master 深度学习与神经网络.zip

constrained-hamiltonian-neural-networks

neural-networks-and-deep-learning-master-python3.zip

MATLAB Deep Learning: With Machine Learning, Neural Networks and A I

Neural Networks and Deep Learning的手写数字识别python3代码

《Neural Networks and Deep Learning》（美）Michael Nielsen 著 英文版.pdf

深度学习论文

Neural Networks and Deep Learning A Textbook 完整版

4824-imagenet-classification-with-deep-convolutional-neural-networks.rar

Neural Networks and Deep Learning - 神经网络与深度学习 中英双版本

Tensorflow-Deep-Neural-Networks-master_python_

YouTube推荐系统Paper[2016]-Deep Neural Networks for YouTube Recommendations.pdf

Neural Networks—Tricks of the Trade (2nd Edition).pdf

【9】Speech recognition with deep recurrent neural networks.pdf

Understanding the difficulty of training deep feedforward neural networks

最新资源

深度学习论文

《Neural Networks and Deep Learning》（美）Michael Nielsen 著英文版.pdf

Neural Networks and Deep Learning - 神经网络与深度学习中英双版本