没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1Matthieu Courbariaux*1 MATTHIEU.COURBARIAUX@GMAIL.COM Itay Hubara*2 ITAYHUBARA@GMAIL.COM Daniel Soudry3 DANIEL.SOUDRY@GMAIL.COM Ran El-Yaniv2 RANI@CS.TECHNION.AC.IL Yoshua Bengio1,4 YOSHUA.UMONTREAL@GMAIL.COM 1Université de Montréal 2Technion - Israel Institute of Technology 3Columbia University 4CIFAR Senior Fellow *Indicates equal contribution. Ordering determined by coin flip.Abstract We
资源推荐
资源详情
资源评论
Binarized Neural Networks: Training Neural Networks with Weights and
Activations Constrained to +1 or −1
Matthieu Courbariaux*
1
MATTHIEU.COURBARIAUX@GMAIL.COM
Itay Hubara*
2
ITAYHUBARA@GMAIL.COM
Daniel Soudry
3
DANIEL.SOUDRY@GMAIL.COM
Ran El-Yaniv
2
RANI@CS.TECHNION.AC.IL
Yoshua Bengio
1,4
YOSHUA.UMONTREAL@GMAIL.COM
1
Universit
´
e de Montr
´
eal
2
Technion - Israel Institute of Technology
3
Columbia University
4
CIFAR Senior Fellow
*Indicates equal contribution. Ordering determined by coin flip.
Abstract
We introduce a method to train Binarized Neu-
ral Networks (BNNs) - neural networks with bi-
nary weights and activations at run-time. At
training-time the binary weights and activations
are used for computing the parameters gradi-
ents. During the forward pass, BNNs drastically
reduce memory size and accesses, and replace
most arithmetic operations with bit-wise opera-
tions, which is expected to substantially improve
power-efficiency. To validate the effectiveness of
BNNs we conduct two sets of experiments on the
Torch7 and Theano frameworks. On both, BNNs
achieved nearly state-of-the-art results over the
MNIST, CIFAR-10 and SVHN datasets. Last but
not least, we wrote a binary matrix multiplication
GPU kernel with which it is possible to run our
MNIST BNN 7 times faster than with an unopti-
mized GPU kernel, without suffering any loss in
classification accuracy. The code for training and
running our BNNs is available on-line.
Introduction
Deep Neural Networks (DNNs) have substantially pushed
Artificial Intelligence (AI) limits in a wide range of tasks,
including but not limited to object recognition from im-
ages (Krizhevsky et al., 2012; Szegedy et al., 2014), speech
recognition (Hinton et al., 2012; Sainath et al., 2013), sta-
tistical machine translation (Devlin et al., 2014; Sutskever
et al., 2014; Bahdanau et al., 2015), Atari and Go games
(Mnih et al., 2015; Silver et al., 2016), and even abstract
art (Mordvintsev et al., 2015).
Today, DNNs are almost exclusively trained on one or
many very fast and power-hungry Graphic Processing
Units (GPUs) (Coates et al., 2013). As a result, it is of-
ten a challenge to run DNNs on target low-power devices,
and substantial research efforts are invested in speeding
up DNNs at run-time on both general-purpose (Vanhoucke
et al., 2011; Gong et al., 2014; Romero et al., 2014; Han
et al., 2015) and specialized computer hardware (Farabet
et al., 2011a;b; Pham et al., 2012; Chen et al., 2014a;b;
Esser et al., 2015).
This paper makes the following contributions:
• We introduce a method to train Binarized-Neural-
Networks (BNNs), neural networks with binary
weights and activations, at run-time, and when com-
puting the parameters gradients at train-time (see Sec-
tion 1).
• We conduct two sets of experiments, each imple-
mented on a different framework, namely Torch7
(Collobert et al., 2011) and Theano (Bergstra et al.,
2010; Bastien et al., 2012), which show that it is pos-
sible to train BNNs on MNIST, CIFAR-10 and SVHN
and achieve nearly state-of-the-art results (see Section
2).
• We show that during the forward pass (both at run-
time and train-time), BNNs drastically reduce mem-
ory consumption (size and number of accesses), and
arXiv:1602.02830v3 [cs.LG] 17 Mar 2016
Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1
replace most arithmetic operations with bit-wise oper-
ations, which potentially lead to a substantial increase
in power-efficiency (see Section 3). Moreover, a bi-
narized CNN can lead to binary convolution kernel
repetitions; We argue that dedicated hardware could
reduce the time complexity by 60% .
• Last but not least, we programed a binary matrix mul-
tiplication GPU kernel with which it is possible to run
our MNIST BNN 7 times faster than with an unopti-
mized GPU kernel, without suffering any loss in clas-
sification accuracy (see Section 4).
• The code for training and running our BNNs is avail-
able on-line (In both Theano framework
1
and Torch
framework
2
).
1. Binarized Neural Networks
In this section, we detail our binarization function, show
how we use it to compute the parameters gradients, and
how we backpropagate through it.
1.1. Deterministic vs Stochastic Binarization
When training a BNN, we constrain both the weights and
the activations to either +1 or −1. Those two values are
very advantageous from a hardware perspective, as we ex-
plain in Section 4. In order to transform the real-valued
variables into those two values, we use two different bi-
narization functions, as in (Courbariaux et al., 2015). Our
first binarization function is deterministic:
x
b
= Sign(x) =
+1 if x ≥ 0,
−1 otherwise,
(1)
where x
b
is the binarized variable (weight or activation)
and x the real-valued variable. It is very straightforward to
implement and works quite well in practice. Our second
binarization function is stochastic:
x
b
=
+1 with probability p = σ(x),
−1 with probability 1 − p,
(2)
where σ is the “hard sigmoid” function:
σ(x) = clip(
x + 1
2
, 0, 1) = max(0, min(1,
x + 1
2
)). (3)
The stochastic binarization is more appealing than the sign
function, but harder to implement as it requires the hard-
ware to generate random bits when quantizing. As a re-
sult, we mostly use the deterministic binarization function
(i.e, the sign function), with the exception of activations at
train-time in some of our experiments.
1
https://github.com/MatthieuCourbariaux/
BinaryNet
2
https://github.com/itayhubara/BinaryNet
1.2. Gradient Computation and Accumulation
Although our BNN training method uses binary weights
and activation to compute the parameters gradients, the
real-valued gradients of the weights are accumulated in
real-valued variables, as per Algorithm 1. Real-valued
weights are likely required for Stochasic Gradient Descent
(SGD) to work at all. SGD explores the space of param-
eters in small and noisy steps, and that noise is averaged
out by the stochastic gradient contributions accumulated in
each weight. Therefore, it is important to keep sufficient
resolution for these accumulators, which at first glance sug-
gests that high precision is absolutely required.
Moreover, adding noise to weights and activations when
computing the parameters gradients provide a form of reg-
ularization that can help to generalize better, as previ-
ously shown with variational weight noise (Graves, 2011),
Dropout (Srivastava, 2013; Srivastava et al., 2014) and
DropConnect (Wan et al., 2013). Our method of training
BNNs can be seen as a variant of Dropout, in which instead
of randomly setting half of the activations to zero when
computing the parameters gradients, we binarize both the
activations and the weights.
1.3. Propagating Gradients Through Discretization
The derivative of the sign function is zero almost every-
where, making it apparently incompatible with backpropa-
gation, since the exact gradient of the cost with respect to
the quantities before the discretization (pre-activations or
weights) would be zero. Note that this remains true even
if stochastic quantization is used. Bengio (2013) studied
the question of estimating or propagating gradients through
stochastic discrete neurons. They found in their experi-
ments that the fastest training was obtained when using the
“straight-through estimator,” previously introduced in Hin-
ton (2012)’s lectures.
We follow a similar approach but use the version of
the straight-through estimator that takes into account the
saturation effect, and does use deterministic rather than
stochastic sampling of the bit. Consider the sign function
quantization
q = Sign(r),
and assume that an estimator g
q
of the gradient
∂C
∂q
has
been obtained (with the straight-through estimator when
needed). Then, our straight-through estimator of
∂C
∂r
is sim-
ply
g
r
= g
q
1
|r|≤1
. (4)
Note that this preserves the gradient’s information and can-
cels the gradient when r is too large. Not cancelling the
gradient when r is too large significantly worsens the per-
formance. The use of this straight-through estimator is il-
lustrated in Algorithm 1. The derivative 1
|r|≤1
can also be
剩余10页未读,继续阅读
资源评论
weixin_38550605
- 粉丝: 5
- 资源: 951
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功