ReversibleRecurrentNeuralNetworks.pdf资源-CSDN文库

NLP

需积分: 10 126 浏览量 2020-03-29 10:33:49 上传评论收藏 2.25MB PDF 举报

资源推荐

资源详情

资源评论

Reversible Recurrent Neural Networks

Matthew MacKay, Paul Vicol, Jimmy Ba, Roger Grosse

University of Toronto

Vector Institute

{mmackay, pvicol, jba, rgrosse}@cs.toronto.edu

Abstract

Recurrent neural networks (RNNs) provide state-of-the-art performance in pro-

cessing sequential data but are memory intensive to train, limiting the ﬂexibility

of RNN models which can be trained. Reversible RNNs—RNNs for which the

hidden-to-hidden transition can be reversed—offer a path to reduce the memory

requirements of training, as hidden states need not be stored and instead can be

recomputed during backpropagation. We ﬁrst show that perfectly reversible RNNs,

which require no storage of the hidden activations, are fundamentally limited be-

cause they cannot forget information from their hidden state. We then provide a

scheme for storing a small number of bits in order to allow perfect reversal with

forgetting. Our method achieves comparable performance to traditional models

while reducing the activation memory cost by a factor of 10–15. We extend our

technique to attention-based sequence-to-sequence models, where it maintains

performance while reducing activation memory cost by a factor of 5–10 in the

encoder, and a factor of 10–15 in the decoder.

1 Introduction

Recurrent neural networks (RNNs) have attained state-of-the-art performance on a variety of tasks,

including speech recognition [

], language modeling [

], and machine translation [

]. However,

RNNs are memory intensive to train. The standard training algorithm is truncated backpropagation

through time (TBPTT) [

]. In this algorithm, the input sequence is divided into subsequences of

smaller length, say

. Each of these subsequences is processed and the gradient is backpropagated.

If H is the size of our model’s hidden state, the memory required for TBPTT is O(T H).

Decreasing the memory requirements of the TBPTT algorithm would allow us to increase the length

of our truncated sequences, capturing dependencies over longer time scales. Alternatively, we could

increase the size

of our hidden state or use deeper input-to-hidden, hidden-to-hidden, or hidden-to-

output transitions, granting our model greater expressivity. Increasing the depth of these transitions

has been shown to increase performance in polyphonic music prediction, language modeling, and

neural machine translation (NMT) [8, 9, 10].

Reversible recurrent network architectures present an enticing way to reduce the memory requirements

of TBPTT. Reversible architectures enable the reconstruction of the hidden state at the current timestep

given the next hidden state and the current input, which would enable us to perform TBPTT without

storing the hidden states at each timestep. In exchange, we pay an increased computational cost to

reconstruct the hidden states during backpropagation.

We ﬁrst present reversible analogues of the widely used Gated Recurrent Unit (GRU) [

] and Long

Short-Term Memory (LSTM) [

] architectures. We then show that any perfectly reversible RNN

requiring no storage of hidden activations will fail on a simple one-step prediction task. This task is

trivial to solve even for vanilla RNNs, but perfectly reversible models fail since they need to memorize

the input sequence in order to solve the task. In light of this ﬁnding, we extend the memory-efﬁcient

32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.

arXiv:1810.10999v1 [cs.LG] 25 Oct 2018

reversal method of Maclaurin et al.

[13]

, storing a handful of bits per unit in order to allow perfect

reversal for architectures which forget information.

We evaluate the performance of these models on language modeling and neural machine translation

benchmarks. Depending on the task, dataset, and chosen architecture, reversible models (without

attention) achieve 10–15-fold memory savings over traditional models. Reversible models achieve

approximately equivalent performance to traditional LSTM and GRU models on word-level language

modeling on the Penn TreeBank dataset [

] and lag 2–5 perplexity points behind traditional models

on the WikiText-2 dataset [15].

Achieving comparable memory savings with attention-based recurrent sequence-to-sequence models

is difﬁcult, since the encoder hidden states must be kept simultaneously in memory in order to

perform attention. We address this challenge by performing attention over a small subset of the

hidden state, concatenated with the word embedding. With this technique, our reversible models

succeed on neural machine translation tasks, outperforming baseline GRU and LSTM models on the

Multi30K dataset [

] and achieving competitive performance on the IWSLT 2016 [

] benchmark.

Applying our technique reduces memory cost by a factor of 10–15 in the decoder, and a factor of

5–10 in the encoder.

2 Background

We begin by describing techniques to construct reversible neural network architectures, which we

then adapt to RNNs. Reversible networks were ﬁrst motivated by the need for ﬂexible probability

distributions with tractable likelihoods [

]. Each of these architectures deﬁnes a mapping

between probability distributions, one of which has a simple, known density. Because this mapping is

reversible with an easily computable Jacobian determinant, maximum likelihood training is efﬁcient.

A recent paper, closely related to our work, showed that reversible network architectures can be

adapted to image classiﬁcation tasks [

]. Their architecture, called the Reversible Residual Network

or RevNet, is composed of a series of reversible blocks. Each block takes an input

and produces

an output

of the same dimensionality. The input

is separated into two groups:

x = [x

; x

]

, and

outputs are produced according to the following coupling rule:

= x

+ F (x

) y

= x

+ G(y

) (1)

where

and

are residual functions analogous to those in standard residual networks [

]. The

output

is formed by concatenating

and

y = [y

; y

]

. Each layer’s activations can be

reconstructed from the next layer’s activations as follows:

= y

− G(y

) x

= y

− F (x

) (2)

Because of this property, activations from the forward pass need not be stored for use in the back-

wards pass. Instead, starting from the last layer, activations of previous layers are reconstructed

during backpropagation

. Because reversible backprop requires an additional computation of the

residual functions to reconstruct activations, it requires

33%

more arithmetic operations than ordinary

backprop and is about

50%

more expensive in practice. Full details of how to efﬁciently combine

reversibility with backpropagation may be found in Gomez et al. [21].

3 Reversible Recurrent Architectures

The techniques used to construct RevNets can be combined with traditional RNN models to produce

reversible RNNs. In this section, we propose reversible analogues of the GRU and the LSTM.

3.1 Reversible GRU

We start by recalling the GRU equations used to compute the next hidden state

(t+1)

given the

current hidden state h

(t)

and the current input x

(t)

(omitting biases):

(t)

; r

(t)

] = σ(W [x

(t)

; h

(t−1)

]) g

(t)

= tanh(U[x

(t)

; r

(t)

 h

(t−1)

])

(t)

= z

(t)

h

(t−1)

+ (1 − z

(t)

)  g

(t)

(3)

Here,



denotes elementwise multiplication. To make this update reversible, we separate the hidden

state h into two groups, h = [h

; h

]. These groups are updated using the following rules:

Code will be made available at https://github.com/matthewjmackay/reversible-rnn

The activations prior to a pooling step must still be saved, since this involves projection to a lower

dimensional space, and hence loss of information.

(t)

; r

(t)

] = σ(W

(t)

; h

(t−1)

])

(t)

= tanh(U

(t)

; r

(t)

 h

(t−1)

])

(t)

= z

(t)

 h

(t−1)

+ (1 − z

(t)

)  g

(t)

(4)

(t)

; r

(t)

] = σ(W

(t)

; h

(t)

])

(t)

= tanh(U

(t)

; r

(t)

 h

(t)

])

(t)

= z

(t)

 h

(t−1)

+ (1 − z

(t)

)  g

(t)

(5)

Note that

(t)

and not

(t−1)

is used to compute the update for

(t)

. We term this model the Reversible

Gated Recurrent Unit, or RevGRU. Note that

(t)

6= 0

for

i = 1, 2

as it is the output of a sigmoid,

which maps to the open interval

(0, 1)

. This means the RevGRU updates are reversible in exact

arithmetic: given

(t)

= [h

(t)

; h

(t)

]

, we can use

(t)

and

(t)

to ﬁnd

(t)

, and

(t)

by redoing

part of our forwards computation. Then we can ﬁnd h

(t−1)

using:

(t−1)

= [h

(t)

− (1 − z

(t)

)  g

(t)

]  1/z

(t)

(6)

(t−1)

is reconstructed similarly. We address numerical issues which arise in practice in Section 3.3.

3.2 Reversible LSTM

We next construct a reversible LSTM. The LSTM separates the hidden state into an output state

and a cell state c. The update equations are:

(t)

, i

(t)

, o

(t)

] = σ(W [x

(t)

, h

(t−1)

]) (7) g

(t)

= tanh(U[x

(t)

, h

(t−1)

]) (8)

(t)

= f

(t)

 c

(t−1)

+ i

(t)

 g

(t)

(9) h

(t)

= o

(t)

 tanh(c

(t)

) (10)

We cannot straightforwardly apply our reversible techniques, as the update for

(t)

is not a non-zero

linear transformation of h

(t−1)

. Despite this, reversibility can be achieved using the equations:

(t)

, i

(t)

, o

(t)

, p

(t)

] = σ(W

(t)

, h

(t−1)

]) (11) g

(t)

= tanh(U

(t)

, h

(t−1)

]) (12)

(t)

= f

(t)

 c

(t−1)

+ i

(t)

 g

(t)

(13) h

(t)

= p

(t)

 h

(t−1)

+ o

(t)

 tanh(c

(t)

) (14)

We calculate the updates for

, h

in an identical fashion to the above equations, using

(t)

and

(t)

We call this model the Reversible LSTM, or RevLSTM.

3.3 Reversibility in Finite Precision Arithmetic

We have deﬁned RNNs which are reversible in exact arithmetic. In practice, the hidden states cannot

be perfectly reconstructed due to ﬁnite numerical precision. Consider the RevGRU equations 4 and

5. If the hidden state

is stored in ﬁxed point, multiplication of

(whose entries are less than

1) destroys information, preventing perfect reconstruction. Multiplying a hidden unit by

1/2

, for

example, corresponds to discarding its least-signiﬁcant bit, whose value cannot be recovered in the

reverse computation. These errors from information loss accumulate exponentially over timesteps,

causing the initial hidden state obtained by reversal to be far from the true initial state. The same

issue also affects the reconstruction of the RevLSTM hidden states. Hence, we ﬁnd that forgetting is

the main roadblock to constructing perfectly reversible recurrent architectures.

There are two possible avenues to address this limitation. The ﬁrst is to remove the forgetting step.

For the RevGRU, this means we compute z

(t)

, r

(t)

, and g

(t)

as before, and update h

(t)

using:

(t)

= h

(t−1)

+ (1 − z

(t)

)  g

(t)

(15)

We term this model the No-Forgetting RevGRU or NF-RevGRU. Because the updates of the NF-

RevGRU do not discard information, we need only store one hidden state in memory at a given time

during training. Similar steps can be taken to deﬁne a NF-RevLSTM.

The second avenue is to accept some memory usage and store the information forgotten from the

hidden state in the forward pass. We can then achieve perfect reconstruction by restoring this

information to our hidden state in the reverse computation. We discuss how to do so efﬁciently in

Section 5.

(0)

(3)

(1)

(2)

(3)

(0)

(2)

(1)

Figure 1: Unrolling the reverse computation of an exactly reversible model on the repeat task yields a sequence-

to-sequence computation.

Left:

The repeat task itself, where the model repeats each input token.

Right:

Unrolling the reversal. The model effectively uses the ﬁnal hidden state to reconstruct all input tokens, implying

that the entire input sequence must be stored in the ﬁnal hidden state.

4 Impossibility of No Forgetting

We have shown reversible RNNs in ﬁnite precision can be constructed by ensuring that no information

is discarded. We were unable to ﬁnd such an architecture that achieved acceptable performance on

tasks such as language modeling

. This is consistent with prior work which found forgetting to be

crucial to LSTM performance [

]. In this section, we argue that this results from a fundamental

limitation of no-forgetting reversible models: if none of the hidden state can be forgotten, then the

hidden state at any given timestep must contain enough information to reconstruct all previous hidden

states. Thus, any information stored in the hidden state at one timestep must remain present at all

future timesteps to ensure exact reconstruction, overwhelming the storage capacity of the model.

We make this intuition concrete by considering an elementary sequence learning task, the repeat task.

In this task, an RNN is given a sequence of discrete tokens and must simply repeat each token at the

subsequent timestep. This task is trivially solvable by ordinary RNN models with only a handful of

hidden units, since it doesn’t require modeling long-distance dependencies. But consider how an

exactly reversible model would perform the repeat task. Unrolling the reverse computation, as shown

in Figure 1, reveals a sequence-to-sequence computation in which the encoder and decoder weights

are tied. The encoder takes in the tokens and produces a ﬁnal hidden state. The decoder uses this

ﬁnal hidden state to produce the input sequence in reverse sequential order.

Notice the relationship to another sequence learning task, the memorization task, used as part of

a curriculum learning strategy by Zaremba and Sutskever

[25]

. After an RNN observes an entire

sequence of input tokens, it is required to output the input sequence in reverse order. As shown in

Figure 1, the memorization task for an ordinary RNN reduces to the repeat task for an NF-RevRNN.

Hence, if the memorization task requires a hidden representation size that grows with the sequence

length, then so does the repeat task for NF-RevRNNs.

We conﬁrmed experimentally that NF-RevGRU and NF-RevLSM networks with limited capacity

were unable to solve the repeat task

. Interestingly, the NF-RevGRU was able to memorize input

sequences using considerably fewer hidden units than the ordinary GRU or LSTM, suggesting it may

be a useful architecture for tasks requiring memorization. Consistent with the results on the repeat

task, the NF-RevGRU and NF-RevLSTM were unable to match the performance of even vanilla

RNNs on word-level language modeling on the Penn TreeBank dataset [14].

5 Reversibility with Forgetting

The impossibility of zero forgetting leads us to explore the second possibility to achieve reversibility:

storing information lost from the hidden state during the forward computation, then restoring it in the

reverse computation. Initially, we investigated discrete forgetting, in which only an integral number

of bits are allowed to be forgotten. This leads to a simple implementation: if

bits are forgotten in

the forwards pass, we can store these

bits on a stack, to be popped off and restored to the hidden

state during reconstruction. However, restricting our model to forget only an integral number of

bits led to a substantial drop in performance compared to baseline models

. For the remainder of

We discuss our failed attempts in Appendix A.

We include full results and details in Appendix B. The argument presented applies to idealized RNNs able

to implement any hidden-to-hidden transition and whose hidden units can store 32 bits each. We chose to use

the LSTM and the NF-RevGRU as approximations to these idealized models since they performed best at their

respective tasks.

Algorithmic details and experimental results for discrete forgetting are given in Appendix D.

Algorithm 1 Exactly reversible multiplication (Maclaurin et al. [13])

1: Input: Buffer integer B, hidden state h = 2

−R

∗

, forget value z = 2

−R

∗

with 0 < z

∗

< 2

2: B ← B × 2

{make room for new information on buffer}

3: B ← B + (h

∗

mo d 2

) {store lost information in buffer}

4: h

∗

← h

∗

÷ 2

{divide by denominator of z}

5: h

∗

← h

∗

× z

∗

{multiply by numerator of z}

6: h

∗

← h

∗

+ (B mod z

∗

) {add information to hidden state}

7: B ← B ÷ z

∗

{shorten information buffer}

8: return updated buffer B, updated value h = 2

−R

∗

this paper, we turn to fractional forgetting, in which a fractional number of bits are allowed to be

forgotten.

To allow forgetting of a fractional number of bits, we use a technique introduced by Maclaurin et al.

[13]

to store lost information. To avoid cumbersome notation, we do away with super- and subscripts

and consider a single hidden unit

and its forget value

. We represent

and

as ﬁxed-point

numbers (integers with an implied radix point). For clarity, we write

h = 2

−R

∗

and

z = 2

−R

∗

Hence,

∗

is the number stored on the computer and multiplication by

−R

supplies the implied

radix point. In general,

and

are distinct. Our goal is to multiply

, storing as few bits as

necessary to make this operation reversible.

The full process of reversible multiplication is shown in detail in Algorithm 1. The algorithm

maintains an integer information buffer which stores

∗

mod 2

at each timestep, so integer

division of

∗

is reversible. However, this requires enlarging the buffer by

bits at each

timestep. Maclaurin et al.

[13]

reduced this storage requirement by shifting information from the

buffer back onto the hidden state. Reversibility is preserved if the shifted information is small enough

so that it does not affect the reverse operation (integer division of

∗

). We include a full review

of the algorithm of Maclaurin et al. [13] in Appendix C.1.

However, this trick introduces a new complication not discussed by Maclaurin et al.

[13]

: the

information shifted from the buffer could introduce signiﬁcant noise into the hidden state. Shifting

information requires adding a positive value less than

∗

. Because

∗

∈ (0, 2

)

(

is the output

of a sigmoid function and

z = 2

−R

∗

h = 2

−R

∗

may be altered by as much

− 1)/2

≥ R

, this can alter the hidden state

or more

. This is substantial, as in practice we

observe |h| ≤ 16. Indeed, we observed severe performance drops for R

and R

close to equal.

The solution is to limit the amount of information moved from the buffer to the hidden state by setting

smaller than

. We found

= 23

and

= 10

to work well. The amount of noise added

onto the hidden state is bounded by

−R

, so with these values, the hidden state is altered by at

most

−13

. While the precision of our forgetting value

is limited to

bits, previous work has

found that neural networks can be trained with precision as low as 10–15 bits and reach the same

performance as high precision networks [26, 27]. We ﬁnd our situation to be similar.

Memory Savings

To analyze the savings that are theoretically possible using the procedure above,

consider an idealized memory buffer which maintains dynamically resizing storage integers

for

each hidden unit

in groups

i = 1, 2

of the RevGRU model. Using the above procedure, at each

timestep the number of bits stored in each B

grows by:

− log

∗

i,h

) = log



∗

i,h



= log

(1/z

i,h

) (16)

If the entries of

i,h

are not close to zero, this compares favorably with the naïve cost of

bits

per timestep. The total storage cost of TBPTT for a RevGRU model with hidden state size

on a

sequence of length T will be

−

t=T

h=1

log

(t)

1,h

) + log

(t)

2,h

)

(17)

Thus, in the idealized case, the number of bits stored equals the number of bits forgotten.

We illustrate this phenomenon with a concrete example in Appendix C.2.

For the RevLSTM, we would sum over p

(t)

and f

(t)

terms.

剩余30页未读，继续阅读

评论收藏

内容反馈

hywcxq

粉丝: 0
资源: 33

Reversible Recurrent Neural Networks.pdf

最新资源

Reversible Recurrent Neural Networks.pdf

Reversible Struc省略 Thulium Anions.pdf

rjMCMCsa.rar_As One_Bayesian_de Freitas_neural matlab _rjMCMCsa

Google C++ International Standard.pdf

CIS_Microsoft_Windows_Server_2008_R2_Benchmark_v3.0.1.pdf

论文研究-Design of a Reversible Data Hiding Algorithm Based on Dynamic DC-QIM.pdf

CIS_Microsoft_Windows_Server_2016_RTM_Release 1607_Benchmark_v1.0.0.pdf

Visual Knowledge Discovery and Machine Learning-Springer(2018).pdf

HSL抗体说明书图解.pdf

reversible data hiding in jpeg.zip

可逆水印：可逆水印 (http://ihome.ust.hk/~spjaiswal/Reversible_watermarking.html)-matlab开发

reversible-random.js:JavaScript的“可逆”随机数生成器。 它能够生成先前的随机数。 基于线性同余

Reversible-Hiding.rar_Reversible-Hiding_可逆信息隐藏_可逆水印算法_直方图可逆_直方图水

Tamper Restoration on DNA sequencesBased on Reversible Data Hiding.

Computing and Combinatorics

CNN-Prediction-Based-Reversible-Data-Hiding-main (1).zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

zotero翻译插件.xpi

最新资源

reversible-random.js:JavaScript的“可逆”随机数生成器。它能够生成先前的随机数。基于线性同余