没有合适的资源?快使用搜索试试~ 我知道了~
lstm
资源推荐
资源详情
资源评论
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2023. All
rights reserved. Draft of February 3, 2024.
CHAPTER
9
RNNs and LSTMs
Time will explain.
Jane Austen, Persuasion
Language is an inherently temporal phenomenon. Spoken language is a sequence of
acoustic events over time, and we comprehend and produce both spoken and written
language as a continuous input stream. The temporal nature of language is reflected
in the metaphors we use; we talk of the flow of conversations, news feeds, and twitter
streams, all of which emphasize that language is a sequence that unfolds in time.
This temporal nature is reflected in some language processing algorithms. For
example, the Viterbi algorithm we introduced for HMM part-of-speech tagging pro-
ceeds through the input a word at a time, carrying forward information gleaned along
the way. Yet other machine learning approaches, like those we’ve studied for senti-
ment analysis or other text classification tasks don’t have this temporal nature – they
assume simultaneous access to all aspects of their input.
The feedforward networks of Chapter 7 also assumed simultaneous access, al-
though they also had a simple model for time. Recall that we applied feedforward
networks to language modeling by having them look only at a fixed-size window
of words, and then sliding this window over the input, making independent predic-
tions along the way. This sliding-window approach is also used in the transformer
architecture we will introduce in Chapter 10.
This chapter introduces a deep learning architecture that offers an alternative
way of representing time: recurrent neural networks (RNNs), and their variants like
LSTMs. RNNs have a mechanism that deals directly with the sequential nature of
language, allowing them to handle the temporal nature of language without the use of
arbitrary fixed-sized windows. The recurrent network offers a new way to represent
the prior context, in its recurrent connections, allowing the model’s decision to
depend on information from hundreds of words in the past. We’ll see how to apply
the model to the task of language modeling, to sequence modeling tasks like part-
of-speech tagging, and to text classification tasks like sentiment analysis.
9.1 Recurrent Neural Networks
A recurrent neural network (RNN) is any network that contains a cycle within its
network connections, meaning that the value of some unit is directly, or indirectly,
dependent on its own earlier outputs as an input. While powerful, such networks
are difficult to reason about and to train. However, within the general class of recur-
rent networks there are constrained architectures that have proven to be extremely
effective when applied to language. In this section, we consider a class of recurrent
networks referred to as Elman Networks (Elman, 1990) or simple recurrent net-
Elman
Networks
2 CHAPTER 9 • RNNS AND LSTMS
works. These networks are useful in their own right and serve as the basis for more
complex approaches like the Long Short-Term Memory (LSTM) networks discussed
later in this chapter. In this chapter when we use the term RNN we’ll be referring to
these simpler more constrained networks (although you will often see the term RNN
to mean any net with recurrent properties including LSTMs).
x
t
h
t
y
t
Figure 9.1 Simple recurrent neural network after Elman (1990). The hidden layer includes
a recurrent connection as part of its input. That is, the activation value of the hidden layer
depends on the current input as well as the activation value of the hidden layer from the
previous time step.
Fig. 9.1 illustrates the structure of an RNN. As with ordinary feedforward net-
works, an input vector representing the current input, x
t
, is multiplied by a weight
matrix and then passed through a non-linear activation function to compute the val-
ues for a layer of hidden units. This hidden layer is then used to calculate a cor-
responding output, y
t
. In a departure from our earlier window-based approach, se-
quences are processed by presenting one item at a time to the network. We’ll use
subscripts to represent time, thus x
t
will mean the input vector x at time t. The key
difference from a feedforward network lies in the recurrent link shown in the figure
with the dashed line. This link augments the input to the computation at the hidden
layer with the value of the hidden layer from the preceding point in time.
The hidden layer from the previous time step provides a form of memory, or
context, that encodes earlier processing and informs the decisions to be made at
later points in time. Critically, this approach does not impose a fixed-length limit
on this prior context; the context embodied in the previous hidden layer can include
information extending back to the beginning of the sequence.
Adding this temporal dimension makes RNNs appear to be more complex than
non-recurrent architectures. But in reality, they’re not all that different. Given an
input vector and the values for the hidden layer from the previous time step, we’re
still performing the standard feedforward calculation introduced in Chapter 7. To
see this, consider Fig. 9.2 which clarifies the nature of the recurrence and how it
factors into the computation at the hidden layer. The most significant change lies in
the new set of weights, U, that connect the hidden layer from the previous time step
to the current hidden layer. These weights determine how the network makes use of
past context in calculating the output for the current input. As with the other weights
in the network, these connections are trained via backpropagation.
9.1.1 Inference in RNNs
Forward inference (mapping a sequence of inputs to a sequence of outputs) in an
RNN is nearly identical to what we’ve already seen with feedforward networks. To
compute an output y
t
for an input x
t
, we need the activation value for the hidden
layer h
t
. To calculate this, we multiply the input x
t
with the weight matrix W, and
the hidden layer from the previous time step h
t−1
with the weight matrix U. We
add these values together and pass them through a suitable activation function, g,
to arrive at the activation value for the current hidden layer, h
t
. Once we have the
9.1 • RECURRENT NEURAL NETWORKS 3
U
V
W
y
t
x
t
h
t
h
t-1
Figure 9.2 Simple recurrent neural network illustrated as a feedforward network.
values for the hidden layer, we proceed with the usual computation to generate the
output vector.
h
t
= g(Uh
t−1
+ Wx
t
) (9.1)
y
t
= f (Vh
t
) (9.2)
It’s worthwhile here to be careful about specifying the dimensions of the input, hid-
den and output layers, as well as the weight matrices to make sure these calculations
are correct. Let’s refer to the input, hidden and output layer dimensions as d
in
, d
h
,
and d
out
respectively. Given this, our three parameter matrices are: W ∈ R
d
h
×d
in
,
U ∈ R
d
h
×d
h
, and V ∈ R
d
out
×d
h
.
In the commonly encountered case of soft classification, computing y
t
consists
of a softmax computation that provides a probability distribution over the possible
output classes.
y
t
= softmax(Vh
t
) (9.3)
The fact that the computation at time t requires the value of the hidden layer from
time t −1 mandates an incremental inference algorithm that proceeds from the start
of the sequence to the end as illustrated in Fig. 9.3. The sequential nature of simple
recurrent networks can also be seen by unrolling the network in time as is shown in
Fig. 9.4. In this figure, the various layers of units are copied for each time step to
illustrate that they will have differing values over time. However, the various weight
matrices are shared across time.
function FORWARDRNN(x, network) returns output sequence y
h
0
←0
for i←1 to LENGTH(x) do
h
i
←g(Uh
i−1
+ Wx
i
)
y
i
← f (Vh
i
)
return y
Figure 9.3 Forward inference in a simple recurrent network. The matrices U, V and W are
shared across time, while new values for h and y are calculated with each time step.
4 CHAPTER 9 • RNNS AND LSTMS
U
V
W
U
V
W
U
V
W
x
1
x
2
x
3
y
1
y
2
y
3
h
1
h
3
h
2
h
0
Figure 9.4 A simple recurrent neural network shown unrolled in time. Network layers are recalculated for
each time step, while the weights U, V and W are shared across all time steps.
9.1.2 Training
As with feedforward networks, we’ll use a training set, a loss function, and back-
propagation to obtain the gradients needed to adjust the weights in these recurrent
networks. As shown in Fig. 9.2, we now have 3 sets of weights to update: W, the
weights from the input layer to the hidden layer, U, the weights from the previous
hidden layer to the current hidden layer, and finally V, the weights from the hidden
layer to the output layer.
Fig. 9.4 highlights two considerations that we didn’t have to worry about with
backpropagation in feedforward networks. First, to compute the loss function for
the output at time t we need the hidden layer from time t −1. Second, the hidden
layer at time t influences both the output at time t and the hidden layer at time t + 1
(and hence the output and loss at t + 1). It follows from this that to assess the error
accruing to h
t
, we’ll need to know its influence on both the current output as well as
the ones that follow.
Tailoring the backpropagation algorithm to this situation leads to a two-pass al-
gorithm for training the weights in RNNs. In the first pass, we perform forward
inference, computing h
t
, y
t
, accumulating the loss at each step in time, saving the
value of the hidden layer at each step for use at the next time step. In the second
phase, we process the sequence in reverse, computing the required gradients as we
go, computing and saving the error term for use in the hidden layer for each step
backward in time. This general approach is commonly referred to as backpropaga-
tion through time (Werbos 1974, Rumelhart et al. 1986, Werbos 1990).
backpropaga-
tion through
time
Fortunately, with modern computational frameworks and adequate computing
resources, there is no need for a specialized approach to training RNNs. As illus-
trated in Fig. 9.4, explicitly unrolling a recurrent network into a feedforward com-
putational graph eliminates any explicit recurrences, allowing the network weights
to be trained directly. In such an approach, we provide a template that specifies the
basic structure of the network, including all the necessary parameters for the input,
9.2 • RNNS AS LANGUAGE MODELS 5
output, and hidden layers, the weight matrices, as well as the activation and output
functions to be used. Then, when presented with a specific input sequence, we can
generate an unrolled feedforward network specific to that input, and use that graph
to perform forward inference or training via ordinary backpropagation.
For applications that involve much longer input sequences, such as speech recog-
nition, character-level processing, or streaming continuous inputs, unrolling an en-
tire input sequence may not be feasible. In these cases, we can unroll the input into
manageable fixed-length segments and treat each segment as a distinct training item.
9.2 RNNs as Language Models
Let’s see how to apply RNNs to the language modeling task. Recall from Chapter 3
that language models predict the next word in a sequence given some preceding
context. For example, if the preceding context is “Thanks for all the” and we want
to know how likely the next word is “fish” we would compute:
P(fish|Thanks for all the)
Language models give us the ability to assign such a conditional probability to every
possible next word, giving us a distribution over the entire vocabulary. We can also
assign probabilities to entire sequences by combining these conditional probabilities
with the chain rule:
P(w
1:n
) =
n
Y
i=1
P(w
i
|w
<i
)
The n-gram language models of Chapter 3 compute the probability of a word given
counts of its occurrence with the n −1 prior words. The context is thus of size n−1.
For the feedforward language models of Chapter 7, the context is the window size.
RNN language models (Mikolov et al., 2010) process the input sequence one
word at a time, attempting to predict the next word from the current word and the
previous hidden state. RNNs thus don’t have the limited context problem that n-gram
models have, or the fixed context that feedforward language models have, since the
hidden state can in principle represent information about all of the preceding words
all the way back to the beginning of the sequence. Fig. 9.5 sketches this difference
between a FFN language model and an RNN language model, showing that the
RNN language model uses h
t−1
, the hidden state from the previous time step, as a
representation of the past context.
9.2.1 Forward Inference in an RNN language model
Forward inference in a recurrent language model proceeds exactly as described in
Section 9.1.1. The input sequence X = [x
1
;...; x
t
;...; x
N
] consists of a series of words
each represented as a one-hot vector of size |V |×1, and the output prediction, y, is a
vector representing a probability distribution over the vocabulary. At each step, the
model uses the word embedding matrix E to retrieve the embedding for the current
word, and then combines it with the hidden layer from the previous step to compute a
new hidden layer. This hidden layer is then used to generate an output layer which is
passed through a softmax layer to generate a probability distribution over the entire
剩余26页未读,继续阅读
资源评论
July工作室
- 粉丝: 1618
- 资源: 503
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- jdk - 22.0.2 - windows graalVM
- jdk - 22.0.2 - windows
- 496785224932493FLUENT_VOF&熔化_2D (不含仿真数据).zip
- jdk - 22.0.2 - macos
- 在Windows系统中管理Mac磁盘的实用工具-在Windows系统中创建并使用Mac磁盘,读取Mac磁盘中的文件
- PFC理论基础与Matlab仿真模型学习笔记(1)-PFC电路概述
- 吞食天地2马腾传.nes
- 西部数据发布的一款西数硬盘检测修复工具-支持WD-L/WD-ROYL板,能进行硬盘软复位,可识别硬盘查看或清除-供大家学习参考
- wwwwwwwwwwwwwwwwwww
- 利用恒源云在云端租用GPU服务器训练YOLOv8模型(包括Linux系统命令讲解)_恒源云跑模型-CSDN博客.html
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功