DeepSentenceEmbeddingUsingLongShort-TermMemoryNetworks资源-CSDN文库

需积分: 22 145 浏览量 2017-03-02 10:51:29 上传评论收藏 1.65MB PDF 举报

资源推荐

资源详情

资源评论

Deep Sentence Embedding Using Long

Short-Term Memory Networks: Analysis and

Application to Information Retrieval

Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song,

Rabab Ward

Abstract—This paper develops a model that addresses

sentence embedding, a hot topic in current natural lan-

guage processing research, using recurrent neural networks

(RNN) with Long Short-Term Memory (LSTM) cells. The

proposed LSTM-RNN model sequentially takes each word

in a sentence, extracts its information, and embeds it into

a semantic vector. Due to its ability to capture long term

memory, the LSTM-RNN accumulates increasingly richer

information as it goes through the sentence, and when it

reaches the last word, the hidden layer of the network

provides a semantic representation of the whole sentence.

In this paper, the LSTM-RNN is trained in a weakly

supervised manner on user click-through data logged by a

commercial web search engine. Visualization and analysis

are performed to understand how the embedding process

works. The model is found to automatically attenuate the

unimportant words and detects the salient keywords in

the sentence. Furthermore, these detected keywords are

found to automatically activate different cells of the LSTM-

RNN, where words belonging to a similar topic activate the

same cell. As a semantic representation of the sentence,

the embedding vector can be used in many different

applications. These automatic keyword detection and topic

allocation abilities enabled by the LSTM-RNN allow the

network to perform document retrieval, a difﬁcult language

processing task, where the similarity between the query and

documents can be measured by the distance between their

corresponding sentence embedding vectors computed by

the LSTM-RNN. On a web search task, the LSTM-RNN

embedding is shown to signiﬁcantly outperform several

existing state of the art methods. We emphasize that the

proposed model generates sentence embedding vectors that

are specially useful for web document retrieval tasks. A

comparison with a well known general sentence embedding

method, the Paragraph Vector, is performed. The results

show that the proposed method in this paper signiﬁcantly

outperforms it for web document retrieval task.

Index Terms—Deep Learning, Long Short-Term Mem-

ory, Sentence Embedding.

I. INTRODUCTION

H. Palangi and R. Ward are with the Department of Electrical and

Computer Engineering, University of British Columbia, Vancouver,

BC, V6T 1Z4 Canada (e-mail: {hamidp,rababw}@ece.ubc.ca)

L. Deng, Y. Shen, J.Gao, X. He, J. Chen and X. Song are

with Microsoft Research, Redmond, WA 98052 USA (e-mail:

{deng,jfgao,xiaohe,yeshen,jianshuc,xinson}@microsoft.com)

EARNING a good representation (or features) of

input data is an important task in machine learning.

In text and language processing, one such problem is

learning of an embedding vector for a sentence; that is, to

train a model that can automatically transform a sentence

to a vector that encodes the semantic meaning of the

sentence. While word embedding is learned using a

loss function deﬁned on word pairs, sentence embedding

is learned using a loss function deﬁned on sentence

pairs. In the sentence embedding usually the relationship

among words in the sentence, i.e., the context informa-

tion, is taken into consideration. Therefore, sentence em-

bedding is more suitable for tasks that require computing

semantic similarities between text strings. By mapping

texts into a uniﬁed semantic representation, the embed-

ding vector can be further used for different language

processing applications, such as machine translation [1],

sentiment analysis [2], and information retrieval [3].

In machine translation, the recurrent neural networks

(RNN) with Long Short-Term Memory (LSTM) cells, or

the LSTM-RNN, is used to encode an English sentence

into a vector, which contains the semantic meaning of

the input sentence, and then another LSTM-RNN is

used to generate a French (or another target language)

sentence from the vector. The model is trained to best

predict the output sentence. In [2], a paragraph vector

is learned in an unsupervised manner as a distributed

representation of sentences and documents, which are

then used for sentiment analysis. Sentence embedding

can also be applied to information retrieval, where the

contextual information are properly represented by the

vectors in the same space for fuzzy text matching [3].

In this paper, we propose to use an RNN to sequen-

tially accept each word in a sentence and recurrently map

it into a latent space together with the historical informa-

tion. As the RNN reaches the last word in the sentence,

the hidden activations form a natural embedding vector

for the contextual information of the sentence. We further

incorporate the LSTM cells into the RNN model (i.e. the

LSTM-RNN) to address the difﬁculty of learning long

term memory in RNN. The learning of such a model

arXiv:1502.06922v3 [cs.CL] 16 Jan 2016

is performed in a weakly supervised manner on the

click-through data logged by a commercial web search

engine. Although manually labelled data are insufﬁcient

in machine learning, logged data with limited feedback

signals are massively available due to the widely used

commercial web search engines. Limited feedback in-

formation such as click-through data provides a weak

supervision signal that indicates the semantic similarity

between the text on the query side and the clicked text

on the document side. To exploit such a signal, the

objective of our training is to maximize the similarity

between the two vectors mapped by the LSTM-RNN

from the query and the clicked document, respectively.

Consequently, the learned embedding vectors of the

query and clicked document are speciﬁcally useful for

web document retrieval task.

An important contribution of this paper is to analyse

the embedding process of the LSTM-RNN by visualizing

the internal activation behaviours in response to different

text inputs. We show that the embedding process of the

learned LSTM-RNN effectively detects the keywords,

while attenuating less important words, in the sentence

automatically by switching on and off the gates within

the LSTM-RNN cells. We further show that different

cells in the learned model indeed correspond to differ-

ent topics, and the keywords associated with a similar

topic activate the same cell unit in the model. As the

LSTM-RNN reads to the end of the sentence, the topic

activation accumulates and the hidden vector at the last

word encodes the rich contextual information of the

entire sentence. For this reason, a natural application

of sentence embedding is web search ranking, in which

the embedding vector from the query can be used to

match the embedding vectors of the candidate documents

according to the maximum cosine similarity rule. Evalu-

ated on a real web document ranking task, our proposed

method signiﬁcantly outperforms many of the existing

state of the art methods in NDCG scores. Please note

that when we refer to document in the paper we mean

the title (headline) of the document.

II. RELATED WORK

Inspired by the word embedding method [4], [5], the

authors in [2] proposed an unsupervised learning method

to learn a paragraph vector as a distributed representation

of sentences and documents, which are then used for

sentiment analysis with superior performance. However,

the model is not designed to capture the ﬁne-grained

sentence structure. In [6], an unsupervised sentence

embedding method is proposed with great performance

on large corpus of contiguous text corpus, e.g., the

BookCorpus [7]. The main idea is to encode the sentence

s(t) and then decode previous and next sentences, i.e.,

s(t−1) and s(t+1), using two separate decoders. The en-

coder and decoders are RNNs with Gated Recurrent Unit

(GRU) [8]. However, this sentence embedding method

is not designed for document retrieval task having a

supervision among queries and clicked and unclicked

documents. In [9], a Semi-Supervised Recursive Au-

toencoder (RAE) is proposed and used for sentiment

prediction. Similar to our proposed method, it does not

need any language speciﬁc sentiment parsers. A greedy

approximation method is proposed to construct a tree

structure for the input sentence. It assigns a vector per

word. It can become practically problematic for large

vocabularies. It also works both on unlabeled data and

supervised sentiment data.

Similar to the recurrent models in this paper, The

DSSM [3] and CLSM [10] models, developed for in-

formation retrieval, can also be interpreted as sentence

embedding methods. However, DSSM treats the input

sentence as a bag-of-words and does not model word

dependencies explicitly. CLSM treats a sentence as a bag

of n-grams, where n is deﬁned by a window, and can

capture local word dependencies. Then a Max-pooling

layer is used to form a global feature vector. Methods in

[11] are also convolutional based networks for Natural

Language Processing (NLP). These models, by design,

cannot capture long distance dependencies, i.e., depen-

dencies among words belonging to non-overlapping n-

grams. In [12] a Dynamic Convolutional Neural Network

(DCNN) is proposed for sentence embedding. Similar to

CLSM, DCNN does not rely on a parse tree and is easily

applicable to any language. However, different from

CLSM where a regular max-pooling is used, in DCNN a

dynamic k-max-pooling is used. This means that instead

of just keeping the largest entries among word vectors in

one vector, k largest entries are kept in k different vec-

tors. DCNN has shown good performance in sentiment

prediction and question type classiﬁcation tasks. In [13],

a convolutional neural network architecture is proposed

for sentence matching. It has shown great performance in

several matching tasks. In [14], a Bilingually-constrained

Recursive Auto-encoders (BRAE) is proposed to create

semantic vector representation for phrases. Through ex-

periments it is shown that the proposed method has great

performance in two end-to-end SMT tasks.

Long short-term memory networks were developed

in [15] to address the difﬁculty of capturing long term

memory in RNN. It has been successfully applied to

speech recognition, which achieves state-of-art perfor-

mance [16], [17]. In text analysis, LSTM-RNN treats a

sentence as a sequence of words with internal structures,

i.e., word dependencies. It encodes a semantic vector of

a sentence incrementally which differs from DSSM and

CLSM. The encoding process is performed left-to-right,

word-by-word. At each time step, a new word is encoded

into the semantic vector, and the word dependencies

embedded in the vector are “updated”. When the process

reaches the end of the sentence, the semantic vector has

embedded all the words and their dependencies, hence,

can be viewed as a feature vector representation of the

whole sentence. In the machine translation work [1], an

input English sentence is converted into a vector repre-

sentation using LSTM-RNN, and then another LSTM-

RNN is used to generate an output French sentence.

The model is trained to maximize the probability of

predicting the correct output sentence. In [18], there are

two main composition models, ADD model that is bag

of words and BI model that is a summation over bi-gram

pairs plus a non-linearity. In our proposed model, instead

of simple summation, we have used LSTM model with

letter tri-grams which keeps valuable information over

long intervals (for long sentences) and throws away use-

less information. In [19], an encoder-decoder approach is

proposed to jointly learn to align and translate sentences

from English to French using RNNs. The concept of

“attention” in the decoder, discussed in this paper, is

closely related to how our proposed model extracts

keywords in the document side. For further explanations

please see section V-A2. In [20] a set of visualizations

are presented for RNNs with and without LSTM cells

and GRUs. Different from our work where the target task

is sentence embedding for document retrieval, the target

tasks in [20] were character level sequence modelling for

text characters and source codes. Interesting observations

about interpretability of some LSTM cells and statistics

of gates activations are presented. In section V-A we

show that some of the results of our visualization are

consistent with the observations reported in [20]. We

also present more detailed visualization speciﬁc to the

document retrieval task using click-through data. We also

present visualizations about how our proposed model can

be used for keyword detection.

Different from the aforementioned studies, the method

developed in this paper trains the model so that sentences

that are paraphrase of each other are close in their

semantic embedding vectors — see the description in

Sec. IV further ahead. Another reason that LSTM-RNN

is particularly effective for sentence embedding, is its

robustness to noise. For example, in the web document

ranking task, the noise comes from two sources: (i) Not

every word in query / document is equally important,

and we only want to “remember” salient words using

the limited “memory”. (ii) A word or phrase that is

important to a document may not be relevant to a

given query, and we only want to “remember” related

words that are useful to compute the relevance of the

document for a given query. We will illustrate robustness

of LSTM-RNN in this paper. The structure of LSTM-

RNN will also circumvent the serious limitation of using

a ﬁxed window size in CLSM. Our experiments show

that this difference leads to signiﬁcantly better results in

web document retrieval task. Furthermore, it has other

advantages. It allows us to capture keywords and key

topics effectively. The models in this paper also do not

need the extra max-pooling layer, as required by the

CLSM, to capture global contextual information and they

do so more effectively.

III. SENTENCE EMBEDDING USING RNNS WITH AND

WITHOUT LSTM CELLS

In this section, we introduce the model of recurrent

neural networks and its long short-term memory version

for learning the sentence embedding vectors. We start

with the basic RNN and then proceed to LSTM-RNN.

A. The basic version of RNN

The RNN is a type of deep neural networks that

are “deep” in temporal dimension and it has been used

extensively in time sequence modelling [21], [22], [23],

[24], [25], [26], [27], [28], [29]. The main idea of using

RNN for sentence embedding is to ﬁnd a dense and

low dimensional semantic representation by sequentially

and recurrently processing each word in a sentence and

mapping it into a low dimensional vector. In this model,

the global contextual features of the whole text will be

in the semantic representation of the last word in the

text sequence — see Figure 1, where x(t) is the t-th

word, coded as a 1-hot vector, W

is a ﬁxed hashing

operator similar to the one used in [3] that converts the

word vector to a letter tri-gram vector, W is the input

weight matrix, W

rec

is the recurrent weight matrix, y(t)

is the hidden activation vector of the RNN, which can be

used as a semantic representation of the t-th word, and

y(t) associated to the last word x(m) is the semantic

representation vector of the entire sentence. Note that

this is very different from the approach in [3] where the

bag-of-words representation is used for the whole text

and no context information is used. This is also different

from [10] where the sliding window of a ﬁxed size (akin

to an FIR ﬁlter) is used to capture local features and a

max-pooling layer on the top to capture global features.

In the RNN there is neither a ﬁxed-sized window nor

a max-pooling layer; rather the recurrence is used to

capture the context information in the sequence (akin

to an IIR ﬁlter).

The mathematical formulation of the above RNN

model for sentence embedding can be expressed as

l(t) = W

x(t)

y(t) = f(Wl(t) + W

rec

y(t − 1) + b) (1)

where W and W

rec

are the input and recurrent matrices

to be learned, W

is a ﬁxed word hashing operator, b

…

Embedding vector

𝒍(2)

𝒍(1)

𝒍(𝑚)

Fig. 1. The basic architecture of the RNN for sentence embedding,

where temporal recurrence is used to model the contextual information

across words in the text string. The hidden activation vector corre-

sponding to the last word is the sentence embedding vector (blue).

is the bias vector and f(·) is assumed to be tanh(·).

Note that the architecture proposed here for sentence

embedding is slightly different from traditional RNN in

that there is a word hashing layer that convert the high

dimensional input into a relatively lower dimensional

letter tri-gram representation. There is also no per word

supervision during training, instead, the whole sentence

has a label. This is explained in more detail in section

IV.

B. The RNN with LSTM cells

Although RNN performs the transformation from the

sentence to a vector in a principled manner, it is generally

difﬁcult to learn the long term dependency within the

sequence due to vanishing gradients problem. One of

the effective solutions for this problem in RNNs is

using memory cells instead of neurons originally pro-

posed in [15] as Long Short-Term Memory (LSTM) and

completed in [30] and [31] by adding forget gate and

peephole connections to the architecture.

We use the architecture of LSTM illustrated in Fig.

2 for the proposed sentence embedding method. In this

ﬁgure, i(t), f(t) , o(t) , c(t) are input gate, forget gate,

output gate and cell state vector respectively, W

, W

and W

are peephole connections, W

, W

reci

and b

i = 1, 2, 3, 4 are input connections, recurrent connections

and bias values, respectively, g(·) and h(·) are tanh(·)

function and σ(·) is the sigmoid function. We use this

architecture to ﬁnd y for each word, then use the y(m)

corresponding to the last word in the sentence as the

semantic vector for the entire sentence.

Considering Fig. 2, the forward pass for LSTM-RNN

󰇛󰇜





󰇛󰇜





󰇛󰇜





󰇛󰇜





󰇛󰇜





󰇛󰇜





󰇛󰇜





󰇛󰇜





󰇛 󰇜

󰇛 󰇜

Input Gate Output Gate

󰇛 󰇜

Forget Gate

Cell







󰇛󰇜

󰇛󰇜

󰇛󰇜

󰇛  󰇜



󰇛 󰇜



󰇛󰇜

󰇛󰇜

󰇛󰇜

󰇛  󰇜













󰇛  󰇜











󰇛  󰇜





















󰇛  󰇜







󰇛  󰇜





Fig. 2. The basic LSTM architecture used for sentence embedding

model is as follows:

(t) = g(W

l(t) + W

rec4

y(t − 1) + b

)

i(t) = σ(W

l(t) + W

rec3

y(t − 1) + W

c(t − 1) + b

)

f(t) = σ(W

l(t) + W

rec2

y(t − 1) + W

c(t − 1) + b

)

c(t) = f(t) ◦ c(t − 1) + i(t) ◦ y

(t)

o(t) = σ(W

l(t) + W

rec1

y(t − 1) + W

c(t) + b

)

y(t) = o(t) ◦ h(c(t)) (2)

where ◦ denotes Hadamard (element-wise) product. A

diagram of the proposed model with more details is

presented in section VI of Supplementary Materials.

IV. LEARNING METHOD

To learn a good semantic representation of the input

sentence, our objective is to make the embedding vectors

for sentences of similar meaning as close as possible,

and meanwhile, to make sentences of different meanings

as far apart as possible. This is challenging in practice

since it is hard to collect a large amount of manually

labelled data that give the semantic similarity signal

between different sentences. Nevertheless, the widely

used commercial web search engine is able to log

massive amount of data with some limited user feedback

signals. For example, given a particular query, the click-

through information about the user-clicked document

among many candidates is usually recorded and can be

used as a weak (binary) supervision signal to indicate

the semantic similarity between two sentences (on the

query side and the document side). In this section, we

explain how to leverage such a weak supervision signal

to learn a sentence embedding vector that achieves the

aforementioned training objective. Please also note that

above objective to make sentences with similar meaning

as close as possible is similar to machine translation

剩余24页未读，继续阅读

评论收藏

内容反馈

ken_henderson

粉丝: 2
资源: 11

Deep Sentence Embedding Using Long Short-Term Memory Networks

Classifying Relations via Long Short Term Memory Networks

Long Short Term Memory Networks for Anomaly Detection in Time Series

Code for Long Short-Term Memory Networks With Python.zip

Long Short-Term Memory Networks With Python.pdf

Long Short-Term Memory Networks With Python

Dependency-based Siamese Long Short-Term Memory Network for Learning Sentence Representations

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 模型

A Structured Self-attentive Sentence Embedding

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

CNN-for-Sentence-Classification-in-Keras, 在Keras中用于句子分类的卷积神经网络.zip

sentence_pattern_2007-0911

short-sentence-similarity-nlp-model:该模型主要是用于短文本语义相似度匹配场景

Python库 | ja_sentence_segmenter-0.0.1-py3-none-any.whl

PyPI 官网下载 | ja_sentence_segmenter-0.0.1-py3-none-any.whl

Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN

Python库 | sentencepiece-0.1.4-cp35-cp35m-manylinux1_i686.whl

sentence-transformers/allenai-specter 模型

BERT-Embedding-Frequently-Asked-Question:使用BERT的基于常见问题的问答系统

Universal Sentence Encoder

DNN Sentence Classification

Python库 | sentencepiece-0.1.96-cp37-cp37m-win_amd64.whl

Python-Implementationoftheimagesentenceembeddingmethod

sentence-transformers-new.rar

Python库 | sentencepiece-0.1.7-cp36-cp36m-win32.whl

stable-diffusion部署需要的包

大规模语言模型：从理论到实践

人工智能大模型介绍.pptx

21个免费无限制免登录chatgpt资源， OpenAI GPT-4\3.5 模型的智能对话链接

diabetes糖尿病数据集

最新资源