Deep Sentence Embedding Using Long Short-Term Memory Networks

所需积分/C币:10 2017-03-02 10:51:29 1.65MB PDF
收藏 收藏

This paper develops a model that addresses sentence embedding, a hot topic in current natural language processing research, using recurrent neural networks (RNN) with Long Short-Term Memory (LSTM) cells. The proposed LSTM-RNN model sequentially takes each word in a sentence, extracts its information
into the semantic vector, and the word dependencies a fixed window size in CLSM. Our experiments show embedded in the vector are"updated". When the process that this difference leads to significantly better results in reaches the end of the sentence the semantic vector has web document retrieval task. furthermore, it has other embedded all the words and their dependencies, hence advantages It allows us to capture keywords and key can be viewed as a feature vector representation of the topics effectively. The models in this paper also do not whole sentence. In the machine translation work [1l, an need the extra max-pooling layer, as required by the put English sentence is converted into a vector repre- CLSM, to capture global contextual information and they sentation using LstM-rnn, and then another lstM- do so more effectively Rnn is used to generate an output French sentence The model is trained to maximize the probability of Ill. SENTENCE EMBEDDING USING RNNS WITH AND predicting the correct output sentence. In [18], there are WITHOUT LSTM CELLS two main composition models, ADD model that is bag this section we introduce the model of recurrent of words and bi model that is a summation over bi-gram neural networks and its long short-term memory version pairs plus a non-linearity In our proposed model, instead for learning the sentence embedding vectors. We start of simple summation, we have used lsTM model with with the basic rnn and then proceed to LStM-RNN letter tri-grams which keeps valuable information over long intervals(for lo less information. In [19, an encoder-decoder approach is A. The basic version of rnn proposed to jointly learn to align and translate sentences The rnn is a type of deep neural networks that from English to French using RNNs. The concept of are"deep"in temporal dimension and it has been used attention"in the decoder, discussed in this paper, is extensively in time sequence modelling [211, [22, [231 closely related to how our proposed model extracts [24],,[,[27 ,[281, [29]. The main idea of using key words in the document side. For further explanations RNN for sentence embedding is to find a dense and please see section v-A2 In [201 a set of visualizations low dimensional semantic representation by sequentiall are presented for RNNs with and without LSTM cells and recurrently processing each word in a sentence and and GRUs. Different from our work where the target task mapping it into a low dimensional vector. In this model is sentence embedding for document retrieval, the target the global contextual features of the whole text will be tasks in [20] were character level sequence modelling for in the semantic representation of the last word in the text characters and source codes. Interesting observations text sequence -see Figure I where x(t) is the t-th about interpretability of some LSTM cells and statistics word, coded as a l-hot vector, Wh is a fixed hashing g of gates activations are presented. In section v-a we operator similar to the one used in [3 that converts the show that some of the results of our visualization are word vector to a letter tri-gram vector, W is the input consistent with the observations reported in [20]. We weight matrix, Wrec is the recurrent weight matrix, y(t) also present more detailed visualization specific to the is the hidden activation vector of the rNn, which can be document retrieval task using click-through data. We also used as a semantic representation of the t-th word, and present visualizations about how our proposed model can y(t)associated to the last word x(m) is the semantic be used for keyword detection representation vector of the entire sentence. note that Different from the aforementioned studies. the method this is very different from the approach in [3 where the developed in this paper trains the model so that sentences bag-of-words representation is used for the whole text that are paraphrase of each other are close in their and no context information is used. This is also different semantic embedding vectors see the description in from [10] where the sliding window of a fixed size(akin Sec. IV further ahead. Another reason that LSTM-Rnn to an FIR filter) is used to capture local features and a is particularly effective for sentence embedding, is its max-pooling layer on the top to capture global features robustness to noise. For example, in the web document In the rnn there is neither a fixed-sized window nor ranking task, the noise comes from two sources:(i)Not a max-pooling layer; rather the recurrence is used to every word in query document is equally important capture the context information in the sequence (akin and we only want to remember"salient words usin to an IIR filter) the limited"memory'.(ii)A word or phrase that is The mathematical formulation of the above rnn important to a document may not be relevant to a model for sentence embedding can be expressed as given query, and we only want to"remember"related 1(t)=Whx(t) words that are useful to compute the relevance of the document for a given query. We will illustrate robustness y(+)=f(Wl(t)+Wrec(t-1)+b)(1) of LSTM-Rnn in this paper. The structure of LSTM- where W and Wrec are the input and recurrent matrices RNN will also circumvent the serious limitation of using to be learned, Wh is a fixed word hashing operator, b Embedding vector l(t l(t) rec c(t-13 Output Gat=() yg(t) W W W L(1 Wh W () >oi))Forget Gare x(n Fig. 1. The basic architecture of the rnn for sentence embedding where temporal recurrence is used to model the contextual information Fig. 2. The basic LSTM architecture used for sentence embeddin across words in the text sting. The hidden activation vector corre- sponding to the last word is the sentence embedding vector(blue) is the bias vector and f() is assumed to be tanh (-) yo(t)=g(Wal(t)+ Wrecay(t-1)+b4) Note that the architecture proposed here for sentence i(t)=o(W3I(t)+ Wrec3y(t-1)+Wp3c(t-1)+b3) embedding Is sightly dirrerent trom traditional kNN n f()=o(W2l(t)+Wrec2y(t-1)+wpc(t-1)+b2) imensional input into a relatively lower dimensional c(t)=f(t)oc(t-1)+i(t)oyg(t letter tri-gram representation. There is also no per word o(t)=o(Wil(t)+Wrocly(t-1)+Wpic(t)+b1) supervision during training, instead, the whole sentence y(t)=o(t)on(c(t)) has a label. This is explained in more detail section where o denotes Hadamard (element-wise) product. A diagram of the proposed model with more details is presented in section VI of Supplementary Materials B. The rnn with lstm cells IV. LEARNING METHOD Although rnn performs the transformation from the To learn a good semantic representation of the input sentence to a vector in a principled manner, it is generally sentence, our objective is to make the embedding vectors difficult to learn the long term dependency within the for sentences of similar meaning as close as possible sequence due to vanishing gradients problem. One of and meanwhile, to make sentences of different meanings he effective solutions for this problem in RNNs is as far apart as possible. This is challenging in practice using memory cells instead of neurons originally pi since it is hard to collect a large amount of manuall posed in [5] as Long Short-Term Memory (LSTM)and labelled data that give the semantic similarity signal completed in [30 and B31 by adding forget gate and between different sentences. Nevertheless, the widely peephole connections to the architecture usea commercia al web search engine is able to log We use the architecture of LstM illustrated in Fig. massive amount of data with some limited user feedback 2 for the proposed sentence embedding method. In this signals. For example, given a particular query, the click- figure i(t), f(t), o(t),c(t) are input gate, forget gate, through information about the user-clicked document output gate and cell state vector respectively, Wpl,Wp2 among many candidates is usually recorded and can be and Wp3 are peephole connections, Wi,Wreci and bi, used as a weak (binary) supervision signal to indicate i=1, 2, 3, 4 are input connections, recurrent connections the semantic similarity between two sentences (on the and bias values, respectively, g( )and h()are tanh() query side and the document side). In this section, we unction and o( is the sigmoid function. We use this explain how to leverage such a weak supervision signal architecture to find y for each word, then use the y(m) to learn a sentence embedding vector that achieves the corresponding to the last word in the sentence as the aforementioned training objective. Please also note that semantic vector for the entire sentence above objective to make sentences with similar meaning Considering Fig. 2 the forward pass for LSTM-RNN as close as possible is similar to machine translation yp+(TD+ ositive sample of clicked document given the r-th query, N is number Click: 1 of query clicked-document pairs in the corpus and e)R(Or, Df) 7(A pO Negative samples e?(Q,#)+∑=;eh,D,) y (5) Fig. 3. The click-through signal can he used as a(binary) indication the sentence on the docum ent side the negative sampes are andomly where A,j=R(Qr, D+)-R(Qr, D, 3), R(, ) was sampled from the training data defined earlier in (3, Dri is the j-th negative candidate document for r-th query and n denotes the number of negative samples used during training tasks where two sentences belong to two different lan- The expression in (5] is a logistic loss over Ar guages with similar meanings and we want to make their It upper-bounds the pairwise accuracy, i. e . the 0-1 semantic representation as close as possible loss. Since the similarity measure is the cosine function We now describe how to train the model to achieve the An,j E[-2, 2. To have a larger range for Ar, j, we use above objective using the click-through data logged by a for scaling. It helps to penalize the prediction error commercial search engine. For a complete description of more. Its value is set empirically by experiments on a the click-through data please refer to section 2 in [32 held out dataset To begin with, we adopt the cosine similarity between To train the rnn and LSTM-RNN, we use back Prop the semantic vectors of two sentences as a measure for agation Through Time(BPTT). The update equations for their similarity parameter A at epoch h are as follows R(Q, D)- yo(lQ)yD(TD) △Ak Ak-AA (Q川·|yD(TD △Ak=1k-1△Ak-1-∈k-1VL(Ak-1+pk-1△Ak-1) where TQ and TD are the lengths of the sentence Q and sentence D, respectively. In the context of where VL( )is the gradient of the cost function in(4, training over click-through data, we will use Q and E is the learning rate and uk is a momentum parameter d to denote an document", respectively. determined by the scheduling scheme used for training In Figure 3 we show the sentence embedding vec- Above equations are equivalent to Nesterov method tors corresponding to the query, yQ(TQ), and all the in [33]. To see why, please refer to appendix A1 of documents,yD+(TD+),yn-(Tn). yp-TD-),[ 34] where Nesterov method is derived as a momentum where the subscript D+ denotes the (clicked)positive method. The gradient of the cost function, VL(A), is sample among the documents, and the subscript D denotes the j-th(un-clicked) negative sample. All these VL(4)=∑∑∑a O△, embedding vectors are generated by feeding the sen aA r=1j—17=0 tences into the rnn or Lstm-rnn model described in Sec. III and take the y corresponding to the last word one large update see the blue box in Figure where T is the number of time steps that we unfold the We want to maximize the likelihood of the clicked network over time and following optimization problem. h be formulated as the document given query, which Ci △ +∑)-17n L(A)-min-1og I P(D+Q- S-min2I(A daa T in (7 and error signals for different param- eters of rnn and LSTM-RNn that are necessary for T二 (4) training are presented in Appendix A Full derivation of where a denotes the collection of the model parameters; gradients in both models is presented in section Ill of in regular RNN case, it includes Wrec and w in Figure supplementary materials 1 and in LSTM-RNN case, it includes W1, W2,W3, To accelerate training by parallelization, we use mini- recl, w rec2 Wrecs, Wrec4, Wpl, Wp2, Wp3, batch training and one large update instead of incremen b,b,b3 and b4 in Figure园D is the clicked tal updates during back propagation through time. To document for r-th query, P(D+ Qr)is the probability resolve the gradient explosion problem we use gradient Algorithm I Training LSTM-RNN for Sentence Embed- (i) How does LSTM-RNn attenuate unimportant infor- ding mation and detect critical information from the input Inputs: Fixed step size "., Scheduling for"a Gradient clip threshold sentence? Or, how are the keywords embedded into the thG", Maximum number of Epochs"n Epoch", Total number of query clicked-document pairs"N", Total number of un-clicked(negative )docu- semantic vector? (ii) How are the global topics identified ments for a given query"n, Maximum sequence length for truncated BP by LSTM-RNN? Outputs: ,wo trained models, one in query side"AQ, one in document To answer these questions, we train the rnn with Initialization: Set all parameters in Ao and An to small random numbers, and without Lstm cells on the click-through dataset =0,h=1. which are logged by a commercial web search engine procedure LS'TM-RNN(AC, An) while i ePoch do The training method has been described in Sec. IV for first minibatch ast minibatch"do 1 Description of the corpus is as follows. The training set while r <n do includes 200, 000 positive query /document pairs where Compute r llse only the clicked signal is used as a weak supervision for Compute 2T=o Xr 0AkQ training lstm. The relevance judgement set (test set) D use [14 to 44) in appendix A is constructed as follows. First, the queries are sampled Compute T ar,j aAk.D from a year of search engine logs. Adult, spam, and p use 4 to a in appendixa bot queries are all removed. Queries are de sum above terms for Q and d over 3 end for that only unique queries remain. To reflex a natural sum above terms for Q and D over r r+⊥ query distribution, we do not try to control the quality end while of these queries. For example, in our query sets, there Compute VL(Ak, Q) D use Compute VL(AK. D) are around 20% misspelled queries, and around 20%0 VL(Ak,Q川> thG then navigational queries and 10%o transactional queries, eto L(A VLk、Q)←tha:mL(Ak,Q Second, for each query, we collect Web documents te end if ifL(△k,D川> thg ther be judged by issuing the query to several popular search ⅴL(Ak,D)←thG:LAh,D)T engines(e. g, Google, Bing)and fetching top-10 retrieval end if results from each. Finally, the query-document pairs are Compute△Ak,Q e△A p use5 judged by a group of well-trained assessors. In this Update:Ak,Q←△Ak,Q+△k-1,Q study all the queries are preprocessed as follows. The Update:Ak,D←△Ak,D+△k-1,D text is white-space tokenized and lower-cased, numbers k←k+1 end ror are retained, and no stemming/inflection treatment is ∈i+1 end while performed. Unless stated otherwise, in the experiments end procedure we used 4 negative samples, i.e.,IL=4 in Fig3 We now proceed to perform a comprehensive analysis by visualizing the trained rnn and lstm-rnn models rmalizati in 1351,[24]. To In particular, we will visualize the on-and-off behaviors accelerate the convergence, we use Nesterov method 33 put gates, cell stat d the and found it effective in training both RNN and LSTM- semantic vectors in lstm-rnn model. which reveals RNN for sentence embedding how the model extracts useful information from the input We have used a simple yet effective scheduling for sentence and embeds it properly into the semantic vector uk: for both RNn and LsTM-RNn models, in the first according to the topic information and last 2%o of all parameter updates /ik=0.9 and for the other 96%o of all parameter updates uk=0.995. We Although giving the full learning formula for all have used a fixed step size for training rnn and a fixed the model parameters in the previous section, we will step size for training LSTM-RNN remove the peephole connections and the forget gate A summary of training method for LSTM-RNN is rom the LSTM-RNN model in the current task. This presented in Algorithm[II is because the length of each sequence, i. e, the number of words in a query or a document, is known in advance and we set the state of each cell to zero in the beginning V. ANALYSIS OF THE SENTENCE EMBEDDING of a new sequence. Therefore, forget gates are not a great PROCESS AND PERFORMANCE EVALUATION help here. Also, as long as the order of words is kept, the To understand how the lstM-rnn performs sentence precise timing in the sequence is not of great concern embedding, we use visualization tools to analyze the Therefore, peephole connections are not that semantic vectors generated by our model. We would as well. Removing peephole connections and forget gate like to answer the following questions: (i) How are will also reduce the amount of training time, since a word dependencies and context information captured? smaller number of parameters need to be learned 0.2 0.2 02 -0.4 -0.6 (a)i(t) (a)i(t 0.2 02 0.2 02 04 0.4 0.6 810 (c)o() (d)y() ()o() (d)y() Fig 4. Query: " hotels in shanghai. Since the sentence ends at Fig. 5. Document: "shanghai hotels accommodation hotel the third word, all the values to the right of it are zero(green color in shanghai discount and reservation". Since the sentence ends a the ninth word, all the values to the right of it are zero(green color) A. Analysis In this section we would like to examine how the in- sentence. For example, in Fig. 5(a),most o formation in the input sentence is sequentially extracted he input gate values corresponding to word 3 and embedded into the semantic vector over time by the word 7 and word 9 have very small values LSTM-RNN model dlight green-yellow color! which corresponds to 1) Attenuating Unimportant Information: First, we the words accommodation,“ discount”and examine the evolution of the semantic vector and how reservation,, respectively. in the document sen unimportant words are attenuated. specifically, we feed tence. Interestingly, input gates reduce the effect of the following input sentences from the test dataset into these three words in the final semantic represent the trained LSTM-rnn model tion,y(t), such that the semantic similarity between Query:"“ hotels in shanghai” sentences from query and document sides are not Document: Shang hai hotels accommodation hotel affected by these words n shanghai discount and reservation 2)Keywords Extraction: In this section, we show Activations of input gate, output gate, cell state and the how the trained LSTM-RNN extracts the important in embedding vector for each cell for guery and document formation, i.e., keywords, from the input sentences. To are shown in Fig. A and Fig.[5] respectively. The vertical this end, we backtrack semantic representations, y(t) axis is the word index from 1 to 10 numbered from left tnal semantic representation. Whenever there is a large to right in a sequence of words and color codes she enough change in cell activation value(y(t), we assume activation values From Figs45 we make the following an important key word has been detected by the model observations We illustrate the result using the above example("hotels Semantic representation y(t)and cell states c(t )are in shanghai"). The evolution of the 10 most active cells evolving over time. Valuable context information is activation, y(t), over time are shown in Fi 6for the gradually absorbed into c(l)and y(), so that the query and the document sentences 2From Fig. 6 we also information in these two vectors becomes richer observe that different words activate different cells. In over time. and the semantic information of the Tables四回 we show the number of cells each word entire input sentence is embedded into vector y(t) IIf this is not clearly visible, please refer to Fig. 1 in section I of which is obtained by applying output gates to the supplementary materials. We have adjusted color bar for all figures to cell states c(t) have the same range for this reason the structure might not be clearly The input gates evolve in such a way that it visible. More visualization examples could also be found in section Iv attenuates the unimportant information and de- Likewise. thc vcrtical axis is thc ccll index and horizontal axis is tects the important information from the input the word index in the sentence TABLE II KEY WORDS FOR DOCUMENT: shanghai hotels accommodation hotel in shanghai discount and reservation shanghai hotels accommodation hotel in shanghai discount and reservation Number of assigned cells out of 10 Left to Right 3 8 5 4 ells out of 1o Right to t eft 4 6 5 4 5 7 5 by a specific cell. For simplicity, we use the following 三咔翻 simple approach: for each given query we look into the keywords that are extracted by the 5 most active cells of LSTM-rnn and list them in Table Ill interesting .o5 each cell collects keywords of a specific topic. For example. cell 26 in Table iii extracts keywords related the keywords related to the topic heallh. ainly focus on to the topic food and cells 2 and 6 m (a)y(t) top 10 for query (b)y(t) top 10 for document ig. 6. Activation values, y(t), of 10 most active cells for Query hotels in shanghai and Document: shanghai hotels accommodation B. Performance evaluation hotel in shanghai discount and reservation 1)Web Document retrieval Task: In this section, we apply the proposed sentence embedding method to an TABLE I important web document retrieval task for a commercial KEY WORDS FOR QUERY:“ hotels in shanghai” hotels t72 web search engine. Specifically, the rnn models(with Number of assigned and without LSTM cells )embed the sentences from the cells out of 1o Left to Right query and the document sides into their corresponding Number of assigned semantic vectors, and then compute the cosine similarity cells out of 10 Right to left 0 between these vectors to measure the semantic similarit between the query and candidate documents Experimental results for this task are shown in Table activates We used Bidirectional LSTM-Rnn to get the IV using the standard metric mean Normalized Dis results of these tables where in the first row, LSTM-RNN counted Cumulative Gain(NDCG)[361(the higher the reads sentences from left to right and in the second row better) for evaluating the ranking performance of the RNN and lsTm-Rnn on a standalone human -rated test it reads sentences from right to left. In these tables we dataset. We also trained several strong baselines, such as labelled a word as a key word if more than 40% of top dssm [3 and CLSM [10], on the same training dataset 0 active cells in both directions declare it as keyword The boldface numbers in the table show that the number and evaluated their performance on the same task. For of cells assigned to that word is more than 4, i.c., 40% fair comparison, our proposed rnn and LSTM-Rnn of top 10 active cells. From the tables, we observe that models are trained with the same number of parameters the key words activate more cells than the unimportant as the dsSM and Cl sm models(14. 4M parameters) Besides, we also include in Table iv two well-known words, meaning that they are selectively embedded into information retrieval (IR) models, BM25 and PLsA, for semantic vector the sake of benchmarking The bm25 model uses the 3)Topic Allocation: Now, we further show that the bag-of-words representation for queries and documents trained LSTM-Rnn model not only detects the key words, but also allocates them properly to different cells which is a state-of-the-art document ranking model based according to the topics they belong to. To do this, we go on term matching, widely used as a baseline in IR through the test dataset using the trained LSTM-RNN society. PLSA(Probabilistic Latent Semantic Analysis model and search for the key words that are detecte ed is a topic model proposed in [371, which is trained ,Note that before presenting the first word of the sequence, activation the documents side from the same training dataset. We values are initially zero so that there is always a considerable change in experimented with a varying number of topics from 100 the cell states after presenting the first word. For this reason, we have to 500 for PLSA, which gives similar performance, and Morcover. another keyword cxtraction cxamplc can be found in scction we report in Table [v the results of using 500 topics IV of supplementary materials. Results for a language model based method, uni-gram TABLE III KEY WORDS ASSIGNED 'TO EACH CELL OF LSTM-RNN FOR DIFFERENT QUERIES OF'TWO'TOPICS, FOODAND"HEALTH cell2 I cell 3 cell 4cell5 L cell 7 cells cell g cell 10 cell 1lall 12 cell 13 cell 14 cell 15 cell 16 commun pregnane cell 1/cell 18 cel 20cel 2l cell 22 cell 23 cell 24 cell 25ce l27 cell 28cell 29 cell 30cell31cell. whiter narcotics during pregnancy side elects fight si h insurance high blood pressure high, pres language model (ULM) with Dirichlet smoothing, are per sentence, sentences with words less than also presented in the table this will be ignored. We set it to i to make To compare the performance of the proposed method sure we do not throw away anything with general sentence embedding methods in document window=5: fixed window size explained in retrieval task, we also performed experiments using two 12]. We used different window sizes, it re general sentence embedding methods sulted in about just 0. 4% difference in final 1)In the first experiment, we used the method pro- NDCG values posed in [2] that generates embedding vectors e size=100. feature vector dimension we used known as Paragraph Vectors. It is also known as 400 as well but did not get significantly dif doc2vec. It maps each word to a vector and then ferent NDCG values uses the vectors representing all words inside a Sample=le-4: this is the down sampling ratio context window to predict the vector representation for the words that are repeated a lot in corpus of the next word. The main idea in this method negative=5: the number of noise words i e to use an additional paragraph token from previ- words used for negative sampling as explained ous sentences in the document inside the context In window. This paragraph token is mapped to vector We used 30 epochs of training. We ran an ex- space using a different matrix from the one used to periment with 100 epochs but did not observe map the words. A primary version of this method much difference in the results is known as word2vec proposed in [39 The only We used gensim 140 to perform experiments difference is that word2vec does not include the To make sure that a meaningful model is trained paragraph token we used the trained doc2vec model to find the To use doc 2vec on our dataset. we first trained most similar words to two sample words in our doc2vec model on both train set(about 200,000 dataset,e.g, the words“piza” and "infection” query-document pairs) and test set(about 900,000 The resulting words and corresponding scores are query-document pairs). This gives us an embed- presented in section V of Supplementary Materi ding vector for every query and document in the als. as it is observed from the resulting words dataset. We used the following parameters for the trained model is a meaningful model and can training recognise semantic similarity. min-count=l minimum number ofof words Doc2vec also assigns an embedding vector for each query and document in our test set. We used these embedding vectors to calculate the cosine LSTM-RNA IIRNN similarity score between each query-document pair in the test set we used these scores to calcu- ate NDCG values reported in Table Iv for the Doc2Vec model Comparing the results of doc 2vec model with our proposed method for document retrieval task shows that the proposed method in this paper significantly outperforms doc2vec. One reason for this is that we have used a very general sen Fig. 7. LSTM-RNN compared to rnn during training: The vcrtical tence embedding method, doc2vec, for document axis is logarithmic scale of the training cost, L(A),in Horizont retrieval task. This experiment shows that it is axis is the number of epochs during training not a good idea to use a general sentence embed ding method and using a better task oriented cost function, like the one proposed in this paper, is necessary. NDCG for the whole test set which are reported 2)In the second experiment, we used the kip in Table ly Thought vectors proposed in During train The proposed method in this paper is perform ing, skip-thought method gets a tuple (s(t ing significantly better than the off-the-shelf skip 1),s(t),s(t+1))where it encodes the sentence thought method for document retrieval task. Nev t)using one encoder, and tries to reconstruct ertheless, since we used skip-thought as an off the previous and next sentences, i. e, s(t-1) the-shelf sentence embedding method. its result (t +1), using two separate decoders. The model is good. This result also confirms that learning uses rnns with Gated Recurrent Unit (GRU) embedding vectors using a model and cost function which is shown to perform as good as lstm specifically designed for document retrieval task is In the paper, authors have emphasized that "Our necessary model depends on having a training corpus of con tiguous text. Therefore, training it on our training As shown in TableIV the LSTM-RNN significantly set where we barely have more than one sentence outperforms all these models, and exceeds the be n query or document title is not fair. However, baseline model(CLSm) by 1.3% in NDCG@l score, since their model is trained on 11,038 books from which is a statistically significant improvement. As we Book Corpus dataset [7 which includes about 74 pointed out in Sec. V-A such an improvement comes million sentences, we can use the trained model from the LSTM-RNN's ability to embed the contextual as an off-the-shelf sentence embedding method as and semantic information of the sentences into a finite authors have concluded in the conclusion of the dimension vector. In Table Iv we have also presented paper. the results when different number of negative samples To do this we downloaded their trained mod- n, is used. Generally, by increasing n we expect the els and word embeddings (its size was more performance to improve. This is because more nega than2gb)avaiLablefrom tive samples results in a more accurate approximation ryankiros/ skip-thoughts'. Then we encoded each of the partition function in (5p. The results of using query and its corresponding document title in our Bidirectional LSTM-RNN are also presented in Table test set as vector IV In this model, one LSTM-RNN reads queries and We used the combine-skip sentence embedding documents from left to right, and the other LSTM-RNN method. a vector of size 4800x 1. where it is reads queries and documents from right to left Then the oncatenation of a uni-skip, i. e, a unidirectional embedding vectors from left to right and right to left encoder resulting in a 2400x 1 vector, and a bi- LSTM-RNNS are concatenated to compute the cosine skip, i.e., a bidirectional encoder resulting in a similarity score and NDCG values 1200 X I vector by forward encoder and another A comparison between the value of the cost function 1200 x 1 vector by backward encoder. The authors during training for LSTM-RNN and RnN on the click have reported their best results with the combine- through data is shown in Fig from this figure skip encoder we conclude that LSTM-RNN is optimizing the cost Using the 4800 x I embedding vectors for each function in (4) more effectively. Please note that all query and document we calculated the scores and parameters of both models are initialized randoml

试读 25P Deep Sentence Embedding Using Long Short-Term Memory Networks
立即下载 低至0.43元/次 身份认证VIP会员低至7折
关注 私信 TA的资源
Deep Sentence Embedding Using Long Short-Term Memory Networks 10积分/C币 立即下载
Deep Sentence Embedding Using Long Short-Term Memory Networks第1页
Deep Sentence Embedding Using Long Short-Term Memory Networks第2页
Deep Sentence Embedding Using Long Short-Term Memory Networks第3页
Deep Sentence Embedding Using Long Short-Term Memory Networks第4页
Deep Sentence Embedding Using Long Short-Term Memory Networks第5页

试读结束, 可继续读3页

10积分/C币 立即下载 >