词向量-开山之作1-Efficient estimation of word representations in vector space.pdf

所需积分/C币:41 2019-07-29 21:12:51 228KB PDF
328
收藏 收藏
举报

词向量开山之作第一篇,讲述作者第一次提出词向量。在自然语言处理任务中,首先需要考虑词如何在计算机中表示。通常,有两种表示方式:one-hot representation和distribution representation。
For all the following models, the training complexity is proportional to O-E×T×Q, where is number of the training epochs, T is the number of the words in the training set and Q is All models are trained using stochastic gradient descent and backpropagation l up to one billion defined further for each model architecture. Common choice is e=3-50 and T 2.1 Feedforward Neural Net Language Model (NNLm) The probabilistic feedforward neural network language model has been proposed in [1. It consists of input, projection, hidden and output layers at the input layer, N previous words are encoded using I-of-V coding, where v is size of the vocabulary. The input layer is then projected to a tion layer P that has dimer trix. As only w inputs are active at any given time, composition of the projection layer is a relatively cheap operation The NNM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense. For a common choice of N= 10, the size of the projection layer(P)might be 500 to 2000, while the hidden layer size H is typically 500 to 1000 Inits. Moreover, the hidden layer is used to compute probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality v. Thus, the computational complexity per each training example is Q-N×D+NxD×H+H×v where the dominating term is 1 x v. However, several practical solutions were proposed for avoiding it; either using hierarchical versions of the softmax 25, 2318 or avoiding normalized models completely by using models that are not normalized during training [49]. With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go down to around log2(V). Thus, most of the complexity is caused by the term X H In our models, we use hierarchical softmax where the vocabulary is represented as a huffman binary tree. This follows previous observations that the frequency of words works well for obtaining classes in neural net language models [16 Huffman trees assign short binary codes to frequent words, and this further reduces the number of output units that need to be evaluated while balanced binary tree would require log2(V)outputs to be evaluated, the Huffman tree based hierarchical softmax requires only about log2 (Urigramnu-per plesvilyV)). For example when the vocabulary size is one million words, this results in about two times speedup in evaluation. While this is not crucial speedup for neural network LMs as the computational bottleneck is in the N DX H term, we will later propose architectures that do not have hidden layers and thus depend heavily on the efficiency of the softmax normalization 2.2 Recurrent Neural Net language Model (rnnlm) Recurrent neural network based language model has been proposed to overcome certain limitations of the feedforward NNLM, such as the need to specify the context length(the order of the model M) and because theoretically rNNs can efficiently represent more complex patterns than the shallow neural networks [15, 2]. The RNn model does not have a projection layer; only input, hidden and output layer. What is special for this type of model is the recurrent matrix that connects hidden layer to itself, using time-delayed connections. This allows the recurrent model to form some kind of short term memory, as information from the past can be represented by the hidden layer state that gets updated based on the current input and the state of the hidden layer in the previous time step The complexity per training example of the rnn model is Q=H×H+H×V, where the word representations D have the same dimensionality as the hidden layer H. Again, the term H X V can be efficiently reduced to H X log2(v) by using hierarchical softmax. Most of the complexity then comes from HH 2.3 Parallel Training of Neural Networks To train models on huge data sets, we have implemented several models on top of a large-scale distributed framework called DistBelief [6), including the feedforward NNLM and the new models proposed in this paper. The framework allows us to run multiple replicas of the same model in parallel, and each replica synchronizes its gradient updates through a centralized server that keeps all the parameters. For this parallel training, we use mini-batch asynchronous gradient descent with an adaptive learning rate procedure called Adagrad [7. Under this framework, it is common to use one hundred or more model replicas, each using many CPU cores at different machines in a data 3 New Log-linear models In this section, we propose two new model architectures for learning distributed representations of words that try to minimize computational complexity. The main observation from the previous section was that most of the complexity is caused by the non-linear hidden layer in the model. while this is what makes neural networks so attractive we decided to explore simpler models that might not be able to represent the data as precisely as neural networks, but can possibly be trained on much more data efficiently The new architectures directly follow those proposed in our earlier work [13 14 where it was found that neural network language model can be successfully trained in two steps: first, continuous word vectors are learned using simple model, and then the N-gram NNLM is trained on top of these distributed representations of words. while there has been later substantial amount of work that focuses on learning word vectors, we consider the approach proposed in 13 to be the simplest one Note that related models have been proposed also much earlier [2681 3.1 Continuous Bag-of- Words Model The first proposed architecture is similar to the feedforward nnlm, where the non-linear hidden layer is removed and the projection layer is shared for all words(not just the projection matrix thus, all words get projected into the same position(their vectors are averaged). We call this archi tecture a bag-of-words model as the order of words in the history does not influence the projection Furthermore we also use words from the future; we have obtained the best performance on the task introduced in the next section by building a log-linear classifier with four future and four history words at the input, where the training criterion is to correctly classify the current(middle)word Training complexity is then Q=N×D+7×1og2(V) (4) We denote this model further as cbow, as unlike standard bag-of-words model, it uses continuous distributed representation of the context. The model architecture is shown at Figure l Note that the weight matrix between the input and the projection layer is shared for all word positions in the same way as in the nim 3.2 Continuous Skip-gram Model The second architecture is similar to CBow, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word. We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples The training complexity of this architecture is proportional to Q=C×(D+D×lo2(V) where C is the maximum distance of the words Thus, if we choose C= 5, for each training word we will select randomly a number R in range < 1; C>, and then use R words from history and INPUT PROJECTION OUTPUT INPUT PROJEC TION OUTPUT w(t+1) W(t1) SUM () CBOW Skip-gram Figure 1: New model architectures. The CBOW architecture predicts the current word based on the context, and the skip-gram predicts surrounding words given the current word R words from the future of the current word as correct labels. This will require us to do R x 2 word classifications, with the current word as input, and each of the R+ R words as output. In the following experiments, we use C=10 4 Results To compare the quality of different versions of word vectors, previous papers typically use a table showing example words and their most similar words, and understand them intuitively. Although it is easy to show that word France is similar to Italy and perhaps some other countries, it is much more challenging when subjecting those vectors in a more complex similarity task, as follows. We follow previous observation that there can be many different types of similarities between words, for example, word big is similar to bigger in the same sense that small is similar to smaller. Example of another type of relationship can be word pairs big- biggest and small -smallest [20]. We further denote two pairs of words with the same relationship as a question, as we can ask: What is the word that is similar to small in the same sense as biggest is similar to big? Somewhat surprisingly, these questions can be answered by performing simple algebraic operations with the vector representation of words. To find a word that is similar to small in the same sense as biggest is similar to big, we can simply compute vector X= ector(” biggest”)- ector(”big”)+ vector("small"). Then, we search in the vector space for the word closest to X measured by cosine distance, and use it as the answer to the question(we discard the input question words during this search). When the word vectors are well trained, it is possible to find the correct answer(word smallest) using this method Finally, we found that when we train high dimensional word vectors on a large amount of data, the resulting vectors can be used to answer very subtle semantic relationships between words, such as a city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectors with such semantic relationships could be used to improve many existing NLP applications, such as machine translation, information retrieval and question answering systems, and may enable other future applications yet to be invented Table 1: Examples of five types of semantic and nine types of syntactic questions in the semantic Syntactic Word relationship test set Type of relationship‖ Word pair 1 Word Pair 2 Common capital city athens G Osle norway All capital cities astana Kazakhstan arare Zimbabwe Currency angola kv wanza Iran rial City-in-state Chicago Illinois Stockton California Man-Woman brother sister grandson granddaughter Adjective to adverb apparent apparentl apid rapidly Opposite possibly impossiblyethical unethical Comparative great greater tough tougher Superlative eas easiest luck luckiest Present Participle think thinking read reading Nationality adjective Switzerland Swiss CambodiaCambodian Past tense walking walked Swimming swam Plural nouns mouse mice dollar dollars Plural verbs work works speak speaks 4.1 Task Description To measure quality of the word vectors, we define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions. Two examples from each category are shown in table Overall, there are 8869 semantic and 10675 syntactic questions. The questions in each category were created in two steps: first, a list of similar word pairs was created manuall Then, a large list of questions is formed by connecting two word pairs. For example, we made a list of 68 large American cities and the states they belong to, and formed about 2.5K questions by picking two word pairs at random. We have included in our test set only single token words, thus multi-word entities are not present(such as New York We evaluate the overall accuracy for all question types, and for each question type separately(se mantic, syntactic). Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question synonyms are thus counted as mistakes. This also means that reaching 100%0 accuracy is likely to be impossible, as the current models do not have any input information about word morphology However, we believe that usefulness of the word vectors for certain applications should be positively correlated with this accuracy metric. Further progress can be achieved by incorporating information about structure of words, especially for the syntactic questions 4.2 Maximization of Accuracy We have used a Google News corpus for training the word vectors. This corpus contains about 6B tokens. We have restricted the vocabulary size to 1 million most frequent words. Clearly, we are facing time constrained optimization problem, as it can be expected that both using more data and higher dimensional word vectors will improve the accuracy. To estimate the best choice of model architecture for obtaining as good as possible results quickly, we have first evaluated models trained on subsets of the training data, with vocabulary restricted to the most frequent 30k words The results using the CBow architecture with different choice of word vector dimensionality and increasing amount of the training data are shown in Table 2 It can be seen that after some point, adding more dimensions or adding more training data provides diminishing improvements. So, we have to increase both vector dimensionality and the amount of the training data together While this observation might seem trivial, it must be noted that it is currently popular to train word vectors on relatively large amounts of data, but with insufficient size Table 2: Accuracy on subset of the Semantic-Syntactic Word Relationship test set, using word vectors from the CBOW architecture with limited vocabulary. Only questions containing words from the most frequent 30k words are used Dimensionality/Training words 24M 49M98M196M39IM783M 50 13415718619.122.5232 100 19423.127.828.733432.2 300 23229235.338643.745.9 600 24030.136540.8466504 Table 3: Comparison of architectures using models trained on the same data, with 640-dimensional word vectors. The accuracies are reported on our Semantic-Syntactic Word Relationship test set Ind on the syntactic relationship /207 Model Semantic-Syntactic Word Relationship test set MSR Word Relatedness Architecture Semantic Accuracy [%l Syntactic Accuracy [ Test Set 120) RNNLM NLM 23 47 CBOW 24 61 Kip-gram 56 such as 50-100). Given Equation 4 increasing amount of training data twice results in about the same increase of computational complexity as increasing vector size twice For the experiments reported in Tables 2 and 4 we used three training epochs with stochastic gradi- ent descent and backpropagation. We chose starting learning rate 0.025 and decreased it linearly, so that it approaches zero at the end of the last training epoch 4.3 Comparison of Model Architectures First we compare different model architectures for deriving the word vectors using the same training data and using the same dimensionality of 640 of the word vectors. In the further experiments, we use full set of questions in the new Semantic-Syntactic Word Relationship test set, i. e. unrestricted to the 30k vocabulary. We also include results on a test set introduced in[20 that focuses on syntactic similarity between words The training data consists of several LDC corpora and is described in detail in [18 (320M words 82K vocabulary ). We used these data to provide a comparison to a previously trained recurrent neural network language model that took about 8 weeks to train on a single cpu. we trained a feed forward NNLM with the same number of 640 hidden units using the DistBelief parallel training [6] using a history of 8 previous words(thus, the nnlm has more parameters than the rnnlm, as the projection layer has size 640 x 8) In Table 3 it can be seen that the word vectors from the RNN (as used in(20) perform well mostly on the syntactic questions. The nnlm vectors perform significantly better than the rnn -this is not surprising, as the word vectors in the RNNlM are directly connected to a non-linear hidden layer The cbow architecture works better than the nnlm on the syntactic tasks and about the same on the semantic one. Finally, the skip-gram architecture works slightly worse on the syntactic task than the CBOW model(but still better than the NNLM), and much better on the semantic part of the test than all the other models Next, we evaluated our models trained using one CPU only and compared the results against publicly available word vectors. The comparison is given in Table4 The CBOW model was trained on subset ?We thank Geoff Zweig for providing us the test set. 7 Table 4: Comparison of publicly available word vectors on the semantic-Syntactic Word relation ship test set, and word vectors from our models. Full vocabularies are used. Model Vector Training Accuracy [%01 Dimensionality words Semantic Syntactic Total Collobert-Weston NNIM 50 660M 9.3 12.3 1.0 Turian NLM 37M 1.4 2.6 2.1 Turian nlm 200 1.4 2.2 1.8 Mnih ntlm 50 37M 1.8 9.1 5.8 Mnih ntlm 100 37M 3.3 13.2 8.8 Nikolov rnnlm 320M 4.9 18.4 12.7 Mikoloy rnnlm 640 320M 8.6 36.5 24.6 Huang ntlm 990M 13.3 11.6 12.3 Our nlM 20 129 26.4 20.3 Our ntlm 50 6B 279 55.8 43.2 Our ntlm 100 6B 34.2 64.5 50.8 CBOW 300 783N 15.5 53.I 36.1 Skip-gram 300 783M 50.0 55.9 53.3 Table 5: Comparison of models trained for three epochs on the same data and models trained for one epoch. Accuracy is reported on the full Semantic-Syntactic data set Model Vector Training Accuracy[‰ Training time DimensionalityIwords [days Semantic Syntactic[Total 3 epoch Cbow 783M 155 53.1 36.1 3 epoch Skip-gram 00 783M 50.0 559533 I epoch CBow 300 783M 138 49.9 33.6 0.3 I epoch CBOW 300 16.1 52.6 36.1 0.6 I epoch CBOw 600 783M 154 533362 0.7 I epoch skip-gram 300 783M 45.6 52.2 49.2 I epoch Skip-gram 1.6B 52.2 538 I epoch Skip-gram 600 783M 56.7 54.5 55.5 2.5 of the Google News data in about a day, while training time for the skip-gram model was about three For experiments reported further, we used just one training epoch(again, we decrease the learning rate linearly so that it approaches zero at the end of training). Training a model on twice as much data using one epoch gives comparable or better results than iterating over the same data for three epochs, as is shown in Table 5 and provides additional small speedup 4.4 Large Scale Parallel Training of models As mentioned earlier, we have implemented various models in a distributed framework called Dis- tBelief. Below we report the results of several models trained on the google News 6B data set. with mini-batch asynchronous gradient descent and the adaptive learning rate procedure called Ada- grad [7]. We used 50 to 100 model replicas during the training. The number of CPU cores is an Table 6: Comparison of models trained using the Dist Belief distributed framework. Note that training of NNIM with 1000-dimensional vectors would take too long to complete Model Vector T raining ccuracy [%] Training time Dimensionality words Idays x CPU cores Semantic Syntactic Total NLM 100 34.2 14x180 CBOW 1000 6B 57.3 68.9 63.7 2x140 Skip-gram 1000 6B 66.1 65.1 65.6 2.5X125 Table 7: Comparison and combination of models on the Microsoft sentence completion Challenge Architecture Accuracy [% 4-gram[32 39 Average LS A similarity [32 49 Log-bilinear model [24 54.8 RNNLMs 19 54 Ip-gram 48.0 Skip -gram rnnlms 58.9 estimate since the data center machines are shared with other production tasks, and the usage can Fluctuate quite a bit. Note that due to the overhead of the distributed framework, the cpu usage of implementations. The result are reported in Table/67 closer to eac the CBow model and the skip-gram model are much closer to each other than their single-machine 4.5 Microsoft Research Sentence Completion Challenge The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancin language modeling and other nlP techniques [32 This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices. Performance of several techniques has been already reported on this set, including N-gram models, LSA-based model [321, log-bilinear model [24] and a combination of recurrent neural networks that currently holds the state of the art performance of 55.4 accuracy on this benchmark [191 We have explored the performance of Skip-gram architecture on this task. First, we train the 640- dimensional model on 50M words provided in [32]. Then, we compute score of each sentence in the test set by using the unknown word at the input, and predict all surrounding words in a sentence The final sentence score is then the sum of these individual predictions. Using the sentence scores we choose the most likely sentence A short summary of some previous results together with the new results is presented in Table7 While the skip-gram model itself does not perform on this task better than LSA similarity, the scores from this model are complementary to scores obtained with RNNLMs, and a weighted combination leads to a new state of the art result 58.9%0 accuracy (59. 2%o on the development part of the set and 58.7% on the test part of the set) 5 Examples of the Learned Relationships Table 8 shows words that follow various relationships. We follow the approach described above: the relationship is defined by subtracting two word vectors, and the result is added to another word. Thus for example, Paris- France Italy Rome As it can be seen, accuracy is quite good, although there is clearly a lot of room for further improvements (note that using our accuracy metric that Table 8: Examples of the word pair relationships, using the best word vectors from table 4(Skip gram model trained on 783M words with 300 dimensionality Relationship Examp Dle 1 Example 2 Example 3 france- Paris Italy: Rome Japan: Tokyo Florida: Tallahassee big- bigger small larger cold: colder quick: quicker Miami-florida Baltimore: Maryland Dallas: Texas Kona: hawaii Einstein - scientist Messi: midfielder Mozart: violinist Picasso: painter Sarkozy -france Berlusconi: Italy Merkel: Germany Koizumi: Japan copper-Cu zinc: zn gold: au uranium plt utonium Berlusconi - Silvio Sarkozy: Nicolas Putin: Medvedev Obama: Barack Microsoft-Windows Google: Android IBM: Linux Apple: iPhone Microsoft - Ballmer Google: Yahoo IBM: mcNealy Apple: Jobs Japan- sushi Germany: bratwurst France: tapas USA: pizza assumes exact match, the results in Table 8 would score only about 60%). We believe that word vectors trained on even larger data sets with larger dimensionality will perform significantly better. and will enable the development of new innovative applications. Another way to improve accuracy is to provide more than one example of the relationship. By using ten examples instead of one to form the relationship vector (we average the individual vectors together), we have observed improvement of accuracy of our best models by about 10%0 absolutely on the semantic-Syntactic test It is also possible to apply the vector operations to solve different tasks. For example, we have observed good accuracy for selecting out-of-the-list words, by computing average vector for a list of words, and finding the most distant word vector. This is a popular type of problems in certain human intelligence tests. Clearly, there is still a lot of discoveries to be made using these techniques 6 Conclusion In this paper we studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks. We observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models(both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set Using the DistBelief distributed framework, it should be possible to train the CBow and Skip-gram models even on corpora with one trillion words, for basically unlimited size of the vocabulary. That is several orders of magnitude larger than the best previously published results for similar models An interesting task where the word vectors have recently been shown to significantly outperform the previous state of the art is the SemEval-2012 Task 2 [11. The publicly available rnn vectors were used together with other techniques to achieve over 50% increase in Spearmans rank correlation over the previous best result [31]. The neural network based word vectors were previously applied to many other NLP tasks, for example sentiment analysis [12 and paraphrase detection [28 It can be expected that these applications can benefit from the model architectures described in this paper Our ongoing work shows that the word vectors can be successfully applied to automatic extension of facts in Knowledge Bases, and also for verification of correctness of existing facts. Results from machine translation experiments also look very promising. In the future, it would be also nteresting to compare our techniques to Latent Relational Analysis [30] and others. We believe that our comprehensive test set will help the research community to improve the existing technigues for estimating the word vectors. We also expect that high quality word vectors will become an important building block for future NLP applications 10

...展开详情
试读 12P 词向量-开山之作1-Efficient estimation of word representations in vector space.pdf
立即下载
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
一个资源只可评论一次,评论内容不能少于5个字
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
关注 私信
上传资源赚钱or赚积分
最新推荐
词向量-开山之作1-Efficient estimation of word representations in vector space.pdf 41积分/C币 立即下载
1/12
词向量-开山之作1-Efficient estimation of word representations in vector space.pdf第1页
词向量-开山之作1-Efficient estimation of word representations in vector space.pdf第2页
词向量-开山之作1-Efficient estimation of word representations in vector space.pdf第3页

试读结束, 可继续读1页

41积分/C币 立即下载