Relation Classification via Convolutional Deep Neural Network

所需积分/C币:16 2018-08-24 09:29:38 833KB PDF

The state-of-the-art methods used for relation classification are primarily based on statistical machine learning, and their performance strongly depends on the quality of the extracted features. The extracted features are often derived from the output of pre-existing natural language processing (NL
People have been moving back O WE into [downtown] Window oces 888883PF Representation Feature lexical level entence level Extraction features max over times nh(w,x口) Output ●●●●●● Sentence level Figure 1: Architecture of the neural network used Figure 2: The framework used for extracting sen for relation classification tence level features 2008), subsequence kernel (Mooney and bunescu, 2005) and dependency tree kernel (bunescu and Mooney, 2005), have been proposed to sol ve the relation classification problem. However, the methods mentioned above suffer from a lack of sufficient labeled data for training. Mintz et al. (2009) proposed distant supervision(DS)to address this problem. The DS method selects sentences that match the facts in a knowledge base as positive examples. The ds algorithm sometimes faces the problem of wrong labels, which results in noisy labeled data. To address the shortcoming of DS, Riedel et al. (2010)and Hoffmann et al.(2011)cast the relaxed Ds assumption as multi-instance learning. Furthermore, Taka matsu ct al.(2012) noted that the relaxed ds assumption would fail and proposed a novel generative model to model the heuristic labeling process in order to reduce the wrong labels The supervised method has been demonstrated to be effective for relation detection and yields rela tively high performance. However, the performance of this method strongly depends on the quality of the designed features. With the recent revival of interest in dnn. many researchers have concentrated on us ing Deep Learning to learn features. In NLP, such methods are primarily based on learning a distributed representation for each word, which is also called a word embeddings(Turian et al., 2010). Socher et al (2012) present a novel recursive neural network(RNN) for relation classification that learns vectors in the syntactic tree path that connects two nominals to determine their semantic relationship. Hashimoto et al.(2013)also use an rnn for relation classification; their method allows for the explicit weighting of important phrases for the target task. As mentioned in Section 1, it is difficult to design high quality features using the existing NLP tools. In this paper, we propose a convolutional dnn to extract lexical and sentence level features for relation classification; our method effectively alleviates the shortcomings of traditional features 3 Methodology 3.1 The Neural Network architecture Figure 1 describes the architecture of the neural network that we use for relation classification. The network takes an input sentence and discovers multiple levels of feature extraction where higher levels represent more abstract aspects of the inputs. It primarily includes the following three components: Word Representation, Feature Extraction and Output. The system does not need any complicated syntactic or semantic preprocessing, and the input of the system is a sentence with two marked nouns. Then, the word tokens are transformed into vectors by looking up word embeddings. In succession, the lexical and sentence level features are respectively extracted and then directly concatenated to form the final feature vector. Finally, to compute the confidence of each relation, the feature vector is fed into a softmax classifier. The output of the classifier is a vector, the dimension of which is equal to the number of predefined relation types. The value of each dimension is the confidence score of the corresponding relation 2337 Left and right tokens of noun 1 Left and right tokens of noun 2 L5 WordNet hypernyms of noun Table i: i l1 level features 3.2 Word Representation In the word representation component, each input word token is transformed into a vector by looking up word embeddings. Collobert et al. (2011)reported that word embeddings learned from significant amounts of unlabeled data are far more satisfactory than the randomly initialized embeddings. In relation classification, we should first concentrate on learning discriminative word embeddings, which carry more syntactic and semantic information, using significant amounts of unlabeled data. Unfortunately, it usuall takes a long time to train the word embeddings. However, there are many trained word embeddings that are freely available(Turian et aL., 2010). A comparison of the available word embeddings is beyond the scope of this paper. Our experiments directly utilize the trained embeddings provided by turian et a.(2010 3.3 Lexical level features Lexical level features serve as important cues for deciding relations. The traditional lexical level features primarily include the nouns themselves, the types of the pairs of nominals and word sequences between the entities, the quality of which strongly depends on the results of existing NLP tools. Alternatively, this paper uses generic word embeddings as the source of base features. We select the word embeddings of marked nouns and the context tokens. Morcover, the WordNet hypernyms* are adopted aS mvrnn ( Socher et al., 2012). All of these features are concatenated into our lexical level features vector l. Table I presents the selected word embeddings that are related to the marked nouns in the sentence 3. 4 Sentence Level features As mentioned in section 3.2, all of the tokens are represented as word vectors, which have been demon- strated to correlate well with human judgments of word similarity. Despite their success, single word vector models are severely limited because they do not capture long distance features and semantic com- positionality, the important quality of natural language that allows humans to understand the meanings of a longer expression. In this section, we propose a max-pooled convolutional neural network to offer sentence level representation and automatically extract sentence level features. Figure 2 shows the frame work for sentence level feature extraction. In the Window processing component, each token is further represented as word Features ( Wf)and position Features (Pf)(see section 3. 4. 1 and 3. 4.2). Then, the vector goes through a convolutional component. Finally, we obtain the sentence level features through a non-linear transformation 3. 4.1 Word Features Distributional hypothesis theory(Harris, 1954)indicates that words that occur in the same context tend to have similar meanings. To capture this characteristic, the wf combines a words vector representation and the vector representations of the words in its context. Assume that we have the following sequence or words PeopleJo havel beeng moving back4 intos [downtown] 6 The marked nouns are associated with a label y that defines the relation type that the marked pair contains Each word is also associated with an index into the word embeddings all of the word tokens of the sentence S are then represented as a list of vectors(x0, x1,..., x6), where xi corresponds to the word Collobert et al.(2011) proposed a pairwise ranking approach to train the word embeddings, and the total training time for an English corpus(wikipedia)was approximately four weeks 2338 embedding of the i-th word in the sentence. To use a context size of w we combine the size w windows of vectors into a richer fcature. For cxample, when we take w= 3, the WF of the third word"moving' in the sentence S is expressed as x2, X3, x4]. Similarly, considering the whole sentence, the WF can be represented as follows xs,X∩,X x∩,x1,X 3.4.2 Position features Relation classification is a very complex task. Traditionally, structure features(e.g, the shortest depen- dency path between nominals) are used to solve this problem(Bunescu and Mooney, 2005). Apparently it is not possible to capture such structure information only through WF. It is necessary to specify which input tokens are the target nouns in the sentence. For this purpose pf are proposed for relation classi- fication. In this paper, the PF is the combination of the relative distances of the current word to wl and 2. For example, the relative distances of moving” in sentence S to " people”and“ downtown”are3 and-3, respectively. In our method, the relative distances also are mapped to a vector of dimension de(a hyperparameter); this vector is randomly initialized. Then, we obtain the distance vectors d and dy with respect to the relative distances of the current word to wl and w2, and PF=d1, d2. Combining the WF and PF, the word is represented as [WF, PF], which is subsequently fed into the convolution component of the algorithm 3.4.3 Convolution We will see that the word representation approach can capture contextual information through combina tions of vectors in a window. However, it only produces local features around each word of the sentence In relation classification, an input sentence that is marked with target nouns only corresponds to a lation type rather than predicting label for each word. Thus, it might be necessary to utilize all of the local features and predict a relation globally. When using neural network, the convolution approach is a natural method to merge all of the features. Similar to Collobert et al. (2011), we first process the output of Window Processing using a linear transformation Z= WIX X E ROX is the output of the window Processing task, where 1o=wX n, n(a hyperparameter) is the dimension of feature vector, and t is the token number of the input sentence. W1ERTIXno, where n hyperparameter) is the size of hidden layer l, is the linear transformation matrix. We can see that the features share the same weights across all times, which greatly reduces the number of free parameters to learn. After the linear transformation is applied, the output Z E Rmlt is dependent on t. To determine the most useful feature in the cach dimension of the feature vectors we perform a max operation over time on Z Z(i,)0 (2) where Z(i, )denote the i-th row of matrix Z. Finally, we obtain the feature vector n im1, m2. ..,mn, the dimension of which is no longer related to the sentence length 3.4.4 Sentence Level Feature Vector To learn more complex features, we designed a non-linear layer and selected hyperbolic tanh as the activation function. One useful property of tanh is that its derivative can be expressed in terms of the function value itself tanh It has the advantage of making it easy to compute the gradient in the backpropagation training procedure Formally, the non-linear transformation can be written as g= tanh(W2m xs and xe are special word embeddings that correspond to the beginning and end of the sentence, respectively 2339 W2ER2X1 is the linear transformation matrix, where n2(a hyperparameter) is the size of hidden layer 2. Compared with m E RnIX, g E R72XI can be considered higher level features(sentence level features) 3.5 Output The automatically learned lexical and sentence level features mentioned above are concatenated into a le vector f=1, g. To compute the confidence of each relation, the feature vector f E Rn3 X(ny equals n2 plus the dimension of the lexical level features)is fed into a softmax classifier f (5) W3 E RnAXn3 is the transformation matrix and o E Rnaxl is the final output of the network, where n4 Is equa the number of possible relation types for the relation classification system. Each output can be then interpreted as the confidence score of the corresponding relation. This score can be interpreted as a conditional probability by applying a softmax operation(see Section 3.6) 3.6 Backpropagation Training The dnn based relation classification method proposed here could be stated as a quintuple 6 (X, N, W1, 2. W3). In this paper, each input sentence is considered independently. Given an in put example s, the network with parameter 0 outputs the vector o, where the i-th component oi contains the score for relation i. To obtain the conditional probability p(il. c, 0), we apply a softmax operation over all relation types: p(i. 0 (6) k=1 Given all our(suppose t)training examples(z(); y()), we can then write down the log likelihood of the parameters as follows ogp(y To compute the network parameter B, we maximize the log likelihood (8)using a simple optimization technique called stochastic gradient descent(SGD) N,W1, W2 and w3 are randomly initialized and X is initialized using the word embeddings Because the parameters are in different layers of the neural network, we implement the backpropagation algorithm: the differentiation chain rule is applied through the network until the word embedding layer is reached by iteratively selecting an example (a, y) and applying the following update rule e←b+4gp(91x,b) 00 (8) 4 Dataset and Evaluation metrics To evaluate the performance of our proposed method, we use the semEval-2010 Task 8 dataset(Hen- x et al., 2010). The dataset is freely available and contains 10, 717 annotated examples, including 8,000 training instances and 2, 717 test instances. There are 9 relationships(with two directions) and an undirected Other class. The following are examples of the included relationships: Cause-Effect, Component-Whole and Entity-Origin. In the official evaluation framework, directionality is taken into account. A pair is counted as correct if the order of the words in the relationship is correct. For example both of the following instances S1 and S2 have the relationship Component -Whole SI: The hafte, of the axec, is make. Component-Whole(el, e2 S2: This machine]e, has two units]ez..= Component-Whole(c2, c1) represents the word embeddings of WordNet hy pernyms 2340 Figure 3: Effect of hyperparameters However, these two instances cannot be classified into the same category because Component- Whole(el, e2) and Component-Whole(ey, el) are different relationships. Furthermore, the official rank- ing of the participating systems is based on the macro-averaged Fl-scores for the nine proper relations (excluding Other). To compare our results with those obtained in previous studies, we adopt the macro averaged Fl-score and also account for directionality into account in our following experimenl op macro 5 Experiments In this section we conduct three sets of experiments. The first is to test several variants via cross validation to gain some understanding of how the choice of hyperparameters impacts upon the perfor mance. In the second set of experiments, we make comparison of the performance among the convolu- tional dnn learned features and various traditional features. The goal of the third set of experiments is to evaluate the effectiveness of each extracted feature 5.1 Parameter Settings In this section, we experimentally study the effects of the three parameters in our proposed method the window size in the convolutional component w, the number of hidden layer l, and the number of hidden layer 2. Because there is no official development dataset, we tuned the hyperparameters by trying different architectures via 5-fold cross-validation In Figure 3, we respectively vary the number of hyper parameters w, n1 and n2 and compute the Fl We can see that it does not improve the performance when the window size is greater than 3. Moreover, because the size of our training dataset is limited, the network is prone to overfitting, especially when using large hidden layers. From Figure 3, we can see that the parameters have a limited impact on the results when increasing the numbers of both hidden layers I and 2. Because the distance dimension has little effect on the result (this is not illustrated in Figure 3), we heuristically choose de = 5. Finally, the word dimension and learning rate are the same as in Collobert et al.(2011). Table 2 reports all the hyperparameters used in the following experiments Hyperparameter Window size Word dim. Distance dim. Hidden layer/ Hidden layer 2 Learning rate value 3 50 de =5 m1=200 m22=100 入=0.01 Table 2: Hyperparameters used in our experiments 5.2 Results of Comparison Experiments To obtain the final performance of our automatically learned features, we select seven approaches as com petitors to be compared with our method in Table 3. The first five competitors are described in Hendrickx et al.(2010), all of which use traditional features and employ svm or MaxEnt as the classifier. These systems design a series of features and take advantage of a variety of resources(WordNet, ProBank and FrameNet, for example). RNN represents recursive neural networks for relation classification, as The corpus contains a Perl-based automatic evaluation tool 2341 SVM POS, stemming, syntactic patterns 60.1 POS, stemming, syntactic patterns, WordNet 74.8 MaxEnt POS, morphological, noun conpound, Thesauri, Google n-grans, WordNet 77.6 VM POS, prefixes, morphological, WordNet, dependency parse, Levin classed, ProBank,I 82 FrameNet, NomLex-Plus, Google n-gram, paraphrases, TextRunner RNN 74.8 POS NER. WordNet 77.6 MVRNN POS NER. Word 824 Proposed word pair, words around word pair, WordNet Table 3 Classifier. their feature sets and the f1-score for relation classification proposed by Socher et al.(2012). This method learns vectors in the syntactic tree path that connect two nominals to determine their semantic relationship. The MVRN model builds a single compositional semantics for the minimal constituent, including both nominals as RNN (socher et al., 2012). It is almost certainly too much to expect a single fixed transformation to be able to capture the meaning combination effects of all natural language operators. Thus, MVRNN assigns a matrix to every word and modifies the meanings of other words instead of only considering word embeddings in the recursive procedure Table 3 illustrates the macro-averaged Fi measure results for these competing methods along with the resources, features and classifier used by each method. Based on these results, we make the following observations. (1)Richer feature sets lead to better performance when using traditional features. This improvement can be explained by the need for semantic generalization from training to test data. The quality of traditional features relies on human ingenuity and prior nlp knowledge. It is almost impossible to manually choose the best feature sets (2) RNN and MVRNN contain feature learning procedures; thus, they depend on the syntactic tree used in the recursive procedures. Errors in syntactic parsing inhibit the ability of these methods to learn high quality features. RNN cannot achieve a higher performance than the best method that uses traditional features, even when pos, ner and wordnet are added to the training dataset compared with rNN, the MVRNN model can capture the meaning combination effectively and achieve a higher pe (3) Our method achieves the best performance among all of the compared methods. We also perform a t-test(p s0.05), which indicates that our method significantly outperforms all of the compared methods 5.3 The effect of learned features Feature Sets Lexical 34.7 53.1 594 659 73.3 Sentence WE 697 +PF Combination all 82.7 Table 4: Score obtained for various sets of features on for the test set. The bottom portion of the table shows the best combination of lexical and sentence level features In our method the network extract lexical and sentence level features. The lexical level features pri- marily contain five sets of features Ll to L5 ) We performed ablation tests on the five sets of features from the lexical part of Table 4 to determine which type of features contributed the most. The results are 2342 presented in Table 4, from which we can observe that our learned lexical level features are effective for relation classification. The F1-score is improved remarkably when new features are added. Similarly, we perform experiment on the sentence level features. The system achieves approximately 9.2% improve ments when adding pf. when all of the lexical and sentence level features are combined we achieve the best resul 6 Conclusion In this paper, we exploit a convolutional deep neural network (dnn to extract lexical and sentence level features for relation classification. In the network, position features(PF) are successfully proposed to specify the pairs of nominals to which we expect to assign relation labels. The system obtains a significant improvement when PF are added. The automatically learned features yield excellent results and can replace the elaborately designed features that are based on the outputs of existing nP tools Acknowledgments This work was sponsored by the National Basic Research Program of China(No. 2014CB340503)and the National Natural Science Foundation of China(No. 61272332, 61333018, 61202329, 61303180) This work was supported in part by Noahs Ark Lab of Huawei Tech Co Ltd. We thank the anonymous reviewers for their insightful comments References Nguyen Bach and Sameer Badaskar. 2007. A review of relation extraction. Literature review for Language and Statistics ll Razvan C Bunescu and Raymond J. Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 724-731 Jinxiu Chen, Donghong Ji, Chew Lim Tan, and Zhengyu Niu. 2005. Unsupervised feature selection for relation extraction. In Proceedings of the international Joint Conference on Natural language Processing, pages 262- 267 Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011 Natural language processing(almost)from scratch. The Journal of Machine Learning research, 12: 2493-2537 Zellig Harris. 1954. Distributional structure. Word, 10(23): 146-162 Takaaki Hasegawa, Satoshi Sekine, and Ralph grishman. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pages 415422 Kazuma Hashimoto, Makoto Miwa, Yoshimasa Tsuruoka, and Takashi Chikayama. 2013. Simple customization of recursive neural networks for semantic relation classification. In Proceeding s of the 2073 Conference on Empirical Methods in Natural language Processing, pages 1372-1376 Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O. seaghdha, Sebastian Pado, marco Pennacchiotti, Lorenza romano, and Stan Szpakowicz. 2010. Semeval-2010 task 8: Multi-way classification Evaluation, SemEval'10, pages 33-38 minals. In Proceedings of the 5th International Workshop on Semantic of semantic relations between pairs of nor Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel s. Weld. 2011. Know ledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meet- ing of the Association for Computational linguistics: Human Lang uage Technologies- Volume 1 pages 541 Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the 42nd Annual Meeting on Association for Computational linguistics on Interactive poster and demonstration sessions 2343 Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP Volume 2, pages 1003-1011 Raymond J Mooney and Razvan C Bunescu. 2005. Subsequence kernels for relation extraction. In Advances in neural information processing systems, pages 171-178 Longhua Qian, Guodong Zhou, Fang Kong, Qiaoming Zhu, and Peide Qian. 2008. Exploiting constituent depen dencies for tree kernel-based semantic relation extraction. In Proceedings of the 22nd International Conference on Computational linguistics, pages 697-704 Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part ll, pages 148-163 Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural language Learning, pages 1201-121I Fabian M. Suchanek, Georgiana Ifrim, and Gerhard Weikum. 2006. Combining linguistic and statistical analysis to extract relations from web documents. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 712-717 Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. 2012. Reducing wrong labels in distant supervision for relation extraction. In Proceedings of the 5Oth annual Meeting of the Association for Computational Linguistics Long Papers- Volume 1, pages 721-729 Joseph turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384-394 Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 2003. Kernel methods for relation extraction. The Journal of Machine learning research, 3: 1083-1 106 GuoDong Zhou, Su Jian, Zhang Jie, and Zhang Min. 200.5. Exploring various know ledge in relation extracti In Proceedings of the 43rd Annual meeting on Association for Computational Linguistics, pages 427-434 on 2344

  • 领英

  • GitHub


关注 私信 TA的资源