蚂蚁金服人工智能部研究员ICML贡献论文01.pdf

所需积分/C币:10 2019-08-28 19:37:40 2.19MB .PDF
8
收藏 收藏
举报

随着机器学习热度的增加和其中“中国力量”的逐渐强大,在各大顶级会议上有越来越多的中国组织排名靠前,大有争夺头把交椅的势头。 比如,本次ICML,清华大学有 12 篇论文被收录;华裔作者的数量也令人惊讶,如佐治亚理工学院终身副教授、机器学习中心副主任宋乐署名的就有8篇论文。 而宋乐教授的另外一个身份,就是蚂蚁金服人工智能部研究员。 蚂蚁金服成为ICML 上“中国力量”的代表之一,为大会奉献了8篇论文。其中,六篇含金量十足的Oral Paper,成为议程上研讨会的主角,接受与会专家的热烈讨论。 这些论文几乎每篇署名作者都有世界级学术专家。比如人工
Learning to Explain: An Information-Theoretic Perspective on Model Interpretation We have thus defined a new random vector Xs E rh, see For a generic model, it is impossible to compute expecta- Figure 1 for a probabilistic graphical model representing its tions under the conditional distribution Pm( c3).Hence construction. We formulate instancewise feature selection we introduce a variational family for approximation as seeking explainer that optimizes the criterion g:={Q1Q={xs→Qs(Yxs),S∈8k}.(3 maxI(Xs;Y subjec S~(X) Note each member Q of the family Q is a collection of In words, we aim to maximize the mutual information be- conditional distributions Qs(Y as), one for each choice tween the response variable from the model and the selected of k-sized feature subset S. For any Q, an application of features. as a function of the choice of selection rule Jensen's inequality yields the lower bound It turns out that a global optimum of Problem(1)has a nat EyiXs [logPm(Y(Xs)]>/Pm(Y Xs)log Qs(Y Xs) ural information-theoretic interpretation it corresponds to the minimization of the expected length of encoded mes- EYIxs [log Qs(YlXs) sage for the model Pm(Y I a)using Fm(rlas), where the where equality holds if and only if I m(Y I Xs)and latter corresponds to the conditional distribution of y upon Qs(Y Xs)are egual in distribution. We have thus ob tained a variational lower bound of the mutual information the followin I(XS: Y). Problem(1)can thus be relaxed as maximizing Theorem 1. Letting Em[ I r] denote the expectation over the variational lower bound, over both the explana ation Pm(1 a), define and the conditional distribution Q C*(a):=arg min Em log maxE log QsYXs such that SN E(X).(4) IPm(ras Then &* is a global optimum of Problem(1). Conversely, For generic choices (Q and &, it is still difficult to solve the variational approximation(4). In order to obtain a tractable any global optimum of Problem(1)degenerates to C al most surely over the marginal distribution IPX method we need to restrict both and e to suitable families over which it is efficient to perform optimization The proof of Theorem l is left to Appendix. In practice, the above global optimum is obtained only if the explanation A single neural network for parametrizing Q: Recall family is sufficiently large. In the case when Pm(Yrs) that=Qs(I s),SE Ok) is a collection of is unknown or computationally expensive to estimate ac- conditional distributions with cardinality Ql=(&). W curately, we can choose to restrict to suitably controlled assume X is a continuous random vector, and Pm(Y milies so as to prevent overfitting .) is continuous with respect to t. Then we introduce a single neural network function ga Rd for parametrizing Q, where the codomain is a(c-1) 3. Proposed method △。1={y∈0,1:0≤v2≤1,=1v=1} a direct solution to Problem(1)is not possible, so that we for the class distribution, and a denotes the learnable param- need to approach it by a variational approximation. In par- eters. We define Qs(rlcs):=ga(as), where ts RDis ticular we derive a lower bound on the mutual information transformed from m with entries not in S replaced by zeros and we approximate the model conditional distribution Pm ∈ by a suitably rich family of functions (s) 0,讠¢S When X contains discrete features we embed each discrete 3.. Obtaining a tractable variational formulation feature with a vector, and the vector representing a specific We now describe the steps taken to obtain a tractable varia- feature is set to zero simultaneously when the corresponding tional formulation feature is not in S 3. 2. Continuous relaxation of subset sampling a variational lower bound: Mutual information between Xs and Y can be expressed in terms of the conditional Direct estimation of the objective function in equation(4) distribution of Y given Xs requires summing over (k) combinations of feature sub Xs, Y)1-E10g Pr I(Xs, Y)-E1od P(Xs)Pm( Pm(YXs sets after the variational approximation. Several tricks (Y) exist for tackling this issue, like REINFORCE-type Al- gorithms (Williams, 1992), or weighted sum of features E log Pm(YI-Xs)+Const parametrized by deterministic functions of X.(A similar concept to the second trick is the" soft attention"struc- ExESIxErIXs logPm(YlXs)+Const ture in vision(Ba et al., 2014)and nlp(Bahdanau et al. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation 2014) where the weight of each feature is parametrized by tion of auxiliary random variables S sampled independentl a function of the respective feature itself. ) We employ an from the Gumbel distribution. Then we use the elementwise alternative approach generalized from Concrete Relaxation product v(8.)oX between V and X as an approximation Gumbel-softmax trick)(Jang et aL., 2017: Maddison et al., of X 2014; 2016, which empirically has a lower variance than REINFORCE and encourages discreteness(Ratel et al., 3.3. The final objective and its optimization 2017) After having applied the continuous approximation of fea The Gumbel-softmax trick uses the concrete distribution as ture subset sampling, we have reduced Problem(4)to the a continuous differentiable approximation to a categorical following distribution. In particular, suppose we want to approximate a categorical random variable represented as a one-hot vector max Ex,Y,s log ga(v(,S)oX,r in Rd with category probability p1, p2,...,Pd. The random where a denotes the neural network used to approximate perturbation for each category is independently generated the model conditional distribution, and the quantity g is used from a gumbel(0, 1)distribution to parametrize the explainer. In the case of classification Gi=-log(-lo Uniform (0, 1) th ccl e can write We add the random perturbation to the log probability of ch category and take a Ex∑吧n(X)logn(v(.9°X,0),(6) over the d-dimen sional vector exp(log pi+ Gi/T Note that the expectation operator E X, c does not depend on the parameters(a, 0), so that during the training stage, we ∑7=1exp{(ogp+G)r} can apply stochastic gradient methods to jointly optimize The resulting random vectorC=(C1,., Ca)is called a the pair(a, 0). In each update, we sample a mini-batch of Concrete random vector, which we denote by unlabeled data with their class distributions from the mod C N Concrete(lo log pa) to be explained, and the auxiliary random variables and we then compute a Monte Carlo estimate of the gradient of We apply the Gumbel-softmax trick to approximate the objective function( 6) weighted subset sampling. We would like to sample a sub set S of k distinct features out of the d dimensions. The 3. 4. The explaining stage pling a h-hot random vector Z from Dd: -zE 10,1/a sampling scheme for S can be equivalently viewed as sar During the explaining stage, the learned explainer maps k 1, with each entry of z being one if it is in the each sample X to a weight vector we(X)of dimension d, selected subset S and being zero otherwise. An importance each entry representing the importance of the corresponding score which depends on the input vector is assigned for each feature for the feature for the specific sample x. In order to provide a de feature. Concretely, we define w6: R-rR that maps the terministic explanation for a given sample, we rank features according to the weight vector, and the k features with the input to a d-dimensional vector, with the ith entry of we(x) largest weights are picked as the explaining features representing the importance score of the ith feature We start with approximating sampling k distinct features For each sample, only a single forward pass through the neu out of d features by the sampling scheme below: Sam- ral network parametrizing the explainer is required to yield ple a single feature out of d features independently for k explanation. Thus our algorithm is much more efficient times. Discard the overlapping features and keep the rest. in the explaining stage compared to other model-agnostic Such a scheme samples at most k: features, and is easier explainers like LIME or Kernel SHAP which require thou to approximate by a continuous relaxation. We further ap sands of evaluations of the original model per sample proximate the above scheme by independently sampling he independent Concrete random vectors, and then we define 4. Experiments a d-dimensional random vector V that is the elementwise We carry out experiments on both synthetic and real data maximum of C1,O2、.,Ck sets. For all experiments, we use RMSprop(Maddison et al Concrete(we(X))ii d. for i-1. 2,..., k, 2016)with the default hyperparameters for optimization V-(V1,V,…,V),V We also fix the step size to be 0.001 across experiments maX The temperature for Gumbel-softmax approximation is fixed The random vector V is then used to approximate the k-hot to be 0. 1. Codes for reproducing the key results are avail- random vector Z during training ableonlineathttps://github.com/jianbo-labi L2Ⅹ We write V=V(8, S) as V is a function of 0 and a collec Learning to Explain: An Information-Theoretic Perspective on Model Interpretation Clock time for various methods feature selection, including Saliency(Simonyan et al., 2013), DeepLiFT (Shrikumar et al., 2017), SHAP (Lundberg Nonlinear additive Lee, 2017), LIME (Ribeiro et al., 2016 Saliency refers to 10 gz Feature switching the method that computes the gradient of the selected class with respect to the input feature and uses the absolute values as importance scores. SHAP refers to Kernel SHAP. The number of samples used for explaining each instance for LIME and shaP is set as default for all experiments. We also compare with a method that ranks features by the input feature times the gradient of the selected class with respect to the input feature. Shrikumar et al (2017) showed it is Figure 2. The clock time (in log scale)of explaining 10,000 sam- equivalent to LRP(Bach et aL., 2015) when activations are oles for each method. The training time of l2X is shown in piecewise linear, and used it in Shrikumar et al(2017)as translucent bars a strong baseline. We call it" Taylor"as it is the first-order 4.1. Synthetic Data Taylor approximation of the model We begin with experiments on four synthetic data sets Our experimental setup is as follows. For each data set, we train a neural network model with three hidden dense lay 2-dimensional XOR as binary classification The input ers. We can safely assume the neural network has success- vector X is generated from a 10-dimensional standard fully captured the important features, and ignored noise fea Gaussian. The response variable Y is generated from tures, based on its error rate. Then we use Taylor, Saliency, P(Y=1X)x expX1X2H DeepLIFT, SHAP, LIMe, and l2X for instancewise feature Orange skin. The input vector X is generated from a 10- selection on the trained neural network models. For L2X dimensional standard Gaussian. The response variable y the explainer is a neural network composed of two hidden is generated from P(Y=1 X)x expIEi1X2-4. layers. The variational family is composed of three hid Nonlinear additive model. Generate X from a den layers. All layers are linear with dimension 200. The 10-dimensional standard Gaussian. The response number of desired features k is set to the number of true variable y is generated from P(Y =1X) features exp{-100sin(2X1)+2X2|+X3+exp{-X4} e Switch feature Generate X from a mixture of two gaus The underlying true features are known for each sample sians centered at +3 respectively with equal probability and hence the median ranks of selected features for each If Xi is generated from the gaussian centered at 3, the sample in a validation data set are reported as a performance metric,the box plots of which have been plotted in Figure 3 2-5th dimensions are used to generate r like the orange We observe that L2X outperforms all other methods on skin model. Otherwise the 6- gth dimensions are used to generate Y from the nonlinear additive model nonlinear additive and feature switching data sets. On the XOR model, DeepLIFT, ShaP and L2X achieve the best The first three data sets are modified from commonly used performance. On the orange skin model, all algorithms have data sets in the feature selection literature( Chen et al., 2017). near optimal performance, with L2X and LIME achieving The fourth data set is designed specifically for instancewise the most stable performance across samples feature selection. Every sample in the first data set has the first two dimensions as true features where each dimension We also report the clock time of each method in Figure 2 itself is independent of the response variable y but the where all experiments were performed on a single nvidia combination of them has a joint effect on Y. In the second Tesla k80 gPu. coded in TensorFlow. Across all the four data set, the samples with positive labels centered around a data sets, SHAP and liME are the least efficient as they require multiple evaluations of the model deeplifT, tay- sphere in a four-dimensional space. the sufficient statistic is formed by an additive model of the first four features. The lor and Saliency requires a backward pass of the model response variable in the third data set is generated from a DeepLIFT is the slowest among the three, probably due to nonlinear additive model using the first four features The the fact that backpropagation of gradients for Taylor and Saliency are built-in operations of TensorFlow, while back- last data set switches important features(roughly) based on the sign of the first feature The 1-5 features are true for propagation in deepLIFT is implemented with high-level operations in Tensor Flow. Our method L2X is the most samples with X1 generated from the Gaussian centered at 3, and the 1, 6-9 features are true otherwise efficient in the explanation stage as it only requires a for- ward pass of the subset sampler. It is much more efficient We compare our method L2X (for"Learning to Explain") compared to SHAP and LIME even after the training time with several strong existing algorithms for instancewise has been taken into consideration, when a moderate numbe Learning to Explain: An Information-Theoretic Perspective on Model Interpretation Orange skin M 日 2 二,二 10 Nonlinear additive 10 Feature switching Figure 3. The box plots for the median ranks of the influential features by each sample, over 10, 000 samples for each data set. The red line and the dotted blue line on each box is the median and the mean respectively. lower median ranks are better The dotted green lines indicate the optimal median rank Truth Model Key words positive positive Ray Liotta and To Hulce shine in this sterling example of brotherly love and cunmniunent. Hulce plays Dominick, (nicky)a mildly mentally handicapped young man who is putting his 12 minutes younger, twin brother, I iotta, who plays Eugene, through medical school. It is set in Baltimore and deals with the issues of sibling rivalry, the unbreakable bond of twins, child abuse and good always winning out over evil. It is captivating, and filled with laughter and tears. If you have not yet seen this film, please rent it, I promise, you' ll be amazed at how such a wonderful film could go unnoticed negativenegative Sorry to go against the flow but I thought this film was unrealistic, boring and way too long. I got tired of atching Gena Rowlands long arduous battle with herself and the crisis she was experiencing. Maybe the film has some cinematic value or represented an important step for the director but for pure entertainment alue. I wish I would have skipped it negative positive This movie is chilling reminder of Bollywood being just a parasite of Ilollywood. Bollywood also tends to feed on past blockbusters for furthering its industry. Vidhu Vinod Chopra made this movie with the reasoning that a cocktail mix of deewar and on the waterfront will bring home an oscar. It turned out to be rookie mistake. Even the idea of the title is inspired from the Elia Kazan classic. In the original, Brando is shown as raising doves as symbolism of peace. Bollywood must move out of Holly woods shadow if it needs to be taken seriously positive negative When a small town is threatened by a child killer, a lady police officer goes after him by pretending to be his friend. As she becomes more and more emotionally involved with the murderer her psyche begins to take a heating causing her to lose focus on the job of catching the criminal. not a film of high voltage excitement, but solid police work and a good depiction of the faulty mind of a psychotic loser Tablc 2. Truc labels and labels prcdictcd by thc modcl arc in thc first two columns. Kcy words picked by L2X arc highlighted in ycllow of samples (10,000)need to be explained. As the scale of split of 25, 000 for training and 25, 000 for testing. The the data to be explained increases, the training of L2X ac- average document length is 231 words, and 10.7 sentences counts for a smaller proportion of the over-all time. Thus We use l2x to study two popular classes of models for the relative efficiency of L2X to other algorithms increases sentiment analysis on the IMDB data set with the size of a data set 4.2.1 EXPLAINING A CNN MODEL WITH KEY WORDS 4.2. VDB Convolutional neural networks(CNN have shown excel- The Large Movie Review Dataset(IMDB )is a dataset of lent performance for sentiment analysis(Kim, 2014; Zhang movie reviews for sentiment classification(Maas et al.,& Wallace, 2015). We use a simple Cnn model on 2011). It contains 50, 000 labeled movie reviews. with a Keras( Chollet et al., 2015) for the imDb data set, which Learning to Explain: An Information-Theoretic Perspective on Model Interpretation Truth Predicted Key sentence positive positive There are few really hilarious films about science fiction but this one will knock your sox off. The lead Martians Jack Nicholson take-off is side-splitting. The plot has a very clever twist that has be seen to be enjoyed. This is a movie with heart and excellent acting by all. Make some popcorn and have a great negative negative You get 5 writers together, have each write a different story with a different genre, and then you try to make one movie out of it Its action its adventure, its sci-fi. its we its a mess. Sorry, but this movie absolutely stinks. 4.5 is giving it an awefully high rating. That said, its movies like this that make me think i could write movies, and i can barely write negative positive This movie is not the same as the 1954 version with Judy garland and James mason, and that is a shame because the 1954 version is, in my opinion, much better. I am not denying Barbra Streisand,'s talent at all She is a good actress and brilliant singer. I am not acquainted with Kris Kristofferson's other work and herefore I can't pass judgment on it. However, this movie leaves much to be desired. It is paced slowly, it has gratuitous nudity and foul language, and can be very difficult to sit through. However. I am not a big fan of rock music, so its only natural that I would like the judy garland version better. See the 1976 film with Barbra and Kris, and judge for yourself. positive negative The first time you see the second renaissance it may look boring. Look at it at least twice and definitely watch part 2. it will change your view of the matrix. Are the human people the ones who started the war Is ai a bad thing? Table 3. True labels and labels from the model are shown in the first two columns. Key sentences picked by L2X highlighted in yellow. is composed of a word embedding of dimension 50, a 1-D Each model explainer outputs a subset of features Xs for convolutional layer of kernel size 3 with 250 filters, a max- each specific sample X. We use Pmy Xs)to approximate pooling layer and a dense layer of dimension 250 as hidden Pm (y Xs). That is, we feed in the sample X to the model layers. Both the convolutional and the dense layers are fol- with unselected words masked by zero paddings. Then lowed by RelU as nonlinearity, and Dropout(Srivastava we compute the accuracy of using Pm(y Xs) to predict et al, 2014) as regularization. Each review is padded/cut to samples in the test data set labeled by Pm(y X), which we 400 words. The CNn model achieves 90% accuracy on the call post-hoc accuracy as it is computed after instancewise test data, close to the state-of-the-art performance(around feature selection 94%). We would like to find out which k words make the most influence on the decision of the model in a specific review. The number of key words is fixed to be k=10 for Human accuracy. When designing human experiments, all the experiment we assume that the key words convey an attitude toward a movie, and can thus be used by a human to infer the review The explainer of l2X is composed of a global component sentiment. This assumption has been partially validated and a local component(See Figure 2 in Yang et al. (2018)) given the aligned outcomes provided by post-hoc accuracy The input is initially fed into a common embedding layer and by human judges, because the alignment implies the followed by a convolutional layer with 100 filters. Then consistency between the sentiment judgement based on se the local component processes the common output using lected words from the original model and that from humans two convolutional layers with 50 filters, and the global com- Based on this assumption, we ask humans on amazon me ponent processes the common output using a max-pooling chanical turk(amt) to infer the sentiment of a review layer followed by a 100-dimensional dense layer. Then we given the ten key words selected by each explainer. The concatenate the global and local outputs corresponding to words adjacent to each other, like"not good at all, keep each feature, and process them through one convolutional their adjacency on the AMT interface if they are selected layer with 50 filters, followed by a Dropout layer(Srivastava simultaneously. The reviews from different explainers have al., 2014). Finally a convolutional network with kernel been mixed randomly and the final sentiment of each review size l is used to yield the output. All previous convolutional is averaged over the results of multiple human annotators layers are of kernel size 3, and RelU is used as nonlinearity. We measure whether the labels from human based on se- The variational family is composed of an word embedding lected words align with the labels provided by the model layer of the same size, followed by an average pooling and in terms of the average accuracy over 500 reviews in the a 250-dimensional dense layer. Each entry of the output test data set. some reviews are labeled as "neutral"based vector V from the explainer is multiplied with the embed on selected words, which is because the selected key words ding of the respective word in the variational family We use do not contain sentiment, or the selected key words contain both automatic metrics and human annotators to validate the effectiveness of l2X comparable numbers of positive and negative words. Thus these reviews are neither put in the positive nor in the nega tive class when we compute accuracy. We call this metric Post-hoc accuracy. We introduce post-hoc accuracy for human accurucy. quantitatively validating the effectiveness of our method. The result is reported in Table 4. We observe that the model Learning to Explain: An Information-Theoretic Perspective on Model Interpretation prediction based on only ten words selected by L2X align with the original prediction for over 90% of the data. The hu- 色3护3了吕了了 judgement given ten words also aligns with the model prediction for 84.4% of the data. The human accuracy is even higher than that based on the original review, which is Figure 4. The above figure shows ten randomly selected figures 83.3%(Yang et al., 2018 ). This indicates the selected words of 3 and 8 in the validation set. The first line include the origin by l2X can serve as key words for human to understand the digits while the second line does not. The selected patches are model behayior Table 2 shows the results of our model on colored with red if the pixel is activated (white) and blue otherwise four examples IMDB-Word IMDB-Sent MNIST 4.2.2. EXPLAINING HIERARCHICAL LSTM Post-hoc accuracy 0.90.8 0.849 0.958 Human accuracy 0.844 0.774 Another competitive class of models in sentiment analysis uses hierarchical LSTM(Hochreiter schmidhuber, 199 Table 4. Post-hoc accuracy and human accuracy of L2X on three Li et aL., 2015). We build a simple hierarchical LSTM by models a word-based CNN model on IMDB, a hierarchical LSTM putting one layer of LSTM on top of word embeddings, model on IMDB, and a CNN model on MNIST which yields a representation vector for each sentence, and images for training and 1, 984 images for testing. Then we then using another LSTM to encoder all sentence vectors. train a simple neural network for binary classification over The output representation vector by the second LsTm is the subset, which achieves accuracy 99. 7% on the test data passed to the class distribution via a linear layer. Both set. The neural network is composed of two convolutional the two LSTMs and the word embedding are of dimension layers of kermel size 5 and a dense linear layer at last. The 100. The word embedding is pretrained on a large cor- two convolutional layers contains and 16 filters respec pus(Mikolov et al., 2013). Each review is padded to contain tively, and both are followed by a max pooling layer of pool 15 sentences. The hierarchical LSTM model gets around size 2. We try to explain each sample image with k=4 im- 90%0 accuracy on the test data. We take each sentence as a age patches on the neural network model, where each patch single feature group, and study which sentence is the most contains 4 x 4 pixels, obtained by dividing each 28 x 28 important in each review for the model image into 7x 7 patches. We use patches instead of raw The explainer of L2X is composed of a 100-dimensional pixels as features for better visualization. word embedding followed by a convolutional layer and a We parametrize the explainer and the variational family max pooling layer to encode each sentence. The encoded with three-layer and two-layer convolutional networks re- sentence vectors are fed through three convolutional layers spectively, with max pooling added after each hidden layer and a dense layer to get sampling weights for each sentence. The 7 7 vector sampled from the explainer is upsampled The variational family also encodes each sentence with a (with repetition) to size 28 x 28 and multiplied with the convolutional layer and a max pooling layer. The encoding input raw pixels vectors are weighted by the output of the subset sampler, We use only the post-hoc accuracy for experiment, with and passed through an average pooling layer and a dense layer to the class probability. all convolutional layers are of results shown in Table 4. The predictions based on 4 patches filter size 150 and kernel size 3. In this setting L2X can bt selected by L2X out of 49 align with those from original interpreted as a hard attention model(Xu et al., 2015) that images for 95.8% of data. Randomly selected examples employs the gumbel-softmax trick with explanations are shown in Figure 4. We observe that L2X captures most of the informative patches, in particular Comparison is carried out with the same metrics. For human those containing patterns that can distinguish 3 and 8 accuracy, one selected sentence for each review is shown to human annotators. The other experimental setups are 5. Conclusion kept the same as above. We observe that post-hoc accu racy reaches 84.4 with one sentence selected by L2X, and We have proposed a framework for instancewise feature human judgements using one sentence align with the origi- selection via mutual information. and a method L2X which nal model prediction for 77. 4% of data. Table 3 shows the seeks a variational approximation of the mutual information explanations from our model on four examples and makes use of a gumbel-softmax relaxation of discrete subset sampling during training. To our best knowledge 4.3. MNIST L2X is the first method to realize real-time interpretation of a black-box model. We have shown the efficiency and the The mnist data set contains 28x 28 images of handwritten gits (Lecun et al., 1998) We form a subset of the mnist capacity of L2X for instancewise feature selection on both synthetic and real data sets data set by choosing images of digits 3 and 8, with 11, 982 Learning to Explain: An Information-Theoretic Perspective on Model Interpretation Acknowledgements eferences L.S. was also supported in part by NSF IIS-1218749, NIH Ba,J, Mnih, V, and Kavukcuoglu,K. Multiple ob BIGDATA 1ROIGM108341. NSF CAREER IIS-1350983. ject recognition with visual attention. arXiv preprint NSF IIS-1639792 EAGER. NSF CNS-1704701. ONR rXiv:1412.7755,2014. No0014-15-1-2340. Intel ISTC. NVidia and Amazon AWS. We thank Nilesh Tripuraneni for comments about Bach s. Binder. A. Montavon G. Klauschen. F. Miiller the gumbel trick K.-R, and Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7): e0130140, 2015 A. Proof of Theorem 1 Forward direction: Any explanation is represented as a Baehrens, D, Schroeter, T, Harmeling, S, Kawanabe, M onditional distribution of the feature subset over the input Hansen, K, and MAzller, K.R. How to explain individ vector. Given the definition of s, we have for any X, and ual classification decisions. Journal of machine learning any explanation & SX Research,11(Jun):1803-1831,2010 Esixem[log Pm(YXs)X]≤ Bahdanau, D, Cho, K, and bengio, Y. Neural machine Em [log Pm(YIXS*(X))X translation by jointly learning to align and translate. arXiv In the case when S(X)is a set instead of a singleton, we e-prints, abs/1409.0473, September 2014 identify S*(X) with any distribution that assigns arbitrary Chen, J, Stern, M, Wainwright, M. J, and Jordan, M.I probability to each elements in S*(X)with zero probability Kernel feature selection via conditional covariance mini outside S*(X). With abuse of notation, S* indicates both the set function that maps every X to a set S(X)and any mization. In Advances in Neural Information Processing real-valued function that maps X to an element in SA(X) Systems30,pp.69496958.2017 TakingexpectationoverthedistributionofX,andaddingchOllet,f.etal.Keras.https://github.com Elog Pm (r)at both sides, we hav keras-team/keras 2015 I(Xs;Y)≤I(Xs;Y) Cover, T M. and Thomas, J. A. Elements of information for any explanation & SIX theory. John wiley sons, 2012 Gao, s, Ver Steeg, G, and Galstyan, A. Variational infor- mation maximization for feature selection. In advances Reverse direction: The reverse direction is proved by Neural Information Processing Systems, pp. 487-495 contradiction. Assume the optimal explanation P(SIX) 2016 is such that there exists a set M of nonzero probability, over which P(s X) does not degenerates to an element in Guyon, I and Elisseeff, A. An introduction to variable and S"(X). Concretely, we define M as feature selection. Journal of machine learning research M={:P(SgS*()X=x)>0 3(Man):1157-1182,2003. For any∈M, we have Hochreiter, S and schmidhuber, J. Long short-term memory ESixEm[log Pm(YlXs)X-a]< neural computation, 9(8): 1735-1780, 1997 Em/log Pm(YIXs"())X=al,(7) Jang, E, Gu, S, and Poole, B Categorical reparameteriza where S"(a)is a deterministic function in the set of distri tion with gumbel-softmax. stal, 1050: 1, 2017 butions that assign arbitrary probability to each elements in S"(.)with zero probability outside S"(). Outside M,we Kim.y convolutional neural networks for sentence classi always have fication. arXiv preprint ar Xiv: 1408.5882 2014 ESIx Emllog Pm(YXs)X=m≤ Kindermans. P-J.. Schutt. K. Muller K -R.. and dahe S. Investigating the influence of noise and distractors from the definition of sk. as m is of nonzero size over on the interpretation of neural networks. arXiv preprint P(X), combining Equation 7 and Equation 8 and taking arXiv:61.07270,2016 expectation with respect to P(X), we have LeCun. Y. Bottou. L. Bengio. Y. and haffner. P. gradient- I(XS:Y<I(Xs*;Y) based learning applied to document recognition. Proceed which is a contradiction ings of the IEEg,86(11):2278-2324,1998 Learning to Explain: An Information-Theoretic Perspective on Model Interpretation Li, J, Luong, M.-T, and Jurafsky, D. A hierarchical neu- Springenberg, J. T, Dosovitskiy, A, Brox, T, and Ried- ral autoencoder for paragraphs and documents. ar Xiv miller, M. Striving for simplicity: The all convolutional preprint arXiv: 1506.01057, 2015 net. arXiv preprint arXiv: 1412.6806, 2014 ipton, Z. C. The mythos of model interpretability. arXiv Srivastava, N, Hinton,G, Krizhevsky, A, Sutskever, I preprint arXiv: 1606.03490, 2016 and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. The Journal or machine Lundberg, S M. and Lee, S.-I. A unified approach to inter- Learning Research, 15(1): 1929-1958, 2014 preting model predictions. pp. 4768-4777, 2017 Williams, R.J. Simple statistical gradient-following algo- Maas, AL Daly, R.E., Pham, P. T, Huang, D, Ng, A.Y., rithms for connectionist reinforcement learning Machine and Potts, C. Learning word vectors for sentiment anal- learning.8(3-4):29-256,1992 ysis. In Proceedings of the 49th Annual Meeting of the Xu. K. Ba.J.. KirosR. Cho. K. Courville. A. salakhud Association for Computational Linguistics: Human Lan nov,R, Zemel, R, and Bengio, Y. Show, attend and tell guage Technologies-Volume 1, pp. 142-150. Association for Computational Linguistics, 2011 Neural image caption generation with visual attention In International Conference on Machine Learning, pp Maddison, C J, Tarlow, D, and Minka, T. A" sampling. In 20482057.2015 Advances in Neural Information Processing Systems, pp Yang, P, Chen, J, Hsieh, C.J., Wang, J.L., and Jordan 3086-3094.2014. M. I. Greedy attack and gumbel attack: Generating Maddison C.J. Mnih. A. and Teh. Y.w. The concrete adversarial examples for discrete data. arXiv preprint arXiv:805.12316.2018. distribution: a continuous relaxation of discrete random variables. arXiv preprint ar Xiv. 1611.00712, 2016 Zhang, Y and wallace, B. A sensitivity analysis of (and practitioners guide to)convolutional neural networks for Mikoloy. t. sutskever. l chen k. corrado ,g. s and sentence classification. arXiv preprint arXiv: 1510.03820 Dean, J. Distributed representations of words and phrases 2015 and their compositionality. In Advances in neural infor- mation processing systems, pp. 3111-3119, 2013 Peng, H, Long, F, and Ding, C. Feature selection based on mutual information criteria of max-dependency max relevance, and min-redundancy. IEEE TransactionS on pattern analysis and machine intelligence, 27(8): 1226 1238,2005 Raffel, C, Luong, T, Liu, P J, Weiss, R.J., and Eck, d Online and linear-time attention by enforcing monotonic alignments. arXiv preprint arXiv: /704.00784, 2017 Ribeiro, M. T, Singh, S, and Guestrin, C. Why should trust you? Explaining the predictions of any classifier In Proceedings of the 22nd ACm SiGKDD International Conference on Knowledge discovery and Data mining pp. 1135-1144.ACM.2016. Shrikumar, A, Greenside, P. and Kundaje, A. Learning important features through propagating activation differ ences In ICML, volume 70 of Proceedings of Machine Learning research, pp 3145-3153. PMLR, 06-11 Aug 2017. Simonyan, K, Vedaldi, A,, and Zisserman, A. Deep in side convolutional networks: Visualising image clas- sification models and saliency maps. arXiv preprint arXiv:l312.6034.2013.

...展开详情
试读 10P 蚂蚁金服人工智能部研究员ICML贡献论文01.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
weixin_38744270 如果觉得有用,不妨留言支持一下
2019-08-28
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
  • 至尊王者

    成功上传501个资源即可获取
关注 私信
上传资源赚积分or赚钱
最新推荐
蚂蚁金服人工智能部研究员ICML贡献论文01.pdf 10积分/C币 立即下载
1/10
蚂蚁金服人工智能部研究员ICML贡献论文01.pdf第1页
蚂蚁金服人工智能部研究员ICML贡献论文01.pdf第2页

试读结束, 可继续读1页

10积分/C币 立即下载 >