蚂蚁金服人工智能部研究员ICML贡献论文01.pdf

随着机器学习热度的增加和其中“中国力量”的逐渐强大，在各大顶级会议上有越来越多的中国组织排名靠前，大有争夺头把交椅的势头。 比如，本次ICML，清华大学有 12 篇论文被收录；华裔作者的数量也令人惊讶，如佐治亚理工学院终身副教授、机器学习中心副主任宋乐署名的就有8篇论文。 而宋乐教授的另外一个身份，就是蚂蚁金服人工智能部研究员。 蚂蚁金服成为ICML 上“中国力量”的代表之一，为大会奉献了8篇论文。其中，六篇含金量十足的Oral Paper，成为议程上研讨会的主角，接受与会专家的热烈讨论。 这些论文几乎每篇署名作者都有世界级学术专家。比如人工
Learning to Explain: An InformationTheoretic Perspective on Model Interpretation We have thus defined a new random vector Xs E rh, see For a generic model, it is impossible to compute expecta Figure 1 for a probabilistic graphical model representing its tions under the conditional distribution Pm( c3).Hence construction. We formulate instancewise feature selection we introduce a variational family for approximation as seeking explainer that optimizes the criterion g:={Q1Q={xs→Qs(Yxs),S∈8k}.(3 maxI(Xs;Y subjec S~(X) Note each member Q of the family Q is a collection of In words, we aim to maximize the mutual information be conditional distributions Qs(Y as), one for each choice tween the response variable from the model and the selected of ksized feature subset S. For any Q, an application of features. as a function of the choice of selection rule Jensen's inequality yields the lower bound It turns out that a global optimum of Problem(1)has a nat EyiXs [logPm(Y(Xs)]>/Pm(Y Xs)log Qs(Y Xs) ural informationtheoretic interpretation it corresponds to the minimization of the expected length of encoded mes EYIxs [log Qs(YlXs) sage for the model Pm(Y I a)using Fm(rlas), where the where equality holds if and only if I m(Y I Xs)and latter corresponds to the conditional distribution of y upon Qs(Y Xs)are egual in distribution. We have thus ob tained a variational lower bound of the mutual information the followin I(XS: Y). Problem(1)can thus be relaxed as maximizing Theorem 1. Letting Em[ I r] denote the expectation over the variational lower bound, over both the explana ation Pm(1 a), define and the conditional distribution Q C*(a):=arg min Em log maxE log QsYXs such that SN E(X).(4) IPm(ras Then &* is a global optimum of Problem(1). Conversely, For generic choices (Q and &, it is still difficult to solve the variational approximation(4). In order to obtain a tractable any global optimum of Problem(1)degenerates to C al most surely over the marginal distribution IPX method we need to restrict both and e to suitable families over which it is efficient to perform optimization The proof of Theorem l is left to Appendix. In practice, the above global optimum is obtained only if the explanation A single neural network for parametrizing Q: Recall family is sufficiently large. In the case when Pm(Yrs) that=Qs(I s),SE Ok) is a collection of is unknown or computationally expensive to estimate ac conditional distributions with cardinality Ql=(&). W curately, we can choose to restrict to suitably controlled assume X is a continuous random vector, and Pm(Y milies so as to prevent overfitting .) is continuous with respect to t. Then we introduce a single neural network function ga Rd for parametrizing Q, where the codomain is a(c1) 3. Proposed method △。1={y∈0,1:0≤v2≤1,=1v=1} a direct solution to Problem(1)is not possible, so that we for the class distribution, and a denotes the learnable param need to approach it by a variational approximation. In par eters. We define Qs(rlcs):=ga(as), where ts RDis ticular we derive a lower bound on the mutual information transformed from m with entries not in S replaced by zeros and we approximate the model conditional distribution Pm ∈ by a suitably rich family of functions (s) 0,讠￠S When X contains discrete features we embed each discrete 3.. Obtaining a tractable variational formulation feature with a vector, and the vector representing a specific We now describe the steps taken to obtain a tractable varia feature is set to zero simultaneously when the corresponding tional formulation feature is not in S 3. 2. Continuous relaxation of subset sampling a variational lower bound: Mutual information between Xs and Y can be expressed in terms of the conditional Direct estimation of the objective function in equation(4) distribution of Y given Xs requires summing over (k) combinations of feature sub Xs, Y)1E10g Pr I(Xs, Y)E1od P(Xs)Pm( Pm(YXs sets after the variational approximation. Several tricks (Y) exist for tackling this issue, like REINFORCEtype Al gorithms (Williams, 1992), or weighted sum of features E log Pm(YIXs)+Const parametrized by deterministic functions of X.(A similar concept to the second trick is the" soft attention"struc ExESIxErIXs logPm(YlXs)+Const ture in vision(Ba et al., 2014)and nlp(Bahdanau et al. Learning to Explain: An InformationTheoretic Perspective on Model Interpretation 2014) where the weight of each feature is parametrized by tion of auxiliary random variables S sampled independentl a function of the respective feature itself. ) We employ an from the Gumbel distribution. Then we use the elementwise alternative approach generalized from Concrete Relaxation product v(8.)oX between V and X as an approximation Gumbelsoftmax trick)(Jang et aL., 2017: Maddison et al., of X 2014; 2016, which empirically has a lower variance than REINFORCE and encourages discreteness(Ratel et al., 3.3. The final objective and its optimization 2017) After having applied the continuous approximation of fea The Gumbelsoftmax trick uses the concrete distribution as ture subset sampling, we have reduced Problem(4)to the a continuous differentiable approximation to a categorical following distribution. In particular, suppose we want to approximate a categorical random variable represented as a onehot vector max Ex,Y,s log ga(v(,S)oX,r in Rd with category probability p1, p2,...,Pd. The random where a denotes the neural network used to approximate perturbation for each category is independently generated the model conditional distribution, and the quantity g is used from a gumbel(0, 1)distribution to parametrize the explainer. In the case of classification Gi=log(lo Uniform (0, 1) th ccl e can write We add the random perturbation to the log probability of ch category and take a Ex∑吧n(X)logn(v(.9°X,0),(6) over the ddimen sional vector exp(log pi+ Gi/T Note that the expectation operator E X, c does not depend on the parameters(a, 0), so that during the training stage, we ∑7=1exp{(ogp+G)r} can apply stochastic gradient methods to jointly optimize The resulting random vectorC=(C1,., Ca)is called a the pair(a, 0). In each update, we sample a minibatch of Concrete random vector, which we denote by unlabeled data with their class distributions from the mod C N Concrete(lo log pa) to be explained, and the auxiliary random variables and we then compute a Monte Carlo estimate of the gradient of We apply the Gumbelsoftmax trick to approximate the objective function( 6) weighted subset sampling. We would like to sample a sub set S of k distinct features out of the d dimensions. The 3. 4. The explaining stage pling a hhot random vector Z from Dd: zE 10,1/a sampling scheme for S can be equivalently viewed as sar During the explaining stage, the learned explainer maps k 1, with each entry of z being one if it is in the each sample X to a weight vector we(X)of dimension d, selected subset S and being zero otherwise. An importance each entry representing the importance of the corresponding score which depends on the input vector is assigned for each feature for the feature for the specific sample x. In order to provide a de feature. Concretely, we define w6: RrR that maps the terministic explanation for a given sample, we rank features according to the weight vector, and the k features with the input to a ddimensional vector, with the ith entry of we(x) largest weights are picked as the explaining features representing the importance score of the ith feature We start with approximating sampling k distinct features For each sample, only a single forward pass through the neu out of d features by the sampling scheme below: Sam ral network parametrizing the explainer is required to yield ple a single feature out of d features independently for k explanation. Thus our algorithm is much more efficient times. Discard the overlapping features and keep the rest. in the explaining stage compared to other modelagnostic Such a scheme samples at most k: features, and is easier explainers like LIME or Kernel SHAP which require thou to approximate by a continuous relaxation. We further ap sands of evaluations of the original model per sample proximate the above scheme by independently sampling he independent Concrete random vectors, and then we define 4. Experiments a ddimensional random vector V that is the elementwise We carry out experiments on both synthetic and real data maximum of C1,O2、.,Ck sets. For all experiments, we use RMSprop(Maddison et al Concrete(we(X))ii d. for i1. 2,..., k, 2016)with the default hyperparameters for optimization V(V1,V,…,V),V We also fix the step size to be 0.001 across experiments maX The temperature for Gumbelsoftmax approximation is fixed The random vector V is then used to approximate the khot to be 0. 1. Codes for reproducing the key results are avail random vector Z during training ableonlineathttps://github.com/jianbolabi L2Ⅹ We write V=V(8, S) as V is a function of 0 and a collec Learning to Explain: An InformationTheoretic Perspective on Model Interpretation Clock time for various methods feature selection, including Saliency(Simonyan et al., 2013), DeepLiFT (Shrikumar et al., 2017), SHAP (Lundberg Nonlinear additive Lee, 2017), LIME (Ribeiro et al., 2016 Saliency refers to 10 gz Feature switching the method that computes the gradient of the selected class with respect to the input feature and uses the absolute values as importance scores. SHAP refers to Kernel SHAP. The number of samples used for explaining each instance for LIME and shaP is set as default for all experiments. We also compare with a method that ranks features by the input feature times the gradient of the selected class with respect to the input feature. Shrikumar et al (2017) showed it is Figure 2. The clock time (in log scale)of explaining 10,000 sam equivalent to LRP(Bach et aL., 2015) when activations are oles for each method. The training time of l2X is shown in piecewise linear, and used it in Shrikumar et al(2017)as translucent bars a strong baseline. We call it" Taylor"as it is the firstorder 4.1. Synthetic Data Taylor approximation of the model We begin with experiments on four synthetic data sets Our experimental setup is as follows. For each data set, we train a neural network model with three hidden dense lay 2dimensional XOR as binary classification The input ers. We can safely assume the neural network has success vector X is generated from a 10dimensional standard fully captured the important features, and ignored noise fea Gaussian. The response variable Y is generated from tures, based on its error rate. Then we use Taylor, Saliency, P(Y=1X)x expX1X2H DeepLIFT, SHAP, LIMe, and l2X for instancewise feature Orange skin. The input vector X is generated from a 10 selection on the trained neural network models. For L2X dimensional standard Gaussian. The response variable y the explainer is a neural network composed of two hidden is generated from P(Y=1 X)x expIEi1X24. layers. The variational family is composed of three hid Nonlinear additive model. Generate X from a den layers. All layers are linear with dimension 200. The 10dimensional standard Gaussian. The response number of desired features k is set to the number of true variable y is generated from P(Y =1X) features exp{100sin(2X1)+2X2+X3+exp{X4} e Switch feature Generate X from a mixture of two gaus The underlying true features are known for each sample sians centered at +3 respectively with equal probability and hence the median ranks of selected features for each If Xi is generated from the gaussian centered at 3, the sample in a validation data set are reported as a performance metric,the box plots of which have been plotted in Figure 3 25th dimensions are used to generate r like the orange We observe that L2X outperforms all other methods on skin model. Otherwise the 6 gth dimensions are used to generate Y from the nonlinear additive model nonlinear additive and feature switching data sets. On the XOR model, DeepLIFT, ShaP and L2X achieve the best The first three data sets are modified from commonly used performance. On the orange skin model, all algorithms have data sets in the feature selection literature( Chen et al., 2017). near optimal performance, with L2X and LIME achieving The fourth data set is designed specifically for instancewise the most stable performance across samples feature selection. Every sample in the first data set has the first two dimensions as true features where each dimension We also report the clock time of each method in Figure 2 itself is independent of the response variable y but the where all experiments were performed on a single nvidia combination of them has a joint effect on Y. In the second Tesla k80 gPu. coded in TensorFlow. Across all the four data set, the samples with positive labels centered around a data sets, SHAP and liME are the least efficient as they require multiple evaluations of the model deeplifT, tay sphere in a fourdimensional space. the sufficient statistic is formed by an additive model of the first four features. The lor and Saliency requires a backward pass of the model response variable in the third data set is generated from a DeepLIFT is the slowest among the three, probably due to nonlinear additive model using the first four features The the fact that backpropagation of gradients for Taylor and Saliency are builtin operations of TensorFlow, while back last data set switches important features(roughly) based on the sign of the first feature The 15 features are true for propagation in deepLIFT is implemented with highlevel operations in Tensor Flow. Our method L2X is the most samples with X1 generated from the Gaussian centered at 3, and the 1, 69 features are true otherwise efficient in the explanation stage as it only requires a for ward pass of the subset sampler. It is much more efficient We compare our method L2X (for"Learning to Explain") compared to SHAP and LIME even after the training time with several strong existing algorithms for instancewise has been taken into consideration, when a moderate numbe Learning to Explain: An InformationTheoretic Perspective on Model Interpretation Orange skin M 日 2 二,二 10 Nonlinear additive 10 Feature switching Figure 3. The box plots for the median ranks of the influential features by each sample, over 10, 000 samples for each data set. The red line and the dotted blue line on each box is the median and the mean respectively. lower median ranks are better The dotted green lines indicate the optimal median rank Truth Model Key words positive positive Ray Liotta and To Hulce shine in this sterling example of brotherly love and cunmniunent. Hulce plays Dominick, (nicky)a mildly mentally handicapped young man who is putting his 12 minutes younger, twin brother, I iotta, who plays Eugene, through medical school. It is set in Baltimore and deals with the issues of sibling rivalry, the unbreakable bond of twins, child abuse and good always winning out over evil. It is captivating, and filled with laughter and tears. If you have not yet seen this film, please rent it, I promise, you' ll be amazed at how such a wonderful film could go unnoticed negativenegative Sorry to go against the flow but I thought this film was unrealistic, boring and way too long. I got tired of atching Gena Rowlands long arduous battle with herself and the crisis she was experiencing. Maybe the film has some cinematic value or represented an important step for the director but for pure entertainment alue. I wish I would have skipped it negative positive This movie is chilling reminder of Bollywood being just a parasite of Ilollywood. Bollywood also tends to feed on past blockbusters for furthering its industry. Vidhu Vinod Chopra made this movie with the reasoning that a cocktail mix of deewar and on the waterfront will bring home an oscar. It turned out to be rookie mistake. Even the idea of the title is inspired from the Elia Kazan classic. In the original, Brando is shown as raising doves as symbolism of peace. Bollywood must move out of Holly woods shadow if it needs to be taken seriously positive negative When a small town is threatened by a child killer, a lady police officer goes after him by pretending to be his friend. As she becomes more and more emotionally involved with the murderer her psyche begins to take a heating causing her to lose focus on the job of catching the criminal. not a film of high voltage excitement, but solid police work and a good depiction of the faulty mind of a psychotic loser Tablc 2. Truc labels and labels prcdictcd by thc modcl arc in thc first two columns. Kcy words picked by L2X arc highlighted in ycllow of samples (10,000)need to be explained. As the scale of split of 25, 000 for training and 25, 000 for testing. The the data to be explained increases, the training of L2X ac average document length is 231 words, and 10.7 sentences counts for a smaller proportion of the overall time. Thus We use l2x to study two popular classes of models for the relative efficiency of L2X to other algorithms increases sentiment analysis on the IMDB data set with the size of a data set 4.2.1 EXPLAINING A CNN MODEL WITH KEY WORDS 4.2. VDB Convolutional neural networks(CNN have shown excel The Large Movie Review Dataset(IMDB )is a dataset of lent performance for sentiment analysis(Kim, 2014; Zhang movie reviews for sentiment classification(Maas et al.,& Wallace, 2015). We use a simple Cnn model on 2011). It contains 50, 000 labeled movie reviews. with a Keras( Chollet et al., 2015) for the imDb data set, which Learning to Explain: An InformationTheoretic Perspective on Model Interpretation Truth Predicted Key sentence positive positive There are few really hilarious films about science fiction but this one will knock your sox off. The lead Martians Jack Nicholson takeoff is sidesplitting. The plot has a very clever twist that has be seen to be enjoyed. This is a movie with heart and excellent acting by all. Make some popcorn and have a great negative negative You get 5 writers together, have each write a different story with a different genre, and then you try to make one movie out of it Its action its adventure, its scifi. its we its a mess. Sorry, but this movie absolutely stinks. 4.5 is giving it an awefully high rating. That said, its movies like this that make me think i could write movies, and i can barely write negative positive This movie is not the same as the 1954 version with Judy garland and James mason, and that is a shame because the 1954 version is, in my opinion, much better. I am not denying Barbra Streisand,'s talent at all She is a good actress and brilliant singer. I am not acquainted with Kris Kristofferson's other work and herefore I can't pass judgment on it. However, this movie leaves much to be desired. It is paced slowly, it has gratuitous nudity and foul language, and can be very difficult to sit through. However. I am not a big fan of rock music, so its only natural that I would like the judy garland version better. See the 1976 film with Barbra and Kris, and judge for yourself. positive negative The first time you see the second renaissance it may look boring. Look at it at least twice and definitely watch part 2. it will change your view of the matrix. Are the human people the ones who started the war Is ai a bad thing? Table 3. True labels and labels from the model are shown in the first two columns. Key sentences picked by L2X highlighted in yellow. is composed of a word embedding of dimension 50, a 1D Each model explainer outputs a subset of features Xs for convolutional layer of kernel size 3 with 250 filters, a max each specific sample X. We use Pmy Xs)to approximate pooling layer and a dense layer of dimension 250 as hidden Pm (y Xs). That is, we feed in the sample X to the model layers. Both the convolutional and the dense layers are fol with unselected words masked by zero paddings. Then lowed by RelU as nonlinearity, and Dropout(Srivastava we compute the accuracy of using Pm(y Xs) to predict et al, 2014) as regularization. Each review is padded/cut to samples in the test data set labeled by Pm(y X), which we 400 words. The CNn model achieves 90% accuracy on the call posthoc accuracy as it is computed after instancewise test data, close to the stateoftheart performance(around feature selection 94%). We would like to find out which k words make the most influence on the decision of the model in a specific review. The number of key words is fixed to be k=10 for Human accuracy. When designing human experiments, all the experiment we assume that the key words convey an attitude toward a movie, and can thus be used by a human to infer the review The explainer of l2X is composed of a global component sentiment. This assumption has been partially validated and a local component(See Figure 2 in Yang et al. (2018)) given the aligned outcomes provided by posthoc accuracy The input is initially fed into a common embedding layer and by human judges, because the alignment implies the followed by a convolutional layer with 100 filters. Then consistency between the sentiment judgement based on se the local component processes the common output using lected words from the original model and that from humans two convolutional layers with 50 filters, and the global com Based on this assumption, we ask humans on amazon me ponent processes the common output using a maxpooling chanical turk(amt) to infer the sentiment of a review layer followed by a 100dimensional dense layer. Then we given the ten key words selected by each explainer. The concatenate the global and local outputs corresponding to words adjacent to each other, like"not good at all, keep each feature, and process them through one convolutional their adjacency on the AMT interface if they are selected layer with 50 filters, followed by a Dropout layer(Srivastava simultaneously. The reviews from different explainers have al., 2014). Finally a convolutional network with kernel been mixed randomly and the final sentiment of each review size l is used to yield the output. All previous convolutional is averaged over the results of multiple human annotators layers are of kernel size 3, and RelU is used as nonlinearity. We measure whether the labels from human based on se The variational family is composed of an word embedding lected words align with the labels provided by the model layer of the same size, followed by an average pooling and in terms of the average accuracy over 500 reviews in the a 250dimensional dense layer. Each entry of the output test data set. some reviews are labeled as "neutral"based vector V from the explainer is multiplied with the embed on selected words, which is because the selected key words ding of the respective word in the variational family We use do not contain sentiment, or the selected key words contain both automatic metrics and human annotators to validate the effectiveness of l2X comparable numbers of positive and negative words. Thus these reviews are neither put in the positive nor in the nega tive class when we compute accuracy. We call this metric Posthoc accuracy. We introduce posthoc accuracy for human accurucy. quantitatively validating the effectiveness of our method. The result is reported in Table 4. We observe that the model Learning to Explain: An InformationTheoretic Perspective on Model Interpretation prediction based on only ten words selected by L2X align with the original prediction for over 90% of the data. The hu 色3护3了吕了了 judgement given ten words also aligns with the model prediction for 84.4% of the data. The human accuracy is even higher than that based on the original review, which is Figure 4. The above figure shows ten randomly selected figures 83.3%(Yang et al., 2018 ). This indicates the selected words of 3 and 8 in the validation set. The first line include the origin by l2X can serve as key words for human to understand the digits while the second line does not. The selected patches are model behayior Table 2 shows the results of our model on colored with red if the pixel is activated (white) and blue otherwise four examples IMDBWord IMDBSent MNIST 4.2.2. EXPLAINING HIERARCHICAL LSTM Posthoc accuracy 0.90.8 0.849 0.958 Human accuracy 0.844 0.774 Another competitive class of models in sentiment analysis uses hierarchical LSTM(Hochreiter schmidhuber, 199 Table 4. Posthoc accuracy and human accuracy of L2X on three Li et aL., 2015). We build a simple hierarchical LSTM by models a wordbased CNN model on IMDB, a hierarchical LSTM putting one layer of LSTM on top of word embeddings, model on IMDB, and a CNN model on MNIST which yields a representation vector for each sentence, and images for training and 1, 984 images for testing. Then we then using another LSTM to encoder all sentence vectors. train a simple neural network for binary classification over The output representation vector by the second LsTm is the subset, which achieves accuracy 99. 7% on the test data passed to the class distribution via a linear layer. Both set. The neural network is composed of two convolutional the two LSTMs and the word embedding are of dimension layers of kermel size 5 and a dense linear layer at last. The 100. The word embedding is pretrained on a large cor two convolutional layers contains and 16 filters respec pus(Mikolov et al., 2013). Each review is padded to contain tively, and both are followed by a max pooling layer of pool 15 sentences. The hierarchical LSTM model gets around size 2. We try to explain each sample image with k=4 im 90%0 accuracy on the test data. We take each sentence as a age patches on the neural network model, where each patch single feature group, and study which sentence is the most contains 4 x 4 pixels, obtained by dividing each 28 x 28 important in each review for the model image into 7x 7 patches. We use patches instead of raw The explainer of L2X is composed of a 100dimensional pixels as features for better visualization. word embedding followed by a convolutional layer and a We parametrize the explainer and the variational family max pooling layer to encode each sentence. The encoded with threelayer and twolayer convolutional networks re sentence vectors are fed through three convolutional layers spectively, with max pooling added after each hidden layer and a dense layer to get sampling weights for each sentence. The 7 7 vector sampled from the explainer is upsampled The variational family also encodes each sentence with a (with repetition) to size 28 x 28 and multiplied with the convolutional layer and a max pooling layer. The encoding input raw pixels vectors are weighted by the output of the subset sampler, We use only the posthoc accuracy for experiment, with and passed through an average pooling layer and a dense layer to the class probability. all convolutional layers are of results shown in Table 4. The predictions based on 4 patches filter size 150 and kernel size 3. In this setting L2X can bt selected by L2X out of 49 align with those from original interpreted as a hard attention model(Xu et al., 2015) that images for 95.8% of data. Randomly selected examples employs the gumbelsoftmax trick with explanations are shown in Figure 4. We observe that L2X captures most of the informative patches, in particular Comparison is carried out with the same metrics. For human those containing patterns that can distinguish 3 and 8 accuracy, one selected sentence for each review is shown to human annotators. The other experimental setups are 5. Conclusion kept the same as above. We observe that posthoc accu racy reaches 84.4 with one sentence selected by L2X, and We have proposed a framework for instancewise feature human judgements using one sentence align with the origi selection via mutual information. and a method L2X which nal model prediction for 77. 4% of data. Table 3 shows the seeks a variational approximation of the mutual information explanations from our model on four examples and makes use of a gumbelsoftmax relaxation of discrete subset sampling during training. To our best knowledge 4.3. MNIST L2X is the first method to realize realtime interpretation of a blackbox model. We have shown the efficiency and the The mnist data set contains 28x 28 images of handwritten gits (Lecun et al., 1998) We form a subset of the mnist capacity of L2X for instancewise feature selection on both synthetic and real data sets data set by choosing images of digits 3 and 8, with 11, 982 Learning to Explain: An InformationTheoretic Perspective on Model Interpretation Acknowledgements eferences L.S. was also supported in part by NSF IIS1218749, NIH Ba,J, Mnih, V, and Kavukcuoglu,K. Multiple ob BIGDATA 1ROIGM108341. NSF CAREER IIS1350983. ject recognition with visual attention. arXiv preprint NSF IIS1639792 EAGER. NSF CNS1704701. ONR rXiv:1412.7755,2014. No00141512340. Intel ISTC. NVidia and Amazon AWS. We thank Nilesh Tripuraneni for comments about Bach s. Binder. A. Montavon G. Klauschen. F. Miiller the gumbel trick K.R, and Samek, W. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one, 10(7): e0130140, 2015 A. Proof of Theorem 1 Forward direction: Any explanation is represented as a Baehrens, D, Schroeter, T, Harmeling, S, Kawanabe, M onditional distribution of the feature subset over the input Hansen, K, and MAzller, K.R. How to explain individ vector. Given the definition of s, we have for any X, and ual classification decisions. Journal of machine learning any explanation & SX Research,11(Jun):18031831,2010 Esixem[log Pm(YXs)X]≤ Bahdanau, D, Cho, K, and bengio, Y. Neural machine Em [log Pm(YIXS*(X))X translation by jointly learning to align and translate. arXiv In the case when S(X)is a set instead of a singleton, we eprints, abs/1409.0473, September 2014 identify S*(X) with any distribution that assigns arbitrary Chen, J, Stern, M, Wainwright, M. J, and Jordan, M.I probability to each elements in S*(X)with zero probability Kernel feature selection via conditional covariance mini outside S*(X). With abuse of notation, S* indicates both the set function that maps every X to a set S(X)and any mization. In Advances in Neural Information Processing realvalued function that maps X to an element in SA(X) Systems30,pp.69496958.2017 TakingexpectationoverthedistributionofX,andaddingchOllet,f.etal.Keras.https://github.com Elog Pm (r)at both sides, we hav kerasteam/keras 2015 I(Xs;Y)≤I(Xs;Y) Cover, T M. and Thomas, J. A. Elements of information for any explanation & SIX theory. John wiley sons, 2012 Gao, s, Ver Steeg, G, and Galstyan, A. Variational infor mation maximization for feature selection. In advances Reverse direction: The reverse direction is proved by Neural Information Processing Systems, pp. 487495 contradiction. Assume the optimal explanation P(SIX) 2016 is such that there exists a set M of nonzero probability, over which P(s X) does not degenerates to an element in Guyon, I and Elisseeff, A. An introduction to variable and S"(X). Concretely, we define M as feature selection. Journal of machine learning research M={:P(SgS*()X=x)>0 3(Man):11571182,2003. For any∈M, we have Hochreiter, S and schmidhuber, J. Long shortterm memory ESixEm[log Pm(YlXs)Xa]< neural computation, 9(8): 17351780, 1997 Em/log Pm(YIXs"())X=al,(7) Jang, E, Gu, S, and Poole, B Categorical reparameteriza where S"(a)is a deterministic function in the set of distri tion with gumbelsoftmax. stal, 1050: 1, 2017 butions that assign arbitrary probability to each elements in S"(.)with zero probability outside S"(). Outside M,we Kim.y convolutional neural networks for sentence classi always have fication. arXiv preprint ar Xiv: 1408.5882 2014 ESIx Emllog Pm(YXs)X=m≤ Kindermans. PJ.. Schutt. K. Muller K R.. and dahe S. Investigating the influence of noise and distractors from the definition of sk. as m is of nonzero size over on the interpretation of neural networks. arXiv preprint P(X), combining Equation 7 and Equation 8 and taking arXiv:61.07270,2016 expectation with respect to P(X), we have LeCun. Y. Bottou. L. Bengio. Y. and haffner. P. gradient I(XS:Y<I(Xs*;Y) based learning applied to document recognition. Proceed which is a contradiction ings of the IEEg,86(11):22782324,1998 Learning to Explain: An InformationTheoretic Perspective on Model Interpretation Li, J, Luong, M.T, and Jurafsky, D. A hierarchical neu Springenberg, J. T, Dosovitskiy, A, Brox, T, and Ried ral autoencoder for paragraphs and documents. ar Xiv miller, M. Striving for simplicity: The all convolutional preprint arXiv: 1506.01057, 2015 net. arXiv preprint arXiv: 1412.6806, 2014 ipton, Z. C. The mythos of model interpretability. arXiv Srivastava, N, Hinton,G, Krizhevsky, A, Sutskever, I preprint arXiv: 1606.03490, 2016 and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. The Journal or machine Lundberg, S M. and Lee, S.I. A unified approach to inter Learning Research, 15(1): 19291958, 2014 preting model predictions. pp. 47684777, 2017 Williams, R.J. Simple statistical gradientfollowing algo Maas, AL Daly, R.E., Pham, P. T, Huang, D, Ng, A.Y., rithms for connectionist reinforcement learning Machine and Potts, C. Learning word vectors for sentiment anal learning.8(34):29256,1992 ysis. In Proceedings of the 49th Annual Meeting of the Xu. K. Ba.J.. KirosR. Cho. K. Courville. A. salakhud Association for Computational Linguistics: Human Lan nov,R, Zemel, R, and Bengio, Y. Show, attend and tell guage TechnologiesVolume 1, pp. 142150. Association for Computational Linguistics, 2011 Neural image caption generation with visual attention In International Conference on Machine Learning, pp Maddison, C J, Tarlow, D, and Minka, T. A" sampling. In 20482057.2015 Advances in Neural Information Processing Systems, pp Yang, P, Chen, J, Hsieh, C.J., Wang, J.L., and Jordan 30863094.2014. M. I. Greedy attack and gumbel attack: Generating Maddison C.J. Mnih. A. and Teh. Y.w. The concrete adversarial examples for discrete data. arXiv preprint arXiv:805.12316.2018. distribution: a continuous relaxation of discrete random variables. arXiv preprint ar Xiv. 1611.00712, 2016 Zhang, Y and wallace, B. A sensitivity analysis of (and practitioners guide to)convolutional neural networks for Mikoloy. t. sutskever. l chen k. corrado ,g. s and sentence classification. arXiv preprint arXiv: 1510.03820 Dean, J. Distributed representations of words and phrases 2015 and their compositionality. In Advances in neural infor mation processing systems, pp. 31113119, 2013 Peng, H, Long, F, and Ding, C. Feature selection based on mutual information criteria of maxdependency max relevance, and minredundancy. IEEE TransactionS on pattern analysis and machine intelligence, 27(8): 1226 1238,2005 Raffel, C, Luong, T, Liu, P J, Weiss, R.J., and Eck, d Online and lineartime attention by enforcing monotonic alignments. arXiv preprint arXiv: /704.00784, 2017 Ribeiro, M. T, Singh, S, and Guestrin, C. Why should trust you? Explaining the predictions of any classifier In Proceedings of the 22nd ACm SiGKDD International Conference on Knowledge discovery and Data mining pp. 11351144.ACM.2016. Shrikumar, A, Greenside, P. and Kundaje, A. Learning important features through propagating activation differ ences In ICML, volume 70 of Proceedings of Machine Learning research, pp 31453153. PMLR, 0611 Aug 2017. Simonyan, K, Vedaldi, A,, and Zisserman, A. Deep in side convolutional networks: Visualising image clas sification models and saliency maps. arXiv preprint arXiv:l312.6034.2013.

20190828
 6.0MB
蚂蚁金服人工智能部研究员ICML贡献论文06.pdf
20190829随着机器学习热度的增加和其中“中国力量”的逐渐强大，在各大顶级会议上有越来越多的中国组织排名靠前，大有争夺头把交椅的势头。 比如，本次ICML，清华大学有 12 篇论文被收录；华裔作者的数量也令人惊
 593KB
蚂蚁金服人工智能部研究员ICML贡献论文05.pdf
20190829随着机器学习热度的增加和其中“中国力量”的逐渐强大，在各大顶级会议上有越来越多的中国组织排名靠前，大有争夺头把交椅的势头。 比如，本次ICML，清华大学有 12 篇论文被收录；华裔作者的数量也令人惊
 286KB
蚂蚁金服人工智能部研究员ICML贡献论文07.pdf
20190829随着机器学习热度的增加和其中“中国力量”的逐渐强大，在各大顶级会议上有越来越多的中国组织排名靠前，大有争夺头把交椅的势头。 比如，本次ICML，清华大学有 12 篇论文被收录；华裔作者的数量也令人惊
 947KB
蚂蚁金服人工智能部研究员ICML贡献论文04.pdf
20190829随着机器学习热度的增加和其中“中国力量”的逐渐强大，在各大顶级会议上有越来越多的中国组织排名靠前，大有争夺头把交椅的势头。 比如，本次ICML，清华大学有 12 篇论文被收录；华裔作者的数量也令人惊
 3.7MB
蚂蚁金服人工智能部研究员ICML贡献论文02.pdf
20190829随着机器学习热度的增加和其中“中国力量”的逐渐强大，在各大顶级会议上有越来越多的中国组织排名靠前，大有争夺头把交椅的势头。 比如，本次ICML，清华大学有 12 篇论文被收录；华裔作者的数量也令人惊
 1.24MB
蚂蚁金服人工智能部研究员ICML贡献论文03.pdf
20190829随着机器学习热度的增加和其中“中国力量”的逐渐强大，在各大顶级会议上有越来越多的中国组织排名靠前，大有争夺头把交椅的势头。 比如，本次ICML，清华大学有 12 篇论文被收录；华裔作者的数量也令人惊
 37.60MB
ICML19attention.pdf
20200326attention机制在深度学习中的应用及其原理，最新讲座使用PPT，供大家学习使用。仅用于个人学习使用，禁止商用，如有侵权，请联系删除！
 950.15MB
ICML20201.zip
20200906ICML 是 International Conference on Machine Learning的缩写，即国际机器学习大会。ICML如今已发展为由国际机器学习学会（IMLS）主办的年度机器学习国
 794.21MB
ICML20202.zip
20200906ICML 是 International Conference on Machine Learning的缩写，即国际机器学习大会。ICML如今已发展为由国际机器学习学会（IMLS）主办的年度机器学习国
 699KB
icml2020文章列表及下载链接.zip
20200831icml 2020 所有文章的下载链接，全部 1086 篇文章，链接点击直接跳转到 pdf，可直接下载paper
 277KB
ICML 2019年 会议文章目录 （含论文下载链接）
20190604international conference on machine learning（ICML） 2019年 会议文章目录 含论文下载链接
 189KB
2007ICMLBoosting_for_Transfer_Learning[1].(上交).pdf
20110628transfer learning 的一篇好文
 17.8MB
多智能体DMICMLACAI.pdf
20200808强化学习与多智能体入门读物，这篇文章对多智能体强化学习（MARL）的背景，目的，代表性的算法进行了调研，在这样一个环境中，每个智能体拥有独立的 Q network，独自采集数据并进行训练，都有对环境的
 20.29MB
2019人工智能发展报告.pdf
201912122019年11月最新由清华大学中国工程院知识智能联合研究中心、中国人工智能学会吴文俊人工智能科学技术评选基地共同发布史上最全、最专业的人工智能发展报告。 编制概要··················
 1KB
ICML2020论文列表与下载链接爬虫
20200925爬取ICML2020公开的论文清单中的论文信息（标题、作者）与对应下载链接，并写入到CSV文件中，共计1084项内容。
 33.76MB
ICML 2013国际会议论文集论文
20141111ICML 2013国际会议论文集论文，机器学习，深度学习领域，比较热
 119KB
icml2020.xlsx
20200831excel 文件，icml 2020 所有文章的下载链接，全部 1086 篇文章，链接点击直接跳转到 pdf
 8.49MB
请看最新8篇ICML 2020投稿论文（包括：自监督学习、联邦学习、图学习、数据隐私、语言模型、终身学习）.zip
202002222020的机器学习在研究什么？请看最新8篇ICML2020投稿论文：自监督学习、联邦学习、图学习、数据隐私、语言模型、终身学习…通过作者们放到 ArXiv 上的 ICML 投稿文章，一窥 ICML20
 136.94MB
A Little Book of Python for Multivariate Analysis 等 28 本
20181104A Little Book of Python for Multivariate Analysis.epub Algorithmic Information Theory  Review For P
 1.91MB
ICML2020_Machine Learning Production Pipeline.pdf
20200719英伟达人工智能应用团队的计算机科学家 Chip Huyen讲述机器学习产品生产部署流程关键要点。【ICML2020】机器学习产品生产部署流程，54页ppt讲述实际ML生产部署
 14.63MB
【ICML2020】基于模型的强化学习方法教程，279页ppt.pdf
20201027强化学习（Reinforcement Learning, RL），又称再励学习、评价学习或增强学习，是机器学习的范式和方法论之一，用于描述和解决智能体（agent）在与环境的交互过程中通过学习策略以达
 12.74MB
ICML 2020上与【域自适应】相关的论文（六篇）
20201006ICML(International Conference on Machine Learning)，即国际机器学习大会, 是机器学习领域全球最具影响力的学术会议之一，因此在该会议上发表论文的研究者也
 180KB
icml 2017年 会议文章目录
20180917international conference on machine learning （ICML）2017年会议文章目录，含论文下载链接
 46.32MB
2014年International Conference on Machine Learning(ICML)论文
201405142014年icml国际机器学习大会文章,85篇。
 4.39MB
近期来发表在ICML 2020上的8篇研究成果
20200624本文推荐几篇放到 ArXiv 上的 ICML 投稿文章，一窥 ICML2020中的 重要的几篇论文究竟在研究什么？这些论文来自牛津大学、上海交大、阿里巴巴、Facebook、伯克利、MIT、剑桥、微软

博客
算表格
算表格

下载
LM2587S.pdf
LM2587S.pdf

下载
nrf903.芯片数据手册
nrf903.芯片数据手册

学院
P1Python100练从入门到入土系列
P1Python100练从入门到入土系列

下载
LTC3124 Datasheet
LTC3124 Datasheet

博客
静态类 c# 1615139615
静态类 c# 1615139615

学院
【拯救者 】数据库系统概论速成
【拯救者 】数据库系统概论速成

下载
MP1541.芯片数据手册
MP1541.芯片数据手册

学院
C# 高级网络编程及RRQMSocket框架详解
C# 高级网络编程及RRQMSocket框架详解

学院
CCNA_CCNP 思科网络认证 动态路由 RIP 协议
CCNA_CCNP 思科网络认证 动态路由 RIP 协议

下载
STM32学习笔记一 GPIO口.docx
STM32学习笔记一 GPIO口.docx

博客
23、CSS3
23、CSS3

下载
DBeaver Enterprise 7.3.0
DBeaver Enterprise 7.3.0

博客
20210307 HTML & CSS
20210307 HTML & CSS

下载
AIOps系统学习资料.zip
AIOps系统学习资料.zip

下载
MP3216.芯片数据手册
MP3216.芯片数据手册

学院
计算机网络 应用层 诸多协议 实验环境搭建
计算机网络 应用层 诸多协议 实验环境搭建

学院
MySQL 设计基础（数据库概论、初探）
MySQL 设计基础（数据库概论、初探）

学院
零基础一小时极简以太坊智能合约开发环境搭建并开发部署
零基础一小时极简以太坊智能合约开发环境搭建并开发部署

学院
基于区块链的供应链金融系统解决方案
基于区块链的供应链金融系统解决方案

博客
为什么C++中拷贝赋值运算符必须返回一个引用
为什么C++中拷贝赋值运算符必须返回一个引用

下载
amba2.0.pdf
amba2.0.pdf

学院
CCNA_CCNP 思科网络认证 动态路由 EIGRP 和 OSPF
CCNA_CCNP 思科网络认证 动态路由 EIGRP 和 OSPF

下载
js逆向调试工具，集成常用js加密函数，前端js几种加密集合
js逆向调试工具，集成常用js加密函数，前端js几种加密集合

下载
LM2575 超详细中文介绍.mht
LM2575 超详细中文介绍.mht

学院
MySQL 索引
MySQL 索引

学院
Python函数库深度详解（1）
Python函数库深度详解（1）

博客
Linux POSIX线程库中线程同步的三种常用方式
Linux POSIX线程库中线程同步的三种常用方式

博客
计算浮点数相除的余数
计算浮点数相除的余数

下载
MC145026数据手册
MC145026数据手册