迁移学习入门级综述文章:A Survey on Transfer Learning

所需积分/C币:47 2018-01-18 11:49:55 2.41MB PDF
28
收藏 收藏
举报

迁移学习入门级综述文章:A Survey on Transfer Learning。分享给大家~
PAN AND YANG: A SURVEY ON TRANSFER LEARNING 1347 TABLE 1 Relationship between Traditional Machine Learning and various Transfer Learning Settings Learning Settings Source and Target Domains Source and Target Tasks Traditional machine learning e same the same Inductive Transfer Learning the same different but related Transfer Learning Unsupervised Transfer Learning different but related different but related transductive Transfer Learning different but related the same is document classification, and each term is taken as a binary probability distributions between domain data are different feature, then 2 is the space of all term vectors, I; is the ith term i. e. P(X5)+P(Xr), where Xs, E rs and Xr; e AT. As an vector corresponding to some documents, and x is a example, in our document classification example, case 1 particular learning sample. In general, if two domains are corresponds to when the two sets of documents are different, then they may have different feature spaces or described in different languages, and case 2 may correspond different marginal probability distributions to when the source domain documents and the target Given a specific domain, D=x, P(X)), a task consists domain documents focus on different topics of two components: a label space y and an objective Given specific domains Ds and Dr, when the learning predictive function / ((denoted by T=0, /(1), which is tasks I s and Tr are different, then either 1)the labe not observed but can be learned from the training data, spaces between the domains are different, i. e, Vs+yr, or which consist of pairs ai, y:, where i E X and y;E ). The 2) the conditional probability distributions between the function/()can be used to predict the corresponding label, domains are different; i.e, P(Ys Xs) P(YT Xr), where r(r), of a new instance a From a probabilistic viewpoint, Ys E ys and Yr E r. In our document classification )can be written as P(vlr). In our document classification example, case 1 corresponds to the situation where source f(r) example, y is the set of all labels, which is True, False for a domain has binary document classes, whereas the target binary classification task, and y is "True or"alse domain has 10 classes to classify the documents to Case 2 For simplicity, in this survey, we only consider the case corresponds to the situation where the source and target where there is one source domain ds and one target domain documents are very unbalanced in terms of the user DT, as this is by far the most popular of the research works in defined classes the literature. More specifically we denote the source domain In addition, when there exists some relationship, explicit {( (CSis 3Sne), where s, E&'s is or implicit, between the feature spaces of the two domains the data instance and ys, E ys is the corresponding class we say that the source and target domains are related label. In our document classification example, Ds can be a set 2. 3 A Categorization of of term vectors together with their associated true or false Transfer Learning Techniques Dr=f(an,,yr),.,(Tmr UIm ) T and yT, E y'r is the corresponding output. In most cases, 3) when to transfer 0<m<n What to transfer"asks which part of knowledge can be We now give a unified definition of transfer learning. transferred across domains or tasks. Some knowledge is Definition 1 Transfer learning ng). Given a source domain Ds specific for individual domains or tasks, and some knowl- and learning task T s, a target domain Dr and learning task edge may be common between different domains such that TT, transfer learning aims to help improve the learning of the they may help improve performance for the target domain or target predictive function )in Dr using the knowledge in task. After discovering which knowledge can be transferred, Ds and T s, where Ds≠Dr,orTs≠Tr learning algorithms need to be developed to transfer the knowledge, which corresponds to the"how to transfer"issue In the above definition, a domain is a pair D=1, P(XI When to transfer"asks in which situations, transferring Thus, the condition Ds# Dr implies that either &ts f xTo skills should be done. Likewise, we are interested Ps(X)+Pr(X). For example, in our document classification nowing in which situations knowledge should not be transferred. In some situations, when the source domain example, this means that between a source document set and a target document set, either the term features are different transfer may be unsuccessful. In the worst case, it mapy and target domain are not related to each other, brute-foro between the two sets(e.g, they use different languages),or even hurt the performance of learning in the target their marginal distributions are different domain, a situation which is often referred to as negativ Similarly, a task is defined as a pair T=D, P(rXh fransfer. Most current work on transfer learning focuses on Thus, the condition !s+Ir implies that either ]sfyr or "What to transfer" and"How to transfer,"by implicitly P(YSIXs)+P(Yr Xr). When the target and source domains assuming that the source and target domains be related to are the same, i. e, Ds=DI, and their learning tasks are the each other. However, how to avoid negative transfer is an same,ieIs=TT, the learning problem becomes a important open issue that is attracting more and more traditional machine learning problem. When the domains attention in the future are different, then either 1)the feature spaces between the Based on the definition of transfer learning, we summarize domains are different, i. e. &s rr, or 2)the feature spaces the relationship between traditional machine learning and between the domains are the same but the marginal various transfer learning settings in Table 1, where we 1348 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. VOL 22. NO. 10. OCTOBER 2010 TABLE 2 Different Settings of Transfer Leaning Transfer Learning Settings Related Areas Source Domain Labels Target Domain Labels Tasks Inductive Transfer Learning Multi-task learning Available Available egression Classification Self-taught Learning Unavailable Available Regression lassification Transductive Transfer Learning Domain Adaptation, Sample Available Unavailable Regression Selection bias. Co-variate Shift Classification Unsupervised Transfer Learnin Unavailable Unavailable Clustering Dimensionality Reduction categorize transfer learning under three subsetting, inductive distributions of the input data are different, transfer learning transductive transfer learning and unsuper P(Xs)≠P(Xr) vised transfer learning based on different situations between The latter case of the transductive transfe the source and target domains and tasks learning setting is related to domain adaptation 1. In the inductive transfer learning setting, the target task for knowledge transfer in text classification [23 is different from the source task no matter when the and sample selection bias [24] or covariate shift source and target domains are the same or not [25, whose assumptions are similar In this case, some labeled data in the target 3. Finally, in the unsupervised transfer learning setting, domain are required to induce an objective predictive similar to inductive transfer learning setting, the target model fr( for use in the target domain. In addition task is different from but related to the source task according to different situations of labeled and However, the unsupervised transfer learning focus on unlabeled data in the source domain we can further solving unsupervised learning tasks in the target categorize the inductive transfer learning setting into domain, such as clustering, dimensionality reduction two cases and density estimation 26, 27. In this case, there are no labeled data available in both source and target a. a lot of labeled data in the source domain are domains in training available. In this case, the inductive transfer The relationship between the different settings of learning setting is similar to the multitask learning transfer learning and the related areas are summarized in setting. However, the inductive transfer learning Table 2 and Fig. 2 setting only aims at achieving high performance Approaches to transfer learning in the above three the target task by transferring knowledge from different settings can be summarized into four cases based he source task while multitask learning tries to on"What to transfer. Table 3 shows hese four cases and learn the target and source task simultaneously. brief description. The first context can be referred to as b. No labeled data in the source domain are instance-based transfer learning(or instance transfer available. In this case, the inductive transfer approach|6[28[29130,[31[24132]33,1341135 learning setting is similar to the self-taught which assumes that certain parts of the data in the source learning setting, which is first proposed by Raina domain can be reused for learning in the target domain by et al. [22]. In the self-taught learning setting, the label spaces between the source and target reweighting. Instance reweighting and importance sampling are two major techniques in this context domains may be different, which implies the A second case can be referred to as feature-representa side information of the source domain cannot be tion-transfer approach [22 136,[37,[ ,[39 ,181,1401 used directly. Thus, it's similar to the inductive [41],142,1431,[44]. The intuitive idea behind this case is to transfer learning setting where the labeled data learn a"good"feature representation for the target domain in the source domain are unavailable In this case the knowledge used to transfer across domains 2. In the transductive fransfer learning setting, the source and target tasks are the same, while the source and is encoded into the learned feature representation with the target domains are different new feature representation, the performance of the target In this situation, no labeled data in the target task is expected to improve significantly domain are available while a lot of labeled data in A third case can be referred to as parameter-transfer the source domain are available. In addition, approach[45], 1461, [471,[48[49], which assumes that the according to different situations between the source source tasks and the target tasks share some parameters or and target domains, we can further categorize the prior distributions of the hyperparameters of the models.The transductive transfer learning setting into two cases transferred knowledge is encoded into the shared para meters or priors. Thus, by discovering the shared parameters a. The feature spaces between the source and or priors, knowledge can be transferred across tasks target domains are different, s+r Finally the last case can be referred to as the relational b. The feature spaces between domains are the knowledge-transfer problem [50], which deals with transfer same,xs=T, but the marginal probability learning for relational domains. The basic assumption PAN AND YANG: A SURVEY ON TRANSFER LEARNING 1349 Self-taught Case 1 Learning No labeled data in a source domain Inductive transfer Learning Labeled data are available eled data are available in a source domain in a target domain Multi-task Case 2 target tasks are h Learning Transter Labeled dala are Lc earning available only il a Transductive source domain different Domain Transfer learning domains but Adaptation single task No labeled data in oth Assumption: single target domain domain and single task Unsupervised Sample selection Bias Transfer learning /Covariance Shift Fig. 2. An overview of different settings of transfer behind this context is that some relationship among the data fr( in Dr using the knowledge in Ds and T s, where in the source and target domains is similar. Thus, the Ts/Ir knowledge to be transferred is the relationship among the data. Recently, statistical relational learning techniques Based on the above definition of the inductive transfer dominate this context 51 521 learning setting, a few labeled data in the target domain are Table 4 shows the cases where the different approaches required as the training data to induce the target predictive are used for each transfer learning setting. We can see that function. As mentioned in Section 2.3, this setting has two the inductive transfer learning setting has been studied in cases: 1)labeled data in the source domain are available and many research works, while the unsupervised transfer 2)labeled data in the source domain are unavailable while learning setting is a relatively new research topic and only unlabeled data in the source domain are available. Most studied in the context of the feature-representation-trarisfer transfer learning approaches in this setting focus on the case. In addition, the feature-representation-transfer problem former case has been proposed to all three settings of transfer learnin ig However, the parameter-transfer and the relational-knowvledge- 3.1 Transferring Knowledge of Instances transfer approach are only studied in the inductive transfer The instance-transfer approach to the inductive transfe learning setting, which we discuss in detail below learning setting is intuitively appealing: although the source domain data cannot be reused directly there are certain 3 INDUCTIVE TRANSFER LEARNING parts of the data that can still be reused together with a few labeled data in the target domain. Definition 2 (Inductive Transfer Learning). Given a source Dai et al [6] proposed a boosting algorithm, Tr boost, domain Ds and a learning task Is, a target domain Dr which is an extension of the Ada boost algorithm, to address and a learning task IT, inductive transfer learning aims the inductive transfer learning problems. Tr Ada Boost assumes to help improve the learning of the target predictive function that the source and target-domain data use exactly the same able 3 Different Approaches to Transfer Learning Transfer Learning Approaches Brief Description Instance-transfer To re-weight some labeled data in the source domain for use in the target domain [6],[28:[29] 30],[31,[24],[32],[33,[34],[35] Feature-pepresentation-transfer Find agood "feature representation that reduces diffcrence bclwcen the source and the target domains and the error of classification and regression models [22], [36].[37],[38],[39],[8 [40],[41],[42],[43],[44 Parameter-transfer Discover shared parameters or priors between the source domain and target domain models, which can benefit for transfer learning [45][46],[47 ].[48],[49] Relational-lnowledge-tr'ansfer Build mapping of relational knowledge between the source domain and the target domains. Both domains are relational domains and i.i.d assumption is relaxed in each domain [50,51, [52 1350 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. VOL 22. NO. 10. OCTOBER 2010 TABLE 4 Different Approaches Used in Different Settings Inductive Transfer Learning Transductive Transfer Learning Unsupervised Transfer Learning Instance-transfer Feature-representation-tr'ansfer Parameter-transfer Relational-knowledge-transfer set of features and labels, but the distributions of the data in learning setting, the common features can be learned by the two domains are different. In addition, TrAda boost solving an optimization problem, given as follow assumes that due to the difference in distributions between the source and the target domains, some of the source domain data may be useful in learning for the target arg mIn ∑∑(m,(n,U2m1)+1412 domain but some of them may not and could even be harmful. tt attempts to iteratively reweight the source s.U∈O domain data to reduce the effect of the bad"source data In this equation, S and T' denote the tasks in the source while encourage the"good" source data to contribute more for the target domain. For each round of iteration, domain and target domain, respectively. A=as, ar E TrAda boost trains the base classifier on the weighted source is a matrix of parameters. U is a d x d orthogonal matrix and target data. The error is only calculated on the target (mapping function) for mapping the original high-dimen- data. Furthermore, Tr Ada Boost uses the same strategy as sional data to low-dimensional representations (T AdaBoost to update the incorrectly classified examples in the norm of A is defined as Al .p:-(a target domain while using a different strategy from optimization problem (1)estimates the low-dimensional Ada boost to update the incorrectly classified source exam- representations UXT, U Xs and the parameters, A, of the ples in the source domain. Theoretical analysis of TrAda- model at the same time. The optimization problem(1)can Boost in also given in [6] be further transformed into an equivalent convex optimiza Jiang and Zhai [30] proposed a heuristic method to tion formulation and be solved efficiently In a follow-up remove"misleading"training examples from the source work, Argyriou et al. [41] proposed a spectral regularization domain based on the difference between conditional framework on matrices for multitask structure learning probabilities P(yrar)and P(slrs). Liao et al. [31] Lee et al. 42] proposed a convex optimization algorithm proposed a new active learning method to select the for simultaneously learning metapriors and feature weights unlabeled data in a target domain to be labeled with the from an ensemble of related prediction tasks. The meta- help of the source domain data. W u and Dietterich [53] priors can be transferred among different tasks. Jebara[43] integrated the source domain (auxiliary) data an Support proposed to select features for multitask learning with Vector Machine (SVM)framework for improving the SvMs. Ruckert and Kramer 154 designed a kernel-based classification performance approach to inductive transfer, which aims at finding a suitable kernel for the target data 3.2 Transferring Knowledge of Feature Representations 3.2.2 Unsupervised Feature Construction he feature-representation-transfer approach to the induc- In [22], Raina et al. proposed to apply sparse coding [55 Live transfer learning problem aims at finding good"feature which is an unsupervised feature construction method, for representations to minimize domain divergence and classi learning higher level features for transfer learning. The basic fication or regression model error. Strategies to find g00(m idea of this approach consists of two steps. In the first step, feature representations are different for difterent types of higher level basis vectors b=(b1, b2:. bs)are learned on the source domain data. If a lot of labeled data in the source the source do data by solving the optimization domain are available, supervised learning methods can be problem(2)as shown as follows used to construct a feature representation. This is similar to common feature learning in the field of multitask learning mIn ∑|s 40. If no labeled data in the source domain are available, unsupervised learning methods are proposed to construct the feature representation st.|bl2≤1,Vj∈1,…S In this equation, as, is a new representation of basis b, for 3.2.1 Supervised Feature Construction ut s, and B is a coefficient to balance the feature Supervised feature construction methods for the inductive construction term and the regularization term. After learning transfer learning setting are similar to those used in multitask the basis vectors b, in the second step, an optimization learning. The basic idea is to learn a low-dimensional algorithm(3)is applied on the target-domain data to learn representation that is shared across related tasks. In higher level features based on the basis vectors b addition, the learned new representation can reduce the classification or regression model error of each task as well Argyriou et al. [40 proposed a sparse feature learning arg minl ∑吗+刚zl method for multitask learning. In the inductive transfer PAN AND YANG: A SURVEY ON TRANSFER LEARNING 1351 Finally, discriminative algorithms can be applied to an,)s (0,v,s) with corresponding labels to train classification or regres sion models for use in the target domain. One drawback of this method is that the so-called higher level basis vectors ∑∑5+2∑|n1+m 女∈{5,T}i=1 t∈{S,T learned on the source domain in the optimization problem st.y(0-t)·m≥ (2)may not be suitable for use in the target domain Recently, manifold learning methods have been 5≥0,讠∈{1,2,,,n}andt∈{S,T} adapted for transfer learning. In 144] Wang and Mahade- By solving the optimization problem above, we can learn van proposed a Procrustes analysis-based approach to the parameters wo, is, and r simultaneously manifold alignment without correspondences ch Several researchers have pursued the parameter-transfer be used to transfer the knowledge across domains via the approach further. Gao et al. [49] proposed a locally aligned manifolds weighted ensemble learning framework to combine multi 3.3 Transferring knowledge of Parameters ple models for transfer learning, where the weights are dynamically assigned according to a model's predictive Most parameter-transfer approaches to the inductive transfer power on each test example in the target domain learning setting assume that individual models for related tasks should share some parameters or prior distributions 3.4 Transferring Relational Knowledge of hyperparameters. Most approaches described in this Different from other three contexts, the relational-knowl- section, including a regularization framework and a edge-transfer approach deals with transfer learning pro- hierarchical Bayesian framework, are designed to work blems in relational domains where the data are non - i id and under multitask learning. However, they can be easily can be represented by multiple relations, such as networked modified for transfer learning. As mentioned above, multi- data and social network data. This approach does not assume task learning tries to learn both the source and target tasks that the data drawn from each domain be independent and simultaneously and perfectly, while transfer learning only identically distributed (i.i. d )as traditionally assumed. It aims at boosting the performance of the target domain by tries to transfer the relationship among data from a source utilizing the source domain data. Thus, in multitask domain to a target domain. In this context, statistical relational earll ng, weights of the loss functions for the source and learning techniques are proposed to solve these problems target data are the same. In contrast, in transfer learning, Mihalkova et al. 50 proposed an algorithm TAMAR that weights in the loss functions for different domains can be transfers relational knowledge with Markov Logic ne different. Intuitively, we may assign a larger weight to the works MLNs)across relational domains. MLNs [56] is a loss function of the target domain to make sure that we can powerful formalism, which combines the compact expres- achieve better performance in the target domain SIveness of first-order logic with flexibility of probability, known as MT-IVM, which is based on Gaussian Processes relational domain are represented by predicates and the, Y Lawrence and Platt[45] proposed an efficient algorithm for statistical relational learning. In MLNS, entitie (GP), to handle the multitask learning case. MT-IVM tries to relationships are represented in first-order logic. TAMAR is learn parameters of a Gaussian Process over multiple tasks motivated by the fact that if two domains are related to each by sharing the same GP prior. Bonilla et al. 146] also other, there may exist mappings to connect entities and investigated multitask learning in the context of GP. The their relationships from a source domain to a target domain authors proposed to use a free-form covariance matrix over For example, a professor can be considered as playing a tasks to model intertask dependencies, where a Gp prior is similar role in an academic domain as a manager in an used to induce correlations between tasks. Schwaighofer industrial management domain. In addition the relation et al. 147 proposed to use a hierarchical Bayesian frame ship between a professor and his or her students is similar work(HB)together with GP for multitask learning to the relationship between a manager and his or her Besides transferring the priors of the GP models, some workers Thus, there may exist a mapping from professor to researchers also proposed to transfer parameters of SVMs manager and a mapping from the Professor-student under a regularization framework. Evgeniou and Pontil (481 relationship to the manager-worker relationship. In this borrowed of hb to svms for multitask learning vein tamar tries to use an mln le earned for a source The proposed method assumed that the parameter, u, in domain to aid in the learning of an mln for a target SVMs for each task can be separated into two terms. One is domain. Basically, TAMAR is a two-stage algorithm. In the first step, a mapping is constructed from a source min to a common term over tasks and the other is a task-specinle the target domain based on weighted pseudo log-likelihood term In inductive transfer learning, measure(WPLL). In the second step, a revision is done for ws=wo+us and wr =wo+uT, the mapped structure in the target domain throu FORTE algorithm [57], which is where ws and ar are parameters of the sVMs for the source g programming(ILP)algorithm for revising first-order task and the target learning task, respectively. wo is a theories. The revised mln can be used as a relational common parameter while us and ur are specific parameters model for inference or reasoning in the target domain for the source task and the target task, respectively. By In the AAAI-2008 workshop on transfer learning fo assuming a hyperplane for task t, an complex tasks, Mihalkova and Mooney 151] extended extension of svms to multitask learning case can be written as the following ttp://www.cs.ulexas.cdu/-mtaylor/aaaiostl/ 1352 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. VOL 22. NO 10. OCTOBER 2010 TAMAR to the single-entity-centered setting of transfer Most approaches described in the following sections are learning where only one entity in a target domain is related to case 2 above. available. Davis and Domingos [52] proposed an approach to transferring relational knowledge based on a form o 4.1 Transferring the Knowledge of Instances second-order markov logic. The basic idea of the algorithm Most instance-transfer approaches to the transductive is to discover structural regularities in the source domain in transfer learning setting are motivated by importance the form of markov logic formulas with predicate variables, sampling. To see how importance-sampling-based methods by instantiating these formulas with predicates from the may help in this setting, we first review the problem of target domain empirical risk minimization(ERM)[60]. In general, we might want to learn the optimal parameters o ne mo by minimizing the expected risk, 4 TRANSDUCTIVE TRANSFER LEARNING The term transductive transfer learning was first proposed by A'=arg min H(jiep[l(r, 1, 0)1 6∈G Arnold et al. [58 where they required that the source and target tasks be the same although the domains may be where l(a, 3, A) is a loss function that depends on the different.On top of these conditions, they further required parameter 0. However, since it is hard to estimate the hat all unlabeled data in the target domain are available at probability distribution P, we choose to minimize the erm training time but we believe that this condition can be instead relaxed; instead, in our definition of the transductive transfer learning setting we only require that part of the unlabeled 6= arg min- ece 77 , target data be seen at training time in order to obtain the marginal probability for the target data Note that the wordtransductiveis used with several where n is size of the training data In the transductive transfer learning setting, we want to meanings. In the traditional machine learning setting, learn an optimal model for the target domain by minimiz test data are required to be seen at training time, and that ing the expected risk, the learned model cannot be reused for future data. Thus when some new test data arrive, they must be classified f"= Lrg mIII∑P(Dh)(xy ∈G together with all existing data. In our categorization of transfer learning, in contrast, we use the term transductive to However, since no labeled data in the target domain are emphasize the concept that in this type of transfer learning, observed in training data, we have to learn a model from the tasks must be the same and there must be some the source domain data instead If P(Ds)=P(Dr), then we unlabeled data available in the target domain may simply learn the model by solving the following Definition 3(Transductive Transfer Learning. Given a optimization problem for use in the target domain, source domain Ds and a corresponding learning task T s, a target domain Dr and a corresponding learning task TT, f"=argm∑PD)Hx,明 transductive transfer learning aims to improve the learning of (x:,y1∈D the target predictive function /r( )in Dr using the knowledge in Otherwise, when P(D5)+P(Dr), we need to modify the s and T s, where Ds f Dr and Ts=TT. In addition, some above optimization problem to learn a model with high unlabeled target-domain data must be available at training tim generalization ability for the target domain, as follows This definition covers the work of arnold et al. 58 since the latter considered domain adaptation where the difference 0=mgm∑b,P(Ds(x lies between the marginal probability distributions of eee (a, wedS source and target data; i. e, the tasks are the same but the (IT, yT domains are different e∈e Ps(ts, ys Similar to the traditional transductive learning setting, which aims to make the best use of the unlabeled test data herefore, by adding different penalty values to each instance for learning, in our classification scheme under transductive (as, gs )with the corresponding weight Prier, ur we can unlabeled data be given. In the above definition of learn a precise model for the target domain. Furthermore, transductive transfer learning, the source and target tasks since P(Yr Xr)=P(Ys Xs). Thus, the difference between are the same, which implies that one can adapt the P(Ds)and P(Dr)is caused by P(xs)and P(Xr)and predictive function learned in the source domain for use in the target domain through some unlabeled target-domain P(ut, yr) P(s data. As mentioned in Section 2.3, this setting can be split to Ps(s, ys. P(r:) two cases: 1)The feature spaces between the source and If we can estimate /(zs: target domains are different, xs+IT, and 2)the feature transductive transfe x t For each instance, we can solve the spaces between domains are the same,x s=tT, but the learning problems marginal probability distributions of the input data are There exist various ways to estimate d Zat grozny different, P(Xs)+P(Xr). This is similar to the require- proposed to estimate the terms P('s, )and P(r, )indepen- ments in domain adaptation and sample selection bias. dently by constructing simple classification problems PAN AND YANG: A SURVEY ON TRANSFER LEARNING Fan et al. [35] further analyzed the problems by using domains. Then, SCL removes these pivot features from the various classifiers to estimate the probability ratio. Huang data and treats each pivot feature as a new label vector. The t al. [32] Proposed a kernel-mean matching (KM) m classification problems can be constructed By assuming each problem can be solved by linear classifier, which is algorithm to learn p(as) directly by matching the means shown as follows between the source domain data and the target domain data in a reproducing-kernel Hilbert space(RKHS). KMM can be f(x)=sgm(n7·x), rewritten as the following quadratic programming (QP) SCL can learn a matrix W=w1w2.Wm of parameters. In optimization problem the third step, singular value decomposition(SVD)is applied to matrix W=a1u'2.Lm. Let w=UnV, then B=Ul 11 7K8-k23 (h is the number of the shared features) is the matrix (linear (6) mapping) whose rows are the top left singular vectors of W st.∈Band∑B-m≤n Finally, standard discriminative algorithms can be applied to the augmented feature vector to build models. The augmen ted feature vector contains all the inal featu where appended with the new shared features 8: ;. As mentioned Ks. s Ks.T in [38, if the pivot features are well designed, then the learned mapping 0 encodes the correspondence between the KTS KT features from the different domains. Although Ben-Davic and K =ki(ai, i). Ks,s and Krr are kernel matrices for et al. [61] showed experimentally that SCL can reduce the the source domain data and the target domain data, difference between domains how to select the pivot features respectively. Fi:=HI Ei k(ai, Tr,), where x; E XsUXT, is difficult and domain dependent. In[38], Blitzer al. used a while T∈X heuristic method to select pivot features for natural language It can be proved that B 32). An advantage of using p processing(NLP)problems, such as tagging of sentences. In KMM is that it can avoid performing density estimation of their follow-up work, the researchers proposed to use Mutual Information(Mi)to choose the pivot features instead either P(cs,)or P(ar), which is difficult when the size of the of using more heuristic criteria [ 8]. MI-SCL tries to find some data set is small. Sugiyama et al. [34 proposed an algorithm pivot features that have high dependence on the labels in the known as Kullback-Leibler Importance Estimation Proce- source domain dure(Klien) to estimate pits directly, based on the Transfer learning in the nlp domain is sometimes minimization of the Kulback-Leibler divergence. can be referred to as domain adaptation. In this area, Daume [39] integrated with cross-validation to pe erform mo del selection proposed a kernel-mapping function for NLP problems, automatically in two steps: 1)estimating the weights of the which maps the data from both source and target domains to source domain data and 2)training models on the reweighted a high-dimensional feature space, where standard discrimi- data. Bickel et al. [33 combined the two steps in a unified native learning methods are used to train the classifiers However, the constructed kernel-mapping function is framework by deriving a kernel-logistic regression classifier. domain knowledge driven. It is not easy to generalize the Besides sample reweighting techniques, Dai et al. [28 kernel mapping to other areas or applications. Blitzer et al extended a traditional Naive Bayesian classifier for the [62] analyzed the uniform convergence bounds for algo- transductive transfer learning problems. For more informa- rithms that minimized a convex combination of source and tion on importance sampling and reweighting methods for target empirical risks covariate shift or sample selection bias, readers can refer to a In[36], Daiet al proposed a coclustering-based algorithm recently published book [29]by Quionero-Candela et al. One to propagate the label information across different domains In [63], Xing et al. proposed a novel algorithm known as can also consult a tutorial on Sample Selection Bias by Fan bridged refinement to correct the labels predickedby a shift- and Sugiyama in ICDM-08 unaware classifier toward a target distribution and take the 4.2 Transferring Knowledge of Feature mixture distribution of the training and test data as a bridge Representations to better transfer from the training data to the test data. In Most feature-representation-transfer approaches to the [64\, Ling et al. proposed a spectral classification framework transductive transfer learning setting are under unsuper- for cross-domain transfer learning problem, where the vised learning frameworks. Blitzer et al. [38] proposed a objective function is introduced to seek consistency between structural correspondence learning(SCL)algorithm, which the in-domain supervision and the out-of-domain intrinsic tends [371 to make use of the unlabeled data from the structure In [651, Xue et al. pr roposed a cross-domain text target domain to extract some relevant features that may classification algorithm that extended the traditional prob reduce the difference between the domains. The first step of abilistic latent semantic analysis(PLSA)algorithm to SCL is to define a set of pivot features(the number of pivot integrate labeled and unlabeled data from different but feature is denoted by ma)on the unlabeled data from both related domains, into a unified probabilistic model. The new model is called Topic-bridged PLSA, or TPLSA 5.Tutorialslidescanbefoundathttp://www.cs.columbia.edu/-tan/Transferlearningviadimensionalityreductionwas 6. The pivot fcatures arc domain specific and dcpend on prior recently proposed by Pan et al.[66]. In this work, Pan et al exploited the Maximum Mean Discrepancy Embedding 1354 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. VOL 22. NO. 10. OCTOBER 2010 reduction, to learn a low-dimensional space to reduce the iteratively to find the best subspace for the target datz Fun (MMDE) method, originally designed for dimensionality data to reduce the dimensions. These two steps difference of distributions between different domai transductive transfer learning. However, MMDE may suffer 6 TRANSFER BOUNDS AND NEGATIVE TRANSFER 67 Pan et al further proposed an efficient feature extraction algorithm, An important issue is to recognize the limit of the power of known as Transfer Component Analysis(TCA)to overcome transfer learning. In [68], Mahmud and Ray analyzed the the drawback of mmde case of transfer learning using Kolmogorov complexity, where some theoretical bounds are proved. In particular, 5 UNSUPERVISED TRANSFER LEARNING the authors used conditional Kolmogorov complexity to measure relatedness between tasks and transfer the"right Definition 4(Unsupervised Transfer Learning). Given a amount of information in a sequential transfer learning task source domain Ds with a learning task T s, a target domain Dr under a Bayesian framework and a corresponding learning task Tr, unsupervised transfer Recently, Eaton et al. [69] proposed a novel graph-based learning aims to help improve the learning of the target method for knowledge transfer, where the relationships predictive function fr(- )in Dr using the knowledge in Ds and between source tasks are modeled by embedding the set of S, where Ts f Tr and ]'s and ]r are not observable arned source models in a graph using transferability as the Transferring to a new task ey mapping Based on the definition of the unsupervised transfer problem into the graph and then learning a function on this learnng setting no labeled data are observed in the source graph that automatically determines the parameters to and target domains in training. So far, there is little research transfer to the new learning task work on this setting. Recently, Self-taught clustering(StCh Negative transfer happens when the source domain data [26] and transferred discriminative analysis(TDA)[27 and task contribute to the reduced performance of learning algorithms are proposed to transfer clustering and transfer in the target domain Despite the fact that how to avoid dimensionality reduction problems, respectively negative transfer is a very important issue, little research 5.1 Transferring Knowledge of Feature work has been published on this topic. Rosenstein et al. 70J Representations empirically showed that if two tasks are too dissimilar, then Dai et al. [26] studied a new case of clustering problems brute-force transfer may hurt the performance of the target known as self-taught clustering. Self-taught clustering is an task. Some works have been exploited to analyze related- instance of unsupervised transfer learning, which aims at ness among tasks and task clustering techniques, such as clustering a small collection of unlabeled data in the [71 ,[72], which may help provide guidance on how to target domain with the help of a large amount of avoid negative transfer automatically. Bakker and Heskes unlabeled data in the source domain STC tries to learn 72 adopted a Bayesian approach in which some of the a common feature space across domains, which helps in model parameters are shared for all tasks and others more clustering in the target domain. The objective function of loosely connected through a joint prior distribution that can StC is shown as follows be learned from the data. Thus, the data are clustered based on the task parameters where tasks in the same cluster are 1(1,2)-1(X,2)-A(X,2-s,2, supposed to be related to each other. Argyriou et al. 73 considered situations in which the learning tasks can be divided into groups. Tasks within each group are related by where Xs and Xr are the source and target domain data, sharing a low-dimensional representation, which differs respectively. Z is a shared feature space by Xs and Xr, and among different groups. As a result, tasks within a group I(, ) is the mutual information between two random can find it easier to transfer useful knowledge variables. Suppose that there exist three clustering functions x7→XT,Cx:Xs→Xs, and Cz:z→Z, where XT, XS, and Z are corresponding clusters of XT, Xs, and Z. 7 APPLICATIONS OF TRANSFER LEARNING respectively. Ine goal of StC is to learn Xr by solving the ecently, transfer learning techniques have been applied optimization problem (7) successfully in many real-world applications. Raina et al 74] and Dai et al. [36], [28] proposed to use transfe arg min (XT, Xs 8)learning techniques to learn text data across domains, espectively. Bli prop An iterative algorithm for solving the optimization function solving NLP problems. An extension of SCL was proposed (8)was in [8 for solving sentiment classification problems. Wu and Similarly, Wang et al. [27] proposed a TDA algorithm to Dietterich [53 proposed to use both inadequate target domain data and plenty of low quality source domain data solve the transfer dimensionality reduction problem. TDA first for image classification problems. Arnold et al.[58] for the target unlabeled data. It then applies dimensionality proposed to use transductive transfer learning methods to reduction methods to the target data and labeled source solve name-entity recognition problems. In [75],[76], [78],[79], transfer learning techniques are proposed to ed transfer lca the predicted labels arc latent extract knowledge from WiFi localization models across variables, such as clusters or reduced dimensions time periods, space, and mobile devices, to benefit WiFi

...展开详情
试读 15P 迁移学习入门级综述文章:A Survey on Transfer Learning
立即下载 低至0.43元/次 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
qq_26493017 原版论文,入坑迁移学习.
2019-04-21
回复
  • 分享达人

    成功上传6个资源即可获取
关注 私信 TA的资源
上传资源赚积分or赚钱
    最新推荐
    迁移学习入门级综述文章:A Survey on Transfer Learning 47积分/C币 立即下载
    1/15
    迁移学习入门级综述文章:A Survey on Transfer Learning第1页
    迁移学习入门级综述文章:A Survey on Transfer Learning第2页
    迁移学习入门级综述文章:A Survey on Transfer Learning第3页

    试读结束, 可继续读2页

    47积分/C币 立即下载 >