Canonical correlation analysis:An overview with application to learning methods

所需积分/C币:50 2017-06-19 18:34:47 511KB PDF
收藏 收藏

Canonical correlation analysis(CCA)典型相关分析,属于多元线性回归分析的一种,用于处理multi-view data,本文详细介绍了CCA相关的技术,是一篇综述文章,非常适合需要的朋友
Introduction 1 Introduction During recent years there have been advances in data learning using kernel methods. Kernel representation offers an alternative learning to non-linear functions by projecting the data. into a high dimensional feature space to increase the computational power of the linear learning machines, though this still leaves open the issue of how best to choose the features or the kernel lunc tion in ways that will improve performance. We review some of the methods that havc bccn devcloped for learning the fcaturc spacc Principal Component Analysis(PCA) is a multivariate data analysis proce- durc that involves a transformation of a numbcr of possibly correlated variables into a smaller number of uncorrelated variables known as principal components PCA only makes use of the training inputs while making no use of the labels Independent Coponent Analysis(ICA) in contrast to correlaLion-based transformations such as pca not only decorrelates the signals but also reduces higher-order statistical dependencies, attempting to make the signals as inde- pendent as possible. In other words, ICA is a way of finding a linear not only orthogonal co-ordinate system in any multivariate data. The directions of the axes of this co-ordinate system are determined by both the second and higher order statistics of the original data. The goal is to perform a linear transform which makes the resulting variables as statistically independent from each other as possible Parlial Least Squares(PLS) is a method similar lo canonical correlalion analysis. It selects feature directions that are useful for the task at hand though PlS only uses one view of an object and the label as the corresponding pair. PLS could be Chought of as a nethod, which looks for directions Chat are good at distinguishing the different labels Canonical Correlation Analysis(CCA) is a method of correlating linear relationships between two multidimensional variables. CCA can be seen as us ing complex labels as a way of guiding fcaturc sclcction towards the underling semantics. CCA makes use of two views of the same semantic object to extract the representation of the semantics. The main difference between CCA and the other three methods is that cca is closely related to mutual informatior (Borga 1998 3 ) Hence CCA can be easily motivated in information based tasks and is our natural selection Proposed by H Hotelling in 1936[12, CCA can be seen as the problem of find ing basis vectors for two sets of variables such that the correlation between the projections of the variables onto these basis vectors are mutually maximised In an attempt to increase the flexibility of the feature selection, kernelisation o CCA(KCCA) has been applied to map the hypotheses to a higher-dimensional feature space. KCCA has been applied in some preliminary work by Fyfe Lai 8. Akaho [1 and the recently Vinokourov et al. 19 with improved results Introduction 2 During recent years there has been a vast increase in the amount of mul- timedia content available both off-line and online, though we are unable to access or make use of this data unless it is organised in such a way as to allow efficient browsing. To enable content based retrieval with no reference to labeling we attempt to learn the semantic representation of images and their associated text. We present a general approach using KCCa that can be used for content [ ll to as well as mate based retrieval [18, ll. In both cases we compare the KCCa approach to the generalised Vector Space model (GVSM). which aims at capturing some term-term correlations by looking at cO-occurrence information This study aims to serve as a tutorial and give addilional llovel contribu tions in the following ways In this sludy we follow the work of Borga [4] where we represent the eigenproblem as two eigenvalue equations as this allows us to reduce the com putation timc and dimensionality of the cigcnvcctors Further to that, we follow the idea of Bach &z Jordan [2 to compute a ncw corrclation matrix with reduccd dimensionality. Though Bach Jordan 2 address a very different problem, they use the same underlining technique of Cholesky decomposition to re-represent the kernel matrices. We show that by using partial Gram-Schmidt ort,hogonolisation [6 is equivalent to incomplete Cholesky decomposition, in the sense that incomplete Cholesky decomposition can be seen as a dual implementation of partial Gram-Schmidt We show that the general approach can be adapted to two different types of problems, content and mate retrieval, by only changing the selection of eigenvectors used in the semantic projection To simplily the learning of the KCCa we explore a lllethod of selecting the regularization parameter a priori such that it gives a value that performs well in scvcral diffcrent tasks In this study we also present a generalisation of the framework for canoni cal corrclation analysis. Our approach is based on the works of Gif(1990) and Ketterling(1971). The purpose of the generalisation is lo exTend the canonical correlation as an associativity measure between two set of variables to more than two scts, whilst preserving most of its propcrtics. The generalisation starts with the optimisation problem formulation of canonical correlation. By changing the objective function we will arrive at the multi set problem. Ap plying similar constraint sets in the optimisation problems we find that the feasible solutions are singular vectors of matrices which are derived the same way for the original and generalised problem In Section 2 we present the theoretical background of CCA. In Section 3 Theoretical foundations 3 we present the CCa and KCCa algorithm. Approaches to deal with the com putational problems that arose in Section 3 are presented in Section 4. Our experimental results are presented In Section 5. In Section 6 we present the generalisation framework for CCa while in Section 7 draws final conclusions 2 Theoretical foundations Proposed by H. Hotelling in 193612, Canonical correlation analysis can be seen as the problem of finding basis vectors for two sets of variables such that the correlation between the projections of the varia bles onto these basis vectors are mutually maximised. Correlation analysis is dependent on the co-ordinate system in which the variables are described, so even if there is a very strong linear relationship between two sets of multidimensional variables, depending on the co-ordinate system used. this relationship might not be visible as a cor relat Canonical correlation analysis seeks a pair of linear transformations one for each of the sets of variables such that when the set of variables are transformed the corresponding co-ordinates are maximally correlated Consider a multivariate random vector of the form(x, y). Suppose we are given a samplc of instances S=((x1, y1):., (xn, yn) of(x, y), we usc Sx to denote(x1,., xn) and similarly Sy to denote(y1, ..., yn). We can consider defining a new co-ordinate for x by choosing a direction Wr and projecting x onto that direction x→〈wx,x) if we do the sane for y by choosing a direction wy we obtain a sanple of the new x co-ordinate as ar, w s=((Wx, x1),... ( Wx, xn)) with the corresponding values of the new y co-ordinate being Wu,y1 The first stage of canonical correlation is to choose Wr and wa to maximise the correlation between the two vectors. In other words the function to be Maximised is P= max corr(S2 Wz, Sy wy) max Wx,Wyer II we use EJ(x,y) lo denote the empirical expectation of the lunction /(x, y) were Exy)=∑f(x,y) Algorithm 4 we can rewrite the correlation expression as wy,ⅹ〉(w EL El Iax VEw!xx'w2 Ewyyy'w follows that Xy」w p= max waN〃wi| xx]wrWyElyy!w Where we use A to denote the transpose of a vector or matrix A ow observe that the covariance matrix of(x, y)is (X,y)=E yy The total covariance matrix c is a block matrix where the within-sets covar ance matrices are Cxx and Cyy and the between-sets covariance matrices are Hence, we can rewrite the function p as mmWVvwCxxWr Wyyy Wy (2.2 the maximum canonical correlation is the maximum of p with respect to wx d w 3 algorithm In this section we will give an overview of the Canonical correlation analysis (CCA) and Kernel-CCA (KCCA) algorithms where we formulate the optimise tion problem as a generalised eigenproblem 3.1 Canonical Correlation Analysis Observe that the solution of equation(2.2) is not affected by re-scaling wa or Wy either together or independently, so that for example replacing wr by awr gives the quotient awWw a w2cxxWrwlCyyW xw. wy Since the choice of re-scaling is therefore arbitrary, the cca optimisation proh lem formulated in equation(2.2)is equivalent to maximising the numerator orithm bject to yyy The corresponding Lagrangian is Taking derivatives in respect to wr and wy we obtain 三xy 入 0 (31) f yxYr T yyy 0 av 32) Subtracting wy times the second cquation from w times the first wc havc 0 wrAz Cxx Wa-wgCyxWx+w AywyCyy Wy-AowCxxw which together with the constraints implies that Xy-Ax-0, let A-=Ax= Assuming Cyy is invertible we have (3.3) and so substituting in equation(3. 1) gives xy Cyy CyxWx=A-Cxxwx (34 Wc arc lcft with a gcncraliscd cigcnproblcm of the form x= ABx. Wc can therefore find the co-ordinate system that optimises the correlation between corresponding co-ordinates by first solving for the generalised eigenvectors of equation 3. 4) to obtain the sequence of wr's and then using equation 3. 3 )to find the corresponding wys If Cxx is invertible we are able to formulate equation 3. 4 as a standard eigenproblem, although to insure a symmetric standard eigenproblem we do the following. As the covariance matrices Cxx and Cyy are symmetric positive definite we are able to decompose them using a complete Cholesky decom position(more details on Cholesky decomposition can be found in section 4.2 Cxx= Rxx. R where Rxa is a lower triangular matrix. If we let ua= Rxx Wc we are able to rewrite equation(3. 1)as follows YxXX H 入2Bxxu C、 yyx U r We are therefore left with a symmetric eigenproblem of the form Ax= Xx Algorithm 6 3.2 Kernel Canonical Correlation Analysis CCa may not extract useful descriptors of the dala because of its linearily Kernel CCa offers an alternative solution by first projecting the data into a higher dimensional feature space φ:x-(x1,…xn)口φ(x)-(1(x),N(x)(n<N) before performing CCa in the new feature space, essentially moving from the primal lo thhe dual representalion approach. Kernels are methods of implicitly mapping data into a higher dimensiona.l feature space, a. method known as the kernel trick Definition 1.(x,y denotes the euclidean inner product of the vectors x,y also willen x'y a kernel is a function K such that for all.z EX .o(2 (35) where o is a mapping from X to a feature space F X>F Kernels offer a great deal of fexibility, as they can be generated from other kernels. In the kernel the data only appears through ent ries in the gram matrix, therefore this approach gives a further advantage as the number of tuneable parameters and updating time does not depend on the number of attributes being used Using the definition of the covariance matrix in equation (2. 1)we can rewrite the covariance matrix C using the data. matrices(of vectors )X and Y, which have the sample vector as rows and are therefore of size m xN, we obtain XX XY The directions w: and wy(of length N)can be rewritten as the projection of the data onto the direction a and B(of length m) YB Substituting into equation(2. 2) we obtain the following XXY Y6 maX Q, B va'XYXXa BYY (36 Lct K2=X'X and Ky=r'y bc the kernel matrices corresponding to the Lwo representalion. We substitute inlo equation (3.6) vKK max (3.7) K2a·/ orithm We find that in equation (3. 7) the variables are now represented in the dual fc orm Observe that as with the primal form presented in equation(2. 2), equation(3.7) is not affected by re-scaling of a and B either together or independently. Ilence the KCCa optimisation problem formulated in equation(3.7)is equivalent to maximising the numerator subject to 6K2 The corresponding Lagrangian is 1(,)-=0K,0-2(K=1)-2(3- Taking derivatives in respect to a and B we obtain af Kkob boka= o 06 K,K (39 Subtracting b times the second equation from o' times the first we have ar'K2KyB-cakar-B'KyKzr+5'ABkuB gBK46-Aaa'K. which together with the constraints implies that Xa-X3=0. let A= Xa= X3 Considering t he case where the kernel matrices Kr and Ky are invertible, we nave K-K.KK Kka substituting in equation (3.8)we obtain Kokok.Kma 0. KK 入2KaK 3.10) Wc arc loft with a gcncraliscd cigcnproblcm of the form Ax= Ax. Wc can deduce from equation 3. 10 that A= 1 for every vector of a; hence we can choose the projections a to be unit vectors ji i= 1, n while are the columns of K, K2. Hence when K or Ky is invertible, perfect correlation can be formed. Since kernel methods provide high dimensional representations such independence is not uncommon. It is therefore clear that a naive application of CCA in kernel defined feature space will not provide useful results. In the next section we investigate how this problem can be avoided Computational issues 8 4 Computational Issues We observe from equation(3.10) that if Kx is invertible maximal correlation is obtained, suggesting learning is trivial. To force non-trivial learning we intro- duce a control on the flexibility of the projections by penalising the norms of the associated weight vectors by a convex combination of constraints based on Partial Least Squares. Another compuTational issue Chat can arise is the use of large training sets, as this can lead to computational problems and degener acy. To ovcrcome this issuc we apply partial gram-Schmidt orthogonolisation (equivalently incomplete Cholesky decomposition)lo reduce the dimensionaliTy of the kernel matrices 4.1 Regularisation To force non-trivial learning on the correlation we introduce a control on the flexibility of the projection mappings using Partial Least Squares(PlS)to penalise the norms of the associated weights. We convexly combine the pls term with the KCCa term in the denominator of equation(3.7)obtaining KmK月 maxo. 3 aK+k|wx1|2)·(K/+lw2) x'k,,B maxa. B /akBa+ rakaa).(6/K2 B RB/K,6) We observe that the new regularised equation is not allecled by re-scaling of c or B, hence the optimisation problem is subject to a'K]+ ha'Kra (BK,6+RB'KyB The corresponding lagrangain is KnOb k2a|Ka′Fa1) (K23+3K3-1) Taking derivatives in respect to a and B af ia K- a(kiatkkra 06 yK-入(K2+KFv) (42) Subtracting 6 times the second equation from a' times the first we have 'KrKyB-Aac(k2a+kK2cj-BKyKrc+ ABB(K46+KKy by ABB(K4B+rkgB)-Aaa'(ka+rk

试读 40P Canonical correlation analysis:An overview with application to learning methods
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    • GitHub

    • 分享宗师

    关注 私信 TA的资源
    Canonical correlation analysis:An overview with application to learning methods 50积分/C币 立即下载
    Canonical correlation analysis:An overview with application to learning methods第1页
    Canonical correlation analysis:An overview with application to learning methods第2页
    Canonical correlation analysis:An overview with application to learning methods第3页
    Canonical correlation analysis:An overview with application to learning methods第4页
    Canonical correlation analysis:An overview with application to learning methods第5页
    Canonical correlation analysis:An overview with application to learning methods第6页
    Canonical correlation analysis:An overview with application to learning methods第7页
    Canonical correlation analysis:An overview with application to learning methods第8页
    Canonical correlation analysis:An overview with application to learning methods第9页
    Canonical correlation analysis:An overview with application to learning methods第10页
    Canonical correlation analysis:An overview with application to learning methods第11页
    Canonical correlation analysis:An overview with application to learning methods第12页


    50积分/C币 立即下载 >