# Introduction to Statistical Learning Theory 40 收藏

Introduction to Statistical Learning Theory
Statistical Learning The 177 It turns out that there are many ways to do so, but no best one For exampl in Physics, people tend to prefer models which have a small number of constants and that correspond to simple mathematical formulas. Often, the length of de scription of a modcl in a coding language can bc an indication of its complexity In classical statistics, the number of free parameters of a model is usually a measure of its complexity. Surprisingly as it may seem, there is no universal way of measuring simplicity (or its counterpart complexity) and the choice of a spe cific measure inherently depends on the problem at hand. It is actually in this choice that the designer of the learning algorithm introduces knowledge about the specific phenomenon under stud This lack of universally best choice can actually be formalized in what called the No Free Lurch theorell, which in essence says that, if there is 11o assumption on how the past(i. e training data) is related to the future(i.etest data), prediction is impossible. Even more, if there is no a priori restriction on the possible phenomena that are expected, it is impossible to generalize and there is thus no better algorithm(any algorithm would be beaten by another one on some phenomenon) Hence the need to make assumptions, like the fact that the phenomenon we bserve can be explained by i siinple Inodel. However, as we said, siMplicity is not an absolute notion, and this leads to the statement that data cannot replace knowledge, or in pscudo-mathcmatical tcrms Generalization- Data Knowledge 1.2 Assumptions We now make more precise the assumptions that are made by the statistical Learning Theory framework. Indeed, as we said before we need to assume that the future(i.e. test)observations are related to the past(i.e. training)ones,so that the phenomenon is somewhat stationary At the core of the theory is a probabilistic model of the phenomenon(or data generation proccss). Within this model, thc rclationship bctwccn past and future observations is that they both are sampled independently from the same distr bution(iid). The independence assumption means that each new observation yields maximum information. The identical distribution means that the obser vations give information about the underlying phenomenon (here a probability distribution An immediate consequence of this very general setting is that one can con struct algorithms(e. g k-nearest neighbors with appropriate k)that are consis- tent, which means that, as one gets more and more data, the predictions of the algorithm are closer and closer to the optimal ones. So this seems to indicate that we can have some sort of universal algorithm. Unfortunately, any(consistent) algorithm can have an arbitrarily bad behavior when given a finite training set These notions are forma lized in appendix b Again, this discussion indicates that generalization can only come when one adds specific knowledge to the data. Each learning algorithm encodes specific 178 Bousquet, Boucheron &e lugosi Know wledge(or a specific assumption about how the optimal classifier looks like) and works best when this assumption is satisfied by the problem to which it is applied Bibliographical remarks. Several textbookS, surveys, and research mono graphs have been written on pattern classification and statistical learning theory. A partial list includes Anthony and Bartlett 2, Breiman. Friedman, Olshen and Stonc 3, Dcvroyc, Gyorfi, and Lugosi 4, Duda and Hart 5, Fukunaga 6 Kearns and Vazirani[7, Kulkarni, Lugosi, and Venkatesh 8, Lugosi 9, McLach an [10, Mendelson [11, Natarajan [12, Vapnik [13, 14, 1, and Vapnik and Chervonenkis 15 2 Formalization We consider an input space a and output space y. Since we restrict, ourselves to binary classification, we choosey=-1,1. Formally, we assume that the pairs(X,yer xy are random variables distributed according to an unknown distribution P. We observe a sequence of n i i d. pairs(Xi, Yi) sampled according to P and the goal is to construct a function g: t-y which predicts y from We leed a criterion to choose this function g. This criterion is a low proba bility of error P(9(X)fY). We thus define the risk of g as (9)=P(9(X)≠Y)=[1(x Notice that P can be decomposed as Px X P(YX). we introduce the regression furuelion n(r)=EYX==2P Y=lX=s-1 and the laryer functioN (or Bayes classifier)t(r)=sgn n(r). This function achieves the minimum risk ovcr all possible measurable functions R(t)=inf R(g We will denote the value R(t) by R*, called the Bayes risk. In the deterministic case, one has Y=t(X)almost surely (P Y=1XE 0, 1) and R"=0. In the general case we can define the noise level as s()= min(PY=lx-r, 1 PY=lX-a)=+ 2(s(X)=0 almost surely in the deterministic case)and this gives R=ES(X) Our goal is thus to identify this function t, but since Pis unknown we cannot directly measure the risk and we also cannot know directly the value of t at the data points. We call only Ineasure the agreeInent of a candidate functiOn with the data. This is called the empirical risk Rn(g) ∑ g(X2)≠ =1 It is common to use this quantity as a criterion to select an estimate of t Statistical Learning The 179 2.1 Algorithms Now that the goal is clearly specified, we review the common strategies to(ap- proximately)achieve it. We denote by gn the function returned by the algorithm Because one cannot compute R(g but only approximate it by Rn(g), it would be uireasollable to look for the functiON Illiniinizing R(g ainong all possible functions. Indeed, when the input space is infinite, one can always construct a function gn which perfectly predicts the labels of the training data(i. e. gn(Xi)= Yi, and Rn(gn)-0, but behaves on the other points as the opposite of the target function t, i.e9n(X)=-Y so that R(gn)=14. So one would have minimum empirical risk but maximum risk. It is thus necessary to prevent this overfitting situation. There are essentially two ways to do this ( which can be combined). The first one is to restrict the class of functions in which the minimization is performed, and the second is to modify the criterion to bc minimized(cg. adding a penalty for complicatcd functions Empirical risk Minimization. This algorithm is one of the most straight forward, yet it is usually efficient. The idea is to choose a model g of possible unctions and to minimize the empirical risk in that model gn arg min Of course, this will work best when the target function belongs to g. However it is rare to be able to make such an assumption, so one may want, to enlarge the model as much as possible, while preventing overfitting Structural risk Minimization. The idea here is to choose an infinite se- quence Ga: d= 1, 2, .. of models of increasing size and to minimize the empirical risk in each model with an added penalty for the size of the model gn = arg Rn(g)+ pen(d, n) g∈ga,d∈N The penalty pen(d, n) gives preference to models where estimation error is small and measures the size or capacity of the model Regularization. Another, usually easier to implement approach consists in choosing a largc modcl g (possibly dense in the continuous functions for cxamplc and to define on g a regularizer, typically a norm gl. Then one has to minimize the regularized empirica. I risk gn- arg min R(g)+入|g2 Strictly speaking this is only possible if the probability distribution satisfies some mild conditions (c g. has no atoms). Otherwise, it may not be possible to achieve R(gn)=l but even in this case, provided the support of P contains infinitely many points, a similar phenomenon occurs 180 Bousquet, Boucheron lugosi Compared to SRM, there is here a free parameter A, called the regularization parameter which allows to choose the right trade-off between fit and complexity Tuning X is usually a hard problem and most often, one uses extra validation data for this task Most existing (and successful) Illethods call be thought of as regularization methods Normalized Regularization. There are other possible approaches when the egularizer can, in some sense, be ' normalized, i.e. when it corresponds to some probability distribution over g Given a probability distribution t defined on g(usually called a prior), one can use as a regularizer-log (9 Reciprocally, from a regularizer of the form g there exists a measure u on g such that e-Algl du(g)<oo for some A>0 then one can construct a prior corresponding to this regularizer. For example, if g is the set of hyperplanes in r going through the origin, g can be identified with rd and, taking b as the Lebesgue measure, it is possible to go frOm the Euclidean norm regularizer to a spherical Gaussian measure on Rd as a prior This type of normalized regularizer, or prior, can be used to construct another probability distribution p on g(usually called posterior),as =z()x(), where y20 is a free paraIneter and Z() is a IiorInalizatioll factor tn. There are several ways in which this p can be used. If we take the function maximizing it, we recover regularization as arg max p(g)= arg min y Rn(g)-logTlg where the regularizer is -y r(9 Also, p can be used to randomize the prcdictions. In that casc, bcforc com outing the predicted label for an input a, one samples a function g according to p and outputs g(r). This procedure is usually called Gibbs classification Another way in which the distribution p constructed above can be used is taking the expected prediction of the functions in g 9n(m)=sgn(F(9() 6 This is fine when g is countable. In the continuous case. one has to consider the density associated to t. we omit t hese details Generalization to infinite dimensional Hilbert spaces can also be done but it require more care. One can for example establish a correspondence between the norm of a reproducing kernel Hilbert space and a Gaussian process prior whose covariance function is the kernel of this space Note that minimizing y Rn(g- log r(g is equivalent to minimizing Rn(g) Statistical Learning Theory 181 This is typically called Bayesian averaging At this point we have to insist again on the fact that the choice of the class g and of the associated regularizer or prior, has to come from a priori knowledge about the task at hand, and there is no universally best choice 2. 2 Bounds We have presented the framework of the theory and the type of algorithms that it studics, we now introducc the kind of results that it aims at. The ovcrall goal is to characterize the risk that soine algorithm Imlay have in a given situatiON. More precisely, a learning algorithm takes as input the data(X1, Yi): . (Xn, Yn)and produces a function n which depends on this data. We want to estimate the risk of gn Ilowever, R(gn)is a random variable(since it depends on the data) and it cannot be computed from the data(since it also depends on the unknown P). Estimates of R(9n) thus usually take the form of probabilistic bounds Notice that when the algorithm chooses its output from a model g possible, by introducing the best function g* in g, with R(9*)=infgEg R(g),to write R(9n)-R=[R(9)-R]+[R(9n)-R(g The first term on the right hand side is usually called the approximation error and measures how well can funct ions in g approach the target. (it would be zero if t g). The second term, called estimation error is a random quantity (it depends on the data) and measures how close is gn to the best possible choice G Estimating the approximation error is usually hard since it requires know ledge about the target. Classically, in Statistical Learning Thcory it is prcfcrablc to avoid making specific assumptions about the target(such as its belonging to some model), but the assumptions are rather on the value of R, or on the noise unction s It is also known that for any(consistent)algorithm, the rate of convergence to zero of the approximation errors can be arbitrarily slow if one does not make assumptions about the regularity of the target, while the rate of convergence of the estimation error can be computed without any such assumption. We will thus focus on the estimation error Another possible decomposition of the risk is the following R(9n)=Rn(9n)+[R(9n)-Rn(9n n this case, one estimates the risk by its empirical counterpart, and some quan- tity which approximates (or upper bounds)R(gn)-Rn(gn To summarize, we write the three type of results we may be interested in For this converge to mcan anything, onc has to consider algorithms which choose functions from a class which grows with the sample size. This is the case for example of Structural risk Minimization or Regularization based algorithms 182 Bousquet, Boucheron lugosi Error bound: R(gn)<Rn(gn)+B(n, g). This corresponds to the estimation of the risk from an empirical quantity. Error bound relative to the best in the class: R(9n)<r(9*)+B(m,g). This tolls how optimal"is thc algorithm givcn the modcl it uscs Error bound relalive lo the Bayes rish: R(m,)< R*+b(r,). This gives heoretica. I guarantees on the convergence to the Bayes risk 3 Basic bounds In this section we show how to obtain simple error bounds(also ca lled general- ization bounds). The elementary material from probability theory that is needed here and in the later sections is summarized in Appendix A 3.1 Rclationship to Empirical Proccsses Recall that we want to estimate the risk R(gn)=EIg(xi+rl of the function gn returned by the algorithm after seeing the data(X1,Y,..., (Xn, Yn). This quantity cannot be observed(P is unknown) and is a random variable(since it depends on the data). Hence one way to make a statement about this quantit elates to an estimate such as the i risk Rn(gn). Thi relationship can take the form of r and lo R(9n)-R2n(9n)>日 For convenience, let Zi=(Xi,Yi)and Z=(X, Y). Given g define the loss class F-{f:(x,y)+1 g∈9} Notice that g contains functions with range in [1,1) while F contains non- negative functions with range in 10, 1. In the remainder of the tutorial, we will go back and forth between F and g (as there is a bijection between them),some tilnles stating the results ill terIns of functions ill f alld soilletilnles in terins of functions in g. It, will be clear from the context which classes g and f we refer to and F will always be derived from the last mentioned class g in the way of (1) We use the shorthand notation Pf=Elf(X, Y)] and Pn.f=2s f(Xi, Yi Pn is usually called the empirical measure associated to the training sample With this notation, the quantity of interest(difference between true and empir ical risks) can be written as Pfn-Pnfr An empirical process is a collection of random variables indexed by a class of functions and such that each random variable is distributed as a sum of i id random variables(values taken by the function at the data {Pf-Pn}∈x Statistical Learning The 183 One of the most studied quantity associated to empirical processes is their supre sup Pf-Pnf It is clear that if we know an upper bound on this quantity, it will be an upper Dound on(2). This shows that the theory of empirical processes is a great source of tools and techniques for Statistical Learning Theory 3.2 Hocffding's Inequality Let us rewrite again the quantity we are interested in as follows R(g)-Rn(g)=E[f(Z)-->f(i It is easy to recognize here the difference between the expectation anld the eIll pirical average of the random variable f(Z). By the law of large numb immediately obtain that IP lim ∑f(z)-f(2)=0 This indicates that with enough samples, the empirical risk of a function is a good approximation to its true risk. It turns out that there exists a quantitative version of the law of large numbers when the variables arc boundcd Theorem 1(Hoeffding). Let Z1,..., Zn be n i.i.d. random variables with f(Z) a, b]. Then for allE>0, we have P∑/(x)-E(川>≤2 2na Let us rewrite the above formula to better understand its consequences. Denote the right hand side by 8. Then IP.f-Pf>( g 2n or(by inversion, see Appendix A) with probability at least 1-8, IPnf-Pfs(b log 2m 184 Bousquet, Boucheron lugosi Applying this to f(z)=laixi+r we get that for any g, and any 8>0, with probability at least 1-0 B(9)≤Rn(9)+ Notice that one has to consider a fixed function g and the probability is with respect to the sampling of the data. If the function depends on the data this does not apply 3. 3 Limitations Although the above result seems very nice (since it applies to any class of bounded functions), it is actually severely limited. Indeed, what it essentially says is that for each(fixed )function f E F, there is a set S of samples for which Pf-Pnf a(and this set of samples has measure P[S2 1-8. How- ever, these sets S may be different for different functions. In other words, for the observed sample, only some of the functions in F will satisfy this inequality Another way to explain the limitation of Hoeffding's inequality is the follow ing,If we take for g the class of all f-1, 1j-valued(measurable)functions,then for any fixed sample, there exists a function f E F such that Pf-Pn.f=1 To see this, take the function which is f(Xi)=Yi on the data and f(X)=y everywhere else. This does not contradict Hoeffding's inequality but shows that it does not yield what we need Figure 2 illustrates the above argumentation. The horizontal axis corresponds Fig. 2. Convergence of the empirical risk to the true risk over the class of functions

...展开详情

• 60
资源
• 10
粉丝
• 等级
•  分享王者 Introduction to Statistical Learning Theory 10积分/C币 立即下载
1/39         10积分/C币 立即下载