Cover T,Hart P.Nearest neighbor pattern classification

所需积分/C币:50 2016-03-06 16:22:20 993KB PDF
收藏 收藏

Cover T,Hart P.Nearest neighbor pattern classification IEEETransction on Information Theory 1967
COVER AND HART: NEAREST NEIGHIBOR PATTERN CLASSIFICATION P(k;m)↑号ink, for any For a given x the conditional loss is minimum when the individual is assigned to the category i for which r, (a) is Pe(le; n)o in n, for any hi>0 lowest. Minimizing the conditional expected loss ob alid viously minimizes the unconditional expected loss. Thus the minimizing decision rule o*, called the Bayes decision P.(k;n)→>0,if0<<α<1, for all rule with respect to n, is given by deciding the category g for which r, is lowest. Using 8*, the conditional bayes In general, then, the 1-NN rule is strictly beller risk r*(c)is than the k 1-nn rule in those cases where th supports of the densities fi, f2,., fu are such that cach *(x)=min∑(x)L(, in-class distance is greater than any between-class distance and the resulting overall minimum expected risk Px 2 called the Bayes risk, is given by *=*() where the expectation is with respect to the compound densit f(x)=∑mf(m) Fig. 1. Admissibility of nearest neighbor rule. IV. BAYES PROCEDURE Y. CONVERGENCE OF NEAREST NEIGHBORS In this section we shall present the simplest version Most of the properties of the nn rules hinge on the of the Bayes decision procedure for minimizing the prob- assumption that the conditional distributions of 8n and e ability of error in classifying a given observation a into approach one another when ah>a. In order to put one of M categories. All the statistics will be assumed bounds on the Nn risk for as wide a, class of underlying known. Bear in mind, however, that the NN rule is statistics as possible, it will be necessary to determine nonparanetric, or distribution free, in the sense that it the weakest possible conditions on the statistics which does not depend on any assumptions about the under- guarantee the above convergence lying statistics for its application. The Bayes risk serves merely as a reference the limit of excellence beyond Lerra(convergene of the Nearest Neighbor which it is not possible to go Let ar and s1,π2,… be independent identically dis et e denote the measurements on an individual and tributed random variables taking values in a separable X the sample space of possible values of a. We shall refer metric space X. Let an denote the nearest neighbor to a to x as the observation On the basis of r a decision must from the set at, a2,...,In Then an+a with prob- be made about the membership of the individual in one ability one of M specified categorics For the purposes of defining the Bayes risk, we assume Remark: In particular; an-e with probability one f1(x),f2(x), far(a), probability densities at a with for any probability measure in Euclidean n-space. We respect to a d-finite mcasure v, such that an individual prove the lemma in this generality in order to include in category i gives rise to an observation a according in its coverage such standard pathological candidates for to density f Let, L(i, i be the loss incurred by assigning counterexamples as the Cantor ternary distribution func- an individual from category 2 to category 3. tion defined on x the rcal line Iem1,n2,…:nm,n:≥0,∑:=1, be the prior prob Since the convergence of the nearest neighbor to a is abilities of the M categories. The conditional probability independent, of the metric, the bounds on the risks of n: (a) of an individual with measurements a belonging the NN rule will be independent of the metric on X. is, by the Bayes theorem P/:IetS2(r) be the sphere{xεX:d(x,a)≤r}of radius r centered at x. where d is the metric defined on X 角:=∑n 讠=1,2,…,M Consider firs( a point a e X having the property tha every sphere S(r),r>0, has nonzero probability measure Thus the random variable a transforms the prior prob- The, for illy 8 >0, ability vector n into the posterior probability vector f(a) xn,x)≥8}=(1-P(S2(6)”→0 If the statistician decides to place an individual with Measurements x into category j, the conditional loss is and thcrcforc, since d(ak, a) is monotonically dccrcasing in k the nearest neighbor lo converges to a with prob 分(z)L(,门 ability one 24 IEEE TRANSACTIONS ON INFORMATION THEORY JANUARY It remains to argue that the random variable has mixture of 8-functions and picecwisc continuous density this property with probability one. We shall do so by functions on Euclidean d-space. Observe thatO< R*< proving that t:he set, N of po nts failing to have this R <2R (1-Re )<3; so R*=0 if and only if R=0 property has probability mcasurc zero. Accordingly, Ict and R 2 if and only if R =3. Thus in the extreme n be the set of all a for which there exists some r suffi- cases of tomplete certainty and complete uncertainty the ciently small that P(s(ra)=0 Nn probability of error equals the Bayes probability By the definition of the separability of x, there exists of cIror. Conditions for equalily of It and R* for other a countable dense subset A of X. For each a e N there values of R* will be developed in the proof exists, by the denseness of A, a, in A for which ar e_(r/3) Thus, therc exists a small sphere Sax(pr/2)which is strictly Proof: Let us condition on the random variables a and contained in the original sphere s ir )ard which contains in the n-sample nn problem. The conditional NN risk a. Thus P(Sax(ra/2))=0. Then the possibly uncountable r( a) is then given, upon using the conditional in- set N is contained in the countable union(by the count dependence of and 0,, by ability of A)of spheres U S,(r). Since N is contained in the count: ble union of sets oi measure zero, P(N )=0, r(a, al=EL(, 0) a, an=Pri 81 x, 'n, as was to be shown Pr{0=1|x}P,{6=2|c} VI. NEAREST NEIGHBOR PROBADILITY OP ERROR -r{9=2|x}P{0=1!x}(1 c and let 0 be the category to which the individual having development, of(4) the above may be written as r Let ne ai, a2,..., an be the nearest neighbor to where the expectation is taken over o and 0. measurement on belongs. If 0 is indeed the category of a the NN rule incurs loss L(6, 0n. If(r, 0), (1:0,,..,(vn, 0n r(x,)=分(x)2(x)+角2(x)分(x (15) are random variables, we define the n-sample NN risk We wish first to show that r(, a n)converges to the (n) by the expectat rando 2分()分() with probubility ()=E[(e,0m)] We have not required that fi, f, be continuous at the po nts a of nonzero probability neasure yI ), because and the (large sample) NN risk i by these points may be trivially taken into account as follows R= lim R( Let v(o>>0; tI ({xo2x=(1-(x)→>0 (16) Throughout this diseussion we shall assumc that the pairs(,0),(a1,B1,.,.,( m,0n)are independent identi- Since a, once equalling o, equals o thereafter distributed random variables in Xx. Of course r(x,x)→2分1(x)分2(x except in trivial cases, there will be somc dependence between the elements vi, 0, of each puir th probability We shall first consider the 1I= 2 category problem for the remaining points, the hypothesized continuity with proba bility of crror critcrion given by thc 0 of f. and fz is needed. Hcro s is a continuity point of f loss matrix and f2 with conditional probability one (conditioned olr such that v(a)= 0). Then, since i is continuous in (12 fi and f2, a is a continuity point of i with probability one 10 by the lemma, a/ converges to the random variable a with probability one. Hence, with probability one where L counts an error whenever a mistake in classifica tion is made. The following theorem is the principal result (xc2)→>x (18) of this discussion and, from(15), with probability one, (x,3)→r(x)=2(x)2(x) (19 Let x be a separable metric space. Let f, and fa be where r(a) is the limit of the n-sample conditionalnnrisk such that, with probability one, a is either 1)a continuity As shown in(6)the conditional Bayes risk is point of f1 and f,, or 2)a point of nonzero probab: lity measure. Then the nn risk R (probability of error) has r(2)=min{1(x),2(x)} the bounds =min{(a),1一角(x)} B≤R≤2比*(1一R*) (13)Now, by the symmetry of ri*in ii, we may write These bounds are as tight as possible Reinarls: In particular, the hypotheses of the theorem (c=2分(x)1(c)-2分()(1-(t) are salisfied for probability densil es which consist of any 2*(x)(一?*(x).(21) 1967 CoVER AND HART: NEAREST NEIGHBOR PATTERN CLASSIFICATION Thus as a by-product of the proof, we have shown in measure y such that, with probability one, x is either the large sample case, that with probability one a randomly 1)a continuity point of fr 23 1tg or 2 a point choscn a will be correctly classified wi:h probability of nonzero probability measure. Then the nn probability 2r ()(1-r*( a )). For the overall NN risk R, we have, of error R has the bounds by dcfinition, E(a, anI (22) R*<配<R where the expectation is (aken over a arld wm. Nowl. These bounds are as tight as possible and hence r, is bounded by one; so applying the dominated Proof Since h-a with probability one, the posterior probability vector 1(rn)->n(c)with probability one convergence Theorem, The conditional n-sample nn risk r(a, z)is R= Tlim r(a, ap) r(x,2)=EL(0,0)1x,x1=∑(x)的(z2)(30) The limit, from(19 )and(21), yields R= E[l which converges with probability onc to the largc samplc conditional risk r(a defined hy E[2个1(x)2(x) E[2r(x)(1-r*(x)] (24) r(2)=∑()分()=1-∑() (31) Since the Bayes risk R is the expect ation of r, we have The conditional Bayes risk r*(r), obtained by selecting, =2R*(1-B*)-2Varr*(x) 25) for a given a, the maximum n; t ),say分(x), Is given b ence *(x)=1-max{分(a)}=1-n(a),(32) E≤2R*(1一*), (26) By the Cauchy-Schwarz inequality wilh equalily irf Var r*=0, which holds iff p+- R* with probability one. Investigating this condition we find(-1)∑们2∑ that for R=- 2R (I- R*) it is necessury and sufficient that [1-角(x)2=(*(x)2.(3) */(1一)or(1一R)/R Adding(M-1)12(e)to cach side, for almost every a(with respect to the probability ure v) Rewriting (24), we have =(*(x)2+(M-1(1-y*(x)2(34) B=E*(x)+r*(x)(1 or =D*+Er*(c)(1-21*(x)] ∑6≥边工+(-r() >B Substituting 35) into ( 31) with equality if aund only if n*(a)(1-2r(a))=0 almost everywhere (with respect to v). Thus the lower bound r(x)<2n(x) R=R* is achieved if and only if r*equals or 3 almost M-1 (36) every where and Er*= R* Examples of probability Taking expectations, and using the dominated convergence distributions achieving the upper and lower bounds will theorem as beforo be given at the end of this section following the extension to 2 categorics Consider now the 4-category problem with the prob- 7=2F2 ar r M ability of error criterion given by the loss function L(i, j)=0 fori=i,and(,=1,for讠≠示 The substitution Hence ick of (21) can no longer be used when M R<R Theoren(Extension of Theorem 1 to 1 2 2) M-1 Let X be a separable metric Letj,f,…,fx ility if and only if =0.Of course ke probability densities with respect to some probability Var r=0 implies r *(a)= R* with probability one. 26 IEEE TRANSACTIONS ON IN FORMATION THEORY The upper bound is attained for the n0-information experiment fi = f2 fu, with n=l-R min(nifi, n2j2 du andn:=B*/(-1);=2, Z. The lower bound R= R is attained, for example, when n;= 1/M, i= min r I x}d,=1.(43) M. and f()= ≤xsMF M-1①≤≤2+1-1 Exhibiting corresponding terms wo have M-1 *≤R≤2R(1一1) 0. elsewhere (39) or << (4) ⅤIⅠ. EXAMPLE In this example we have found an exact expression Let the real valued random variable a have triangular for the NN risk R(n)for any finite sample size. Observe densities fi and f2 with prior probabilities 71-72=2, that R (1)=2, in agreement with simpler considerations as shown in Fig. 2. The density f=nf1 +nefa on a is and that R(n)converges to its limit approximately as 1/n2 uniform on [0, 1], thus facilitating calculation of the dis tribution of the nearest neighbor x/ VII.THEh-N、RULE From Section V it is also possible to conclude that the f(x)=2-2x f,(x =2x kth nearest neighbor to c converges to a with probability mple size n increases with k fixed. Sin of the t neighbors casts conditionally independent votes as to the category of a, we may conclude, in the a, tegory case for odd l, that the conditional k-NN ris is gI the limit (with probability onc)as n increase ∑()(14-( (1-元(x) 分)(1一(2) =(+1)/2 Note that the conditionalNN risks Ta(a) Decreasing in l(to min (i1(a), 1-11(a)), as we might suspect. Thus Che least the uncond tional Nn risks Re will also be monotonically dccrcasing in le(to R*) Observe that in(45) s is symmetric in i1 and 1-n Thus rk may be expressed solely in terms of r*= min {分,1一分} in the forr Fig. 2. Triangle densities for example. rR pr The probability of error for this example in the n-samplc single nn case is “”(y"a R2()=El:1()2(x2)+m1n1(3)f(x2) k+1)/2 L[( (1-:)xn] 40) Now let p2(") be defined to be the least concave function Upon performing a lengthy but stra ghtforward cal- greater than px(r*k).Then culation, we obtain p2(r*)≤pA(“) (47) ?( 3(+1)(+2) (41)and, by Jensens inequality Thus Hn=Bn=Fp()≤Ep(r*)≤(E*)=p(R*),(48 SO Pk(R*)is an upper bound on the large sample k-NN R=lim R(n (42) risk RK. It may further be shown, for any R*, that Px(R*) is the least upper bound on R& by demonstrating simple The nn risk R is to be compared to the Bayes risk statisties which achieve it. Hence we have the bound EE TRANSACTIONS ON INFORMATTON TEEORY: VOL. IT-18, NO. 1, JANUARY 1967 R*≤B.≤p(B*)≤p-(R)≤ References [11E nd. L. Hod Diseriminatory analysis (*)=2*(1-l (4 parametric discrimination, "?USAF School of Aviation Medicine FA1(128)-31, February \it 21-49-004, Rept. 4, Contract Randolph Field, Tex,, Proj where the upper and lower bounds on R are as tight [21 Discriminatory analysis: small sample performance as possible USAF School oi Aviation Medicine, Randolph F'eld, Tex Project 21-49-004, Rept. 11, August 1952 An empirical Bayes approach to non-parametric IX. CONCLUSIONS two-way classification, in Studies in Item Analysis and Predit tom,五, nor The single nn rule has been shown to be admissible 1961 among the class of k-Nn rules for the n-sample case [4] L. N. Kanal," 'Statistical methods for pattern classification Philco Rept., 1963; originally appeared in T. Harley et al., for any n. It has been shown that the nn probability emi-automatic imagery screening research study and experi of error R, in the M-category classification problem, is mental investigation, "Philco Reports, y043-2 and v043-3, VolI, sec. 6, and Appendix H, prepared for U. S. Army Elec ctronic bounded below by the Bayes probability of error A* and Research and D)evelopment Lab under Contract DA-36-039-so 90742, March29,1963. above hy R (2-MR*/(M-1) 1)). Thus any other decision [5] G. Sebestyen, Decision Making Processes in Patlern Recognition New York: Macmillan, 1962, pp. 90-92. rulc bascd on the infinite data set can cut the probability [6] Nils Nilsson, Learning Machines. New York: McGraw-HI of error by at most one half. In this sense, half of the 1965,pp 120-121 availablc i lc information in an infinite collection of classified [7] D.0. Loftsgaarden and C. P. Quesenberry, "A nonparametric s.mples is contained in the nearest, neighbor vol.36,pp.1049-1051,June1065 A Generalized Form of Price's Theorem and Its Converse johN I. BROWN, R. SENIOR MEMBER IEEE Abstract-The case of n unity-variance random variable INtROduction x1,x2…,x2 governed by the joint probability density w(x1x2…·x is considered, where the density depends on the (normalized) cross- covariances p;=EI(x;一元)(x;一8)l, It is shown that the DICES THEOREM and its. various extensions ondition of output correlations bctwccn zero-mcmory non- linearities subjected to jointly Gaussian inputs. In its KEI(x1,x xm)1 original form, the theorem considered n jointly normal random variables, a1, a n, with respective means Tn and nth-order joint probability density it,,! (2 holds for an" arbitrary” function ft(x1,x,……,x) of n variables近f and only if the underlying density w(xi, x2,..., x,)is the usual exp -2(6--,① n-dimensional Gaussian density for correlated random variables This result establishes a generalized form of Prices theorem in where IM n is the determinant of ll =[Prs which: 1)the relevant condition ()subsumes Prices original con dition; 2)the proof is accomplished without appeal to Laplace integral expansions; and 3 )conditions referring to derivatives with respect to diagonal terms p: i are avoided, so that the unity variance the correlation coefficient of a and e. and l-. is the ssumption can be retained cofactor of Par in 11, From [1, the theorem stalement is as follows Lct thero bc n zcro-memory nonlinear devices specifie Manuseript received February 10, 1966; revised May 2, 1966 The author is with the Ordnance Resesrch Laboratory, Penli- by the input-output relationsh ipf:(x),i=1,2,…,,n sylvania, State Univcrsily, Talc College, Pa Let cach a be the single input to a corresponding /i (a

试读 7P Cover T,Hart P.Nearest neighbor pattern classification
立即下载 低至0.43元/次 身份认证VIP会员低至7折
  • 分享达人

关注 私信 TA的资源
Cover T,Hart P.Nearest neighbor pattern classification 50积分/C币 立即下载
Cover T,Hart P.Nearest neighbor pattern classification第1页
Cover T,Hart P.Nearest neighbor pattern classification第2页

试读结束, 可继续读1页

50积分/C币 立即下载 >