LOF: Identifying DensityBased Local Outliers

这篇论文是Breunig于2000年发表在Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data的关于LOF算法的经典论文，需要了解该算法的同学可以详细读一读。
relative to their local neighborhoods, particularly with respect to the densities of the neighborhoods. These outliers are regarded as To illustrate, consider the example given in Figure 1. This is a sim reachdist 1. 0)=kdistance(o) ple 2dimensional dataset containing 502 objects. There are 400 ob ects in thc first cluster Cl, 100 objects in thc cluster C? additional objects o1 and un. In this example, C2 forms a denser ter than C. According to Hawkins' definition, both ol and o2 reachdisik(p,, o) can be called outliers, whereas objects in C and C, should not be With our notion of a local " outlier we wish to label both o, and o 2 as outliers. In contrast within the framework of distancebased out liers, only o is a reasonable DB(pct, dmin)outlier in the following Figure 2: reachdist(p i,o)and reachdisi(p2, 0), for k=4 scnsc. If for cvcry object q in Cl, the distance bctwccn q and its nearest neighbor is greater than the distance between Un and C? C.e., d(o. C2)), we can in fact show that there is no appropriate val Definition 4:(kdistance neighborhood of an object p) ue of pct and dmin such that o2 is a DB(pct, dminoutlier but the the Givcn thc kdistancc of p, thc kdistance neighborhood of p objects in CI are not contains every object whose distance from p is not greater than The reason is as follows If the dmin value is less than the distance the kdistance,ie. Nk distanced(p)={a∈D钟p}或P,q)"k d(o2, C2), then all 501 objects (pct=100*501/502)arc furthcr away from O? than dmin. But the same condition holds also for every ob These objects g are called the knearest neighbors of p ject q in CI. Thus, in this case, O] and all objects in Ci are Db(pct, Whenever no confusion arises, we simplify our notation to use dmin)outliers N(p)as a shorthand for Nkdistance(p (p). Note that in definition 3 Otherwise, if the dmin value is greater than the distance d(o,, C,) the kdistance(p)is well defined for any positive integer k, although then it is easy to see that: o, is a db(pct, dminoutlier implies that the object o may not be unique. In this case, the cardinality of Nk(p) there are many objects q in C1 such that q is also a DB(pct, dmin) is greater than k. For example, suppose that there are: (i)l object outlier. This is because the cardinality of the se with distance I unit from p;(11)2 objects with distance 2 units from p:and(ii3 objects with distance 3 units from p. Then 2dis ped d(, 02)dmin is always bigger than the cardinality of tance(p) is identical to 3distance(p). And there are 3 objects of 4 the set (pcdI dp. q)"dmin). Thus, in this case, if o2 is a distance()from p. Thus, the cardinality of N4(p) can be greater DB(pCt, dmin)outlier, so are many objects q in CI. Worse still, than 4. in this case 6 there are values ofpcl and dmin such that while o, is not an outlier Definition 5:(reachability distance of an object p w.r.t. ob ne g in CI are ject o Let k be a natural numbcr. The reachability distance of object 4. FORMAL DEFINITION OF LOCAL p with respect to object o is defined as OUTLIERS reachdisIkp, o)=max( kdistance(o), d(p, o)) The above example shows that the global view taken by Db(pct, Figure 2 illustrates the idea of reachability distance with k4 In dmin)outliers is meaningful and adequate under certain conditions, tuitively, if object p is far away from o(e. g. p2 in the figure), then but not satisfactory for the general case when clusters of different the reachability distance between the two is simply their actual dis densities exist. In this section, we develop a formal definition of lo tance. However, if they are"sufficiently"close(e.g, P, in the fig cal outliers, which avoids the shortcomings presented in the previ ure), the actual distance is replaced by the kdistance of o. The rea ous section. The key difference between our notion and existing no son is that in so doing, the statistical fluctuations of d(p, for all th tions of outliers is that being outlying is not a binary property p's close to o can be significantly reduced. The strength of this Instead, we assign to each object an outlier factor, which is the de smoothing cffcct can bc controlled by the parameter h. Thc highcr gree the object is being outlying the value of k, the more similar the reachability distances for ob We begin with the notions of the kdistance of object p, and, corre jects within the same neighborhood spondingly, the kdistance neighborhood of p So far, we have defined kdistance(p) and reachdistk(p)for any Definition 3: (kdistance of an object p) positive integer k But for the purpose of defining outliers, we focus For any positive integer k, the kdistance of object p, denoted as on a specific instantiation of k which links us back to densitybased kdistance(p), is defined as the distance d(p, o) between p and an clustering. In a typical densitybased clustering algorithm, such as object o∈ D such that: [7].[3].[22], or[ll], there are two parameters that define the notion of density: (i) a parameter MinPls specifying a minimum number of or at least k objects o’∈Dl{p} it holds that objects; (ii) a parameter specifying a volume. These two parameters d(p, o)"d(p, o), and dctcrminc a density thr'eshold for thc clustcring algorithms to opc (ii) for at most k1 objects o'EDIp) it holds that ate. That is, objects or regions are connected if their neighborhood cp,o)<纵(P,0). densities exceed the given density threshold to detect density based outliers, however, it is necessary to compare the densities of 5.1 LOF for Objects Deep in a Cluster different sets of objects, which means that we have to determine the density of sets of objects dynamically. Therefore, we keep MinPts In section 3, we motivate the notion of a local outlier using figure I as thc only parameter and usc the valucs rcachdistminPts(p, o),for In particular, wc hopc to labcl o2 as outlying, but labcl all objects in O E NMinPIs(p), as a measure of the volume to determine the densit the cluster CI as nonoutlying. Below, we show that for most ob in the neighborhood of an object p jects in Ci its LOF is approximately l, indicating that they cannot bc labcled as outlying Definition 6: (local reachability density of an object p) Lemma 1: Let C be a collection of objects. Let reachdistnin de The local reachability density of p is defined as notc the minimum reachability distance of objects in C, 1.c., reach reachdistminPas(p, o distmmin p, q distmax denote the maximum reachability distance of objects in C. OCN ird Minptslp Let e be defined as(reachdistmaxireachdistmin1) MinIs( Then for all objects P E C, such that (i)all thc Min Pisncarcst neighbors g of p arc in C, and Intuitively, the local reachability density of an object p is the in (iall the MinPtsnearest neighbors o of q are also in C, verse of the average reachability distance based on the Min Pts it holds that 1/(1+8)"LOF(P)"(1+E) nearest neighbors of p. Note that the local density can be if all the Proof (Sketch): For all MinPtsnearest neighbors q of P, reach reachability distances in the summation are 0. This may occur for dist(p, q)> reachdistmin. Then thc local reachability density an object p if there are at least Min Pis objects, different from p, but of p, as per definition 6, is l/reachdistmin. On the other sharing the same spatial coordinates, i. e. if there are at least MinPts hand, reachdisz(p, q)"reachdistmax. Thus, the local reach duplicates of p in the dataset. For simplicity, we will not handle this ability density of p is 2 1/reuchcdislmux case explicitly but simply assume that there are no duplicates ( To Let q be a MinPtsnearest neighbor of p. By an argument iden deal with duplicates, we can base our notion of neighborhood on a tical to thc onc for p abovc, thc local reachability dcnsity of q is kdistinctdistance, defined analogously to kdistance in definition also between 1/reachdistmar and 1/reachdistmin 3, with the additional requirement that there be at least k objects Thus, by definition 7, we have reachdistmin/reachdistmax with different spatial coordinates. LOFP)"reachdistmaxc/reachdistmin. Hence, we estab Definition 7: (local)outlier factor of an object p) ishl(1+ε)"O/(m)"(1+) The(local) outlier factor of p is defined as The interpretation of lemma I is as follows Intuitively, corre ird sponds to a"cluster". Let us consider the objects p that are"deep dri inside the cluster. which means that all the minPtsnearest neigh LOFMinPasp)=OCN inPus(p)Min Pte(P) bors g of p are in C, and that, in turn, all the MinPtsnearest neigh bors of g are also in C. For such deep objects P, the LOF of p is The outlier factor of object p captures the degree to which we call bounded If C is a"tight"cluster, the a value in lemma 1 can be quite small, thus forcing the LOF of p to be quite close to 1 p an outlier. It is the avcragc of thc ratio of thc local reachability density of p and those of p's MinPtsnearest neighbors. It is easy to To return to the example in figure I, we can apply lemma I to con see that the lower p's local reachability density is. and the higher the clude that the LOFs of most obiects in cluster C, are close to l local reachability densities of p's MinPlsnearest neighbors are, the higher is the LOF value of p In the following section, the formal 5.2 A General Upper and Lower bound on properties of LOF are made precise. To simplify notation, we drop LOF the subscript Min Pts from reachdist, Ird and LOF, if no confusion Lemma I above shows a basic property of loF, namely that for ob arises jects deep inside a cluster, their LOFs are close to l, and should not be labeled as a local outlier. A few immediate questions come to 5. PROPERTIES OF LOCAL OUTLIERS mind. What about those objects that are near the periphery of the In this section, we conduct a detailed analysis on the properties of cluster? And what about those objects that are outside the cluster, LOF. The goal is to show that our definition of LOF captures the such as on in figure 1? Can we get an upper and lower bound on the spirit of local outliers, and enjoys many desirable propcrtics. Spo LOF of these objects? cifically, we show that for most objects p in a cluster, the lOF of p Theorem 1 below shows a general upper and lower bound on is approximately equal to 1. As for other objects, including those LOF(p) for any object p. As such, theorem 1 generalizes lemma 1 outside of a cluster, we give a general theorem giving a lower and along two dimensions. First, theorem l applies to any object P, and upper bound on the loF. Furthermore, we analyze the tightness of is not restricted to objects deep inside a cluster. Second, even for our bounds. We show that the bounds are tight for important classes objects dccp insidc a cluster, the bound givcn by theorem I can bc f objects. However, for other classes of objects, the bounds may tighter than the bound given by lemma 1, implying that the epsilon not be as tight For the latter, we g ive another theorem specifying defined in lemma i can be made closer to zero. This is because in better bounds lemma 1. the values of reachdistmin and reachdistmar are ob tained based on a larger set of reachability distances. In contrast, in theorem 1, this minimum and maximum are based on just the MinPtsnearest neighborhoods of the objects under consideration, directmin(P) Proof (Sketch): (a) indirect LOF(p): VOENMiRP(p): reachdist(, o)2 dir by definition of directmin(p) NMin Pis(p) Vq(o): reachdist(o, q)"indirectmar(p) x=0×imin by definition of indireclmor(p → LOFMinPts(p)≥4 > LOFMinPts(p)"6 reachdist(o, q) Figure 3: Illustration of theorem I 31vMinPi (o) NInnIs(Ol indirect ,1c giving rise to tighter bounds. In section 5.3, we will analyze in greater details the tightness of the bounds given in theorem 1 lrd(U)≥ indirectmax(p) Before we present theorem 1, we define the following terms. For Thus. it follows that any object P, let directmin(p) denote the minimum reachability dis ird(o tance between p and a MinPtsnearest neighbor of p,i.e Ird(p) directmin(p)min( reuchdist(p, 4 y e NMinPis(p)) LOF(p) Similarly, let direct max(p)denote the corresponding maximum, i.e MinI directa(p)=max i reachdisu(p, q) g E NMinPts(p)? Furthermore, to generalize these definitions to the minPtsnearest neighbor g of p, let indirectmin(p)denote the minimum reachability indirect(p) distance between g and a MinPtsnearest neighbor of g, i.e oCNMinPis(p) indirectmin(p)=min i reachdist(, o)g E NMinP()and o cireclmin(py direct i n(p) ∈ mInis(q)} MinIs(p) indirect a(p) Similarly, lct indirectmax(p) dcnotc thc corrcsponding maximum In the sequel, we refer to ps Min Plsnearest neighborhood as ps di (b)1O/(p) indirectmin(p) analogously rect neighborhood, and refer to q's MinPitsnearest neighbors as p's direct neighbors, whenever q is a MinPtsncarcst neighbor of p To illustrate the theorem using the example in figure 3, suppose Figure 3 gives a simple example to illustrate these definitions. In that dunin is 4 times that ofinax, and dmax is 6 times that of imin.Then this cxamplc, object p lics somc distance away from a cluster of ob jects C. For ease of understanding, let Min Pts=3. the directmin( p) by theorem 1. the loF of p is between 4 and 6. It should also be clear from theorem I that LOF(p) has an easytounderstand inter value is marked as dmin in the tigure; the directmar(p) value is pretation. It is simply a function of the reachability distances in p's marked as dmar. Because p is relatively far away from C, the 3dis dircct ncighborhood rclativc to thosc in p's indirect ncighborhood tance of every object g in C is much smaller than the actual distance between p and q. Thus, from definition 5, the reachability distance 5.3 The Tightness of the bounds of p w.r.t. g is given by the actual distance between p and g. Now As discussed before, theorem I is a general result with the specified among the 3nearest neighbors of p, we in turn find their minimum upper and lower bounds for LOF applicable to any object p. An im and maximum reachability distances to thcir 3ncarest neighbors mediate question comes to mind. How good or tight are these In the figure, the indirectmin(p) and indirectmax(p) values are bounds? In other words, if we use /OMg to denote the upper marked as imin and inar respectively bound directmarindirectmin, and use LOFmin to denote the lower Theorem 1: Lct p bc an object from thc databaSc D, and bound directmin / indirectmax, how largc is thc spread or diffcrencc 1 MinPts"D between LOFmur and LOFmin? In the follow ing we study this issue Then it is the case that a key part of the following analysis is to show that the spread lop directi(p) directa(p) LOFE marLOFmin is dependent on the ratio of direcvindirect. It turns out dired indirectmin(p) that the spread is small under some conditions but not so small der other conditions LOFLOF 15.00 LOF HFICEA indirect direct. pct direct·pct direct 0.00 LOF LOFmux Pct=10 indirect indirect pct indirect indirect·pct 500 /+ 4 0.00 15 1+ proportion direc indirect Figure 4: Upper and lower bound on LoF depending on direct/ indirect for diffcrent values of pct Figure 5 shows that(LOFmar LOFmin)(direct/indirect) is only de pendent on the pcrccntagc valuc pct. Its valuc approaches infinit Givcn directmin(p) and directmax(p)as dcfincd abovc, wc usc di if pct approaches 100, but it is very small for reasonable values of recl(p) to denote the mean value of directmin(p) and direclmar(p) pct. This also verifies that the relative fluctuation of the LOF is con Similarly, we use indirect(p) to denote the mean value of indirect stant for a fixed percentage pcl, as we have seen in figure 4 nd indirectmar(p). In the sequel, whe nfusion To summarize, if the fluctuation of the average reachability dis arises, we drop the parameter p, e.g, direct as a shorthand of di tances in the direct and indirect neighborhoods is small (i.e, Pcl is low), theorem 1 estimates the LOF very well, as the minimum and maximum lof bounds arc closc to cach othcr Thcre are two im Now to make our following analysis easier to understand, we sim portant cases for which this is true plify our discussion by requiring that lFec direc lmin)/direct=(indirec indirec min)/indirect The percentage pct is very low for an object p, if the fluctuation That is we assume that the reachability distances in the direct and of the reachability dista indirect neighborhoods fluctuate by the same amount. Because of MinPtsnearest neighbors of p belong to the same cluster. In this this simplification, we can use a single parameter pct in the sequel case, the values direclmin, dlirec'lmar: indireclmin and indirec nax to control the fluctuation. More specifically, in figure 4. pct=x% Is consistent with the result cstablished in lcmma/e to 1.This are almost identical, resulting in the lof being close to 1. This corresponds to the situation where directmax= direct(11x), di rectmin=direct (Ix%), indirectmax indirect (1+x%)and indi The argument above can be generalized to an object p which is rectminindircct*(1x%o). Figure 4 shows the situations when pct not located dccp insidc a cluster, but whose MinPtsncarcst is set to 1%0, 5o and 10%o The spread between lO/max and lor,min neighbors all belong to the same cluster (as depicted in e sure 3). In this case, even though LOF may not be close to l fis Increases as pci increases e bounds on LOF as predicted by theorem l are tight More importantly, figure 4 shows that, for a fixed percentage oct=x%, the spread between LOFmar and LOFmin grows linearly 5.4 Bounds for Objects whose Direct Neighbor with rcspcct to thc ratio direct/indirect. This mcans that the rclativc hoods Overlap Multiple Clusters span(LOFmox LOFmin/(direct/indirect) is constant. Stated differ So far wc havc analy zcd the tightness of the bounds given in cntly, thc rclativc fluctuation of thc loF dcpcnds only on thc ratios theorem 1, and have given two conditions under which the bounds of the underlying reachability distances and not on their absolute are tight. An immediate question that comes to mind is: under what values. This highlights the spirit of local outliers To be more precise, in fact, the whole situation is best captured in the 3dimensional space where the three dimensions are: LOF LOFmin),(direct/indirect), and pct. Figurc 4 then rcprcscnts a Sc of 2d projections on the first two dimensions. But figure 4 does not show the strength of the dependency between the relative 5000 fluctuation of the LOF and the rclativc fluctuation of pct. for this purpose, figure 5 is useful. The yaxis of the figure shows the ratio between the two dimensions (LOFmax LOFmin)and(direct/indi rect)in thc 3dimcnsional spacc mentioned abovc, and the xaxis corresponds to the other dimension pct. To understand the shape of 0⑩203405060789010 the curve in figure 5, we have to take a closer look at the ratio loF LOFmin/(dlirecllindirecl) percentage of fluctuation pct Figure 5: Relative span for loF depending on percentage of fluctuation for d and w MinIs =6 LOF(P)4. 5. direclmin(p)l. Indirectmax( p) i and LOF(P)·5, Ireclmax indirectmin( p) We give a proof sketch of theorem 2 in the appendix. Theorem 2 gcncralizcs thcorcm I in taking into considcration thc ratios of thc Figure 6: llustration of theorem 2 MinPtsnearest neighbors coming from multiple clusters. As such condition are the bounds not tight? Based on figure 5. it the MinPts there is the follow ing corollary ncarcst neighbors of an object p belong to dittcrcnt clustcrs having Corollary 1: If the number of partitions in theorem 2 is 1,then different densities, the value for pct may be very large. Then based LOFmin and LOFmax given in theorem 2 are exactly the same corre on figure 5, the spread between LOFmax and LOFmin value can be sponding bounds given in theoren 1 large. In this case, the bounds given in theorem I do not work well As an example, let us consider the situation shown in figure 1 6. THE IMPACT OF THE PARAMETER again. For object O2, because all its MinPisnearest neighbors come MNPTS from the same cluster C2, the bounds given by theorem l on the LOF of o, is expected to be tight. In contrast, the MinPtsnearest In the previous section, we have analyzed the formal properties of LOF. For objects deep inside a cluster, we have shown that the lOF neighbors of oi come from both clusters Ci and ( In this case, the is approximately equal to 1. For other objects, we have established given bounds on the loF of 0, may not be as good two sets of upper and lower bounds on the LoF, depending on Theorem 2 below intends to give better bounds on the lOF of ob whether the MinPtsnearest neighbors come from one or more clus cct p when p's MinPtsncarcst ncighborhood overlaps with morc ters. It is important to note that all the previous results are based on than one cluster. The intuitive meaning of theorem 2 is that, when a given MinPts value. In this section, we discuss how the LOF val we partition the MinPtsnearest neighbors of p into several groups, ue is influenced by the choice of the MinPts value, and how to de ch group contributes proportionally to the LOF ofp termine the right Min Pts values for the loF computation An example is shown in figure 6 for MinPis=6. In this case, 3 of ob ject p's 6nearest neighbors come from cluster Cl, and the other 3 6.1 How LOF Varies according to changing come from cluster C2. Then according to theorem 2, LOFmin is giv MinPts values en by(0.5"d min I 0.5d2mminy(o5/ t0.5/), where dImin Given the analytic results established in the previous section sev and demin give the minimum reachability distances between p and eral interesting questions come to mind. How does the lof value the 6nearest neighbors of p in Ci and Cz respectively, and ilmax change when the MinPts value is adjusted? Given an increasing se quence of Min Pis values, is there a corresponding monotonic se and i2 mar give the maximum reachability distances between q and quence of changes to LOF? That is, does loF decrease or increase s 6nearest neighbors, where q is a 6nearest neighbor of p from monotonically? CI and C, respectively. For simplicity, figure 6 does not show the case for the upper bound lofmax Unfortunately, the reality is that LOf neither decreases nor increas es monotonically. Figure 7 shows a simple scenario where all the Theorem 2: Let p be an object from the database D, objects are distributed following a Gaussian distribution. For each 1"Min Pls"Db and C1, C2.,Cn be a partition of Nminpi(p), i.e. NMinPts(p)=C LOF values, as well as the standard deviation, are show d mean MinPts value between 2 and 50. the minimum maximum C;↑ 2 for l"i"n,i↑ Let us consider the maximum LOF as an example. Initially, when Furthcrmorc, Ict $i=CI yNMinPis(p) bc the percentage of ob the MinPts value is set to 2, this reduces to using the actual inter object distance d(p, o) in definition 5. B Jects in ps neighborhood, which are also in (i Let the notions ue, the statistical fluctuations in reachability distances and in LOF direclmin(p), direc(max(p), indirec(min(p), and arc wcakcncd. Thus, thcrc is an initial drop on the maximum LOF value. However. as the MinPts value continues to increase, the in direct max(p) be defined analogously to directmin(p), direct maximum LOF value goes up and down, and eventually stabilizes max(p), indirectmin(p), and indirectmar(p) but restricted to the set Ci to some value (e.g,, directmin (p) denotes the minimum reachability distance be If the lof value changes nonmonotonically even for such a pure distribution like the gaussian distribution the lof value changes tween p and a Min PIsnearest neighbor of p in the set C;) more wildly for more complex situations. Figure shows a twodi Then, it holds that (a) mcnsional datasct containing thrcc clusters, whcrc Si consists of 10 Gaussian distribution F Tl 2.5 g 15 max 05 with std. d 16111621263136414651 Minpts Figure 7: Fluctuation of the outlicrfactors within a Gaussian cluster objects, S2 of 35 objects and S3 of 500 objects. On the right side are relative to this cluster. This value could be applicationdependent representative plots for one object from each of these clusters. The For most of the datasets we experimented with, picking 10 to 20 ap lots show the lof over minPts for the range from 10 to 50. While pears to work well in general the lOF of an object in Sy is very stable around I, the lO,s of the Next, we turn to the sclcction of a rcasonablc valuc of MinPtsUB objects in S, and S3 change more wildly the upper bound value of the range of MinPts values. Like the lower bound Min PtsLB, the upper bound also has an associated meaning 6.2 Determining a Range of MinPts values Let C be a set/cluster of close by" objects. Then MinPisUB can be Because the LOF value can go up and down, we propose as a heu regarded as the maximum cardinality of C for all objects in C to po ristic that we use a range of MinPts values. In the following, we tentially be local outliers By close by "we mean, that the direct provide guidelines as to how this range can be picked. We use min, directmax, indirectmin and indircctmax values are all very simi MinPtslB and minPtsUb to denote the lower bound and the"up lar. In this case, for MinPts values exceeding MinPts(B, theorem I per bound” of the range requires that the loF of all objects in C be close to 1. Hencc, the Let us first determine a reasonable value of MinPtsLB. Clearly guideline we provide for picking Min Pts UB is the maximum num MinPtslB can bc as small as 2. Howcvcr, as cxplaincd abovc and ber of close by "objects that can potentially be local outliers before definition 5. it is wise to remove unwanted statistical fluctu As an example, let us consider the situation shown in figure 8 ations due to Min Pis being too small. As an example, for the Gaus again. Recall that S consists of 10 objects, S, of35 objects and s sian distribution shown in figure 7, the standard deviation of LOF of 500 objects. From the plots, it is clear that the objects in S3 are only stabilizes when Min Pts lB is at least 10. As another extreme never outliers, always having their lOF values close to l. In con example, suppose we turn the (iaussian distribution in figure 7 to a trast, the objects in S are strong outliers for MinPts values between uniform distribution. It turns out that for min Pts less than 10. there 10 and 35. The objects in S, are outliers starting at Min Pts=45 can be objects whose LOF are significant greater than 1. This is The reason for the last two effects is that, beginning at Min Pts=36 counterintuitive because in a uniforn distribution, no objec the MinPtsnearest neighborhoods of the objects in S, start to in should be labeled as outlying. Thus, the first guideline we provide for picking Min PisLB is that it should bc at Icast 10 to rcmovc un cludc somc object(s) from S. From thcrc on, thc objects in S and wanted statistical fluctuations S2 exhibit roughly the same behavior. Now at Min Pis=45, the members of"combined"set of objects S and s, start to include The second guideline we provide for picking MinPts/B is based on a more subtle observation. Consider a simple situation of one objec object(s) from S3 in their neighborhoods, and thus starting to be p and a set/cluster C of objects. If C contains fewer than MinPtsLB come outliers relative to S3. Depending on the application domain, objects, then the set of MinPisnearest neighbors of each object in we may want to consider a group of 35 objects (like S2)acluster or Cwill include p, and vice versa. Thus, by applying theorem l, the a bunch of " close by local outliers. To facilitate this we can l OFof n and all the objects in C will be quite similar, thus making choose a MinPtsUB valuc accordingly, that is cithcr smaller than 35 p indistinguishable from the objects in C. or larger than 35. A similar argument can be made for MinPtsLB If, on the other hand, C contains more than Minpts B objects, the with respect to the minimum number of objects relative to which MinPtsnearest neighborhoods of the objects deep in C will not other objects can be considered local outliers contain p, but some objects of C will be included in p's neighbor Having determined MinPtsLB and MinPis UB, we can compute for the distance be d and the h object its LOF val the he density of C, the loF of p can be quite different from that of an ob ristic franking all objects with respect to the maximum Lof value ect in C. The key observation here is that Min/ts/B can be ithin the spe gc Is, thc ranking of an object p 1s ed as the minimum number of objects a""(like C above)has based on: max(LOFMinPts(p) Min Pts"MinPts"Min PtS UB) to contain, so that other objects (like p above)can be local outliers 8 point in S1 point in S point in S3 Example dataset MinPts(10 to 50) Min Pts(10 to 50) MinPts(10 to 50) Figure 8: Ranges of loF values for different objects in a sample dataset Given all the LOf values within the range instead of taking the players, for which we happen to have a domain expert " handy, maximun, we could take other aggregates, such as the minimun or who confirmed the meaning fulness of the outliers found. The last the mean. The situation in figure 8 shows that taking the minimum subsection contains performance experiments showing the practi could bc inappropriate as thc minimum may crasc thc outlying na cability of our approach cvcn for largc, highdimcnsional datasets ture of an object completely. Taking the mean may also have the ef fect of diluting the outlying nature of the object. We propose to take Additionally, we conducted experiments with a 64dimensiona dataset, to demonstrate that our definitions are reasonable in very the maximum to highlight the instance at which the object is the most outl high dimensional spaces. The feature vectors used are color histo grams extracted from tv snapshots [2]. We indentified multiple clusters, e.g. a cluster of pictures from a tennis match, and reason 7. EXPERIMENTS able local outliers with LOF values of up to 7 In this section, with the proposed heuristic of taking the maximum LOF value within the range, we show that our ideas can be used to 7.1 A Synthetic Example successfully identify outliers which appear to be meaningful bu The left side of figure 9 shows a 2dimensional dataset containing cannot be identified by other methods. We start with a synthetical one low density Gaussian cluster of 200 objects and three large 2dimensional dataset for which we show the outlier factors for all lusters of 500 objects each. Among these three, one is a dense objects, in order to give an intuitive notion of the lO/ values com Gaussian cluster and the other two are uniform clusters of different puted. The second example uses the realworld dataset that has densities. Furthermore, it contains a couple of outliers. On the right been used in [kn98] to evaluate the Db(pct, dmin )outliers. We re sidc of figure 9 wc plot thc LOF of all thc objects for MinPts=40 peat their experiments to validate our method. In the third example, as a third dimension. We see that the objects in the uniform clusters we identify meaningfull outliers in a data base of german soccer ill have their LOFequal to 1. Most objects in the Gaussian clusters 14 12 810 8 6 Figure 9: Outlierfactors for points in a sample dataset (Min Pis=40) also have 1 as their LOF values Slightly outside the gaussian clus ters, there are several weak outliers, i.e., those with relatively low, Outlier Games Goals Rank Player name Factor Played Scored/Positio but larger than 1, LOF values. The remaining seven objects all have gnificantly larger LOF valucs. Furthcrmorc, it is clcar from thc Michael preetz Offense figure that the value of the lof for each of these outliers depends 21.70 Michael Schjonberg 15 Defense on the density of the cluster(s) relative to which the object is an out 1.67 HansJorg Butt. Goalie lier, and the distance of the outlier to the cluster(s) 4 1.63 Ulf Kirsten Offense 7.2 Hockey Data Giovane ether 320 6793 Offense ImInllInulInl 0 In [13], the authors conducted a number of experiments on histori cal NhL player data; scc [13] for a morc dctailcd cxplanation of thc meulan attributes used. We repeat their experiments on the nhl96 dataset, computing the maximum LOF in the MinPts range of 30 to 50 maximum For the first test, on the 3dimensional subspace of points scored, mean 18.01.9 plusminus statistics and penaltyminutes, they identified Vladimir standard deviation 11.0 3.0 Konstantinov as the only DB(O998, 263044)outlier. He was also our top outlier with the LOF value of 2. 4. The second strongest lo Table 3: Results of the soccer player dataset cal outlier, with the LOF of2. 0, is Matthew Barnaby For most out and scored 7 goals. He was the only goalie to score any goal; he too liers found, we do not explain why they are outliers from a domain expert standpoint here: the interested reader can find this informa kicked the penalty shots for his team. On rank positions four and fivc, wc found Ulf Kirsten and Giovanc Elber, two offensive play Lion in [13]. The point here is that by ranking outliers with their maximum LOF value, we get almost identical results. In the next ers with very high scoring averages subsection, wc show how this approach can identify somc outliers 7.4 Performance that [13] cannot find. In this section, we evaluate the performance of the computation of In the sccond test, thcy idcntificd the Db(0.997, 5)outliers in thc 3 dimensional subspace of games played. goals scored and shooting LOF. The following experiments were conducted on an Pentium I11450 workstation with 256 MB main memory running linux 2.2 percentage, finding Chris Osgood and Mario lemieux as outliers Again, they are our top outliers, Chris Osgood with the LOF of 6.0 All algorithms wcrc implcmcntcd in Java and cxccutcd on the IBM and mario Lemieux with the loF of 2. 8 On our ranked list base JVM 1. 8. The datasets used were generated randomly, containing different numbers of gaussian clusters of different sizes and densi on LOF, Stevc Poapst, ranked third with the lOF of 2.5, played ties. All times are wallclock times. i. e include CPUtime and I/o only three games, scored once and had a shooting percentage o 50% To compute the lof values within the range between MinPtsLB and Min Pis UB, for all the n objects in the database D, we imple 7.3 Soccer Data mented a twostep algorithm. In the first step, the MinPtsCBnear cst neighborhoods arc found, and in thc sccond stcp the lOFs arc In the following experiment, we computed the local outliers for a computed Let us look at these two steps in detail database of soccerplayer information from the" FuBball 1 Bundes liga"(the Gcrman national soccer Icaguc)for thc season 1998/99 In the first step, the MinPtsUBncarcst neighbors for cvcry point p The database consists of 375 players, containing the name, the are materialized, together with their distances to p. The result of this number of games played, the number of goals scored and the posi step is a materialization database M of size n"MinPts Ub distances tion of the player(goalie, defense, center, offense). From these we Note that the size of this intermediate result is independent of the derived the average number of goals scored per game, and per dimension of the original data. The runtime complexity of this step formcd outlier dctection on the thrccdimcnsional subspacc o number of games, average number of goals per game and position (coded as an integer In general, this dataset can be partitioned into four clusters corresponding to the positions of the players. Wecom 14000 5d puted the lOF values in the minPts range of 30 to 50. Below we 12000 discuss all the local outliers with LOF>1.5(scc tablc 3), and cx plain why they are exceptional 9000 Thc strongest outlier is Michacl Prcctz, who played the maximum number of games and also scored the maximum number of goals, which made him the top scorer in the league( Torschutzenkonig") He was an outlier relative to the cluster of offensive players. The 2000 second strongest outlier is Michael Schjonberg. Ile played an aver 0 agc numbcr of games but he was an outlier bccausc most other dc 200 30U 4005006007 fense players had a much lower average number of goals scored per n[*1000 game. The reason for this is that he kicked the penalty shots (elf meter")for his team. The player that was ranked third is HansJOrg Figure 10: Runtime of the materialization of the 50nn queries for different dataset sizes and Butt, a goalie who played the maximum number of games possible diffcrent dimensions using an index 10
 《异常检测——从经典算法到深度学习》2 基于LOF的异常检测算法 5609202005232. 基于 LOF 的异常检测算法 此篇主要介绍以下内容： LOF 算法概述 LOF 算法应用实例 小结 2.1 LOF算法概述 LOF (Local Outliers Factor，局部异常因子) 算法 是一种非监督异常检测算法，它是通过计算给定数据点相对于其邻域的局部密度偏差而实现异常检测。LOF: Identifying DensityBased Local Outliers 论文下载 核心思路： LOF算法是通过比较每个点p和邻域点的密度来判断该点是否为异常：点p的密度越低，越有可能是异常点。
 【异常检测】局部离群因子（LOF）：识别基于密度的局部异常值（文献学习） 45920210217LOF: Identifying DensityBased Local Outliers Markus M. Breunig, HansPeter Kriegel, Raymond T. Ng, Jörg Sander Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000 局部离群因子（LOF）：识别基于密度的局部异常值 摘要 对于许多KDD应用程序，例如检测电子商务中的犯罪活动、发现罕见的实例或异常值，可能比发
 3.60MB
纳指LOF：2020年年度报告.PDF
20210330纳指LOF：2020年年度报告.PDF
 7.22MB
生物LOF：2020年年度报告.PDF
20210329生物LOF：2020年年度报告.PDF
 1.78MB
纳指LOF：2019年半年度报告.PDF
20210530纳指LOF：2019年半年度报告.PDF
 2.26MB
湾区LOF：2019年年度报告.PDF
20210522湾区LOF：2019年年度报告.PDF
 52.92MB
MachineLearningLoF:努力工作，休息自然会照顾自己源码
20210329机器学习LoF 努力工作，休息自然会照顾自己。 第四周 5.1排序和搜索：为什么要打扰这些简单的任务？ 5.2卫星数据和密钥 5.3工作原理：卡片分类 5.4伪代码 5.5用于插入排序的Python代码 5.6正确性 5.7就地分拣 5.8...
 2KB
局部异常因子算法Local Outlier Factor(LOF)matlab
20190402基于密度剔除噪声点和异常数据 局部离群因子 表示点p的邻域点Nk(p)Nk(p)的局部可达密度与点p的局部可达密度之比的平均数。 如果这个比值越接近1，说明p的其邻域点密度差不多，p可能和邻域同属一簇；...
 4.12MB
DSA_LoF:信仰的飞跃源码
20210414DSA_LoF 信仰的飞跃 函数和数组 ＃ 标题 解决方案 时间 空间 困难 标签 笔记 0 O（位数） O（1） 简单的 1个 O（位数） O（1） 简单的 2个 O（位数） O（1） 简单的 递归 迭代是人的，递归神圣的。 基本问题...
 10KB
JoFund：基金净值接口源码
20210217接口1：//天天基金接口当日净值增长不准/ **数据格式：{“ fundcode”：“ 161725”，//基金代码“ name”：“招商中证白酒指数（LOF）”“，//基金名称“ jzrq”：“ 20210209”，//上一个结算日期“ dwjz”：“ ...
 744B
Local Outlier Factor(LOF算法matlab程序)
20181116Local Outlier Factor(LOF算法matlab程序)，常用于离群点检验，异常值剔除等应用中
 1.3MB
practicedashboard:实践。源码
20210508ReactJS提供了一个lof准则并可以执行SPA。 易于扩展，因为它符合惯例 您只需遵循“组件”的概念即可扩展您的webApp。 ReactJS专注于MVC的V。因此您可以快速学习它。 如果要缩放到SPA，可以使用FLUX。 为什么不...
 692KB
MartinLof类型理论中的编程：简介Programming in MartinLof's Type Theory: An Introduction
20191104本书对MartinLof的类型理论进行了详尽的介绍，并提供了有关多态集，子集，单态集以及全套有用示例的信息。
 696B
Local Outlier Factor（LOF）异常检测算法一维DEMO
20180326LOF异常检测算法一维DEMO, 找了好久没有找到，自己写了一个，可以帮助理解概念
 13KB
lof离群算法MATLAB实现有数据
20200730基于密度的局部异常因子检测方法对局部离 群点有极高的灵敏性，若检测对象中不存在离群 点， 各对象的 LOF 数值近似等于 1， 若存在离群点， 则离群点的LOF数值随离群程度的增大而增大
 102.15MB
martinlof:Per MartinLöf的论文源码
20210503PerMartinLöf的收藏作品 如果有任何不正确或遗失的作品，请通知我们。 标题 日期 1966年 1969年 关于构造数学的注释（Almqvist和Wiksell） 1970年 类型理论（斯德哥尔摩大学预印本） ...（逻辑讨论会布里斯托尔会议...
 38KB
YourFathersNightmare:Commodore 64 迷宫游戏源码
20210529你父亲的噩梦 Commodore 64 迷宫游戏 使用 MiniLD52 的交叉汇编程序 CBMPrgStudio 编写。 关于游戏的一些内容： : 游戏玩法： : LOfHyg_vlg
 1KB
LOF异常值剔除算法
20180825LOF算法：剔除异常值，用于数据量不大，使用简单，并具有可视化功能，可将异常数据在图上显示出来，方便耐用。。。。所需积分不知道为啥被提高了，在此重新改一下传
 291KB
LOF业绩的实证研究
20200201LOF业绩的实证研究，洪玉庭，张红，LOF(Listed Openended Funds) 作为一种新的基金创新形式，本文总结了LOF在我国的发展特点。借鉴了Treynor与Mauzy的评价基金择时能力与择股能力�
 751KB
使用案例学习LOF算法
20150622本文使用具体的案例详细讲解LOF算法的计算过程，需要对该算法有清楚理解的tx值得下载，可以帮助你快速准确的理解异常检测中的LOF算法。
 7.28MB
AnomalyDetection:无监督和半监督异常检测隔离森林内核PCA检测ADOA等源码
20210510AnomalyDetection Author： MaXiao EMail： 备注： 若文档无法正常显示图片，请参考右方链接： 第一部分：无监督异常检测 (Unsupervised Detection) 1. 算法 1.1 Isolation Forest 算法论文： ...

下载
松下PLC入门到精通全套视频教程 FPXH步进伺服模拟量温度模块 MODBUS通讯教程视频+松下PLC编程软件.zip
松下PLC入门到精通全套视频教程 FPXH步进伺服模拟量温度模块 MODBUS通讯教程视频+松下PLC编程软件.zip

下载
天线宝宝总结考试不挂科
天线宝宝总结考试不挂科

下载
物流行业：从亚马逊看京东物流发展空间20210614兴业证券51页.pdf.pdf
物流行业：从亚马逊看京东物流发展空间20210614兴业证券51页.pdf.pdf

下载
中国社会办医集团化管理之路.pdf.pdf
中国社会办医集团化管理之路.pdf.pdf

下载
基于MATLAB的图像拼接技术及其实现.pdf
基于MATLAB的图像拼接技术及其实现.pdf

下载
威纶触摸屏宏指令编程 视频教程（13集) +程序+手册 在送威伦触摸屏基础教程视频.zip
威纶触摸屏宏指令编程 视频教程（13集) +程序+手册 在送威伦触摸屏基础教程视频.zip

下载
新潮传媒奶制品行业洞察及营销策略.pdf.pdf
新潮传媒奶制品行业洞察及营销策略.pdf.pdf

下载
海南自由贸易港建设白皮书（2021）.pdf.pdf
海南自由贸易港建设白皮书（2021）.pdf.pdf

下载
java的概述.docx
java的概述.docx

下载
content_1622475543810(1).pdf
content_1622475543810(1).pdf