# 使用Python编写LOF算法

5星（超过95%的资源） 330 收藏

++++ took 9. 99999 msecs for Outlier scoring Now lets se the histogram of Outlier score, to choose the optimal threshold to decid weather a data-point is outlier is not In : weights= np ones_like(outlier_score)/outlier_score. shape  to normalize the histogram to hist(outlier_score, bins=50, weights weights, histtype ='stepfilled, color ='cyan') title(Distribution of outlier score) Out : <matplotlib text. Text at 0x36030588> Distribution of outlier score 09 08 7 06 05 04 02 0.L 6 8 It can be observd that, the optimal outlier score threshold to decide weather a data-point is outlier is outlier or not is around 2 for most of the cases. so lets use it to see our sesults In:th hreshold 2 plot mom outlier 9 scatter(data[:, 0], data[:, 1],c='green', s= 10, edgecolors=None, alpha=0.5) find the outliers and plot te outliers idx here (outlier score > threshold) scatter(data Lidx, 0], dataLidx, 1],c=red', 3=10, edgecolors-'None', alpha=0. 5) Out : <matplotlib collections. PathCollection at 0x3640e6a0> 400p 300D 2000 1000 100 2000 2000-1000 0 1000 2000 3000 400D We have seen the results of lof with naive approachfor knn queries. Now lets see optimisations with KD-Trees USing KD Trees KD-Trees insertion and Knn query In : from sklearn neighbors import KDTree as Tree tic- time. time o BT= Tree(data, leaf_size=5, p=2) Query for k nearest, k +1 because one of the returnee is self dx,idx_knn=BT query (data[: ,: I,k=k+1) print ,+ took %g msecs for Tree KNN Querying,% ((time. time()- tic)*1000) + took 122 msecs for Tree KNN Querying LRD computation In : tic =time. time o dx. idx dx D ], idx knn[, 1: 1 get the radius for each point in dataset radius is the distance of kth nearest point for each point in dataset radius dx[:,-1] caLcu late the local reachability dens? t LRD=np. mean(np. maximum(dx, radius lidx_knn]), axis = 1) print ,+ took %g msecs for LRD computation,% ((time. time()-tic)* 1000) +t took 8.99982 msecs for lrd com putation Now, rest is same, so, im just replicating the rsult for completion In :# calculating the outlier score tic= time. time o rho = 1./np array(lrd)# inverse of density outlier_score= np sum (rho [idx_knn], axis =1)/np array(rho, dtype=np. float16) outlier score * 1/k print ++++ took %g msecs for Outlier scoring% ((time. time(-tic)* 1000) plotting the histogram of outlier score weights= np ones_like(outlier_score)/outlier_score. shape [o] to normalize the histogram to hist(outlier_score, bins =50, weights = weights, histtype ='stepfilled', color ='cyan') title(Distribution of outlier score) dp lotting the result threshold =2 t plot non outliers as green scatter(data[:,o], data[:, 1] green, 10, edgecolors=None, alpha=0.5) find the outliers and plot te outliers idx np where(outlier_score> threshold) scatter(data Lidx, o], dataLidx, 1],c='red', s=10, edgecolors=None, alpha=0. 5) ++++ took 4. 00019 msecs for outlier scoring Out : <matplotlib collections. PathCollection at 0x36ad0b38> Distribution of outlier score 09 0.7 06 05 04 0.3 0.2 400p 300D 2000 1000 100 -2000 2000-1000 0 1000 2000 3000 400D The results are same. and should be Putting everything together Lets create a class, to combine evrything together. It will be important in evaluating performance. From above results, we note that the most time is spent for knn querying In : import numpy as np import matplotlib. pyplot as plt import sys from sklearn neighbors import distanceMetric from sklearn datasets import make_blobs from sklearn neighbors import KDTree as Tree def exit O sys. exito class loF def init(self, k= 3) self. k=k a function to create synthetic test data def generate_data(self, n =500, dim=3): n1,n2 cluster of gaussian random data data1,= make blobs(n1, dim, centers= 3) cluster of uniform random variable data2=np. randomuniform(0, 25, size=(n2, dim)) cluster of dense un i form random variable data3=np. random uniform (100, 200, size =(n3, dim)) mic the three dataset self data np vstack((np vstack((datal, data2)), data3)) np. random shuffle(self data) add some noise 21pf 2s skewed distribut ion zipf_alpha =2.5 noise= np. random zipf(zipf_alpha,(n, dim))*\ np sign((np. random randint(2, size=(n, dim))-05) self data + noise KN querying with naive approach def knn naive(self): distance between point. import time time. time( dist-DistanceMetric. get_metric('euclidean')pairwise(self data) print + took %g msecs for Distance computation,% ((time, timeo-tic)* 1000) tic= time. time o get the radius for each point in dataset (distance to kth nearest neighbor) f radius is the distance of kth nearest point for each point in dataset self. idx_knn=np.argsort (dist, axis=1)[:, 1: self. k+ 1] by row, get k nearest nei radius= np linalg. norm(self data- self data[self. idxknnL:,-1]l, axis= 1)# radiu print ++ took %g msecs for KNN Qwerying'% ((time. timeo- tic)*1000) calcu late the local reachability density LRD= L for ng lf. idx k pe [O1) rex. LRD append (np. mean(np. maximum(dist Li, self.idx_knn[i]] radius [self. idx knn[i]])) urn np array(LrD) knn querying with KDTrees def knn tree(self) # LmpoT rt ti tic =time. time( BT= Tree(self data, leaf_size=5, p=2) Query for k nearest, k 1 because one of the returnee is self dx, self. idx_knn = T query(self data[: ,: ,k= self.k+ 1) print + took %g msecs for Tree KNN Querying,% ((time, timeo-tic)* 1000) dx, self. idx knn dx[:,1:, self. idx knn[: 1: I get the radius for each point in dataset radius is the distance of kth nearest point for each point in dataset radius dx[:,-1] calculate the local reachability density LRD= np. mean(np. maximum(dx, radius [self.idx_knn),ax D return lri def train(self, data None, method ='Naive) check if dataset ded for trainin 9 try: assert data != None and data shape [o] self data data self. d pe number of data points except AssertionError y n= self data shape [o]# number of data point except Attributeerro print No data to fit the model, please provide data or call generate_data me exitO assert method. lower( in ['naive,,'n,tree,'t' except AssertionError print Method must be NaiveIn or treet exito find the rho, which is inverse of LRD if method.1ower()in[ nal ve,’n’ rho 1./ self. knn_ naive o e1 if method.1ower()in['tree,’t’] rho = 1./ self. knn treeo self score=np sum (rho [self. idx_knn], axis =1)/np array(rho, dtype =np. float16) self score 1/self.k def plot(self, threshold= None) set the threshold if not threshold from scipy stats. stats import quantiles threshold=max(quantiles(self score, prob=0.95),2) self, threshold th reduce data to 2d if required if self data shape >2: from sklearn decomposition import PCA pca= PCA(ncomponents =2) self data= pca fit_transform(selfdata) t plot non outliers as green plt. figure plt scatter(self data[:, 0], self data[:, 1,c=green, s=10, edgecolors=None? find the outliers and plot te outliers dx= np. where(self score self threshold) plt scatter(self data[idx, 0], selfdata[idx, 1], c='red', s=10, edgecolors='None plt. legend(['Normal,'Outliers'1) plot the distribution of outlier score lt figure weights= np ones_like(self score)/self score. shape [O] plt. hist(self score, bins = 25, weights weights, histtype ='stepfilled,, color 1七.七t1e( Distribution of out1 her score) Performance Evaluation Lets create a function to evaluate te performance In : def perf_test(n_list= None, methods=['Tree','Naive', plot= False) import time if not n_list: n_list =[2 * i for i in range(7, 14)1 esult = result. append(n_list) for m in methods temp=[ for n in n list tic= time. time o lof LOF (k 5) lof generate data(n = n, dim =2) lof. train(method =m) temp. append (1000000 *(time. time(-tic)) print,Took %g msecs with %s method for %d datapoints%\ ((time. timeo- tic)*1000 result.append(temp) plot fig, ax=plt. subplots ax set_xscale('log, basex=2) ax set_yscale(log,, basey=10) plt plot (result[o], result[1l,'m*-, ms=10, mec None) try plt, plot (result[o], result,'co--, ms=8, mec= None except Indexerror olt xlabel("Number of data points sn$") plt ylabel("Time of execution$\mu secs\$1) plt. legend (method pper1eft’) It show O Now, lets compare the performace of 2 methods- Naive and KDTree implementations In : perf_testmethods =[",,Naive,,nlist=[2** i for i in range (4, 14)], plot=True) Took 2.00009 msecs with Tree method for 16 datapoints Took 1.99986 msecs with Tree method for 32 datapoints Took 2. 00009 msecs with Tree method for 64 datapoints Took 3.00002 msecs with Tree method for 128 datapoints Took 4. 99988 msecs with Tree method for 256 datapoints Took 11. 0002 msecs with Tree method for 512 datapoints Took 20.9999 msecs with Tree method for 1024 datapoints Took 48.0001 msecs with Tree method for 2048 datapoints Took 106 msecs with Tree method for 4096 datapoints Took 179 msecs with Tree method for 8192 datapoints Took 3.00002 msecs with Naive method for 16 datapoints Took 3.00002 msecs with Naive method for 32 datapoints Took 6.00004 msecs with Naive method for 64 datapoints Took 13 msecs with Naive method for 128 datapoints Took 30.9999 msecs with Naive method for 256 datapoints Took 82 9999 msecs with Naive method for 512 datapoints Took 249 msecs with Naive method for 1024 datapoints Took 834 msecs with Naive method for 2048 datapoint Took 3734 msecs with Naive method for 4096 datapoints Took 15796 msecs with Naive method for 8192 datapoints 10 ★★Tee o Naive 10 ★ 10 2 11 Number of data points n We see that KD Tree outperforms Naive method for marge n, but it may not do well for small number of datasets. In my PC, i cannot run Naive method beyond 2 datapoints, or else i receie Memory Error. So lets evauate te performance of KDTrees upto 1Million datapoints In : perf_test (methods =['Tree'I, n_list =[2 * i for i in range(4, 21)], plot True Took 2.00009 msecs with Tree method for 16 datapoints Took 2.00009 msecs with Tree method for 32 datapoints Took 1.99986 msecs with Tree method for 64 datapoints Took 3. 00002 msecs with Tree method for 128 datapoints Took 6. 00004 msecs with Tree method for 256 datapoints Took 9.00006 msecs with Tree method for 512 datapoints Took 20 msecs with Tree method for 1024 datapoints Took 50 msecs with Tree method for 2048 datapoints Took 108 msecs with Tree method for 4096 datapoints Took 194 msecs with Tree method for 8192 datapoints Took 396 msecs with Tree method for 16384 datapoints Took 837 msecs with Tree method for 32768 datapoints Took 1741 msecs with Tree method for 65536 datapoints Took 3596 msecs with Tree method for 131072 datapoints 10

...展开详情

2018-03-24

2017-12-30

2017-12-07

2017-11-21

• 4
资源
• 0
粉丝
• 等级 使用Python编写LOF算法 48积分/C币 立即下载
1/11    48积分/C币 立即下载