[1]倪巍伟,陈耿,陆介平,等.基于nested-loop的大数据集快速离群点检测算法[J].东南大学学报(自然科学版),2006,36(3):463-466.[doi:10.3969/j.issn.1001-0505.2006.03.027]
 Ni Weiwei,Chen Geng,Lu Jieping,et al.Efficient nested-loop based outlier detection algorithm for large data set[J].Journal of Southeast University (Natural Science Edition),2006,36(3):463-466.[doi:10.3969/j.issn.1001-0505.2006.03.027]
点击复制

基于nested-loop的大数据集快速离群点检测算法()
分享到:

《东南大学学报(自然科学版)》[ISSN:1001-0505/CN:32-1178/N]

卷:
36
期数:
2006年第3期
页码:
463-466
栏目:
计算机科学与工程
出版日期:
2006-05-20

文章信息/Info

Title:
Efficient nested-loop based outlier detection algorithm for large data set
作者:
倪巍伟1 陈耿2 陆介平1 孙志挥1
1 东南大学计算机科学与工程学院, 南京 210096; 2 南京审计学院审计信息工程重点实验室,南京 210029
Author(s):
Ni Weiwei1 Chen Geng2 Lu Jieping1 Sun Zhihui1
1 School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
2 Key Laboratory of Audit Information Engineering, Nanjing Audit University, Nanjing 210029, China
关键词:
大数据集 模信息表 向量内积不等式 离群点检测
Keywords:
large data set mode table vectors’ inner product inequation outlier detection
分类号:
TP311
DOI:
10.3969/j.issn.1001-0505.2006.03.027
摘要:
针对已有的多数离群点检测算法存在扩展性差,不能有效应用于大数据集的问题,在已有的基于距离的离群点检测算法的基础上,设计模信息表存储结构,利用向量内积不等式关系以及合理的存储分配和调度策略,提出一种高效离群点检测算法DBoda.该算法通过在预处理中存储每个点的模信息,减少点间距离的计算量,并对嵌套循环方法进行优化,进一步减少I/O的开销.理论分析和试验结果表明,所提算法具有时间消耗小和适用于处理大数据集的特点,可以有效地解决离群点检测中的算法时间复杂性和算法扩展性问题.
Abstract:
Most of the existed outlier detection algorithms have the limitation in algorithms’ expansibility, and cannot be used efficiently for the large data set. To solve this problem, mode storage structure and vectors’ inner product inequation are designed, the suitable storage allocating method and the I/O strategy are adopted. Furthermore, based on the existed distance-based outlier detection algorithm, an efficient nested-based outlier detection algorithm DBoda is proposed, which is suitable for the large data set. Two strategies are adopted in the algorithm. Firstly, during the pretreatment process, each data point’s mode information is stored to reduce the computation work. Secondly, optimization is adopted in the nested loop step to reduce I/O. Theoretical analysis and experiment results testify that DBoda is efficient and suitable to deal with large data set. It can solve the time complexity and expansibility problem of outlier detection algorithms.

参考文献/References:

[1] Han J,Kamber M.Data mining[M].New York:Morgan Kaufmann Publishers,2001:1-321.
[2] Johnson T,Kwok I,Ng R.Fast computation of 2-dimensional depth contours [C] // Proc 4th Int Conf on Knowledge Discovery and Data Mining.New York,1998:224-228.
[3] Knorr E,Ng R.Algorithms for mining distance-based outliers in large datasets[C] //Gupta Ashish,ed. Proceedings of the 24th Conference on VLDB.New York,1998:392-403.
[4] Yu D,Sheikholeslami G,Zhang A.Findout:finding outliers in very large datasets [R].Technical report 99-03,Department of Computer Science and Engineering,State University of New York at Buffalo,1999.
[5] Breunig M M,Kreigel H P,Ng R T,et al.LOF:identifying density-based local outliers[C] //Chen Weidong,Naughton Jeffrey F,Bernstein Philip A,eds.Proceedings of the ACM SIGMOD Conference.Dallas,TX,2000:93-104.
[6] Joshi M,Agarwal R,Kumar V.Mining needles in a haystack:classifying rare classes via two-phase rule induction[C] //Aref Walid G,ed.Proc of ACM SIGMOD Int Conf on Management of Data.Santa Barbara,CA,2001:91-102.
[7] Hawkins D. Identification of outliers[M].London:Chapman and Hall,1980:1-45.
[8] Breunig M M,Kriegel H P,Ng R T,et al.Optics-of:identifying local outliers[C] // Proc of PKDD’99,Lecture Notes in Computer Science(LNAI 1704).Springer Verlag,1999:262-270.
[9] Li Cunhua,Sun Zhihui.Approximation approach to grid-based clustering algorithms and its error evaluation[J].Journal of Software,2003,14(7):1267-1274.
[10] Ramaswamy S,Rastogi R,Shim K.Efficient algorithms for mining outliers from large data sets[J].ACM Sigmoid Record,2000,29(2):427-438.

备注/Memo

备注/Memo:
基金项目: 国家自然科学基金资助项目(70371015)、高等学校博士学科点专项科研基金资助项目(20040286009)、审计署审计科研所专项资助项目(SK2006007).
作者简介: 倪巍伟(1979—),男,博士; 孙志挥(联系人),男, 教授,博士生导师,sunzh@seu.edu.cn.
更新日期/Last Update: 2006-05-20