[1]胡文瑜,孙志挥,张柏礼.分布式数据挖掘中的最优K相异性取样技术[J].东南大学学报(自然科学版),2008,38(3):385-389.[doi:10.3969/j.issn.1001-0505.2008.03.005]
 Hu Wenyu,Sun Zhihui,Zhang Baili.Sampling method using optimizable K-dissimilarity for distributed data mining[J].Journal of Southeast University (Natural Science Edition),2008,38(3):385-389.[doi:10.3969/j.issn.1001-0505.2008.03.005]
点击复制

分布式数据挖掘中的最优K相异性取样技术()
分享到:

《东南大学学报(自然科学版)》[ISSN:1001-0505/CN:32-1178/N]

卷:
38
期数:
2008年第3期
页码:
385-389
栏目:
计算机科学与工程
出版日期:
2008-05-20

文章信息/Info

Title:
Sampling method using optimizable K-dissimilarity for distributed data mining
作者:
胡文瑜12 孙志挥1 张柏礼1
1 1 东南大学计算机科学与工程学院, 南京 210096; 2 2 福建工程学院计算机与信息科学系, 福州 350014
Author(s):
Hu Wenyu12 Sun Zhihui1 Zhang Baili1
1 School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
2 Department of Computer and Information Science, Fujian University of Technology, Fuzhou 350014, China
关键词:
分布式数据挖掘 最优K相异性选择算法 Agent
Keywords:
distributed data mining(DDM) optimizable K-dissimilarity selection method Agent
分类号:
TP311.13
DOI:
10.3969/j.issn.1001-0505.2008.03.005
摘要:
为了弥补基于集中式处理的分布式数据挖掘方法的不足,有效地实施分布式数据挖掘(DDM)任务,需要一种能从分布式数据源中获取多样化代表性取样集的技术.提出了一种新的适用于分布式数据挖掘环境的数据取样算法(OptiSim-DDM方法),算法核心是基于最优K相异性进行数据选择,采用移动Agent技术和扩展的最优K相异性数据多样化代表性子集选择方法,能在各分布式数据场地中轮转选择出全局数据集的多样化代表性取样集.该方法通过降低所挖掘的数据集的数据规模来降低数据挖掘算法的时空复杂度,降低网络通讯代价,提高数据挖掘的执行效率,适合于各场地数据是互相关联和互相依赖的分布式数据挖掘任务.实验结果证实该方法是可行、有效的.
Abstract:
A sampling method to obtain a diversity representative subset from distributed data sources is necessary to avoid the shortcomings of client-serve methods based on centralized datasets and to effectively perform distributed data mining tasks. A novel data sampling method for distributed data mining, OptiSim-DDM, is proposed. Its main idea is data selection using optimizable K-dissimilarity selection. The OptiSim-DDM is an integration of the technology of mobile agents and an extending optimizable K-dissimilarity selection method. A diversity representative sampling dataset selected in turn from distributed data cites can be generated by use of this method. Apart from being able to reduce the complexity of time and space and to decrease the communication costs as well as improving the efficiency of performing data mining tasks in distributed environment by scaling down the dataset for data mining, the OptiSim-DDM is suitable for the cases that data mining is performed on a special sampling dataset generated by means of interaction and inter-combination of sites dataset in the distributed environment. The experimental results show that the new method is effective and efficient.

参考文献/References:

[1] Park B,Kargupta H.Distributed data mining:algorithms,systems,and applications[M].Hillsdale,NJ:Lawrence Erlbaum,2003:341-361.
[2] Zaki M J,Pan Y.Introduction:recent developments in parallel and distributed data mining[J].Journal of Distrib Parallel Databases,2002,11(2):123-127.
[3] Ashrafi M Z,Taniar D,Smit K A.A data mining architecture for distributed environments[C] //Innovative Internet Computing Systems,Lecture Notes in Computer Science.Berlin,Germany:Springer-Verlag,2002,2346:27-38.
[4] Kargupta H,Park B.Collective data mining:a new perspective toward distributed data mining[C] //Advances in Distributed and Parallel Knowledge Discovery.Menlo Park.CA,USA:AAAI/MIT Press,2000:131-178.
[5] Cabri G,Leonardi L,Zambonelli F.Mobile agent technology:current trends and perspectives[EB/OL].(2002-11-10)[2007-05-02].http://polaris.ing.unimo.it/MOON/papers/aica98.pdf.
[6] Clark R D.OptiSim:an extended dissimilarity selection method for finding diverse representative subsets[J].Journal of Chem Inf Computer Science,1997,37(6):1181 - 1188.
[7] Clark R D,Langton W J.Balancing representativeness against diversity using optimizable K-dissimilarity and hierarchical clustering[J]. Journal of Chem Inf Computer Science,1998,38(6):1079-1086.
[8] Soltanshahi F,Akella L,Clark R D.OptDesign:extending optimizable K-dissimilarity selection for use in combinatorial library design[J].Journal of Chem Inf Computer Science,2003,43(3):829-836.
[9] 胡文瑜,孙志挥,周晓云.基于相异性选择的密度聚类算法研究[J].小型微型计算机系统,2006,27(9):1601-1604.
  Hu Wenyu,Sun Zhihui,Zhou Xiaoyun.Research of density-based clustering algorithm based on dissimilarity selection[J]. Mini-Micro Systems,2006,27(9):1601-1604.(in Chinese)
[10] Zhong N,Matsui Y,Okuno T,et al.Framework of a multi-agent kdd system[C] //Proc of Intelligent Data Engineering and Automated Learning-IDEAL,Third International Conference.Manchester,UK:Springer-Verlag,2002:337-346.
[11] Ester M,Kriegel H P,Sander J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C] //Proc of the 2nd Int’l Conf on Knowledge Discovering in Databases and Data Mining.Portland,Oregon.USA:ACM Press,1996:226-231.

备注/Memo

备注/Memo:
作者简介: 胡文瑜(1963—),女,博士生,副教授; 孙志挥(联系人),男,教授,博士生导师,sunzh@seu.edu.cn.
基金项目: 国家自然科学基金资助项目(70371015)、教育部高等学校博士点科研基金资助项目(20040286009)、福建省教育厅科技资助项目(JB06142).
引文格式: 胡文瑜,孙志挥,张柏礼,等.分布式数据挖掘中的最优K相异性取样技术[J].东南大学学报:自然科学版,2008,38(3):385-389.
更新日期/Last Update: 2008-05-20