[1]於跃成,王建东,郑关胜,等.基于约束信息的并行k-means算法[J].东南大学学报(自然科学版),2011,41(3):505-508.[doi:10.3969/j.issn.1001-0505.2011.03.014]
 Yu Yuecheng,Wang Jiandong,Zheng Guansheng,et al.Parallel k-means algorithm based on constrained information[J].Journal of Southeast University (Natural Science Edition),2011,41(3):505-508.[doi:10.3969/j.issn.1001-0505.2011.03.014]
点击复制

基于约束信息的并行k-means算法()
分享到:

《东南大学学报(自然科学版)》[ISSN:1001-0505/CN:32-1178/N]

卷:
41
期数:
2011年第3期
页码:
505-508
栏目:
计算机科学与工程
出版日期:
2011-05-20

文章信息/Info

Title:
Parallel k-means algorithm based on constrained information
作者:
於跃成12王建东1郑关胜1陈斌1
(1南京航空航天大学信息科学与技术学院,南京 210016)(2江苏科技大学计算机科学与工程学院,镇江 212003)
Author(s):
Yu Yuecheng12Wang Jiandong1Zheng Guansheng1Chen Bin1
(1College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China)
(2College of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang 212003, China)
关键词:
k-means并行k-means约束聚类约束并行k-means
Keywords:
k-means parallel k-means constrained clustering constrained parallel k-means
分类号:
TP391.4
DOI:
10.3969/j.issn.1001-0505.2011.03.014
摘要:
为获得分布式数据集上用户所期望的聚类结果,提出了基于约束信息的并行k-means聚类算法.在分析并行k-means能够有效实现对水平分布式数据集进行聚类的基础上,修改并行k-means算法的目标函数,设计约束并行k-means算法,将站点用户的约束信息以chunklet的形式引入到分布式聚类过程,从而引导算法执行有偏搜索.约束并行k-means算法在理论上保证无约束样本簇内距离最小的同时能够确保chunklet约束中的样本与对应的簇中心之间的平均距离最小.实验结果表明,约束并行k-means算法能够有效改善并行k-means的聚类精度,同时在分布式环境下能够得到与已有约束聚类算法在集中式数据集上相等价的聚类结果.
Abstract:
In order to obtain the desired clustering results on the distributed data set, a parallel k-means algorithm is presented based on constrained information. On the basis of the facts that the parallel k-means algorithm can be effectively used in clustering the horizontal distributed data set, the objective function of the parallel k-means algorithm is modified, and the constrained parallel k-means algorithm is designed, then the constrained information of site users is introduced into the distributed clustering process in the form of chunklets, which can guide the algorithm to a bias search. Theoretically the algorithm guarantees the inter-cluster distance among the unconstrained samples to be the closest, and guarantees the average distance between constrained samples in a chunklet and the corresponding cluster center to be the closest one. The results from the experiments show that the algorithm can effectively enhance the clustering precision of parallel k-means, meanwhile it can obtain the clustering results on the distributed data set, which are equivalent to the results of the constrained k-means algorithm running on a centralized data set.

参考文献/References:

[1] Inan A,Kaya S V,Saygin Y,et al.Privacy preserving clustering on horizontally partitioned data [J].Data and Knowledge Engineering,2007,63(3):646-666.
[2] Dhillon I S,Modha D S.A data-clustering algorithm on distributed memory multiprocessors [C]//Proceedings of the KDD’99 Workshop on High Performance Knowledge Discovery.San Digeo,USA,1999:245-260.
[3] MacQueen J.Some methods for classification and analysis of multivariate observations [C]//The 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley,USA,1967:281-297.
[4] Bandyopadhyay S,Giannella C,Maulik U,et al.Clustering distributed data streams in peer to peer environments [J].Information Sciences,2006,176(14):1952-1985.
[5] Datta S,Bhaduri K,Giannella C,et al.Distributed data mining in peer-to-peer networks [J].Internet Computing,2006,10(4):18-26.
[6] Datta S,Giannella C,Kargupta H.Approximate distributed k-means clustering over a peer-to-peer network [J].IEEE Transactions on Knowledge and Data Engineering,2009,21(10):1372-1388.
[7] Jin R M,Goswami A,Agrawal G.Fast and exact out-of-core and distributed k-means clustering [J].Knowledge and Information Systems,2006,10(1):17-40.
[8] Wagstaff K,Cardie C,Rogers S,et al.Constrained k-means clustering with background knowledge [C]//Proceeding of 18th International Conference on Machine Learning.Williamstown,USA,2001:577-584.
[9] Basu S,Banerjee A,Mooney R J.Active semi-supervision for pairwise constrained clustering [C]//Proceedings of the 2004 SIAM International Conference on Data Mining.Lake Buena Vista,FL,USA,2004:333-344.
[10] Bar-Hillel A,Hertz T,Shental N,et al.Learning a Mahalanobis metric from equivalence constraints [J].Journal of Machine Learning Research,2005,6:937-965.
[11] Alipanahi B,Biggs M,Ghodsi A,et al.Distance metric learning vs.fisher discriminant analysis[C]//Proceedings of the 23th AAAI Conference on Artificial Intelligence.Chicago,USA,2008:598-603.

相似文献/References:

[1]吉根林,凌霄汉,杨明.一种基于集成学习的分布式聚类算法[J].东南大学学报(自然科学版),2007,37(4):585.[doi:10.3969/j.issn.1001-0505.2007.04.008]
 Ji Genlin,Ling Xiaohan,Yang Ming.Distributed clustering algorithm based on ensemble learning[J].Journal of Southeast University (Natural Science Edition),2007,37(3):585.[doi:10.3969/j.issn.1001-0505.2007.04.008]

备注/Memo

备注/Memo:
作者简介:於跃成(1971—),男,博士生;王建东(联系人),男,教授,博士生导师,aics@nuaa.edu.cn.
基金项目:国家高技术研究发展计划(863计划)资助项目(2006AA12A106)、国家自然科学基金资助项目(60903130).
引文格式: 於跃成,王建东,郑关胜,等.基于约束信息的并行k-means算法[J].东南大学学报:自然科学版,2011,41(3):505-508.[doi:10.3969/j.issn.1001-0505.2011.03.014]
更新日期/Last Update: 2011-05-20