41

2011年第3期

505-508

2011-05-20

Parallel k-means algorithm based on constrained information

(1南京航空航天大学信息科学与技术学院,南京 210016)(2江苏科技大学计算机科学与工程学院,镇江 212003)
(1College of Information Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China)
(2College of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang 212003, China)

10.3969/j.issn.1001-0505.2011.03.014

In order to obtain the desired clustering results on the distributed data set, a parallel k-means algorithm is presented based on constrained information. On the basis of the facts that the parallel k-means algorithm can be effectively used in clustering the horizontal distributed data set, the objective function of the parallel k-means algorithm is modified, and the constrained parallel k-means algorithm is designed, then the constrained information of site users is introduced into the distributed clustering process in the form of chunklets, which can guide the algorithm to a bias search. Theoretically the algorithm guarantees the inter-cluster distance among the unconstrained samples to be the closest, and guarantees the average distance between constrained samples in a chunklet and the corresponding cluster center to be the closest one. The results from the experiments show that the algorithm can effectively enhance the clustering precision of parallel k-means, meanwhile it can obtain the clustering results on the distributed data set, which are equivalent to the results of the constrained k-means algorithm running on a centralized data set.

