摘要聚类是数据挖掘中一个非常重要的分支,用于发现数据中包含的未知信息。聚类算法分析的研究有着悠久的历史,近几十年来,聚类的重要性以及它与其他科学研究领域的交叉特性得到了人们的普遍肯定。随着聚类分析技术的迅速发展以及应用的不断扩展,聚类分析越来越成为数据挖掘中一个引人关注的研究课题。47217
本文第一部分将简单介绍聚类的定义及其主要研究的问题。聚类是将数据分成许多类簇,其中一个类簇内的实体是相似的,而不同类簇间的实体是互不相似的。目前,聚类主要运用在图象处理、模式识别、客户信息分析、金融分析、医学等诸多领域中。
第二部分简单介绍了五种典型的聚类方法:基于划分的聚类方法、基于层次的聚类方法、基于密度的聚类方法、基于网格的聚类方法和基于模型的聚类方法。每一种典型聚类方法都有优点及缺点,面对不同的数据对象时,要根据具体的要求选择合适的聚类方法。
第三部分重点介绍了K-means聚类算法,这是一种典型的基于划分的聚类算法,它通过不断的迭代来进行聚类,当收敛到约束条件时就终止迭代,输出聚类结果。由于该方法思想简单又易于操作,因此已成为最常用的聚类算法之一。本文还列举了K-means算法在二维数据聚类以及文档聚类上的应用。但K-means算法的也存在不足:该算法对于初始聚类中心的选择非常的敏感,容易取得局部的最优解;聚类数目的 值通常需要用户事先给定;对噪声数据和孤立数据较为敏感;K-means聚类算法不适用于大量数据的聚类问题。因此在解决实践问题时往往需要将K-means算法与其他聚类算法综合运用才行。
在文章的最后,我们对前面的叙述提出了总结与展望。
毕业论文关键字:数据挖掘; 聚类算法; K-means; 划分
ABSTRACT
Clustering is an important branch of data mining, which is used to discover unknown information contained in the the data . Research on clustering analysis has a long history, in recent years, the importance of the characteristics of clustering and cross it and other areas of scientific research have been widely affirmed by the people. With the rapid development of the cluster analysis technology and application expansion, clustering analysis has become an interesting research topic in data mining.
The first part of this paper will introduce the definition and the main research of the clustering problem. Clustering is to pide the data into many clusters, one class of entities within the cluster are similar, but not the same cluster entity is similar to each other. At present, the cluster is mainly used in image processing, pattern recognition, analysis of customer information, financial analysis, medical and many other fields.
The second part introduces five typical clustering: clustering method based on partitioning based clustering method, based on hierarchical clustering method, density based method, grid based clustering and clustering model. Each kind of typical clustering methods have advantages and disadvantages, in the face of different data objects, to choose a suitable clustering methods according to specific requirements.
The third part focuses on the K-means clustering algorithm, which is a typical clustering algorithm based on pision, through constant iteration to converge to a cluster, when the constraint conditions of the iteration is stopped, the output clustering. Because the method is simple and easy to operate, so it has become one of the most commonly used clustering algorithms. This paper also lists the K-means algorithm in two dimensional data clustering and document clustering. But K-means algorithm has some disadvantages: the choice of the algorithm for the initial clustering center is very sensitive, easy to get a local optimal solution; the number of clusters usually requires the user value given in advance; sensitive to noises and isolated data; K-means clustering algorithm for clustering large data. Therefore, in solving practical problems often need to K-means algorithm and other clustering algorithms use only.