《数据挖掘之聚类分析ppt课件.pptx》由会员分享,可在线阅读,更多相关《数据挖掘之聚类分析ppt课件.pptx(44页珍藏版)》请在三一办公上搜索。
1、,Clustering,Overview,Partitioning MethodsK-MeansSequential LeaderModel Based MethodsDensity Based MethodsHierarchical Methods,2,What is cluster analysis?,Finding groups of objectsObjects similar to each other are in the same group.Objects are different from those in other groups.Unsupervised Learnin
2、gNo labelsData driven,3,Clusters,4,Inter-Cluster,Intra-Cluster,Clusters,5,Applications of Clustering,MarketingFinding groups of customers with similar behaviours.BiologyFinding groups of animals or plants with similar features.BioinformaticsClustering microarray data, genes and sequences.Earthquake
3、StudiesClustering observed earthquake epicenters to identify dangerous zones.WWWClustering weblog data to discover groups of similar access patterns.Social NetworksDiscovering groups of individuals with close friendships internally.,6,Earthquakes,7,Image Segmentation,8,The Big Picture,9,Requirements
4、,ScalabilityAbility to deal with different types of attributesAbility to discover clusters with arbitrary shapeMinimum requirements for domain knowledgeAbility to deal with noise and outliersInsensitivity to order of input recordsIncorporation of user-defined constraintsInterpretability and usabilit
5、y,10,Practical Considerations,11,Scaling matters!,Normalization or Not?,12,13,Evaluation,14,VS.,Evaluation,15,Silhouette,A method of interpretation and validation of clusters of data.A succinct graphical representation of how well each data point lies within its cluster compared to other clusters.a(
6、i): average dissimilarity of i with all other points in the same clusterb(i): the lowest average dissimilarity of i to other clusters,16,Silhouette,17,K-Means,18,K-Means,19,K-Means,20,K-Means,Determine the value of K.Choose K cluster centres randomly.Each data point is assigned to its closest centro
7、id.Use the mean of each cluster to update each centroid.Repeat until no more new assignment.Return the K centroids.ReferenceJ. MacQueen (1967): Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probabil
8、ity, vol.1, pp. 281-297.,21,Comments on K-Means,ProsSimple and works well for regular disjoint clusters.Converges relatively fast.Relatively efficient and scalable O(tkn)t: iteration; k: number of centroids; n: number of data pointsConsNeed to specify the value of K in advance.Difficult and domain k
9、nowledge may help.May converge to local optima.In practice, try different initial centroids.May be sensitive to noisy data and outliers.Mean of data points Not suitable for clusters of Non-convex shapes,22,The Influence of Initial Centroids,23,The Influence of Initial Centroids,24,Sequential Leader
10、Clustering,A very efficient clustering algorithm.No iterationA single pass of the dataNo need to specify K in advance.Choose a cluster threshold value.For every new data point:Compute the distance between the new data point and every clusters centre.If the minimum distance is smaller than the chosen
11、 threshold, assign the new data point to the corresponding cluster and re-compute cluster centre.Otherwise, create a new cluster with the new data point as its centre.Clustering results may be influenced by the sequence of data points.,25,26,Gaussian Mixture,27,Clustering by Mixture Models,28,K-Mean
12、s Revisited,29,= 1 , 1 ,( 2 , 2 ),= 1 , 2 ,model parameters,latent parameters,Expectation Maximization,30,31, = () (),EM: Gaussian Mixture,32,33,Density Based Methods,Generate clusters of arbitrary shapes.Robust against noise.No K value required in advance.Somewhat similar to human vision.,34,DBSCAN
13、,Density-Based Spatial Clustering of Applications with NoiseDensity: number of points within a specified radiusCore Point: points with high densityBorder Point: points with low density but in the neighbourhood of a core pointNoise Point: neither a core point nor a border point,35,Core Point,Noise Po
14、int,Border Point,DBSCAN,36,p,q,directly density reachable,p,q,density reachable,o,q,p,density connected,DBSCAN,A cluster is defined as the maximal set of density connected points.Start from a randomly selected unseen point P.If P is a core point, build a cluster by gradually adding all points that a
15、re density reachable to the current point set.Noise points are discarded (unlabelled).,37,Hierarchical Clustering,Produce a set of nested tree-like clusters.Can be visualized as a dendrogram.Clustering is obtained by cutting at desired level.No need to specify K in advance.May correspond to meaningf
16、ul taxonomies.,38,Agglomerative Methods,Bottom-up MethodAssign each data point to a cluster.Calculate the proximity matrix.Merge the pair of closest clusters.Repeat until only a single cluster remains.How to calculate the distance between clusters?Single LinkMinimum distance between pointsComplete L
17、inkMaximum distance between points,39,Example,40,Single Link,Example,41,Example,42,Min vs. Max,43,3,6,5,2,1,4,A,B,C,D,E,MIN,A,B,C,D,E,MAX,Reading Materials,Text BooksR. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, Chapter 10, 2nd Edition, John Wiley & Sons.J. Han and M. Kamber, Data
18、Mining: Concepts and Techniques, Chapter 8, Morgan Kaufmann. Survey PapersA. K. Jain, M. N. Murty and P. J. Flynn (1999) “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31(3), pp. 264-323.R. Xu and D. Wunsch (2005) “Survey of Clustering Algorithms”, IEEE Transactions on Neural Networks, Vol
19、. 16(3), pp. 645-678.A. K. Jain (2010) “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters, Vol. 31, pp. 651-666.Online Tutorialshttp:/home.dei.polimi.it/matteucc/Clustering/tutorial_htmlhttp:/www.autonlab.org/tutorials/kmeans.htmlhttp:/users.informatik.uni-halle.de/hinnebur/ClusterTutorial,44,