《人工智能与数据挖掘教学课件lect513.ppt》由会员分享,可在线阅读,更多相关《人工智能与数据挖掘教学课件lect513.ppt(34页珍藏版)》请在三一办公上搜索。
1、1,Chapter 3 Basic Data Mining Techniques,3.3 The K-Means Algorithm(For cluster analysis),5/12/2023,AI&DM BUPT,脊春柔矫啥跳们秃内辞继矢橙烘揽诉锯圭墩巴又藏抹氢肝宪瓜吗球反蕾战人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,2,1.What is Cluster Analysis(clustering)?,Cluster(簇):a collection of data objectsSimilar to one another within th
2、e same clusterDissimilar to the objects in other clustersHigh quality clusters:high intra-class similaritylow inter-class similarityCluster analysis(聚类分析)Grouping a set of data objects into clustersClustering is unsupervised learning(unsupervised classification):no predefined classes.It is a form of
3、 learning by observation,rather than learning by examples.Typical applicationsAs a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms,5/12/2023,AI&DM BUPT,扭齐服代禽溺卡汀阎霹脓硫古督宽布设餐年拇款必揍滦蛋攫病翁茹齐鸯崎人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,3,Examples of Clust
4、ering Applications(I),聚类分析在客户细分中的应用消费同一种类的商品或服务时,不同的客户有不同的消费特点,通过研究这些特点,企业可以制定出不同的营销组合,从而获取最大的消费者剩余,这就是客户细分的主要目的。常用的客户分类方法主要有三类:经验描述法,由决策者根据经验对客户进行类别划分;传统统计法,根据客户属性特征的简单统计来划分客户类别;非传统统计方法,即聚类-基于人工智能技术的方法。,5/12/2023,AI&DM BUPT,嵌窜案氓笨棋梆友者兜链港百翼瑰恐夯足软敏示乡粒凌肖野事寄跋惹酷锌人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-1
5、3,4,Examples of Clustering Applications(II),Marketing:Help marketers discover distinct groups in their customer bases,and then use this knowledge to develop targeted marketing programsInsurance:Identifying groups of motor insurance policy holders with a high average claim costCity-planning:Identifyi
6、ng groups of houses according to their house types,values,and geographical locations,5/12/2023,AI&DM BUPT,正骆噎垣豆怒劳俘东妆业怎舟瞳法宁猛刑沏葬电酌胁钟架汉塑落仓隙快冤人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,5,Example,5/12/2023,AI&DM BUPT,卧声输了妇铂褂抽蚀现蟹或癌例脏豢氧晒疲屹胳材瓶靡芒浅傲恍洪宰妓庞人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,6,Example
7、,5/12/2023,AI&DM BUPT,辨焕宇杠讲劲拂烩润闽畅嘉拳伴醇荤谎撅悲赊嚣绎赘祭玫质舟柳另惕惺膜人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,7,Example,5/12/2023,AI&DM BUPT,于瓦扑仕癸疮雇寐朽霍金修橙蓝硫蚜令漳呸危蒲斧幕铁济债钝胎楔钱笔懊人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,8,Example,5/12/2023,AI&DM BUPT,族颓奴寻怨届梳鹅趟鸳遭缝歹扑槽撰娘存萍莱抨刽货揩躁磷批两气卉拍澈人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖
8、掘教学课件lect-5-13,9,2.The K-Means Algorithm,Choose a value for K,the total number of clusters.Randomly choose K points as cluster centers.Assign the remaining instances to their closest cluster center.Calculate a new cluster center for each cluster.Repeat steps 3-5 until the cluster centers do not chan
9、ge.,5/12/2023,AI&DM BUPT,旱柬迫谷便怒戮牡乃呕盖窃宽盾瓣石它颧亲判拧蓬澜压困旦异豺社凌黑栏人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,10,The K-Means Clustering Algorithm,Example,5/12/2023,AI&DM BUPT,蹦廖耸萨裕统诣胆嘱根芍剪隅琉店锅羌希衣泣麓凤搀绑攘狭捞任痴蛆伴檄人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,11,5/12/2023,AI&DM BUPT,雾侵转岗蕉檬客孝驮轨尊怒思战污粥昆哺皇孕当鞍职弛妥匪琵刁惶余饿雁
10、人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,12,5/12/2023,AI&DM BUPT,喉袄乐徒慑薯勇堆防瞥裂凳圆诸贫降浇辕惰钒氧蓄季佐济妙捷荤俗蓝撒腿人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,13,Problem:We may see a different final cluster configuration for each alternative choice of the initial centers.Solution:Try different centers.But set a
11、Maximum Acceptable Squared Error.,5/12/2023,AI&DM BUPT,克米痢旧登接叹蒂罚某脱裔悲伦近砒赌榴亚眉哎锁兼蹦令必吭蓑请萍抨嘲人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,14,3.General Considerations,Requires real-valued data.The number of clusters present in the data,is selected by human.Works best when the clusters in the data are of ap
12、proximately equal size.Attribute significance cannot be determined.Lacks explanation capabilities.,5/12/2023,AI&DM BUPT,仪靠液侮哦封样炕穿笛怔刨酥害也蔚盅掺句谋补卑峻陛亢肥首滚谊山走粤人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,15,4.Types of data in clustering analysis,4.1 Interval-scaled variables:4.2 Binary variables:4.3 Nomina
13、l and ordinal variables:4.4 Variables of mixed types:,Distance is normally used to measure the similarityor dissimilarity between two data objects,5/12/2023,AI&DM BUPT,曼途有耘曲吝阅广跑守翱颇化爸吵访解膛哄熙诫意抵跋坯蚊笨疆缆变昧寂人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,16,4.1 Interval-valued variables(间隔值变量),If q=1,d is Man
14、hattan distanceIf q=2,d is Euclidean distance:Requirements for distance functiond(i,j)0d(i,i)=0d(i,j)=d(j,i)d(i,j)d(i,k)+d(k,j),5/12/2023,AI&DM BUPT,熄即署蚂诣沟埋广兑北另饥局拄踏俐锰驭弘锈蔓没超收纯汽棘贮渭耀抑搭人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,17,4.1 Interval-valued variables(Cont.1),Some popular measures include:Min
15、kowski distance:where i=(xi1,xi2,xip)and j=(xj1,xj2,xjp)are two p-dimensional data objects,and q is a positive integer,5/12/2023,AI&DM BUPT,靖汀粘旅炭烫激辗尤夏檬事您见浙氧亦疹哭洒页骸牢输沮杆箱仅敲凝酉俐人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,18,4.1 Interval-valued variables(cont.2),Standardize dataFind out the mean:Calculat
16、e the mean absolute deviation(绝对偏差均值):Calculate the standardized measurement(z-score),5/12/2023,AI&DM BUPT,庙溶传逊旭暮愤闸蛔瓤媚喘舰梗钝蹦虾馁樱确吸渗抚亡老私营矛掏缉翌潮人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,19,4.2 Binary Variables(二值变量),A contingency table(相依表)for binary dataSimple matching coefficient(if the binary varia
17、ble is symmetric(对称的)):Jaccard coefficient(if the binary variable is asymmetric(非对称的)):,Object i,Object j,5/12/2023,AI&DM BUPT,循孽读岛致蒲筒崭焕沁珍绵吓郧桃岿犹厂骑通旧接劈垦纽幸溉韩暑搓龄扑人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,20,4.2 Binary Variables(cont.),Examplegender is a symmetric attributethe remaining attributes ar
18、e asymmetric binary attributeslet the values Y and P be set to 1,and the value N be set to 0,5/12/2023,AI&DM BUPT,鬃珊绸耸茄枷窥奠酌肄甜烫庚犁帖绍郁器浆泪棵浙寸豌司膝阴铜影搏柳哇人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,21,4.3 Nominal Variables(符号变量),Nominal Variables can be treated as a generalization of the binary variable in
19、 that it can take more than 2 states,e.g.,red,yellow,blue,greenMethod:Simple matching-symmetricm:#of matches,p:total#of variables,5/12/2023,AI&DM BUPT,舆认康麦直宦寓剿恒睬坟淆度辈簿核煽烟捣继仰赤让贵匈义铱粱虞烬坊修人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,22,4.4 Ordinal Variables(顺序变量),Variables that order is important,e.g.,ra
20、nkCan be treated like interval-scaled replacing xif in rank order map the range of each variable onto 0,1 by replacing i-th object in the f-th variable bycompute the dissimilarity using methods for interval-scaled variables,5/12/2023,AI&DM BUPT,灶套肤饥口坪莫倦樊就烈蔷釉葡永褐慈坞兽锁火礁伟侮辉悉岳钙蛊塑碱揣人工智能与数据挖掘教学课件lect-5-13人
21、工智能与数据挖掘教学课件lect-5-13,23,4.5 Variables of Mixed Types,A database may contain all types of variablessymmetric binary,asymmetric binary,nominal,ordinal,interval-valued.One may use a weighted formula to combine their effects.f is binary or nominal:dij(f)=0 if xif=xjf;or dij(f)=1 o.w.f is interval-based
22、:use the normalized distancef is ordinalcompute ranks rif and zif treat zif as interval-scaled,5/12/2023,AI&DM BUPT,烬订秸涤律焚妨其想噶哦书嘎聘枝伤凸躁纵龄拦俏第阁妖已乳弄狡誉罪桥人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,24,4.5 Variables of Mixed Types(cont.),One may use a weighted formula to combine their effects.xif or xjf i
23、s missingxif=xjf=0,and variable f is asymmetric Otherwise,5/12/2023,AI&DM BUPT,墟峪灵酞鞠蘸杀荣滇傈斧漾挽沦纤逃顺饲突哩鸿丢压炉匣丫泌封喻奋棘祸人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,25,5.More about clustering Algorithms:K-means&K-medoids,Partitioning method:Construct a partition of n objects into a set of k clustersSimilarit
24、y Function:usually is distancek-means(MacQueen67):Each cluster is represented by the center of the clusterk-medoids or PAM(Partition around medoids)(Kaufman&Rousseeuw87):Each cluster is represented by one of the objects in the cluster,5/12/2023,AI&DM BUPT,掀鞘玫因箍淌真赐浇醒热钵剥捻稠萎鹿绞盛收例伯早彤箱虎魂悸窖列胎仙人工智能与数据挖掘教学课
25、件lect-5-13人工智能与数据挖掘教学课件lect-5-13,26,Comments on the K-Means Method,Strength Relatively efficient:O(tkn),where n is#objects,k is#clusters,and t is#iterations.Normally,k,t n.WeaknessApplicable only when mean is defined,then what about categorical data?Need to specify k-the number of clusters,in advanc
26、eUnable to handle noisy data and outliersNot suitable to discover clusters with non-convex shapes,5/12/2023,AI&DM BUPT,普纵泪锌晒驰仅豆姑懊溺燕贮朝钳颐篙橇桅瞳渣兢酸吧讲豪鼻皂探褪寝臀人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,27,The K-Medoids Clustering Method,Find representative objects,called medoids(聚类代表),in clustersPAM(Parti
27、tioning Around Medoids,1987)starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering,5/12/2023,AI&DM BUPT,垮楞筋帽肾略快坠诌信右诫烷商郊殴阁生航囱栓态致瘤妓明崖求寇滓筏蚜人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,28,P
28、AM(Partitioning Around Medoids),PAM(Kaufman and Rousseeuw,1987),Use real object to represent the clusterSelect k representative objects arbitrarilyFor each pair of non-selected object h and selected object i,calculate the total swapping cost TCihFor each pair of i and h,If TCih 0,i is replaced by hT
29、hen assign each non-selected object to the most similar representative objectrepeat steps 2-3 until there is no changePAM works effectively for small data sets,but does not scale well for large data sets,5/12/2023,AI&DM BUPT,砍叫钉稗萍添卸衙藉扼馅莉昏擒加嫁姑丸兵秃琉菊书帛馅雇欲砰贤钙旅迹人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-1
30、3,29,6.Agglomerative(凝聚的)Clustering(10.4),Place each instance into a separate partition.Until all instances are part of a single cluster:a.Determine the two most similar clusters.b.Merge the clusters chosen into a single cluster.3.Choose a clustering formed by one of the step 2 iterations as a final
31、 result.,5/12/2023,AI&DM BUPT,驻撬猫角罪聋厚秆倘唁拣虾悬终汤楔目勤汾输侯谩抛涅衡摔炮汰壕尚搓溯人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,30,Agglomerative Clustering:An Example,5/12/2023,AI&DM BUPT,擦柳贺僵维料鹃头拢虎砷砌仁倔庄任叼詹底枢举黄兜舷桥暴绵凹嫌碱顾簇人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,5/12/2023,31,AI&DM BUPT,柠拘延大沮胜卯推蜀回囱咨演抠吼瘤免映液师话腕包修霄赌兑谩遵什迪妊
32、人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,32,Summary:Requirements of Clustering Algorithm,ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary(任意的)shapeMinimal requirements for domain knowledge to determine input parametersAbility to deal with noisy da
33、taInsensitivity to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability,5/12/2023,AI&DM BUPT,锋烹涸排付贱汐榴焰忻甩忘阁跟韧躁拼嘶痢糠鹃儿敢蛤马狡朔熏爸瞧劫湘人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,33,Challenges Further Research,Considerable progress has been made i
34、n scalable clustering methodsPartitioning:k-means,k-medoids,CLARANSHierarchical:BIRCH,CUREDensity-based:DBSCAN,CLIQUE,OPTICSGrid-based:STING,WaveClusterModel-based:Autoclass,Denclue,CobwebCurrent clustering techniques do not address all the requirements adequately,5/12/2023,AI&DM BUPT,佳孔蜡鲤局蚌与吾柿码笛尘孔或
35、例澡兹赦递驴肯爆锅勾絮妄叠褥璃凤艺聊人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,34,Homework,Perform the third iteration of the k-means algorithm for the example given in the section“An Example Using K-Means”.What are the new cluster centers?Suppose that the data mining task is to cluster the following 8 points(with(
36、x,y)representing location)into 3 clusters.A1(2,10),A2(2,5),A3(8,4),B1(5,8),B2(7,5),B3(6,4),C1(1,2),C2(4,9)The distance function is Manhattan distance.Suppose initially we assign A1,B1,and C1 as the center of each cluster,respectively.Use the k-means algorithm to show only:(a)the three cluster centers after the first round execution;(b)the final three clusters.,5/12/2023,AI&DM BUPT,矢允鼎泞盈儡命呈涧毖锅伞杨骇邻令厕慢半辫句滁屈剿卒渺啮靛撒攘贤腔人工智能与数据挖掘教学课件lect-5-13人工智能与数据挖掘教学课件lect-5-13,