高性能数据挖掘技术及其应用.ppt

上传人:牧羊曲112 文档编号:6493753 上传时间:2023-11-06 格式:PPT 页数:71 大小:3.86MB
返回 下载 相关 举报
高性能数据挖掘技术及其应用.ppt_第1页
第1页 / 共71页
高性能数据挖掘技术及其应用.ppt_第2页
第2页 / 共71页
高性能数据挖掘技术及其应用.ppt_第3页
第3页 / 共71页
高性能数据挖掘技术及其应用.ppt_第4页
第4页 / 共71页
高性能数据挖掘技术及其应用.ppt_第5页
第5页 / 共71页
点击查看更多>>
资源描述

《高性能数据挖掘技术及其应用.ppt》由会员分享,可在线阅读,更多相关《高性能数据挖掘技术及其应用.ppt(71页珍藏版)》请在三一办公上搜索。

1、刘 莹 博士 副教授中国科学院研究生院信息科学与工程学院,高性能数据挖掘技术及其应用,2023/11/6,Ying Liu,2,简介,1999/07,北京大学,计算机科学与技术,学士2001/12,美国西北大学(Northwestern University),计算机工程,硕士2005/06,美国西北大学(Northwestern University),计算机工程,博士2005/06 2005/11,助理研究员,美国西北大学2006/01 今,副教授,中国科学院研究生院信息科学与工程学院,虚拟经济与数据科学研究中心,2023/11/6,Ying Liu,3,科研经历,美国国家航空航天局(NA

2、SA):Mass Storage Performance Information System美国能源部(DOE):Scientific Data Management Integrated Software Infrastructure CenterIntel公司:Characterizing Scalable Data Mining Kernels/Primitives on SMPs美国国家科学基金(NSF):High-Performance Techniques,Designs and Implementation of Software Infrastructure for Chan

3、ge Detection and Mining(IIS-0536994),2023/11/6,Ying Liu,4,科研经历,负责中国人民银行横向课题个人信用评分系统研究主持自然科学基金创新群体项目子课题海量数据的挖掘技术的研究主持自然科学基金重点项目子课题可信软件过程的基本属性和度量模型主持教育部留学归国人员启动基金基于传感器网络的交通数据流挖掘主持中科院研究生院院长基金基于效用的数据挖掘理论与技术的研究,2023/11/6,Ying Liu,5,科研成果,大规模科学模拟计算中的高性能数据挖掘天体物理模拟中的聚类算法HOP的并行方案适用于超大规模的科学模拟计算中,取得了非常好的加速比被美国圣

4、地亚哥超级计算中心(SDSC)使用可扩展的数据挖掘算法的性能评估可扩展的数据挖掘算法的性能评估发布了NU-Minebench,第一个数据挖掘算法的基准组(benchmark suite),被下载1666次(2005/06/15 今)被Intel公司使用,2023/11/6,Ying Liu,6,提纲,数据挖掘简介高性能(并行/分布式)数据挖掘应用实例介绍天体模拟(cosmological simulation)天文(astronomy)航天(space operation)生态系统(ecosystem)生物信息学(bioinformatics)总结,2023/11/6,Ying Liu,7,数

5、据挖掘,自动的、从”海量”数据中挖掘出隐藏的、潜在的、有价值的知识的技术挖掘的结果(知识)是用户感兴趣的,管理决策支持系统数据挖掘技术的特点海量数据从历史的数据中自动寻找高效可扩展性好模型更新快应用性强,2023/11/6,Ying Liu,8,数据挖掘的动机 商业角度,收集和存储的数据量太大 电子商务商业交易数据信用卡交易保险CPU的处理速度每年增长15%,不能满足数据量增长的需要提供更好的个性化服务,先进的客户关系管理手段等,数据爆炸,知识贫乏,2023/11/6,Ying Liu,9,数据挖掘的动机 科学计算角度,海量数据(GB/hour)遥感数据天文望远镜巡天基因表达微阵列(Micro

6、arrays)科学模拟帮助科学家对数据进行多种分析,如分类、分层等,2023/11/6,Ying Liu,10,数据挖掘的起源,交叉学科统计方法机器学习方法神经网络数据库并行计算传统方法的局限性在于海量数据高维数据异构数据复杂数据类型,2023/11/6,Ying Liu,11,流程,Data Cleaningand Integration,Databases,Data Warehouse,Knowledge,Selection and Transformation,Data Mining,Pattern Evaluation,Flat files,2023/11/6,Ying Liu,12,

7、数据挖掘的主要技术,聚类(clustering)异常点检测(anomaly detection)分类(classification)预测(prediction)关联规则(association rules mining)顺序模式(sequential pattern)时间序列(time-series),2023/11/6,Ying Liu,13,聚类,自动将数据分成若干簇,使得不同簇的数据项相似性最小,簇内数据项的相似性最大。(不依赖于预先定义好的类,不需要训练集)应用模式识别地理信息系统 图像处理生物基因序列分析天体模拟文档聚类,常用算法K-means,BIRCH,DBSACN,EM,202

8、3/11/6,Ying Liu,14,异常点检测,从数据集中找出与正常行为有显著差异的数据项应用信用卡欺诈医疗数据分析网络入侵检测常用算法聚类Statistical-based,Distance-based,Deviation-based,2023/11/6,Ying Liu,15,分类,根据从训练集数据(training data)中分析得来的数据各域与已知类别间的函数关系,预测一个新的数据记录的类别应用市场预测客户关系管理(CRM)营销策略信用评分常用算法决策树贝叶斯分类神经网络K-近邻,2023/11/6,Ying Liu,16,分类,class,2023/11/6,Ying Liu,1

9、7,预测,根据历史数据建立数学模型,预测新的记录的一个属性的值。回归(Regression)方法,线性、非线性曲线拟合常用算法线性回归Logistic回归,y,2023/11/6,Ying Liu,18,关联规则,从数据中找出频繁集(frequent itemsets),并且找出频繁集中数据项间的相互影响作用应用市场组合分析套装产品分析目录设计交叉销售常用算法AprioriDICFP-growth,A为“北京附近有冷涡”,B为“北京地区有降水”,A、B同时出现的概率较高(s=60%),P(B|A)高(c=75%),A 导致B,2023/11/6,Ying Liu,19,顺序模式,从与时间顺序有

10、关的数据中找出频繁的(frequent events),然后寻找出频繁集中数据项间的相互影响作用应用电信市场营销DNA序列分析常用算法GSPSPADE,买PC,买打印机,买墨盒,买新的CPU,Time,凡是购买了新电脑的顾客,9个月后很可能又要买新的CPU,营销手段:9个月后主动向用户推荐,以保持客户,2023/11/6,Ying Liu,20,时间序列,随时间变化的数值序列,分析序列的周期,不同序列的相似度,以及预测序列的趋势应用股票价格医疗诊断电力消耗交通流,time,price,2023/11/6,Ying Liu,21,Why High Performance Data Mining?

11、,Lots of data being collected in commercial and scientific world,massive data setsStrong competitive pressure to extract and use the information from the data,e.g.Climate simulationAstrophysicsMolecular biology,2023/11/6,Ying Liu,22,Why High Performance Data Mining?,Data and/or computational resourc

12、es needed for analysis are often distributedSometimes the choice is distributed data mining or no data miningOwnership,privacy,security issues,Accelerate the computation Use more memory from multiple machines,Solution:parallel computing!,2023/11/6,Ying Liu,23,Progress in HPC-past 6 decades,ENIACS 19

13、45 100 K Hz 5 K Additions/second 357 Multiplications/second,IBM Blue Gene/L,CPU power increasing by a factor of 30-100 every decade Multi-Giga Hz,multi-Gigabyte,multi-core CPUs are commodity Teraflops computers are common Petaflops scale computing within reach,Jaguar-Cray XT4/XT3-Oak Ridge National

14、Laboratory,EKA(HP Cluster Platform 3000BL)-Computational Research Laboratories,2023/11/6,Ying Liu,24,TOP 10 Machines 7/2008,2023/11/6,Ying Liu,25,2023/11/6,Ying Liu,26,Supercomputers in China,2004年6月,曙光超级服务器,每秒峰值运算速度万亿次,列全球第十,位于上海超级计算中心2008年6月,曙光5超级服务器,每秒峰值运算速度160万亿次,位于上海超级计算中心联想深腾6800网格超级计算机,265个四路

15、节点机,1060个处理器芯片,每秒峰值运算速度5万亿次,列2003年11月世界TOP500第14名,位于中科院网络信息中心,2023/11/6,Ying Liu,27,体系结构(Architectures),Shared Address SpaceAll processors share a single global address spaceSingle address space facilitates a simple programming modelExamples:SGI Origin 3000,IBM SP2,2023/11/6,Ying Liu,28,体系结构(Archite

16、ctures),Message passing platformEach processor has local memory with local address spaceOnly way to exchange data is using explicit message passingTime taken for message depends on the relative locations of the source and destination processorsPerformance of a parallel program determined by how well

17、 the location of data matches its useExample:clusters,IBM SP and SGI Origin 3000 support it,2023/11/6,Ying Liu,29,体系结构(Architectures),Clusters of 4-way SMPs,HybridMost popular,2023/11/6,Ying Liu,30,Parallel Programming,Construct or modify a series program for solving a given problem on a parallel ma

18、chineThe programmers responsibility to identify the ways to decompose the computation and extract concurrencyAn exact copy of the program on each processorComplex programming,2023/11/6,Ying Liu,31,Parallel Programming,Data parallelismPartition the data across processorsEach processor performs the sa

19、me operations on its local data partitioningTask parallelismAssign independent modules to different processorsEach processor performs different operations,2023/11/6,Ying Liu,32,提纲,数据挖掘简介高性能(并行/分布式)数据挖掘应用实例介绍天体模拟(cosmological simulation)航天(space operation)生态系统(ecosystem)生物信息学(bioinformatics)天文(astron

20、omy)总结,2023/11/6,Ying Liu,33,天体模拟(Cosmological Simulation),N-body simulation numerically approximates the evolution of the universeEach body represents a galaxy or a star,and bodies attract each other through the gravitational forceSimilar applicationsProtein foldingTurbulent fluid flow simulation,2

21、023/11/6,Ying Liu,34,2023/11/6,Ying Liu,35,HOP Clustering Algorithm,Difficult to discern which particles belong to the same group or cluster,computational intensiveHOP,density-based clustering algorithm by Daniel J.Eisenstein,Piet Hut,1998Automatically identify groups of particles in N-body simulati

22、onParticle attributes:mass,three-dimension coordinatesFour processing stages:Constructing a KD tree Generating density Hopping Grouping,2023/11/6,Ying Liu,36,Find the median particle on the longest axis Recursively bisect the particles along the longest axis Nearby particles are in the same sub-doma

23、in Each internal node contains boundary,Two-dimensional KD Tree,HOP Clustering Algorithm,2023/11/6,Ying Liu,37,Generating density Traverse the tree to find Ndens neighbors for every particle Assign an estimated density to every particle Hopping Associate each particle with its densest neighbor Each

24、particle“hops”to its densest neighbor till it reaches a particle that is its own densest neighbor Grouping Define particles associated to the same densest neighbor as a group Refine and merge groups,HOP Clustering Algorithm,2023/11/6,Ying Liu,38,Key idea Load balance Assign approximate equal number

25、of particles to each processor Minimize communication overheads Requests for potential required remote particles are packed into a single message,and the required particles are transferred to the requesting processors,HOP Clustering Parallelization Ying Liu,Wei-keng Liao,Alok Choudhary,Northwestern

26、University,USA,2023/11/6,Ying Liu,39,Assign approximate equal number of particles to each processor Find the median particle in parallel on the longest axis Bisect particles along the longest axis Exchange particles between bisected processors Build local KD tree Maintain a global tree on each proce

27、ssor with no real particle transfer,Construct Parallel KD Tree,HOP Parallelization,2023/11/6,Ying Liu,40,Generate Density,Intersection test Send out a single message to request the required remote particles Transfer the required particles Search for neighbors Calculate density,HOP Parallelization,20

28、23/11/6,Ying Liu,41,Hopping Hop to its highest density neighbor Book the required remote particles and send out requests Transfer the required particles to requesting processors Grouping Particles linked to the same densest particle are defined as a group Refine groups,HOP Parallelization,2023/11/6,

29、Ying Liu,42,Experiment,ENZOAn adaptive mesh refinement(AMR),grid-based hybrid code(hydro+N-Body),simulate the cosmological structure formationUse the algorithms of Berger&Collela to improve spatial and temporal resolution in regions of large gradients,such as gravitationally collapsing objectsSoftwa

30、re is flexible,can simulate a wide range of cosmological situationsParallelized using MPI and can run on any shared or distributed memory parallel supercomputer or clustersSimulations on 1024 processors have been carried out on the San Diego Supercomputing Centers Blue Horizon,an IBM SP,2023/11/6,Yi

31、ng Liu,43,Data set 1,Data set 2,Data Source,Each data set contains 491520 particles,2023/11/6,Ying Liu,44,Density generation is the most time consuming stage Data set 2 takes longer execution time,Total Execution Time,Data set 1,Data set 2,Performance Evaluation,2023/11/6,Ying Liu,45,The overall per

32、formance scales up on IBM SP2 and SGI Origin2000 when increasing number of processors It scales up to 32 processors on Linux Cluster,Speedups for Total Execution Time,Data set 1,Data set 2,Performance Evaluation,2023/11/6,Ying Liu,46,Generating density stage scales up on IBM SP2 and SGI Origin2000 I

33、t scales up to 32 processors on Linux Cluster,Speedups for Generating Density,Data set 1,Data set 2,Performance Evaluation,2023/11/6,Ying Liu,47,Data set 1,Data set 2,Communication time does not scale well Communication time increases when number of processors goes beyond 32,Communication Costs,Perf

34、ormance Evaluation,2023/11/6,Ying Liu,48,提纲,数据挖掘简介应用实例介绍天体模拟(cosmological simulation)天文(astronomy)航天(space operation)生态系统(ecosystem)生物信息学(bioinformatics)高性能(并行/分布式)数据挖掘总结,2023/11/6,Ying Liu,49,天文(Astronomy)University of Baltimore,USA,Predictive Mining of Time Series Data in Astronomy发觉相同天体或者不同天体间有趣的

35、周期性的行为或者巧合。应用这种周期性的行为来预测或者分析天体行为算法将每个望远镜收集的数据看成时间序列对时间序列用sliding window处理,得到子序列对这个数据的子序列使用聚类的算法进行分析,得到这个子序列中各种pattern用这些pattern来表示这段时间序列意义如果pattern A出现在时间序列1当中,那么在此后T时间之内有c%的几率,pattern B会出现在时间序列1,得到有意义的关联规则对不同的时间序列的pattern进行比较,2023/11/6,Ying Liu,50,天文(Astronomy),分析天体周期性行为的框架,2023/11/6,Ying Liu,51,天文

36、(Astronomy)Lawrence Livermore National Laboratory,USA,Mining the FIRST survey for galaxies with a bent-double morphologyFIRST:Faint Images of the Radio Sky at Twenty CentimetersRadio equivalent of the Palomar Observatory Sky Survey(POSS)10,000 square degrees survey of the North Galactic CapUsing the

37、 NRAO Very Large Array(VLA),B configuration,2023/11/6,Ying Liu,52,天文(Astronomy),The FIRST data1.8 pixels,resolution 5,rms0.15mJy90 radio sources per square-degree at 1mJy thresholdThe morphological type of a radio source provides clues to their emission mechanism,source class,and the properties of t

38、he surrounding mediumThe raw data from the telescopes is extensively processedImages maps and catalog available(),2023/11/6,Ying Liu,53,天文(Astronomy),Use data mining to find“bent-doubles”in FIRSTFIRST astronomers interested in“bent-doubles”indicates presence of clusters of galaxies first“identify”us

39、ing a visual technique followed by optical observations and checks with other surveysVisual identification is no longer feasible subjective,tedious,likely to miss cases.900,000 galaxies in the full surveyGoal:replace the visual identification of bent doubles by a semi-automated one,2023/11/6,Ying Li

40、u,54,天文(Astronomy),Detecting bent-double galaxies in 250GB image data,78MB catalog data(as of 7/2000),2023/11/6,Ying Liu,55,天文(Astronomy),MethodologyGroup the catalog entries into a“galaxy”Separate sources based on number of catalog entries 1-entry sources unlikely to be bent-doubles 3-entry sources

41、 all“interesting”study the 2-and 3-entry sources separatelyresults in splitting a small training set(313-118+195),2023/11/6,Ying Liu,56,天文(Astronomy),Calculate features for a galaxy(103 features)Use the features to train a decision treeUse the tree to classify the unlabeled galaxies and validate the

42、 resultsUse validated results to enhance training set,2023/11/6,Ying Liu,57,天文(Astronomy),Results using a single tree for 3-entry sources were satisfactory Labeled training set:167 bents,28 non-bents Performed several inner iterations using pruned trees(c5.0 decision tree software)Ten 10-fold cross-

43、validation errors:mean(SE)using all the features:9.7%(0.3%)using triple features only:10.7%(0.3%)Discriminating features include geometrically calculated angles,relative distances,ellipticity and symmetry measures,2023/11/6,Ying Liu,58,New Trends GPUs+CUDA,GPU(Graphic Processing Unit),图形处理器,专用处理器,CP

44、U和GPU每秒浮点运算数,2023/11/6,Ying Liu,59,New Trends GPUs+CUDA,GPU与CPU结构区别更多的晶体管高内存带宽驱动的多核GPU优势成本低(几百美元)多线程(几百个线程)处理计算密集型数据的效率远高于CPUGPU缺点编程难,2023/11/6,Ying Liu,60,New Trends GPUs+CUDA,CUDA一个基于业界标准的C语言的编程环境,用于开发GPU的计算应用程序GPU并行执行非常多线程CPU把计算密集的、并行度高的部分卸载给GPU易编程,软件层次结构,2023/11/6,Ying Liu,61,New Trends GPUs+CUD

45、A,原来只能由workstation完成的工作,可以由PC完成超级计算一次革命性的进步成功例子斯坦福大学利用CUDA开发了在GPU上运行的foldinghome,最高运行速度比CPU快140倍。Foldinghome进行蛋白质折叠模拟,找出蛋白质误折叠的后果。Elemental Technologies利用CUDA开发了在使用基于GPU的Badaboom软件后,视频编码的转换过程最高比传统方法快了18倍。有了CUDA的帮助,地理信息系统中,从前需要20分钟才能完成的运算现在只需30秒即可完成,而从前需要30到40秒钟完成的运算现在能够实现实时运算。CUDA技术是自微处理器发明以来计算行业内所诞

46、生的最具革命性的技术。”M,2023/11/6,Ying Liu,62,New Trends GPUs+CUDA,伊利诺伊大学(UIUC)利用GPU进行并行分子动力学研究,用于分析大型生物分子系统。“未来计算性能的加强将直接来自多核GPU(图形处理器)大规模并行硬件。目前的最大挑战是将代码实现并行化,以便更好地利用相关的硬件,而CUDA取得了突破性的进展,推进了这一领域的发展。”胡文美教授 金融分析、天体物理学、地震成像等各个领域的开发人员正在受益于CUDA的开发工具。“凭借CUDA,我们很容易地就可利用GPU的处理能力,减少时间和资金的投入。一台主机系统配备两块Tesla D870的成本要比

47、组建16核集群低很多。”Technician“Volera只用了12个GPU(图形处理器)就能实时分析美国整个期权市场,延迟时间不超过10微秒。而达到这样的速度则通常会至少需要60个传统的1U服务器。通过使用GPU,我们的客户可以用更小的维护成本、更低的电能消耗以及更小的占地空间实现更好的效益。”Hanweck Associates,2023/11/6,Ying Liu,63,Data Mining on GPU,University of VirginiaUniversity of Illinois at Urbana-Champaign University of California a

48、t Davis中国科学院研究生院,2023/11/6,Ying Liu,64,High Performance Scientific Data Mining Projects,Hillol Kargupta,University of Maryland,USADistributed Data Mining for Scalable Analysis of Data from Virtual Observatories,NASA,2007-2010Astronomers are unable to tap the riches of this collection of gigabyte,ter

49、abyte,and(eventually)petabyte catalogs without a computational backbone that includes support for queries and data mining across distributed virtual tables of decentralized,joined,and integrated sky survey catalogs.(1)Design and implement distributed algorithms for computing statistical primitives,p

50、rincipal component analysis,and outlier detection from distributed astronomy catalogs and their partial images stored in users local data management systems.(2)Develop a prototype system which will offer a rich collection of web-services based on various DDM algorithms.(3)Search for unusual correlat

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 生活休闲 > 在线阅读


备案号:宁ICP备20000045号-2

经营许可证:宁B2-20210002

宁公网安备 64010402000987号