An Introduction of Big Data[大数据的介绍](PPT34).ppt

上传人:文库蛋蛋多 文档编号:2366607 上传时间:2023-02-15 格式:PPT 页数:34 大小:2.66MB
返回 下载 相关 举报
An Introduction of Big Data[大数据的介绍](PPT34).ppt_第1页
第1页 / 共34页
An Introduction of Big Data[大数据的介绍](PPT34).ppt_第2页
第2页 / 共34页
An Introduction of Big Data[大数据的介绍](PPT34).ppt_第3页
第3页 / 共34页
An Introduction of Big Data[大数据的介绍](PPT34).ppt_第4页
第4页 / 共34页
An Introduction of Big Data[大数据的介绍](PPT34).ppt_第5页
第5页 / 共34页
点击查看更多>>
资源描述

《An Introduction of Big Data[大数据的介绍](PPT34).ppt》由会员分享,可在线阅读,更多相关《An Introduction of Big Data[大数据的介绍](PPT34).ppt(34页珍藏版)》请在三一办公上搜索。

1、An Introduction of Big Data,WEB GROUP2011.9.24,1,2,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,3,Information Explosion57%every year(IDC)Double every 1.5 years988EB(1EB=1024PB)data will be produ

2、ced in 2010(IDC)18 million times of all info in books IT850 million photos&8 million videos/day(Facebook)50PB web pages,500PB log(Baidu)Telco(Log,multimedia data)Enterprise Storage Public UtilitiesHealth Care(medical images-photos)Public Traffic(surveillance-videos),What is Big Data,4,DefinitionBig

3、data is the confluence of the three trends consisting of Big Transaction Data,Big Interaction and Big Data ProcessingQuestions?Big Data=Large-Scale Data(Massive Data),What is Big Data,Structural and Semi-Structural Transaction Data,.Unstructured dataInteraction Data,5,The properties of Big DataHugeD

4、istributedDispersed over many serversDynamicItems add/deleted/modified continuouslyHeterogeneousMany agents access/update dataNoisyInherentUnintentionalMaliciousUnstructured/semi-structuredNo database schemaComplex interrelationships,What is Big Data,6,Outline,What is Big DataThe Framework of Big Da

5、taThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,7,The Framework of Big Data,8,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,9,The Applicati

6、ons of Big Data,Celestial bodyExobiology,Inheritance Sequence of cancer,AdvertisementFinding communities,SNAFinding communities,Data MiningConsuming habit,Changing router,10,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related

7、with Big DataConclusions,11,Efficiency requirements for AlgorithmTraditionally,“efficient”algorithmsRun in(small)polynomial time:O(nlogn)Use linear space:O(n)For large data sets,efficient algorithmsMust run in linear or even sub-linear time:o(n)Must use up to poly-logarithmic space:(logn)2Mining Big

8、 DataAssociation Rule and Frequent PatternsTwo parameters:support,confidenceClusteringDistance measure(L1,L2,L,Edit Distance,etc,.)Graph structureSocial Networks,Degree distribution(heavy trail),The Challenges of Big Data,12,Clean Big DataNoise in data distorts Computation resultsSearch resultsNeed

9、automatic methods for“cleaning”the dataDuplicate eliminationQuality evaluationComputing ModelAccuracy and ApproximationEfficiency,The Challenges of Big Data,13,Abstract Model of Computing,Computing Model of Big Data,13,Approximation of,Data,(n is very large),Approximation of f(x)is sufficient Progra

10、m can be randomized,Computer Program,Examples,Mean,Parity,Random Sampling,Computing Model of Big Data,14,Query a few data items,Data,Examples,MeanO(1)queries,Parityn queries,Approximation of,(n is very large),Computer Program,15,AdvantagesUltra-efficientSub-linear running time&space(could even be in

11、dependent of data set size)DisadvantagesMay require random accessDoesnt fit many problems,Random Sampling,Data Streams,Computing Model of Big Data,16,Data,Computer Program,Stream through the data;Use limited memory,Examples,MeanO(1)memory,Parity1 bit of memory,Approximation of,(n is very large),17,A

12、dvantagesSequential accessLimited memoryDisadvantagesRunning time is at least linearToo restricted for some problems,Random Sampling,Sketching,Computing Model of Big Data,18,Data1,Data2,Data1,Data2,Sketch2,Sketch1,Compress eachdata segment intoa small“sketch”Compute overthe sketches,Examples,Equalit

13、yO(1)size sketch,Hamming distanceO(1)size sketch,Lp distance(p 2)(n1-2/p)size sketch,Approximation of,(n is very large),19,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataResearch works related with Big DataConclusions,20,Finding Maximal Cliques

14、 in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011),Research works related with Big Data,21,Massive graph dataGraph is a powerful modeling tool for analyzing massive networks.Graph data is ever

15、ywhere(e.g.chemistry,biology,image,vision,social networks,the Web,etc.).The outstanding property of graph data is massive.,Finding Maximal Cliques in Massive Networks by H*-Graph,22,MotivationThis has become a serious concern in view of the massive volume of todays fast-growing network graphs.Web gr

16、aph has over 1 trillion webpages(google)Social networks have millions to billions of users(Facebook,Linkedin)Maximal Clique Enumeration(MCE)is very useful and helpful for analyzing massive graph data.How to find MCE in massive graph?The best algorithm require memory space linear in the size of the i

17、nput graph,which is clearly infeasible on massive graph.,Finding Maximal Cliques in Massive Networks by H*-Graph,23,Challenges&MethodsClique:A subset of vertices such that every two vertices are connected.Clique problem is NP-Complete.Maximal Clique:If no more vertices can be added to the clique.The

18、 Graph has 5 maximal cliques1,2,5,2,3,3,4,4,5 and 4,6Due to Massive graph,authors provide an External-memory algorithm for MCE(ExtMCE).One critical problem must be handled.What portion should be chosen at each recursive step and how?H*-graph is a core of graph,which can be stand for the massive grap

19、h.Only finding MCE in H*-graph that can fit into memory.H*-graph is the largest set of h vertices in G that have degree at least h.Therefore,authors maintain and update MCE in H*-graph.,Finding Maximal Cliques in Massive Networks by H*-Graph,24,Finding Maximal Cliques in Massive Networks by H*-Graph

20、(Sigmod 2010)Large-Scale Collective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011)The Anatomy of a Large-Scale Social Search Engine(WWW2010),Research works related with Big Data,25,MotivationTwo kinds of ApproachesPair-wise Entity MatchingLabel pairs as ma

21、tch/non-match independentlyIgnoring the relational informationLow accuracyCollective Entity MatchingLabel all pairs collectively High accuracyOften scale only to a few 1000 entities,Large-Scale Collective Entity Matching,How can we scale Collective Entity Matching to millions of entities?,26,Motivat

22、ionTwo kinds of ApproachesPair-wise Entity MatchingLabel pairs as match/non-match independentlyIgnoring the relational informationLow accuracyCollective Entity MatchingLabel all pairs collectively High accuracyOften scale only to a few 1000 entities,Large-Scale Collective Entity Matching,How can we

23、scale Collective Entity Matching to millions of entities?,27,MethodThe scalable EM framework consists of three key componentsModeling an entity matcher as a block boxRunning multiple instances of the matcher on small subsets of entitiesUsing message passing across the instances to control the intera

24、ction between different runs of the matcher.,Large-Scale Collective Entity Matching,Vibhor Rastogi,Nilesh Dalvi,Minos Garofalakis,Large-Scale Collective Entity Matching.Proceedings of the VLDB Endowment,Vol.4,No.4,28,Finding Maximal Cliques in Massive Networks by H*-Graph(Sigmod 2010)Large-Scale Col

25、lective Entity Matching(VLDB2011)Estimating Sizes of Social Networks via Biased Sampling(WWW 2011),Research works related with Big Data,29,MotivationSocial network have become pretty big:Facebook(650,000,000)Qzone(200,000,000)Twitter(175,000,000)No public API for population size queries.Exhaustive c

26、rawl is time/space/communication intensive and violates“politeness”Goal:Obtaining estimates for sizes of populations in social network with limit public API calls.,Estimating Sizes of Social Networks via Biased Sampling,MethodBiased sampling random walk on directed graphConstruct 4 statistics:C the

27、number of collisions.C the number of non-unique elements the sum of the sampled nodes degrees.the sum of the inverse sampled nodes degrees.Two way to estimate the number of nodes:At least samples are needed to guarantee the accuracy of the estimate.,Large-Scale Collective Entity Matching,collision b

28、ased estimator,non-unique element based estimator,Example,1,2,3,5,4,c,3,0,1/3,-,3,f,c,b,d,d,3,3,2,4,4,0,0,0,1,2,5,9,12,15,19,5/6,13/12,17/12,21/12,2,-,-,-,13,9,seed,32,Outline,What is Big DataThe Framework of Big DataThe Applications of Big DataThe Challenges of Big DataSome research works related Big DataConclusions,33,Conclusions,Data on todays scales require scientific and computational intelligence.Big Data is a challenge and an opportunity for us.,34,Thank You,

展开阅读全文
相关资源
猜你喜欢
相关搜索
资源标签

当前位置:首页 > 建筑/施工/环境 > 项目建议


备案号:宁ICP备20000045号-2

经营许可证:宁B2-20210002

宁公网安备 64010402000987号