《云计算与云数据管理.ppt》由会员分享,可在线阅读,更多相关《云计算与云数据管理.ppt(171页珍藏版)》请在三一办公上搜索。
1、云计算与云数据管理,陆嘉恒中国人民大学,先进数据管理前沿讲习班,主要内容,2,云计算概述 Google 云计算技术:GFS,Bigtable 和MapreduceYahoo云计算技术和Hadoop云数据管理的挑战,人民大学新开的分布式系统与云计算课程,3,分布式系统概述分布式云计算技术综述分布式云计算平台分布式云计算程序开发,第一篇分布式系统概述,4,第一章:分布式系统入门 第二章:客户-服务器端构架 第三章:分布式对象 第四章:公共对象请求代理结构(CORBA),第二篇 云计算综述,5,第五章:云计算入门 第六章:云服务 第七章:云相关技术比较7.1网格计算和云计算7.2 Utility计算
2、(效用计算)和云计算 7.3并行和分布计算和云计算 7.4集群计算和云计算,第三篇 云计算平台,6,第八章:Google云平台的三大技术 第九章:Yahoo云平台的技术 第十章:Aneka 云平台的技术第十一章:Greenplum云平台的技术第十二章:Amazon dynamo云平台的技术,第四篇 云计算平台开发,7,第十三章:基于Hadoop系统开发 第十四章:基于HBase系统开发 第十五章:基于Google Apps系统开发 第十六章:基于MS Azure系统开发 第十七章:基于Amazon EC2系统开发,Cloud computing,Why we use cloud computi
3、ng?,Why we use cloud computing?,Case 1:Write a fileSaveComputer down,file is lostFiles are always stored in cloud,never lost,Why we use cloud computing?,Case 2:Use IE-download,install,useUse QQ-download,install,useUse C+-download,install,useGet the serve from the cloud,What is cloud and cloud comput
4、ing?,CloudDemand resources or services over Internetscale and reliability of a data center.,What is cloud and cloud computing?,Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the Internet.Users need not have knowledge
5、 of,expertise in,or control over the technology infrastructure in the cloud that supports them.,Characteristics of cloud computing,Virtual.software,databases,Web servers,operating systems,storage and networking as virtual servers.On demand.add and subtract processors,memory,network bandwidth,storage
6、.,IaaSInfrastructure as a Service,PaaSPlatform as a Service,SaaSSoftware as a Service,Types of cloud service,Software delivery modelNo hardware or software to manageService delivered through a browserCustomers use the service on demandInstant Scalability,SaaS,ExamplesYour current CRM package is not
7、managing the load or you simply dont want to host it in-house.Use a SaaS provider such as SYour email is hosted on an exchange server in your office and it is very slow.Outsource this using Hosted Exchange.,SaaS,Platform delivery modelPlatforms are built upon Infrastructure,which is expensiveEstimat
8、ing demand is not a science!Platform management is not fun!,PaaS,ExamplesYou need to host a large file(5Mb)on your website and make it available for 35,000 users for only two months duration.Use Cloud Front from Amazon.You want to start storage services on your network for a large number of files an
9、d you do not have the storage capacityuse Amazon S3.,PaaS,Computer infrastructure delivery modelA platform virtualization environmentComputing resources,such as storing and processing capacity.Virtualization taken a step further,IaaS,ExamplesYou want to run a batch job but you dont have the infrastr
10、ucture necessary to run it in a timely manner.Use Amazon EC2.You want to host a website,but only for a few days.Use Flexiscale.,IaaS,Cloud computing and other computing techniques,The 21st Century Vision Of Computing,Leonard Kleinrock,one of the chief scientists of the original Advanced Research Pro
11、jects Agency Network(ARPANET)project which seeded the Internet,said:“As of now,computer networks are still in theirinfancy,but as they grow up and become sophisticated,we will probably see the spread of computer utilities which,like present electric and telephone utilities,will service individual ho
12、mes and offices across the country.”,The 21st Century Vision Of Computing,Sun Microsystemsco-founder Bill Joy He also indicated“It would take time until these markets to mature to generate this kind of value.Predicting now which companies will capture the value is impossible.Many of them have not ev
13、en been created yet.”,The 21st Century Vision Of Computing,Definitions,utility,Definitions,utility,Utility computing is the packaging of computing resources,such as computation and storage,as a metered service similar to a traditional public utility,Definitions,utility,A computer cluster is a group
14、of linked computers,working together closely so that in many respects they form a single computer.,Definitions,utility,Grid computing is the application of several computers to a single problem at the same time usually to a scientific or technical problem that requires a great number of computer pro
15、cessing cycles or access to large amounts of data,Definitions,utility,Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.,Grid Computing&Cloud Computing,share a lot commonality intention,architecture and t
16、echnology Difference programming model,business model,compute model,applications,and Virtualization.,Grid Computing&Cloud Computing,the problems are mostly the samemanage large facilities;define methods by which consumers discover,request and use resources provided by the central facilities;implemen
17、t the often highly parallel computations that execute on those resources.,Grid Computing&Cloud Computing,VirtualizationGriddo not rely on virtualization as much as Clouds do,each individual organization maintain full control of their resources Cloudan indispensable ingredient for almost every Cloud,
18、2023/2/16,36,Any question and any comments?,主要内容,37,云计算概述 Google 云计算技术:GFS,Bigtable 和MapreduceYahoo云计算技术和Hadoop云数据管理的挑战,Google Cloud computing techniques,The Google File System,The Google File System(GFS),A scalable distributed file system for large distributed data intensive applicationsMultiple GF
19、S clusters are currently deployed.The largest ones have:1000+storage nodes300+TeraBytes of disk storageheavily accessed by hundreds of clients on distinct machines,Introduction,Shares many same goals as previous distributed file systemsperformance,scalability,reliability,etcGFS design has been drive
20、n by four key observation of Google application workloads and technological environment,Intro:Observations 1,1.Component failures are the normconstant monitoring,error detection,fault tolerance and automatic recovery are integral to the system2.Huge files(by traditional standards)Multi GB files are
21、commonI/O operations and blocks sizes must be revisited,Intro:Observations 2,3.Most files are mutated by appending new dataThis is the focus of performance optimization and atomicity guarantees4.Co-designing the applications and APIs benefits overall system by increasing flexibility,The Design,Clust
22、er consists of a single master and multiple chunkservers and is accessed by multiple clients,The Master,Maintains all file system metadata.names space,access control info,file to chunk mappings,chunk(including replicas)location,etc.Periodically communicates with chunkservers in HeartBeat messages to
23、 give instructions and check state,The Master,Helps make sophisticated chunk placement and replication decision,using global knowledgeFor reading and writing,client contacts Master to get chunk locations,then deals directly with chunkserversMaster is not a bottleneck for reads/writes,Chunkservers,Fi
24、les are broken into chunks.Each chunk has a immutable globally unique 64-bit chunk-handle.handle is assigned by the master at chunk creationChunk size is 64 MBEach chunk is replicated on 3(default)servers,Clients,Linked to apps using the file system API.Communicates with master and chunkservers for
25、reading and writingMaster interactions only for metadataChunkserver interactions for dataOnly caches metadata informationData is too large to cache.,Chunk Locations,Master does not keep a persistent record of locations of chunks and replicas.Polls chunkservers at startup,and when new chunkservers jo
26、in/leave for this.Stays up to date by controlling placement of new chunks and through HeartBeat messages(when monitoring chunkservers),Operation Log,Record of all critical metadata changesStored on Master and replicated on other machinesDefines order of concurrent operationsAlso used to recover the
27、file system state,System Interactions:Leases and Mutation Order,Leases maintain a mutation order across all chunk replicasMaster grants a lease to a replica,called the primaryThe primary choses the serial mutation order,and all replicas follow this orderMinimizes management overhead for the Master,A
28、tomic Record Append,Client specifies the data to write;GFS chooses and returns the offset it writes to and appends the data to each replica at least onceHeavily used by Googles Distributed applications.No need for a distributed lock managerGFS choses the offset,not the client,Atomic Record Append:Ho
29、w?,Follows similar control flow as mutationsPrimary tells secondary replicas to append at the same offset as the primaryIf a replica append fails at any replica,it is retried by the client.So replicas of the same chunk may contain different data,including duplicates,whole or in part,of the same reco
30、rd,Atomic Record Append:How?,GFS does not guarantee that all replicas are bitwise identical.Only guarantees that data is written at least once in an atomic unit.Data must be written at the same offset for all chunk replicas for success to be reported.,Detecting Stale Replicas,Master has a chunk vers
31、ion number to distinguish up to date and stale replicasIncrease version when granting a leaseIf a replica is not available,its version is not increasedmaster detects stale replicas when a chunkservers report chunks and versionsRemove stale replicas during garbage collection,Garbage collection,When a
32、 client deletes a file,master logs it like other changes and changes filename to a hidden file.Master removes files hidden for longer than 3 days when scanning file system name spacemetadata is also erasedDuring HeartBeat messages,the chunkservers send the master a subset of its chunks,and the maste
33、r tells it which files have no metadata.Chunkserver removes these files on its own,Fault Tolerance:High Availability,Fast recoveryMaster and chunkservers can restart in secondsChunk ReplicationMaster Replication“shadow”masters provide read-only access when primary master is downmutations not done un
34、til recorded on all master replicas,Fault Tolerance:Data Integrity,Chunkservers use checksums to detect corrupt dataSince replicas are not bitwise identical,chunkservers maintain their own checksumsFor reads,chunkserver verifies checksum before sending chunkUpdate checksums during writes,Introductio
35、n to MapReduce,MapReduce:Insight,”Consider the problem of counting the number of occurrences of each word in a large collection of documents”How would you do it in parallel?,MapReduce Programming Model,Inspired from map and reduce operations commonly used in functional programming languages like Lis
36、p.Users implement interface of two primary methods:1.Map:(key1,val1)(key2,val2)2.Reduce:(key2,val2)val3,Map operation,Map,a pure function,written by the user,takes an input key/value pair and produces a set of intermediate key/value pairs.e.g.(docid,doc-content)Draw an analogy to SQL,map can be visu
37、alized as group-by clause of an aggregate query.,Reduce operation,On completion of map phase,all the intermediate values for a given output key are combined together into a list and given to a reducer.Can be visualized as aggregate function(e.g.,average)that is computed over all the rows with the sa
38、me group-by attribute.,Pseudo-code,map(String input_key,String input_value):/input_key:document name/input_value:document contents for each word w in input_value:EmitIntermediate(w,1);reduce(String output_key,Iterator intermediate_values):/output_key:a word/output_values:a list of counts int result=
39、0;for each v in intermediate_values:result+=ParseInt(v);Emit(AsString(result);,MapReduce:Execution overview,MapReduce:Example,MapReduce in Parallel:Example,MapReduce:Fault Tolerance,Handled via re-execution of tasks.Task completion committed through master What happens if Mapper fails?Re-execute com
40、pleted+in-progress map tasksWhat happens if Reducer fails?Re-execute in progress reduce tasksWhat happens if Master fails?Potential trouble!,MapReduce:,Walk through of One more Application,MapReduce:PageRank,PageRank models the behavior of a“random surfer”.C(t)is the out-degree of t,and(1-d)is a dam
41、ping factor(random jump)The“random surfer”keeps clicking on successive links at random not taking content into consideration.Distributes its pages rank equally among all pages it links to.The dampening factor takes the surfer“getting bored”and typing arbitrary URL.,PageRank:Key Insights,Effects at e
42、ach iteration is local.i+1th iteration depends only on ith iterationAt iteration i,PageRank for individual nodes can be computed independently,PageRank using MapReduce,Use Sparse matrix representation(M)Map each row of M to a list of PageRank“credit”to assign to out link neighbours.These prestige sc
43、ores are reduced to a single PageRank value for a page by aggregating over them.,PageRank using MapReduce,Source of Image:Lin 2008,Phase 1:Process HTML,Map task takes(URL,page-content)pairs and maps them to(URL,(PRinit,list-of-urls)PRinit is the“seed”PageRank for URLlist-of-urls contains all pages p
44、ointed to by URLReduce task is just the identity function,Phase 2:PageRank Distribution,Reduce task gets(URL,url_list)and many(URL,val)valuesSum vals and fix up with d to get new PREmit(URL,(new_rank,url_list)Check for convergence using non parallel component,MapReduce:Some More Apps,Distributed Gre
45、p.Count of URL Access Frequency.Clustering(K-means)Graph Algorithms.Indexing Systems,MapReduce Programs In Google Source Tree,MapReduce:Extensions and similar apps,PIG(Yahoo)Hadoop(Apache)DryadLinq(Microsoft),Large Scale Systems Architecture using MapReduce,BigTable:A Distributed Storage System for
46、Structured Data,Introduction,BigTable is a distributed storage system for managing structured data.Designed to scale to a very large sizePetabytes of data across thousands of serversUsed for many Google projectsWeb indexing,Personalized Search,Google Earth,Google Analytics,Google Finance,Flexible,hi
47、gh-performance solution for all of Googles products,Motivation,Lots of(semi-)structured data at GoogleURLs:Contents,crawl metadata,links,anchors,pagerank,Per-user data:User preference settings,recent queries/search results,Geographic locations:Physical entities(shops,restaurants,etc.),roads,satellit
48、e image data,user annotations,Scale is largeBillions of URLs,many versions/page(20K/version)Hundreds of millions of users,thousands or q/sec100TB+of satellite image data,Why not just use commercial DB?,Scale is too large for most commercial databasesEven if it werent,cost would be very highBuilding
49、internally means system can be applied across many projects for low incremental costLow-level storage optimizations help performance significantlyMuch harder to do when running on top of a database layer,Goals,Want asynchronous processes to be continuously updating different pieces of dataWant acces
50、s to most current data at any timeNeed to support:Very high read/write rates(millions of ops per second)Efficient scans over all or interesting subsets of dataEfficient joins of large one-to-one and one-to-many datasetsOften want to examine data changes over timeE.g.Contents of a web page over multi