中国电信——Google核心技术初探.ppt

资源描述

《中国电信——Google核心技术初探.ppt》由会员分享，可在线阅读，更多相关《中国电信——Google核心技术初探.ppt（106页珍藏版）》请在三一办公上搜索。

1、,核心技术初探,中国电信集团股份有限公司广州研究院,提纲,1 由廉价计算机组成的计算群集Google的全球性架构2 GFSGoogle的文件系统浅析3 ChubbyGoogle的锁服务4 bigtableGoogle为结构化数据开发的分布式数据库5 map/reduceGoogle的分布式并行编程框架6 sawzallGoogle的并行编程语言7 GAEGoogle Application Engine8 开源方面的Hadoop 判断网页的重要性Pagerank算法简介及正确性浅析总结,Google是什么？,Google,Google 本是数学中的一个名词，它表示一个十分巨大的数：1后面跟

2、100个 0（即 10 的100次方）Sergey Brin 和Larry Page 使用Google 作为自己的搜索引擎和公司名的主要原因是希望自己设计的 Web 搜索引擎将来能够支持十分巨大的 Web页面检索,Google集群计算机数量的增长,万台,2001.3,0.8,2006.8,45,2003,10,2008.10,80万？,90,值得注意的Google的查询结果,约有427,000项符合google cluster ieee的查询结果，以下是第1-10项（搜索用时 0.24 秒）查询时间 2008-8-30 9:00约有426,000项符合google cluster ieee的查

3、询结果，以下是第1-10项（搜索用时 0.27 秒）查询时间 2008-8-30 15:47,Why?,What will happen when a user enters a query to Google?,Google的查询服务架构,1、DNS负载均衡（考虑全球各个群集的负载情况）2、群集内部的硬件负载均衡设备制定一个web server3、将query执行拆分成两个主要阶段：（1）index server进行反向索引查询，并输出各关键字对应文档集合的交集（docid的列表，排序）（2）根据docid的列表由docment server计算实际的title、url和摘要,http:/+

4、cluster,使用复制技术提升容量及容错,查询行为的特点：只读操作为主、更新操作（写）很少充分利用群集固有的并行计算特性将数据及处理器分割成独立的shard，这样能够使性能近似线性增加强调群集的整体吞吐量而不是单一线程的高性能依赖软件可靠性，不使用RAID使用复制以达到高吞吐量和高可用性性价比为第一考虑，只配置目前性价比最好的cpu使用廉价的普通pc,Google的数据中心,Tons of cheap x86 boxes Price/performance is king Power is a major issue,too Dataset is split up over multiple

5、 machines Index servers Document servers Spellcheck servers etc.,Google典型的硬件配置,Cheap x86 boxes,40-80/rack Modest disk space/box(250GB?;one per box)Intra-rack network is 100baseT Inter-rack network is 1000baseT They build their own servers and switches Have many patents on cable/rack/power innovation

6、s,性价比,Traditional big-iron box(circa 2003)8 2GHz Xeons 64GB RAM 8TB disk$758,000 USD Prototypical GOOG rack(circa 2003)176 2GHz Xeons 176GB RAM 7TB disk$278,000 USD,有了分布式架构，Google还需要淘汰机器吗？,集群中的跨代cpu，从单一处理器的533MHz的Intel Celeron服务器到双1.4GHz的奔腾III服务器（2003年的情况）Google的标准是计算每查询成本/性能，因此每台机器只使用2-3年另外一个因素，即使G

7、oogle开发了大规模并行算法，也需要淘汰过久的硬件，使得能够保持相对简单的并行控制,Google真的要建水电站吗？,商业IDC的能耗是70-150W每平方英尺，Google的数据中心达到400-700W每平方英尺用于冷却的空调系统、UPS系统也需要专门设计,Google对微处理器的看法,需要适度的高CPI（性能因子，Clock cycles per instruction）指令层没有太多可开发的并行计算潜力对于线程级的并发，微电路层的并行计算架构是有希望的（按照Google的说法，双核cpu能提高30%的性能）,小结：Google的做法,Replication,replication,rep

8、lication Each piece of data is available on multiple machines Literally dozens of copies of the Web across their clusters Requests are split up over the logical clusters and handled in parallel,Google的逻辑结构图,Google需要解决海量计算的需求,more queries,better results,more data,Every Google service sees continuing

9、growth in computational needsMore queriesMore users,happier usersMore dataBigger web,mailbox,blog,etc.Better resultsFind the right information,and find it faster,Google Technology Layers,Computing platform,Distributed systems infrastructure,Services and Applications,的核心组件,Distributed lockserver:chub

10、by.Distributed storage:GFS,BigTable.The workqueue.Parallel computation:MapReduce,Sawzall.,GFS:The Google File System,Reliable distributed storage system for petabyte scale filesystems.Data kept in 64-megabyte“chunks”stored on disks spread across thousands of machinesEach chunk replicated,usually 3 t

11、imes,on different machines so that GFS can recover seamlessly from disk or machine failure.A GFS cluster consists of a single master,multiple chunkservers,and is accessed by multiple clients.,Client,Client,Misc.servers,Client,Replicas,Masters,Chunkserver 1,Chunkserver N,Chunkserver 2,Master manages

12、metadataData transfers happen directly between clients/chunkserversFiles broken into chunks(typically 64 MB)Chunks triplicated across three machines for safetySee SOSP03 paper at http:/,GFS:The Google File System,GFS的系统架构,1、一个GFS集群包含一个主服务器和多个块服务器，被多个客户端访问 2、文件被分割成固定尺寸的块。在每个块创建的时候，服务器分配给它一个不变的、全球唯一的6

13、4位的块句柄对它进行标识。块服务器把块作为linux文件保存在本地硬盘上，并根据指定的块句柄和字节范围来读写块数据。3、主服务器管理文件系统所有的元数据，包括名称空间，访问控制信息，文件到块的映射信息，以及块当前所在的位置。它还管理系统范围的活动，例如块租用管理，孤儿块的垃圾回收，以及块在块服务器间的移动。主服务器用心跳信息周期地跟每个块服务器通讯，发送指令并收集块服务器状态。,GFS的接口,GFS提供了一个类似传统文件系统的接口，虽然它并没有实现类似POSIX的标准API。文件在目录中按照层次组织，用路径名来标识。我们支持常用的操作，如创建，删除，打开，关闭，读和写文件。GFS有快照和记录追

14、加操作。快照操作可以用很低的成本创建文件或者目录树的拷贝。记录追加操作可以在保证原子性的前提下，允许多个客户端同时在一个文件上追加数据。这对于实现多路结果合并以及生产者-消费者模型非常有好处，多个客户端可以同时在一个文件上追加数据，而不需要任何额外的锁定。,采用64M的块的优缺点,首先，减少了客户端和主服务器通讯的需求，因为对同一个块的读写，只需要一次用于获得块位置信息的与主服务器的通讯。对Google非常重视的工作负载来说，这种减少尤其明显，因为Google的应用程序经常连续读写巨大的文件。其次，由于块尺寸很大，所以客户端会对一个给定的块进行许多操作，这样就可以通过跟块服务器保持较长时间的T

15、CP连接来减少网络负载。第三，降低了主服务器需要保存的元数据的尺寸。这就允许把元数据放在内存中。缺点，对于小文件来说，某个块可能成为访问热点，从而产生性能瓶颈。,采用压缩存储,Google 采用 ZLIB 压缩算法先对原始 Web页面信息进行压缩，然后只存储压缩后的结果。ZLIB 压缩算法对 WEB文档的压缩比为 3：1因此一个 64MB 的大文件，实际上包含 192MB 的原始 Web 文档。,sync,length,Compressed packet,sync,length,Compressed packet,Packet的格式,docid,encode,urllen,Pagelen,ur

16、l,page,主服务器如何应对失效,主服务器保存三种主要类型的元数据：文件和块的命名空间，文件到块的映射，以及每个块副本的位置。所有的元数据都保存在主服务器的内存里。除此之外，前两种类型（命名空间和文件块映射）的元数据，还会用日志的方式保存在主服务器的硬盘上的操作日志内，并在远程的机器内复制一个副本。使用日志可以简单可靠的更新主服务器的状态，而且不用担心服务器崩溃带来的数据不一致的风险。主服务器不会持久化保存块的位置信息。主服务器在自己启动以及块服务器加入集群的时候，询问块服务器它所包含的块的信息。,GFS简化的一致性模型,文件命名空间的修改（例如，文件创建）是原子性的。他们仅受主服务器的控制

17、：命名空间锁定保证了原子性和正确性。主服务器的操作日志定义了这些操作的全局总顺序在一系列成功的操作后，被操作的数据范围被保证为已定义的，并且包含最后一次操作写入的数据。（a）GFS对块的多个副本采用一样的顺序进行操作。（b）并使用块版本号来检测副本是否因为它所在的块服务器当机而错过了某些操作，而失效了。失效的副本不会再被任何操作涉及，也不会被主服务器作为块位置告知客户端，而会优先被垃圾收集。倚靠追加而不是其他的写入、检查点、写操作的自验证、自说明的记录,GFS的名称空间管理和锁,每个主服务器操作运行之前都需要获得一系列的锁。例如，如果操作包含/d1/d2/./dn/leaf，首先获得目录/d

18、1，/d1/d2，.，/d1/d2/./dn的读取锁，以及全路径/d1/d2/./dn/leaf的读写锁。因为名称空间可以有许多节点，所以读写锁需要的时候才会被分配，一旦不再使用就会被删除。锁的获取依据一个全局一致的顺序来避免死锁：首先由名称空间的层次决定顺序，统一层次内的锁顺序由字典顺序决定。,GFS的高可用性,一个块服务器失效后，有些块的副本数量可能过低，必须被克隆以恢复它们的复制水平。恢复所有这样的块需要的时间取决于资源的总量。在实验中，停掉集群B里面的一台块服务器。这个块服务器有15000个块，包含600GB的数据。为了限制对正在运行的程序的干扰，为一些定期任务提供余地，默认参数限制集

19、群中最多有91个并行的克隆操作（块服务器数量的40%），每个克隆操作的速度可以是6.25MB（50Mbps）。所有的块会在23.2分钟内恢复，复制的速度是440MB/s。在另外的实现中，停掉两个块服务器，每个大概有16000个块和660GB数据。这个双倍的失效，造成266个块只有一个副本。这266个块被优先复制，在2分钟内所有都恢复到至少两个副本，这样就把集群推到一个状态，可以容忍另外的块服务器失效，不会造成数据丢失。,Chubby A Distributed lockservice,特点：不是编程框架，也不是单机版上传统的互斥锁，而是一种由集群系统提供的锁服务一个chubby的实例称之为一个

20、chubby cell，可以为1万台机器提供锁服务大部分采用集中式部署，并至少复制一个远程cell主要的设计目标：可靠性、可用性、大量用户、易理解的语法,chubbygoogle帝国的基石,依赖chubby运行的系统有：1.GFS-use chubby lock to appoint a GFS master server 2.Bigtable-use chubby in serveral ways:to elect a master,to allow the master to discover the server it control,and to permit clients to f

21、ind the master 3.GFS and bigtable-use chubby to store a small amount of meta-data在部署chubby之前，google的分布式系统采用ad hoc的方法来选举master或者手工指定Chubby的优势：提高可用性，在failure over的时候不需要人工干涉Chubby采用Paxos协议来解决分布式一致性的问题,为什么是锁服务而不是其它？,首先，google发现他们的程序员有时候没有按照架构设计者的要求在设计中考虑高可用性，没有使用一致性协议其次，很多的google服务需要选举mater或者需要一个机制来在不同的

22、机器间发布数据，chubby的成功在于充当了一个名字服务器并提供一致性的客户端caching而不是time-based caching第三，锁接口更容易被程序员熟悉，部分程序员在分布式环境中常常错误地使用锁最后，分布式一致性算法使用多数派来作决定，因此需要使用几个副本来获得高可用性因此，google的架构师选择了锁服务，而不是一个开发库或者是一致性服务，并且提供小文件存储功能来允许mater宣告他们自己以及相关参数,chubby锁的特点,长期锁（hours or days）而不是短期锁（seconds or less）好处：长期锁使得chubby server的事务率和client的事务率关系

23、不大，而且可以减少server当掉时client被大量锁定的风险,Chubby的系统结构,典型的场景是5台服务器组成一个cell，选择mater和数据同步均采用一致性协议,Client application,Chubby library,Client application,Chubby library,Client processes,RPCs,5 server of a chubby cell,master,Chubby server放在不同的机架上,Chubby锁的形式,Exports a filesystem interface,similar to unixDirectory ba

24、sed approach:/ls/foo/wombat/pouchCoarse-grained locks,can store small amount of data in a lock5 replicas,need a majority vote to be activeGeneric operations:create,read,write,lockCallbacks on data stored in Chubby:File contents changedChild node added,removed,changedLock expired,lock competition,.Ch

25、ubby master failure,Chubby的事件,le contents modiedoften used to monitor the lo-cation of a service advertised via the lechild node added,removed,or modiedused to im-plement mirroring.(In addition to allowing new les to be discovered,returning events for child nodes makes it possible to monitor ephemer

26、al les without affecting their reference counts.)Chubby master failed overwarns clients that other events may have been lost,so data must be rescanned.a handle(and its lock)has become invalidthis typi-cally suggests a communications problem.lock acquiredcan be used to determine when a pri-mary has b

27、een elected.conicting lock request from another clientallows the caching of locks.,Caching,长期caching而不是time-based，由服务器通知客户端cache无效，客户端自身也有一个租约，修改操作仅在服务期知道所有client都将原有的cache无效后进行Client只有是cache无效的操作，而不是update内容虚同步的方法在已经存在多种通信协议的环境中是不合适的,Session and KeepAlives,A client requests a new session on rst con

28、tacting the master of a Chubby cell.It ends the session explicitly either when it terminates,or if the session has been idle(with no open handles and no calls for a minute)每个session还有一个关联的lease timer，lease timer本质上是一个Qos承诺（保证chubby锁服务提供响应的最大等待时间），可用于服务器负载平衡（流量控制）Client端的lease timer timeout之后，进入所谓危险期

29、（此时没有断开session），而是进入所谓grace period（45s by default），如果在grace peroid能够重新连接上服务器，就继续会话的操作，否则向上层应用返回错误原则：尽量不restart，如果断开之后重新连接之后所有原操作均不能执行,Fail-over过程,除了延迟，上层应用感觉不到master的变化,清除cache，启动grace perios timer,Chubby的数据库,第一个版本用Berkeley DB，后来嫌Berkeley的开发人员对分布式复制部分功能的代码更新不够快，自己做了一个类似Birrell的简单的database，实现了原子操作，取消

30、了事务操作,备份,每几个小时，每个chubby cell的master就写一个自己数据库的snapshot到另一栋建筑的另一个GFS file server中，防止循环依赖,滥用的客户端,缺少更激进的caching，很多程序员喜欢写个循环不停地去retry一个不存在的文件，因此需要惩罚过多使用open()的应用，另外，采用negative caching（BSD的虚拟文件系统）的方法来解决缺少quotas，导入一个256kBytes的chubby文件大小限制把chubby当Publish/subscribe来用是划不来的，主要是因为chubby非常重视对数据的无效操作而不是数据更新操作,教训,

31、开发者是很少考虑可用性的细纹理的锁是不需要的不用TCP，而用UDP作为RPCs的网络层通信协议（主要是TCP本身的拥塞控制及回退机制让google无法精确控制各种timer）,BigTable,A distributed storage system for managing structured data that is designed to scale to a very large size:petabytes of data across thousands of commodity servers.Built on top of GFS.Used by more than 60 G

32、oogle products and projects including Google Earth,Google Finance,Orkut,BigTable:An Example for crawl data,A web crawling system might use BigTable that stores web pages.Each row key could represent a specific URL,with columns for the page contents,the language,the references to that page,or other m

33、etadata.The row range for a table is dynamically partitioned between servers.Rows are clustered together on machines by key,so using URLs as keys would minimize the number of machines where pages from a single domain are stored.Each cell is timestamped so there could be multiple versions of the same

34、 data in the table.,Data model:a big map,triple for key-lookup,insert,and delete APIArbitrary“columns”on a row-by-row basisColumn family:qualifier.Family is heavyweight,qualifier lightweightColumn-oriented physical store-rows are sparse!Does not support a relational modelNo table-wide integrity cons

35、traintsNo multirow transactions,SSTable,Immutable,sorted file of key-value pairsChunks of data plus an index Index is of block ranges,not values,Index,64K block,64K block,64K block,SSTable,Tablet,Contains some range of rows of the tableBuilt out of multiple SSTables,Index,64K block,64K block,64K blo

36、ck,SSTable,Index,64K block,64K block,64K block,SSTable,Tablet,Start:aardvark,End:apple,Table,Multiple tablets make up the tableSSTables can be sharedTablets do not overlap,SSTables can overlap,SSTable,SSTable,SSTable,SSTable,Tablet,aardvark,apple,Tablet,apple_two_E,boat,Finding a tablet,Servers,Tabl

37、et servers manage tablets,multiple tablets per server.Each tablet is 100-200 megsEach tablet lives at only one serverTablet server splits tablets that get too bigMaster responsible for load balancing and fault toleranceUse Chubby to monitor health of tablet servers,restart failed serversGFS replicat

38、es data.Prefer to start tablet server on same machine that the data is already at,Tablet的操作,Editing a table,Mutations are logged,then applied to an in-memory versionLogfile stored in GFS,SSTable,SSTable,Tablet,apple_two_E,boat,Insert,Insert,Delete,Insert,Delete,Insert,Memtable,Compactions,Minor comp

39、action convert the memtable into an SSTableReduce memory usage Reduce log traffic on restartMerging compactionReduce number of SSTablesGood place to apply policy“keep only N versions”Major compactionMerging compaction that results in only one SSTableNo deletion records,only live data,性能优化方法：Locality

40、 Groups,Group column families together into an SSTableAvoid mingling data,ie page contents and page metadataCan keep some groups all in memoryCan compress locality groupsBloom Filters on locality groups avoid searching SSTable,其它方面的优化,两阶段压缩模式两层次的Caching布隆过滤器单一的commit-log文件加速Tablet的移动使用不可变的数据,Microbe

41、nchmarks,Application at Google,小结：bigtable的成功之处,最成功之处，对很多技术的选择、取舍都恰到好处,Workqueue:Scheduling many jobs on many machines,A large scale time sharing system built out of an array of computers and their ram,cpus and disks.Schedules jobs,allocates resources,reports status,and collects the results.Similar

42、to other distributed systems described in the literature,such as Condor.The same machines that serve as chunkservers for a GFS cell can also contribute CPU and memory resources to a workqueue because the computational requirements of a GFS storage cell are negligible.So a machine A may contribute to

43、 storage cluster(GFS cell)B and also to compute cluster(workqueue)C-the compute and storage cells can(and will)overlap.,Workqueue:Scheduling many jobs on many machines,GFS Master,GFSChunkserver,Job 0task,Machine 1,Workqueueslave,Workqueuemaster,Bigmemory job task,GFSChunkserver,Job 2task,Workqueuesl

44、ave,Bigmemory job task,Machine N,MapReduce引言,MapReduce中最重要的两个词就是Map（映射）和Reduce（规约）。初看Map/Reduce这两个词，熟悉Function Language的人一定感觉很熟悉。FP把这样的函数称为”higher order function”（”High order function”被成为Function Programming的利器之一哦），也就是说，这些函数是编写来被与其它函数相结合（或者说被其它函数调用的）。如果说硬要比的化，可以把它想象成C里面的CallBack函数，或者STL里面的Functor。比如

45、你要对一个STL的容器进行查找，需要制定每两个元素相比较的Functor（Comparator），这个Comparator在遍历容器的时候就会被调用。,MapReduce引言（二）,对图像处理程序来说，其实大多数的图像处理操作都是对图像矩阵进行某种运算。这里的运算通常有两种，一种是映射，一种是规约。拿两种效果来说，”老照片”效果通常是强化照片的G/B值，然后对每个象素加一些随机的偏移，这些操作在二维矩阵上的每一个元素都是独立的，是Map操作。而”雕刻”效果需要提取图像边缘，就需要元素之间的运算了，是一种Reduce操作。再举个简单的例子，一个一维矩阵（数组）0,1,2,3,4可以映射为0,2,

46、3,6,8（乘2），也可以映射为1,2,3,4,5（加1）。它可以规约为0（元素求积）也可以规约为10（元素求和）。,MapReduce引言（三）,面对复杂问题，古人教导我们要“分而治之”，英文中对应的词是”Divide and Conquer“。Map/Reduce其实就是Divide/Conquer的过程，通过把问题Divide，使这些Divide后的Map运算高度并行，再将Map后的结果Reduce（根据某一个Key），得到最终的结果。Googler发现这是问题的核心，其它都是共性问题。因此，他们把MapReduce抽象分离出来。这样，Google的程序员可以只关心应用逻辑，关心根据哪些

47、Key把问题进行分解，哪些操作是Map操作，哪些操作是Reduce操作。其它并行计算中的复杂问题诸如分布、工作调度、容错、机器间通信都交给Map/Reduce Framework去做，很大程度上简化了整个编程模型。,MapReduce引言（四）,MapReduce的另一个特点是，Map和Reduce的输入和输出都是中间临时文件（MapReduce利用Google文件系统GFS来管理和访问这些文件），而不是不同进程间或者不同机器间的其它通信方式。这是Google一贯的风格，化繁为简，返璞归真。,MapReduce引言（五）,Map的定义：Map,written by the user,takes

48、 an input pair and produces a set of intermediate key/value pairs.The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.Reduce的定义：The Reduce function,also written by the user,accepts an intermediate key I and

49、a set of values for that key.It merges together these values to form a possibly smaller set of values.Typically just zero or one output value is produced per Reduce invocation.The intermediate values are supplied to the users reduce function via an iterator.This allows us to handle lists of values t

50、hat are too large to fit in memory.,MapReduce,A programming model and an associated implementation for processing and generating large data sets.A user specified map function processes a key/value pair to generate a set of intermediate key/value pairs.A user specified reduce function merges all inte

展开阅读全文