基于多核的并行程序设计.ppt

资源描述

《基于多核的并行程序设计.ppt》由会员分享，可在线阅读，更多相关《基于多核的并行程序设计.ppt（61页珍藏版）》请在三一办公上搜索。

1、基于多核的并行程序设计,2023/10/11,2,outline,参考资料baidu/google课本并行编程模式（清华大学出版社）并行程序设计（机械工业出版社）老师/助教预备知识计算机组成原理操作系统c/c+,2023/10/11,3,课程安排,并行体系与多核体系结构多核计算机系统的操作系统基于多核计算机系统的软件开发工具基于多核的软件设计基于多核平台的程序调试和优化技术,2023/10/11,4,多核发展及其挑战,2023/10/11,5,拉开多核时代的序幕,2023/10/11,6,多核处理器的基本架构,背景：随着芯片制成工艺的不断进步，从体系结构来看，传统的处理器体系结构技术已

2、面临瓶颈，晶体管集成度已经过亿，很难通过提高主频来提升性能；从应用需求来看，日益复杂的多媒体、科学计算、虚拟化等多个应用领域都呼唤更为强大的计算能力。在这样的背景下，各主流处理器厂商纷纷将产品战略从提高芯片的时钟频率转向多线程、多内核。,1.多核芯片,发展近况：继双核之后，Intel已经在2006年11月推出了4核产品，AMD也推出了代号为巴塞罗那的4核处理器。目前，多核处理器的推出越演越烈，在推出了代号为Niagara的8核处理器之后，Sun公司还计划推出Niagara2处理器。Intel近日内声称，明年即将研制推出10核以上的处理器产品。,2023/10/11,7,多核处理器简介,什么是多

3、核处理器两个或多个独立运行的内核集成于同一个处理器上双核处理器=一个处理器上包含2个内核,2023/10/11,8,多核处理器简介,为什么采用双核,2023/10/11,9,定义：片上多核处理器（Chip Multi-Processor,CMP）就是将多个计算内核集成在一个处理器芯片中，从而提高计算能力。,多核处理器的基本架构,2.片上多核处理器体系结构,分类：按计算内核是否对等，CMP可分为同构多核和异构多核。计算内核相同，地位对等的称为“同构多核”，现在Intel和AMD主推的双核处理器就是同构多核的；计算内核不同，地位不对等的称为“异构多核”，异构多核采用“主处理器协处理器”的设计，IB

4、M、SONY等联手推出的Cell处理器就是异构多核处理器的典范。,2023/10/11,10,硬件结构：由于CMP处理器的各CPU核心执行的程序之间有时需要进行数据共享与同步，故硬件结构必须支持核间通信。,多核处理器的基本架构,2.片上多核处理器体系结构,总线共享cache结构：是指每个CPU内核拥有共享的二级或三级cache，用于保存比较常用的数据，并通过连接核心的总线进行通信。优点：结构简单、通信速度高。缺点：基于总线的结构可扩展性较差。,基于片上互连的结构：指每个CPU内核拥有独立的处理单元和cache，各个CPU核心间通过交叉开关或片上网络等方式连接在一起，各个CPU核心间通过消息通信

5、。优点：可扩展性好、数据带宽有保证。缺点：硬件结构复杂，且软件改动较大。,2023/10/11,11,多核挑战软件开发,多核的影响,原有软件大都是并行的多核提供了更高性能的执行平台需要做的是针对多核进行优化,多核应用不存在困难,业务特征是并发的,应用具有天然的并发性多核提供了一个高性能计算平台,面临挑战不大,原有大部分程序是串行的需要很好的并行编程模型和开发环境,挑战很大,2023/10/11,12,多核挑战软件开发,并行程序设计为什么难？其根本原因是因为大多数计算机和编程语言发明之初就是按照冯诺依曼理论进行设计的。根据冯诺依曼的理论，CPU是按照程序指令，一条条取出来并顺序执行的。而在多

6、核或者多CPU的计算机中，同时会有多条指令在执行。,2023/10/11,13,多核挑战软件开发,并行程序设计之难首先，运行于不同处理器上的各项任务之间的通信就是个难题。其次，由于并行系统缺少明确的全局系统状态，不像串行程序容易理解第三，因为并行程序执行时，每一次的执行路径并不完全一样，这会给并行程序设计的纠错和调优等带来很大困难。,2023/10/11,14,多核带来的挑战,毫无疑问，多核给我们提供了更经济的计算能力。但是，这种能力能否善加利用还要取决于软件。如果不针对多核进行软件开发，不仅多核提供的强大计算能力得不到利用，相反还有可能不如单核CPU好用。“从某种程度上说，对于软件开发者而

7、言，CPU主频提升就像是免费的午餐，此前所有的程序很自然地会从主频的提升中受益，而如今多核出现了，这种免费的午餐没有了。我们必须针对多核重新进行软件设计。”,2023/10/11,15,认识并行计算,2023/10/11,16,What Is Parallel Computing?,Attempt to speed solution of a particular task by1.Dividing task into sub-tasks2.Executing sub-tasks simultaneously on multiple processorsSuccessful attempts

8、require both1.Understanding of where parallelism can be effective2.Knowledge of how to design and implement good solutions,2023/10/11,17,Why Parallel Computing?,“The free lunch is over.”Herb SutterWe want applications to execute fasterClock speeds no longer increasing exponentially,2023/10/11,18,Way

9、s of Exploiting Parallelism,Domain decomposition(域分解)数据Task decomposition(任务分解)计算Pipelining(流水线)3者的结合,2023/10/11,19,Domain Decomposition(域划分),First,decide how data elements should be divided among processors划分的对象是数据，可以是算法的输入数据、中间处理数据和输出数据Second,decide which tasks each processor should be doing划分时考虑数

10、据上的相应操作；如果一个任务需要别的任务中的数据，则会产生任务间的通讯Example:Vector additionadd two vectors of size 100,000 using two processors划分方法，最佳是分成前后两部分,2023/10/11,20,Domain Decomposition,Find the largest element of an array,2023/10/11,21,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,shared

11、 scalar variable that will hold the global maximum,2023/10/11,22,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,23,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,24,Domain Decomposition,Find the largest eleme

12、nt of an array,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,25,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,26,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,27,Domain Decomposition,Find the largest element of an arr

13、ay,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,28,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,The first CPU copies the maximum value it found into the shared memory location.,2023/10/11,29,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,

14、The first CPU copies the maximum value it found into the shared memory location.,2023/10/11,30,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,31,Domain Decomposition,Find the largest element of an array,CPU 0,CPU 1,CPU 2,CPU 3,When the last CPU is done,t

15、he shared location has the maximum value.,2023/10/11,32,Task(Functional)Decomposition,First,divide tasks among processors划分的对象是计算，将计算划分为不同的任务，其出发点不同于域分解Second,decide which data elements are going to be accessed(read and/or written)by which processors划分后，研究不同任务所需的数据。如果这些数据不相交的，则划分是成功的；如果数据有相当的重叠，意味着要

16、重新进行域分解和功能分解；Example:Event-handler for GUIOne processor may be watching the keyboard and mouse while another processor performs the activity related to a previous user action.,2023/10/11,33,Task Decomposition,f(),s(),r(),q(),h(),g(),In a task decomposition we look for functions that can execute simu

17、ltaneously.In this drawing the arrows represent the precedence constraints among the functions.,2023/10/11,34,Task Decomposition,f(),s(),r(),q(),h(),g(),CPU 0,CPU 2,CPU 1,Question:Why is there no point in assigning“f”,“r”,and“s”to different CPUs?,2023/10/11,35,Task Decomposition,f(),s(),r(),q(),h(),

18、g(),CPU 0,CPU 2,CPU 1,Blue circles indicate active CPUs.,2023/10/11,36,Task Decomposition,f(),s(),r(),q(),h(),g(),CPU 0,CPU 2,CPU 1,2023/10/11,37,Task Decomposition,f(),s(),r(),q(),h(),g(),CPU 0,CPU 2,CPU 1,2023/10/11,38,Task Decomposition,f(),s(),r(),q(),h(),g(),CPU 0,CPU 2,CPU 1,2023/10/11,39,Pipe

19、lining,Special kind of task decomposition“Assembly line”parallelismIn a pipelined application,the output of each function is the input to the next function.If we are only interested in processing one data set,there is no parallelism.the throughput is limited by the slowest stage.So if all the stages

20、 dont run at the same speed,its inefficient.Example:3D rendering in computer graphics,Rasterize,Clip,Project,Model,Input,Output,2023/10/11,40,Processing One Data Set(Step 1),Rasterize,Clip,Project,Model,Here a graphics rendering computation can be divided into four stages.If we want to process only

21、one data set,it takes one step for each stage.,2023/10/11,41,Processing One Data Set(Step 2),Rasterize,Clip,Project,Model,2023/10/11,42,Processing One Data Set(Step 3),Rasterize,Clip,Project,Model,2023/10/11,43,Processing One Data Set(Step 4),Rasterize,Clip,Project,Model,The pipeline processes 1 dat

22、a set in 4 steps,Here a graphics rendering computation can be divided into four stages.If we want to process only one data set,it takes one step for each stage.,2023/10/11,44,Processing Two Data Sets(Step 1),Rasterize,Clip,Project,Model,CPU0,CPU1,CPU2,CPU3,每个CPU完成特定功能,2023/10/11,45,Processing Two Da

23、ta Sets(Time 2),Rasterize,Clip,Project,Model,2023/10/11,46,Processing Two Data Sets(Step 3),Rasterize,Clip,Project,Model,2023/10/11,47,Processing Two Data Sets(Step 4),Rasterize,Clip,Project,Model,2023/10/11,48,Processing Two Data Sets(Step 5),Rasterize,Clip,Project,Model,The pipeline processes 2 da

24、ta sets in 5 steps,2023/10/11,49,Pipelining Five Data Sets(Step 1),Data set 0,Data set 1,Data set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,50,Pipelining Five Data Sets(Step 2),Data set 0,Data set 1,Data set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,51,Pipelining Fi

25、ve Data Sets(Step 3),Data set 0,Data set 1,Data set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,52,Pipelining Five Data Sets(Step 4),Data set 0,Data set 1,Data set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,53,Pipelining Five Data Sets(Step 5),Data set 0,Data set 1,Dat

26、a set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,54,Pipelining Five Data Sets(Step 6),Data set 0,Data set 1,Data set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU 2,CPU 3,2023/10/11,55,Pipelining Five Data Sets(Step 7),Data set 0,Data set 1,Data set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU

27、2,CPU 3,2023/10/11,56,Pipelining Five Data Sets(Step 8),Data set 0,Data set 1,Data set 2,Data set 3,Data set 4,CPU 0,CPU 1,CPU 2,CPU 3,Question:How much faster is the pipelined computation than a sequential computation?Answer:It took 8 steps to process 5 data elements.It would have taken 20 steps fo

28、r the sequential computation to process five data elements.The pipelined computation is 20/8=2.5 times faster.,2023/10/11,57,Dependence Graph(依赖图),Graph=(nodes,arrows)Node for eachVariable assignment(except index variables)ConstantOperator or function callArrows indicate use of variables and constan

29、tsData flowControl flow,2023/10/11,58,Dependence Graph Example#1,for(i=0;i 3;i+)ai=bi/2.0;,b0,b1,b2,a0,a1,a2,/,/,/,2,2,2,2023/10/11,59,Dependence Graph Example#1,for(i=0;i 3;i+)ai=bi/2.0;,b0,b1,b2,a0,a1,a2,/,/,/,2,2,2,Domain decompositionpossible,2023/10/11,60,Dependence Graph Example#2,for(i=1;i 4;i+)ai=ai-1*bi;,b1,b2,b3,a1,a2,a3,*,*,*,a0,2023/10/11,61,Dependence Graph Example#2,for(i=1;i 4;i+)ai=ai-1*bi;,b1,b2,b3,a1,a2,a3,*,*,*,a0,No domain decomposition,

展开阅读全文