存储系统容错编码简介课件.ppt

资源描述

《存储系统容错编码简介课件.ppt》由会员分享，可在线阅读，更多相关《存储系统容错编码简介课件.ppt（119页珍藏版）》请在三一办公上搜索。

1、存储系统容错编码简介,内容,RAID、容错编码Reed-Solomon编码二进制线性码阵列码利用组合数学工具构造容错编码,内容,RAID、容错编码Reed-Solomon编码二进制线性码阵列码利用组合数学工具构造容错编码,RAID,Redundant Arrays of Inexpensive DisksRedundant Arrays of Independent Disks容量性能可靠性Chen, P. M., Lee, E. K., Gibson, G. A., Katz, R. H., and Patterson, D. A. “RAID: high-performance, reli

2、able secondary storage.” ACM Computing Surveys 26(2), pp. 143-185, June 1994.,RAID结构,Data Striping,stripe unit,stripe,04KB-1,4KB8KB-1,8KB12KB-1,12KB16KB-1,16KB20KB-1,RAID结构,Redundancy,04KB-1,RAID结构,编码：d1 XOR d2 XOR XOR dn = p解码：di = p XOR d1 XOR XOR di-1 XOR di-1 解码：pnew = p XOR diold XOR dinew,RAID

3、结构,data unit、parity unitRAID5：更新负载均匀分布,RAID结构,RAID5的读写,RAID5的读写,RAID5布局,Edward K. Lee, Randy H. Katz, “The Performance of Parity Placements in Disk Arrays”, IEEE Transactions on Computers, vol. 42 no. 6, pp. 651-664, 1993.,RAID5布局,RAID5布局,RAID0,Hui-I Hsiao and David J. DeWitt, “Chained declustering:

4、 A new availability strategy for multiprocessor database machines”, Technical Report CS TR 854, University of Wisconsin, Madison, June 1989.,RAID0,Gang Wang, Xiaoguang Liu, Sheng Lin, Guangjun Xie, Jing Liu, “Constructing Double- and Triple-erasure-correcting Codes with High Availability Using Mirro

5、ring and Parity Approaches”, ICPADS2007.,What is an Erasure Code?,J. S. Plank, “Erasure Codes for Storage Applications”, Tutorial of the 4th Usenix Conference on File and Storage Technologies, San Francisco, CA, Dec, 2005.,When are they useful?,Anytime you need to tolerate failures.,When are they us

6、eful?,Anytime you need to tolerate failures.,When are they useful?,Anytime you need to tolerate failures.,When are they useful?,Anytime you need to tolerate failures.,When are they useful?,Anytime you need to tolerate failures.,When are they useful?,Anytime you need to tolerate failures.,When are th

7、ey useful?,Anytime you need to tolerate failures.,Terms & Definitions,Number of data disks:nNumber of coding disks:mRate of a code:R = n/(n+m)Identifiable Failure: “Erasure”,The problem, once again,Issues with Erasure Coding,PerformanceEncodingTypically O(mn), but not always.UpdateTypically O(m), bu

8、t not always.DecodingTypically O(mn), but not always.,Issues with Erasure Coding,Space UsageQuantified by two of four:Data Devices:nCoding Devices:mSum of Devices:(n+m)Rate:R = n/(n+m)Higher rates are more space efficient,but less fault-tolerant.,Issues with Erasure Coding,FlexibilityCan you arbitra

9、rily add data / coding nodes?(Can you change the rate)?How does this impact failure coverage?,Trivial Example: Replication,MDSExtremely fast encoding/decoding/update.Rate: R = 1/(m+1) - Very space inefficientThere are many replication/based systems(P2P especially).,Less Trivial Example: Simple Parit

10、y,Patterson D A, Gibson G A, Katz R H, “A case for redundant arrays of inexpensive disks (RAID)”, ACM International Conference on Management of Data, Chicago, ACM Press, 1988, pp. 109-116.P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance, reliable secondary

11、 storage. ACM Computing Surveys, 26(2):145185, June 1994.,Evaluating Parity,MDSRate: R = n/(n+1) - Very space efficientOptimal encoding/decoding/update:n-1 XORs to encode & decode2 XORs to updateExtremely popular (RAID Level 5).Downside: m = 1 is limited.,Unfortunately,Those are the last easy things

12、 youll see.For (n 1, m 1), there is no consensus on the best coding technique.They all have tradeoffs.,Why is this such a pain?,Coding theory historically has been the purview of coding theorists.Their goals have had their roots elsewhere (noisy communication lines, byzantine memory systems, etc).Th

13、ey are not systems programmers.(They dont care),内容,RAID、容错编码Reed-Solomon编码二进制线性码,内容,RAID、容错编码Reed-Solomon编码二进制线性码阵列码利用组合数学工具构造容错编码,Reed-Solomon Codes,The only MDS coding technique for arbitrary n & m.This means that m erasures are always tolerated.Have been around for decades.Expensive.J. S. Plank.

14、A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software Practice& Experience, 27(9):9951012, September 1997.,Reed-Solomon Codes,Operate on binary words of data, composed of w bits, where 2w n+m.,Reed-Solomon Codes,Operate on binary words of data, composed of w bits, wher

15、e 2w n+m.,Reed-Solomon Codes,This means we only have to focus on words, rather than whole devices.,Word size is an issue:If n+m 256, we can use bytes as words.If n+m 65,536, we can use shorts as words.,Reed-Solomon Codes,Codes are based on linear algebra.First, consider the data words as a column ve

16、ctor D:,Reed-Solomon Codes,Codes are based on linear algebra.Next, define an (n+m)*n “Distribution Matrix” B, whose first n rows are the identity matrix:,Reed-Solomon Codes,Codes are based on linear algebra.B*D equals an (n+m)*1 column vector composed ofD and C (the coding words):,Reed-Solomon Codes

17、,This means that each data and coding word has a corresponding row in the distribution matrix.,Reed-Solomon Codes,Suppose m nodes fail.To decode, we create B by deleting the rows of B that correspond to the failed nodes.,Reed-Solomon Codes,Suppose m nodes fail.To decode, we create B by deleting the

18、rows of B that correspond to the failed nodes.Youll note that B*D equals the survivors.,Reed-Solomon Codes,Now, invert B:,Reed-Solomon Codes,Now, invert B:And multiply both sides of the equation by B-1,Reed-Solomon Codes,Now, invert B:And multiply both sides of the equation by B-1Since B-1*B = I, Yo

19、u have just decoded D!,Reed-Solomon Codes,Now, invert B:And multiply both sides of the equation by B-1Since B-1*B = I, You have just decoded D!,Reed-Solomon Codes,To Summarize: EncodingCreate distribution matrix B.Multiply B by the data to create coding words.To Summarize: DecodingCreate B by deleti

20、ng rows of B.Invert B.Multiply B-1 by the surviving words to reconstruct the data.,Reed-Solomon Codes,Two Final Issues:1: How to create B?All square submatrices must be invertible.Derive from a Vandermonde MatrixJ. S. Plank and Y. Ding. Note: Correction to the 1997 tutorial on Reed-Solomon coding. S

21、oftware Practice & Experience, 35(2):189194,2005.#2: Will modular arithmetic work?NO! (no multiplicative inverses)Instead, you must use Galois Field arithmetic.,Reed-Solomon Codes,(n+m)n的范德蒙矩阵基本变换任意两列可交换任何一列可以乘以一个非0数任意两列可做如下变换：Ci=Ci+c*Cj，c非0,Reed-Solomon Performance,Encoding: O(mn)More specificall

22、y: mS (n-1)/BXOR + n/BGFMult S = Size of a deviceBXOR = Bandwith of XOR (3 GB/s)BGFMult = Bandwidth of Multiplication over GF(2w)GF(28): 800 MB/sGF(216): 150 MB/s,Reed-Solomon Performance,Update: O(m)More specifically: m+1 XORs and m multiplications.,Reed-Solomon Performance,Decoding: O(mn) or O(n3)

23、Large devices: dS (n-1)/BXOR + n/BGFMult Where d = number of data devices to reconstruct.Yes, theres a matrix to invert, but usually thats in the noise because dSn n3.,Reed-Solomon Bottom Line,Space Efficient: MDSFlexible:Works for any value of n and m.Easy to add/subtract coding devices.Public-doma

24、in implementations.Slow:n-way dot product for each coding device.GF multiplication slows things down.,Cauchy Reed-Solomon Codes,J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. Zuckerman. An XOR-based erasure-resilient coding scheme. Technical Report TR-95-048, International Computer Sc

25、ience Institute, August 1995. #1: Use a Cauchy matrix instead of a Vandermonde matrix: Invert in O(n2).#2: Use neat projection to convert Galois Field multiplications into XORs.Kind of subtle, so well go over it.,Cauchy Reed-Solomon Codes,取GF(2w)中m+n个不同元素，构成X=x1, , xm，Y=y1, , ym Cauchy矩阵：元素(i, j)为1/

26、(xi+yj)GF(2w)上运算,Cauchy Reed-Solomon Codes,例：X=1, 2，Y=0, 3, 4, 5, 6,Cauchy Reed-Solomon Codes,Cauchy Reed-Solomon Codes,n、m固定，w增长，GC的优势明显,参考文献,参考资料J. S. Plank. Enumeration of optimal and good Cauchy matrices for Reed-Solomon coding. Technical Report CS-05-570, University of Tennessee, December 2005.

27、通信协议避免重新发送L. Rizzo. Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Computer Communication Review, 27(2):2436, 1997.L. Rizzo and L. Vicisano. RMDP: an FEC-based reliable multicast protocol for wireless environments. Mobile Computer and Communication Review, 2(2), A

28、pril 1998.,参考文献,已经正在成为IETF标准M. Luby, L. Vicisano, J. Gemmell, L. Rizo, M. Handley, and J. Crowcroft. Forward error correction (FEC) building block. IETF RFC 3452 (http:/www.ietf.org/rfc/rfc3452.txt), December 2002.M. Luby, L. Vicisano, J. Gemmell, L. Rizo, M. Handley, and J. Crowcroft. The use of fo

29、rward error correction(FEC) in reliable multicast. IETF RFC 3453 (http:/www.ietf.org/rfc/rfc3453.txt), December 2002.加密C. S. Jutla. Encryption modes with almost free message integrity. Lecture Notes in Computer Science, 2045, 2001.,参考文献,分布式数据结构W. Litwin and T. Schwarz. Lh*rs: a high-availability sca

30、lable distributed data structure using Reed Solomon codes. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 237248. ACM Press, 2000.降低无线通信能耗P. J. M. Havinga. Energy efficiency of error correction on wireless systems, 1999.广域网、对等网存储系统J. Kubiatowicz, D. Binde

31、l, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: An architecture for global-scale persistent storage. In Proceedings of ACM ASPLOS. ACM, November 2000.,参考文献,用于cache而不是冗余J. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A digital fount

32、ain approach to reliable distribution of bulk data. In ACM SIGCOMM 98, pages 5667, Vancouver, August 1998.J.W. Byers, M. Luby, and M. Mitzenmacher. Accessing multiple mirror sites in parallel: Using tornado codes to speed up downloads. In IEEE INFOCOM, pages 275283, New York, NY, March 1999.I. T. Ro

33、wstron and P. Druschel. Storage management and caching in PAST, a large-scale, persistent peer-topeer storage utility. In Symposium on Operating Systems Principles, pages 188201, 2001.,参考文献,Tornado码M. Luby, M. Mitzenmacher, and A. Shokrollahi. Analysis of random processes via and-or tree evaluation.

34、 In 9th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998.M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, and V. Stemann. Practical loss-resilient codes. In 29th Annual ACM Symposium on Theory of Computing, pages 150159, 1997.M. Luby. Benchmark comparisons of erasure codes. http:

35、/www.icsi.berkeley.edu/luby/erasure.html, 2002.（性能比RS码好）,参考文献,传统单节点用于容错和提高性能S. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch. A decentralized algorithm for erasure-coded virtual disks. In DSN-04: International Conference on Dependable Systems and Networks, Florence, Italy, 2004. IEEE.G. R

36、. Goodson, J. J. Wylie, G. R. Ganger, and M. K. Reiter. Efficient byzantine-tolerant erasure-coded storage. In DSN-04: International Conference on Dependable Systems and Networks, Florence, Italy, 2004. IEEE.W.Wilcke et al. The IBM intelligent brick project petabytes and beyond. IBM Journal of Resea

37、rch and Development, to appear, April 2006.归档系统S. Rhea, C. Wells, P. Eaton, D. Geels, B. Zhao, H. Weatherspoon, and J. Kubiatowicz. Maintenance-free global data storage. IEEE Internet Computing, 5(5):4049, 2001.,参考文献,广域网存储：S. Atchley, S. Soltesz, J. S. Plank, M. Beck, and T. Moore. Fault tolerance i

38、n the network storage stack. In IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, Ft. Lauderdale, FL, April 2002.R. L. Collins and J. S. Plank. Assessing the performance of erasure codes in the wide-area. In DSN-05: International Conference on Dependable Systems and Networks, Yokoham

39、a, Japan, 2005.IEEE.H. Xia and A. A. Chien. RobuSTore: Robust performance for distributed storage systems. Technical Report CS2005-0838, University of California at San Diego, October 2005.,参考文献,对等网：Z. Zhang and Q. Lian. Reperasure: Replication protocol using erasure-code in peer-to-peer storage net

40、work. In 21st IEEE Symposium on Reliable Distributed Systems (SRDS02), pages 330339, October 2002.J. Li. PeerStreaming: A practical receiver-driven peer-to-peer media streaming system. Technical Report MSR-TR-2004-101, Microsoft Research, September 2004.W. K. Lin, D. M. Chiu, and Y. B. Lee. Erasure

41、code replication revisited. In PTP04: 4th International Conference on Peer-to-Peer Computing. IEEE, 2004.L. Dairaine, J. Lacan, L. Lancerica, and J. Fimes. Content-access QoS in peer-to-peer networks using a fast MDS erasure code. Computer Communications, 28(15):17781790, September 2005.,参考文献,内容分发系统

42、：J. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A digital fountain approach to reliable distribution of bulk data. In ACM SIGCOMM 98, pages 5667, Vancouver, August 1998.M. Mitzenmacher. Digital fountains: A survey and look forward,.In 2004 IEEE Information Theory Workshop, San Antonio, October 200

43、4.,内容,RAID、容错编码Reed-Solomon编码二进制线性码阵列码利用组合数学工具构造容错编码,内容,RAID、容错编码Reed-Solomon编码二进制线性码阵列码利用组合数学工具构造容错编码,Binary Linear Codes,Lisa Hellerstein, Garth A Gibson, Richard M Karp, Randy H Katz, David A Patterson, “Coding techniques for handling failures in large disk arrays”, Algorithmica, vol. 12, no. 2/3

44、, pp. 182-208, 1994. 数据单元划分为重叠的校验组，每个校验组相当于一个RAID4/5,Binary Linear Codes,容错编码评价标准MTTDLcheck disk overheadupdate penaltygroup size扩展性,t维编码,磁盘排列为t维方阵，每维最后一行为校验盘1维：RAID4、RAID5，校验磁盘开销，1/G2维：2/G，但注意，G是校验组大小，不是磁盘数，G2=N，因此是2/sqrt(N)t维：t/G，t/N1/t，t大于3时，冗余率太差,t维编码,校验矩阵表示法,数据位和校验位组成一个向量码字parity check matrixH=

45、P | Ic(k + c)，k数据位数，c校验位数I为cc的单位矩阵，P为ck的0/1矩阵表示校验计算公式，对码字X，应有HX=0P的列数据盘，I的列校验盘行校验组元素：1参与校验组，0未参与,校验矩阵表示法,容错能力的判定,以下命题是等价的H能恢复任何的t擦除故障H能检测任意的t错误任意两个码字的距离distance不相同的位的数目，至少为t+1H的任意t列在GF2上线性无关因此，故障盘是否可恢复对应列是否线性无关！线性无关所有列向量的和或任意非空子集的和不为0,校验矩阵还可以表示其他指标,校验开销：c/k校验组大小：行的权重1的个数更新代价：列的权重一种常见的扩展方式新增加的磁盘全部清0无

46、需重新计算校验，而H相应的增加一列MTTDL用校验矩阵表示比较困难,重构,假定m个盘故障，重排校验矩阵H=A | B，X=d | y，其中B和y对应故障磁盘重构求解yHX=0Ad+By=0求解线性方程组Ad=By此方程组有唯一解（可正确重构）的充要条件是B的各列线性无关无需解整个方程组，抽取出故障校验组对应的行Ad=By，计算y=(B)-1Ad即可,双容错和三容错编码,最优冗余full-2和full-3：所有可能的权重为2（3）的列组成的校验矩阵bad t+1故障：一个数据单元及其t个校验单元高可靠性编码：一个t-erasure-correcting code可恢复除bad t+1故障之外的所

47、有t+1故障图表示法full-2、full-3：完全图二维：完全二部图,2擦除码对比,线性码比较,线性码比较,t3,full-t不具备t容错能力，t维码冗余率太差定理：H=P | I 为校验矩阵，若P的所有列重量为t，且对于P的任意两行，P至多有1列在两行上均为1，则编码能容t错S为P的j列的集合(j=t)，若全来自I，显然和非0至少有一列q来自P，则其他q-1列，每列与q至多有1行同时包含1q的t个1中至多有j-1个在同一行上还有1q至少有一个1在同一行上是唯一的一个1S之和重量至少为1,内容,RAID、容错编码Reed-Solomon编码二进制线性码阵列码利用组合数学工具构造容错编码,内容

48、,RAID、容错编码Reed-Solomon编码二进制线性码阵列码利用组合数学工具构造容错编码,阵列码（array codes）,数据单元排列为方阵，每列数据单元放在一个盘上沿不同方向组织校验组，每个校验组相当于一个RAID4/5，每个方向的校验单元放置在一个磁盘上可看作线性码的布局形式,EVENODD,双容错水平码，第一个MDS阵列码，可能也是最重要的M. Blaum, J. Brady, J. Bruck, J. Menon, “EVENODD: an efficient scheme for tolerating double disk failures in RAID architec

49、tures”, IEEE Trans. Comput., vol. 44, no. 2, pp. 192-202, 1995.,EVENODD编码,(p-1)*p的数据阵列，两个校验盘，p为素数第一个校验盘：水平校验，RAID4/5第二校验盘：主对角线方向组织S导致计算代价非最优,EVENODD解码,S=all P XOR all Qdependency chain，zigzag decoding,EVENODD推广,容错能力2，数据方阵中选取斜率为0、1、t-1这t个方向组织校验组M. Blaum, J. Bruck, and A. Vardy, “MDS array codes with

50、independent parity symbols”, IEEE Trans. on Information Theory, Vol. 42, No. 2, Mar, 1996, pp. 529-542.并非对所有的n、m=t可保证t容错能力，上文中给出了适用的参数,RDP,双容错水平码，(p-1)*(p-1)的数据阵列，p为素数校验相关水平校验单元作为“数据”参与对角线校验,阵列码和线性码的关系,00,01,02,03,04,10,11,12,13,14,20,21,22,23,24,30,31,32,33,34,40,41,42,43,44,Q0,Q1,Q2,Q3,Q4,P0,P1,P2

展开阅读全文