《资料结构与演算法.ppt》由会员分享,可在线阅读,更多相关《资料结构与演算法.ppt(98页珍藏版)》请在三一办公上搜索。
1、1,資料結構與演算法,台大資工系 呂學一http:/www.csie.ntu.edu.tw/hil/algo/,2,Suffix Tree,字尾樹、後綴樹an extremely powerful data structure for string algorithms,3,上次:Gusfields algorithm,Input:P and S.Output:All occurrences of P in S.Time:O(|P|+|S|)Technique:Z values of PS.Z(i+|P|)|P|iff P=Sii+|P|1.,P,S,4,To P or not to P.,P
2、reprocessing PGusfieldBoyer-MooreKnuth-Morris-PrattPreprocessing SSuffix tree,5,From Suffix Trie to Suffix Tree,6,Suffixes of S,S=b b a b b a a bS18=b b a b b a a bS28=b a b b a a bS38=a b b a a bS48=b b a a bS58=b a a bS68=a a bS78=a bS88=b,1st suffix,2nd suffix,3rd suffix,4th suffix,5th suffix,6th
3、 suffix,7th suffix,8th suffix,7,KEY:P occurs in S iff P is a prefix of a suffix of S.,S=b b a b b a a bS18=b b a b b a a bS28=b a b b a a bS38=a b b a a bS48=b b a a bS58=b a a bS68=a a bS78=a bS88=b,1st suffix,2nd suffix,3rd suffix,4th suffix,5th suffix,6th suffix,7th suffix,8th suffix,8,T=Suffix T
4、rie of S,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,9,Why suffix trie?,The following statements are equivalent.P occurrs in S.P is a prefix of a suffix of S.P corresponds to a path of T starting from the root of T.,10,P=b a b b a,b b a b b a a b b a b b a a b a b b a a b
5、 b b a a b b a a b a a b a b b,P occurs in S!,11,P=b b a a b a,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,P doesnt occur in S!,12,P=a b b b a a,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,P doesnt occur in S!,13,Q:P出現在S的哪個位置?,14,P=a b b a a,8,
6、4,4,4,4,1,1,1,1,1,5,5,5,2,2,2,2,2,6,7,6,3,3,3,7,3,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,Output:3,15,用suffix trie來解題,Question:If we are given a suffix trie of S,what is the time complexity for finding an occurrence of P in S?Answer:O(|P|)time.,16,建構Suf
7、fix Trie,Q:Time complexity for constructing the suffix trie T of S?,Q(|S|),Q(|S|log|S|),Q(|S|2),Q(|S|3),17,Time=O(|S|2),8,4,4,4,4,1,1,1,1,1,5,5,5,2,2,2,2,2,6,7,6,3,3,3,7,3,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,18,Worst-case time=W(|S|2),How to establi
8、sh a lower bound?Answer:Showing an example which takes W(|S|2)time for any algorithm.,19,S=a a a a b b b b,20,Summary,Suffix trie is good in solving Substring Problem,but may require W(|S|2)time and space.Question:is there a compact representation of suffix trie that needs only O(|S|)time and space?
9、,21,Suffix Tree,A compact representation of suffix trie,22,Observations on Trie T of S,T has at most|S|leaves.Why?T has at most|S|1 branching nodes.Why?,23,Keeping leaves and branching nodes pact representation of edge labels,S=a a a a b b b b,5,8,5,8,5,8,5,8,4,8,1,1,2,2,3,3,24,S=a a a a b b b b,25,
10、S=b b a b b a a b,26,S=b b a b b a a b,1,1,2,3,4,8,7,8,4,8,7,8,4,8,7,8,3,3,3,3,27,S=b b a b b a a b,28,Question,The space complexity of suffix treeO(|S|)O(|S|log|S|)O(|S|2)O(|S|3)Why?Number of nodes=O(|S|).Number of edges=O(|S|).Space required by each edge=O(1).,29,The challenge,Constructing Suffix
11、Tree in Linear Time,30,History of Suffix Tree Algorithms,Weiner,IEEE FOCS 1973Linear time but expensive in space.D.E.Knuth:“the algorithm of 1973”.McCreight,J.ACM 1976Linear time and quadratic space.Ukkonen,Algorithmica 1995Linear time and linear space.Much better readability.,31,Esko Ukkonen,Univer
12、sity of Helsinki,Finland,32,Ukkonens Algorithm,Let T(k)denote the suffix tree of S1k.Frameworkcompute T(1);for k=2 to|S|do compute T(k)from T(k-1);,33,|S|iterations the k-th iteration has k steps,S=b b a b b a a bS18=b b a b b a a bS28=b a b b a a bS38=a b b a a bS48=b b a a bS58=b a a bS68=a a bS78
13、=a bS88=b,34,Ukkonens approach on Suffix Trie,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,35,S=,b,b,a,b,b,a,a,b,1,1,2,3,1,2,3,3,3,3,1,1,2,4,3,4,3,4,2,5,3,5,3,5,2,6,3,6,3,6,3,3,7,7,4,7,3,3,7,7,4,7,2,3,7,7,4,7,7,8,4,8,7,8,4,8,7,8,4,8,The corresponding process of growing suf
14、fix tree,Observation:The tree structure does not change very often,i.e.,only O(|S|)times.,36,Observation,At the beginning of the k-th iteration,there are exactly k“growing points”(生長點),all with different heights.The k-th iteration iteratively grows k edges with label Sk at those k growing points.,37
15、,Ukkonens approach on Suffix Trie,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,38,Growing Suffix Trie,Three cases while growing trieCase 1:growing an edge at a leaf.Case 2:growing a new branch of leaf.Case 3:does not change the tree structure.,39,Ukkonens approach on Suffi
16、x Trie,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,Case 1:長此以往,Case 2:節外生枝,Case 3:若無其事,40,三階段定理,Those k steps in the k-th iteration have the following pattern:some(at least one)Case-1 steps,followed by some(could be zero)Case-2 steps,followed by some(could
17、be zero)Case-3 steps.,41,Ukkonens approach on Suffix Trie,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,Case 1:長此以往,Case 2:節外生枝,Case 3:若無其事,42,Proving 三階段定理(1),在同一個iteration之中,長此以往之前(下)的step一定也是長此以往。At the end of each iteration,if a growing point is a leaf,th
18、en all previous(lower)growing points are also leaves.Why?,43,An illustration,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,44,Key:what does a leaf growing point mean?,The i-th growing point=the end of the i-th suffix Sik of the current prefix S1k.Argument:the i-th(i 1)growi
19、ng point is a leaf.Neither Sika nor Si.kb is a substring of S1k.Neither Si1ka nor Si1.kb is a substring of S1k.The(i 1)-st growing point is a leaf.,45,An illustration,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,46,What does a Case-1 Step mean?,At the beginning of the curr
20、ent iteration,its corresponding growing point is a leaf.,47,An illustration,b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,48,三階段定理(1):長此以往之前一定是長此以往,At the end of each iteration,if a growing point is a leaf,then all previous(lower)growing points are also leaves.Therefore,in
21、the next iteration,if a step is a“長此以往”,then all previous steps are also“長此以往”.,49,Proving 三階段定理(2),在同一個iteration之中,若無其事之後(上)的step一定也是若無其事。At the end of each iteration,if a growing point is an internal node,then all latter(higher)growing points are also internal.Why?,50,An illustration,b b a b b a b
22、 a b b a a b b a b b a b a a,51,Key:what does an internal growing point mean?,The i-th growing point=the end of the i-th suffix Sik of the current prefix S1k.Argument:the i-th(i k)growing point is internal.Sika or Si.kb is a substring of S1k.Si+1ka or Si+1.kb is a substring of S1k.The(i+1)-st growin
23、g point is internal.,52,An illustration,b b a b b a b a b b a a b b a b b a b a a,53,What does a Case-3 Step mean?,At the end of the current iteration,its corresponding growing point is still an internal node.,54,An illustration,b b a b b a b a b b a a b b a b b a b a a,55,三階段定理(2):若無其事之後一定是若無其事,At
24、the end of each iteration,if a growing point is an internal node,then all latter(higher)growing points are also internal.Therefore,in“this”iteration,if a step is“若無其事”,then all its following steps are also“若無其事”.,56,三階段定理得證,Those k steps in the k-th iteration have the following pattern:some(at least
25、 one)Case-1 steps,followed by some(could be zero)Case-2 steps,followed by some(could be zero)Case-3 steps.,This is about the status within the same iteration,57,The fate of growing points in different iterations,葉子的宿命 一日為葉子,終身為葉子,58,Ukkonens approach on Suffix Trie,b b a b b a a b b a b b a a b a b
26、b a a b b b a a b b a a b a a b a b b,Case 1:長此以往,Case 2:節外生枝,Case 3:若無其事,59,Observation,Only O(|S|)Case-2 steps throughout the execution.Why?,60,Whats remaining directly working on suffix tree,葉子生長點之所對應的edge label 斷頭指標,61,Taking a closer look,Case-1 Step,62,Closer look at Case-1 step,Always occurs
27、at a leaf(growing point).At the beginning of iteration k,the path above a leaf has label?,k1.Each Case-1 step in iteration k simply changes the label?,k1 to?,k.,?,k1,?,k,63,An idea for Case-1 steps,Use?,-to label the path above each leaf.Thus,no need to do anything for each case-1 step.,?,-,64,Using
28、?,-for each leaf,S=,b,b,a,b,b,a,a,b,1,-,2,-,3,-,3,-,3,3,7,-,4,-,3,3,7,-,4,-,2,3,7,-,4,-,1,1,1,1,2,1,1,1,65,Saving a lot of efforts,We can simply ignore all Case-1 steps.Recall that the number of Case-2 steps is at most|S|.Q:Is this good enough?,Case 1:長此以往,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b
29、a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,66,Taking a closer look,Case-2 Steps:節外生枝,67,Closer look at Case-2 step,Always occurs at a growing point that is not a leaf,which is not necessarily an internal node.When the growing point is an internal node Label=k,-,where k is th
30、e current iteration index.,k,-,68,Closer look at Case-2 step,When the growing point is not an internal node,k,-,t,i,j,i+t,j,i,i+t-1,69,Taking a closer look,Case-3 Steps:若無其事,70,Closer look at Case-3 step,Always occurs at a growing point that is not a leaf,which is not necessarily an internal node.Wh
31、en the new position of the growing point is an internal node,t,i,j,0,71,Closer look at Case-3 step,When the new position of the growing point is not an internal node,t,i,j,t+1,72,Exercise,Give a sequence S such that the number of Case-3 steps throughout the process of growing its suffix trie(or suff
32、ix tree)is W(|S|2).,73,How does Ukkonen overcome the problem of too many Case-3(若無其事)steps?,Completely ignore them以若無其事的態度處理若無其事的狀況,74,Why Case-3 steps?,t,i,j,0,t,i,j,t+1,75,Why Case-3 steps?,For correctly maintaining the position of each growing point.(Why?)For correctly running Case-2 steps.(By wh
33、at?),k,-,t,i,j,i+t,j,i,i+t-1,76,What if,Saving the book keeping efforts in all Case-3 steps,77,Saving even more efforts,Case 1:長此以往,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,78,Rough idea,Just keep one current growing point through
34、out the execution.Deriving the new position of the current growing point from its previous position(with the helpusing suffix links(斷頭指標),79,Only one growing point,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,The challenges:How do we
35、derive the position of the current growing point?Vertically(case 2)Horizontally(case 3)Q:Which one is easier?,80,Horizontally,Moving from iteration k 1 to iteration k.The growing point does not move!This is the easier case.,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a
36、 a b b b a a b b a a b a a b a b b,81,Horizontal movement,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,Case 1:長此以往,Case 2:節外生枝,Case 3:若無其事,82,Vertically,Moving from Step i to Step i+1 in the same iteration.The growing point moves dramatically.This is the tou
37、gher case.,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,83,Vertical movement,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,Case 1:長此以往,Case 2:節外生枝,Case 3:若無其事,84,斷頭指標(suffix link),前人種樹後人涼的哲學,85
38、,前人種樹後人涼,每次千辛萬苦找到vertical movement的目的時,把這個movement的起點與終點用一個link記錄下來.下回遇到這個起點時,就可以直接走到終點去,不用再重新找一次.這些link就叫做suffix link(斷頭指標).,86,為何稱為斷頭指標?,終點所對應的字串,是起點所對應之字串的斷頭字串(second suffix),Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,87,斷頭指標的性質(
39、1),每個斷頭指標的起點一定是個internal node,不會葉子不會是某個suffix tree edge的中間Why?,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,88,斷頭指標的性質(2),每個internal node一定是某個斷頭指標的起點Why?,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b
40、 b b a a b b a a b a a b a b b,89,運用斷頭指標,S=,b,b,a,b,b,a,a,b,1,-,2,-,3,-,3,-,3,3,7,-,4,-,3,3,7,-,4,-,2,3,7,-,4,-,1,1,1,1,2,1,1,1,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,90,Traversal with the help of suffix links:phase(1),Going up to a closest internal no
41、de(whose suffix link must be available).Suppose this upward traversal passes through t characters.Following the suffix link that starts from this internal node.,t,i,j,91,Traversal with the help of suffix links:phase(2),Going down by matching the t-character substring Si,i+t 1 of S.,t,i,j,92,Running
42、Time?,Navely:O(t).Cleverly:O(1+d),where d is the number of internal nodes being went through during phase(2).,t,i,j,93,Overall Time=O(|S|),Suppose di is the d in the i-th Case-2-step traversal.It suffices to show d1+d2+d|S|=O(|S|).,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b
43、 a b b a a b b b a a b b a a b a a b a b b,94,=the“slack”of the growing point,The slack means the distance between a position P and the closest internal node above P.,t,i,j,95,case-3 traversal,Each case-3 traversal(i.e.,horizontal movement)can only increase the value of by at most one.(It can even d
44、ecrease the value of.),Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b,96,case-2 traversal,The i-th case-2 traversal(i.e.,vertical movement)decreases the value of by at least di.,Case 2:節外生枝,Case 3:若無其事,1 2 3 4 5 6 7 8b b a b b a a b b a
45、 b b a a b a b b a a b b b a a b b a a b a a b a b b,97,d1+d2+d|S|=O(|S|),Initial=O(1).can be increased by one for at most|S|times(because there are at most|S|horizontal movements(i.e.,case-3 traversals).Since is always non-negative,the above bound is proved.,98,運用斷頭指標,S=,b,b,a,b,b,a,a,b,1,-,2,-,3,-,3,-,3,3,7,-,4,-,3,3,7,-,4,-,2,3,7,-,4,-,1,1,1,1,2,1,1,1,