生物物理ppt课件.pptx

上传人:牧羊曲112 文档编号:2161412 上传时间:2023-01-22 格式:PPTX 页数:53 大小:2.72MB
返回 下载 相关 举报
生物物理ppt课件.pptx_第1页
第1页 / 共53页
生物物理ppt课件.pptx_第2页
第2页 / 共53页
生物物理ppt课件.pptx_第3页
第3页 / 共53页
生物物理ppt课件.pptx_第4页
第4页 / 共53页
生物物理ppt课件.pptx_第5页
第5页 / 共53页
点击查看更多>>
资源描述

《生物物理ppt课件.pptx》由会员分享,可在线阅读,更多相关《生物物理ppt课件.pptx(53页珍藏版)》请在三一办公上搜索。

1、Gene Prediction,王 秀 杰,中科院遗传发育所,Ideal case,Real world,What is a gene?,Wilhelm Johannsens definition of a gene:,The word gene was first used by Wilhelm Johannsen in 1909,based on the concept developed by Gregor Mondel in 1866.,What is a gene?,A gene is the basic physical and functional unit of heredit

2、y.Genes,which are made up of DNA,act as instructions to make molecules called proteins.,Old concept:,Gene Prediction,Gene prediction:To identify all genes in a genome,atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaat

3、gcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatg

4、caagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctat

5、gctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctat

6、gctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagct,atgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgc

7、taagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttac

8、cttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgac

9、aatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatg

10、cggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagct,Gene,Gene prediction is the basic for functional studies,Finding all genes in a genome could be hard,Finding all the genes is hard,Mammalian genomes are large 8000 km of 10 bp type,Only about 1%coding proteins,Non-coding RNAs a

11、re more difficult to be predicted,The structure of prokaryotic(原核生物的)genes,Promoter structure of prokaryotic(原核生物的)genes,The structure of eukaryotic(真核生物的)genes,The structure of eukaryotic(真核生物的)genes,Open Reading Frames(ORFs),Protein coding gene prediction is to detect potential coding regions by l

12、ooking for ORFs,Signals defining ORFs in eukaryotic genes:,Start codon:ATGStop codons:TAG,TGA,TAA,Splicing donor sites:usually GTSplicing acceptor sites:usually AG,UTRs are usually defined according to expression evidence,Types of exons,Six Frames in a DNA Sequence,DNA replication occurs in the 5-to

13、-3 direction,Six Frames in a DNA Sequence,Six Frames in a DNA Sequence,Codon usage selection in translation,Codon usage selection in translation,Codon usage in mouse genome,Uneven usage of codons may characterize a real gene!,Eukaryotic ORF prediction,Signals defining ORFs in eukaryotic genes:,Start

14、 codon:ATGStop codons:TAG,TGA,TAA,Splicing donor sites:usually GTSplicing acceptor sites:usually AG,-Coding frame,-Codon usage,Gene syntax rules,The common gene syntax rules for forward-strand genes:,Conceptual gene finding framework,Conceptual gene finding framework,Methods for Eukaryotic Gene Pred

15、iction,Ab initio method:-Only use genomic sequences as input-GENSCAN(Burge 1997;Burge and Karlin 1997)-Fgenesh(Solovyev and Salamov 1997)-Capable to predict novel genes,2.Transcript-alignment-based method:-Use cDNA,mRNA or protein similarity as major clues-ENSEMBL(Birney et al.2004)-High accuracy-Ca

16、n only find genes with transcription evidence,3.Hybrid method:-Integrate EST,cDNA,mRNA or protein alignments into ab initio method-Fgenesh+(Solovyev and Salamov 1997)-AUFUSRUS+(Stanke,Schoffmann et al.2006),Methods for Eukaryotic Gene Prediction,4.Comparative-genomics-based method:-Assume coding reg

17、ions are more conserved,Genome 1,Genome 2,Methods for Eukaryotic Gene Prediction,4.Comparative-genomics-based method:-Assume coding regions are more conserved-Capable to predict novel genes and non-protein coding genes-Can use transcript data to improve prediction accuracy-TWINCAN and N-SCAN(do not

18、use transcript similarity)-TWINCAN-EST and N-SCAN-EST(use transcript similarity),About the ab initio gene prediction methods,Difficult to handle the following cases:,Nested/overlapped genesPolycistronic genesAlternative splicingFrame-shift errorsSplit start codonsNon-ATG triplet as the start codonEx

19、tremely short exonsExtremely long intronsNon-canonical intronsUTR introns,Hidden Markov Model is a commonly used algorithm for gene prediction,Hidden Markov Model(HMM),Hidden Markov Model,Markov Property,Markov Property is simply that given the present state,future states are independent of the past

20、,Stochastic processes are generally considered as the collections of random variables,thus have Markov Property,Markov Chain,Markov Chain is a system that we can use to predict the future given the present,In the Markov Chain,the present state only depends on two things:,-Previous state-Probability

21、of moving from previous state to present state,Markov Chain,To estimate the status of students,Markov Chain,Suppose graduate students have two types of moods:-Happy-Depressed about research,Each type of students has its own Markov chain,Finally,there are three locations we can find the students:-Lab

22、-Canteen-Dorm,Markov Chain,Markov Chain of happy students,Lab,Canteen,Dorm,Markov Chain,Markov Chain of depressed students,Lab,Canteen,Dorm,Markov Chain Probability,The probability of observing a given sequence is equal to the product(乘积)of all observed transition probabilities.,P(Canteen-Dorm-Lab)=

23、P(Canteen)P(Dorm|Canteen)P(Lab|Dorm),P(Canteen-Lab)=P(Canteen)P(Lab|Canteen),Markov Model,A Markov model is a stochastic model used to model randomly changing system where it is assumed that the future states depends only on the present state.,Hidden Markov Model,Now we have the general information

24、about the relationship between the student mood and location,-Mood is Hidden,If we simply observe the locations of a student,can we tell what mood he is in?,-Observations are the locations of the students,-Parameters of the model are the probabilities of a student being in a particular location,Hidd

25、en Markov Model(HMM),Observations:LLLCDCLLDDLLCDLDDCDDDDLCLLLCCL,Hidden state:HHHHHHHHHHHHDDDDDDDDDHHHHHH,Using HMM to estimate student mood,Hidden Markov Model(HMM),Application of HMM in gene prediction,What do we want?,Why are HMMs a good fit for gene prediction?,-DNA sequences are in order which

26、is necessary for HMMs,-Enough training data for what is a gene and what is not a gene,-To find coding and non-coding regions from an unlabeled string of DNA sequences,HMMs need to be trained to be truly effective,HMMs for gene prediction,HMMs for gene prediction,Cautions about HMMs,Need to be mindfu

27、l of overfitting,HMMs can be slow(needs proper decoding),-DNA sequences can be very long thus processing them can be very time consuming,States are supposed to be independent of each other and this is NOT always true!,-Need a good training set,-More training data does not always mean a better model,

28、Protein-coding genes have specific evolutionary constraints-Gaps between homologous genes are multiples of three(preserve amino acid translation)-Mutations are mostly at synonymous positions-Conservation boundaries are sharp(pinpoint individual splicing signals),Features for protein coding genes,Dme

29、l TGTTCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDsec TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDsim TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATC

30、TTCDyak TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDere TGTCCATAAATAAA-TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT-GGCTCCAGCATCTTTDana TGTCCATAAATAAA-TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG-CGGCCGTGA-GGCTCCATCAT

31、CTTADpse TGTCCATAAATGAA-TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG-GGCTCCATCATTTTCDper TGTCCATAAATGAA-TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG-GGCTCCATTATTTTCDwil TGTTCATAAATGAA-TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG-GGTTC

32、CATTATCTTCDmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC-TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTTDvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC-GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTCDgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC-TGCCATGGGCACGACCCTG

33、GCCGGGTTCCATCAGCTTT*,Splice,REALITY,PREDICTION,Exon Level,WRONGEXON,CORRECTEXON,MISSINGEXON,Measure of prediction accuracy,Nucleotide Level,Measure of prediction accuracy,C:correct;nc:incorrect;TP:true positive;FP:false positive;FN:false negative;TN:true negative,Gene prediction software,Example of

34、gene finders,Example of gene finders,Example of gene finders,Accuracy of Gene Prediction,Gene prediction is easier in microbial genomesWhy?Smaller genomesSimpler gene structuresMore sequenced genomes!(for comparative approaches)Methods?Previously,mostly HMM-based Now:similarity-based methodsbecause

35、so many genomes are available,Gene prediction in prokaryotes,Summary,Nothing is perfect,Each gene identification approach has its own features and limitations,Genome annotation is an on-going process,and the accuracy is bring improved along with the improvement of methods and accumulation of the evidence data,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 生活休闲 > 在线阅读


备案号:宁ICP备20000045号-2

经营许可证:宁B2-20210002

宁公网安备 64010402000987号