《生物信息学应用及主要算法.ppt》由会员分享,可在线阅读,更多相关《生物信息学应用及主要算法.ppt(97页珍藏版)》请在三一办公上搜索。
1、生物信息学应用及主要算法,美国国家卫生研究院(NIH)的定义:Bioinformatics(Research,development,or application of computational tools and approaches for expanding the use of biological,medical,behavioral or health data,including those to acquire,store,organize,archive,analyze,or visualize such data.)为拓展生物学、医学、行为学和卫生学数据的用途,而进行有关
2、计算机方法手段的研究、开发与应用,包括此类数据的采集、存贮、整理、归档、分析与可视化。Computational Biology(The development and application of data-analytical and theoretical methods,mathematical modeling and computational simulation techniques to the study of biological,behavioral,and social systems.)开发和应用数据分析、理论方法、数学模型和计算机仿真技术,用于生物学、行为学和社会
3、群体系统的研究。,Bioinformatics,Computational Biology,What is Bioinformatics/Computational Biology?,Two aspect of Bioinformatics,Theoretical Studies,Application,Data analysis,Algorithms,Tools,Applications,发现生物学规律,,解读生物遗传密码,认识生命的本质,研究基因组数据之间的关系,分析现有的基因组数据,利用数学模型和人工智能技术,后基因组时代新的研究方式,一、分子生物学基础,生物分类真核生物原核生物古细菌真
4、核细胞Eukaryotic Cells原核细胞 Prokaryotic Cells,THE CHEMICAL BASIS OF LIFE,Types of Biological Molecules(1),单糖二糖寡糖多糖,Types of Biological Molecules(2),脂类lipid,Types of Biological Molecules(4),氨基酸 amino acids、肽键peptide bond、f(C-N)y(C-C)角,Types of Biological Molecules(4),an a-helix,two stranded b-sheet,b-tur
5、ns are four amino acids big and are stabilized by i-i+4 H-bonds.,Protein domain,Aprotein domainis a part of protein sequence andstructurethat canevolve,function,and exist independently of the rest of the protein chain.,Each domain forms a compact three-dimensional structure and often can be independ
6、ently stable andfolded.Many proteins consist of several structural domains.One domain may appear in a variety of different proteins.,Molecular evolutionuses domains as building blocks and these may be recombined in different arrangements to createproteinswith different functions.,Types of Biological
7、 Molecules(5),Types of Biological Molecules(6),核苷酸nucleotide、碱基base、核酸nucleic acid、DNA、RNA,Types of Biological Molecules(7),rRNA、tRNA、mRNA,Levels of organization of chromatin,DNA组蛋白histone核小体nucleosome核纤维nucleosome filament染色体chromosome,An overview of the flow of genetic information,中心法则Central Dogm
8、a of Genetics基因表达Gene Expression,原核细胞的基因结构Gene Structure of Prokaryote,ORF1,ORF2,ORF3,mRNAs,Proteins,GenomeDNA,ORF1,ORF2,ORF3,Transcription,Translation,Transcript,Transcription initiation site,Translation start site,1 mRNA,many proteins.,1 transcript,many proteins.,Transcription termination site,原核生
9、物,Gene,Organization of a bacterial operon,基因的结构:原核生物基因的操纵子,真核细胞的基因结构Gene Structure of Eukaryote,1,2,3,4,5,1,2,3,4,5,mRNAs,Proteins,Translation,Transcription,1,2,3,5,An,An,r,r,91%,9%,Alterative products,Cap,Cap,GenomeDNA,Exon,3-UTR,Intron,5-UTR,1 mRNA,1 protein.,1 gene,many proteins.,18%,82%,Alterati
10、ve products,Gene,Liver cells,Neurons,真核生物,Organization of a eukaryotic gene,基因的结构:真核生物基因,二、生物学数据库简介,一级数据库数据库中的数据直接来源于实验获得的原始数据,只经过简单的归类整理和注释二级数据库对原始生物分子数据进行整理、分类的结果,是在一级数据库、实验数据和理论分析的基础上针对特定的应用目标而建立的。,生物信息 学数据库 工具,生物信息数据库,染色体,核酸,蛋白质,基因组图谱,DNA序列,蛋白质序列,蛋白质结构,基因组数据库,核酸序列数据库,蛋白质序列数据库,蛋白质结构数据库,二级数据库 复合数据
11、库,基因组作图,序列测定,结构测定,国际著名的生物信息中心,NCBI National Center for Biotechnology Information(US)EBI European Bioinformatics Institute(EU)HGMP Human Genome Mapping Project Resource Centre(UK)ExPASy Expert of Protein Analysis System(Switzerland)CMBI Centre of Molecular and Biomolecule(The Netherlands)ANGIS Nation
12、al Genome Information Service(Australia)NIG National Institute of Genetics(Japan)Generally accessible through the web Google:http:/www.google./Biohunt:http:/www.expasy.org/BioHunt/Amos links:www.expasy.ch/alinks.htmlNucleic Acids Research http:/www.oxfordjournals.org/nar/database/c/http:/bioinformat
13、ics.ca/links_directory/,New Data,不同数据库的序列格式,在运行序列分析软件中遇到的首要问题就是如何通过不同的程序使用不同的序列格式。这些格式都是标准ASCII码文件,但在显示各种信息或序列本身的某些字符或字有所不同。,1 GenBank中DNA序列格式2 EMBL序列格式3 SwissProt序列格式4 FASTA序列格式5 NBRF序列格式6 Intelligenetics序列格式,7 GCG序列格式8 PIR/CODATA序列格式9 Plain/ASCII.Staden序列格式10 ASN.1序列格式11 GDE格式,通用的序列数据库的格式FASTA格式(或
14、Pearson格式),序列文件的第一行是由大于符号()打头的任意文字说明,主要为标记序列用。从第二行开始是序列本身,标准核苷酸符号或氨基酸单字母符号。通常核苷酸符号大小写均可,而氨基酸一般用大写字母。文件中和每一行都不要超过80个字符(通常60个字符)。,核酸序列,氨基酸序列,最重要的核酸序列数据库,欧洲分子生物学实验室的EMBL http:/www.embl-heidelberg.de 美国生物技术信息中心的GenBank http:/www.ncbi.nlm.nih.gov/Web/Genbank/index.html 日本遗传研究所的DDBJ http:/www.ddbj.nig.ac.
15、jp/,EBI,GenBank,DDBJ,EMBL,EMBL,Entrez,SRS,getentry,NIG,CIB,NCBI,NIH,SubmissionsUpdates,SubmissionsUpdates,SubmissionsUpdates,Microbial genome data resource,individual genomes NCBIs Entrez Gene,EBIs Integr8,UCSC Genome BrowserMultigenome CMR,ERGO,Microbes Online,PUMA2,IMG,and MBGDOrganism-specific an
16、notation resources PeerGAD,ASAPIntegrated annotation resources SEED,PUMA2Metagenome resources GenBank,IMG/M,M,Protein sequence database,SWISSPROT+TrEMBL The Protein Information Resource(PIR)UniProt,Protein Domain database,PROSITE http:/expasy.org/prosite/ProDom http:/prodom.prabi.fr/prodom/current/h
17、tml/home.phpPfam http:/pfam.sanger.ac.uk/SMART http:/smart.embl-heidelberg.de/InterPro http:/www.ebi.ac.uk/interpro/,Protein Structure Database,PDB http:/www.pdb.org/PDBSUMhttp:/www.ebi.ac.uk/pdbsum/,Structural Classification of Proteins,SCOP(Structural Classification of Proteins)http:/scop.mrc-lmb.
18、cam.ac.uk/scop/CATH(Class,Architecture,Topology,Homology)http:/www.cathdb.info/,DSSP(Definition of Secondary Structure of Proteins)蛋白质二级结构构象参数数据库 DSSP的网址:http:/www.cmbi.kun.nl/gv/dssp/FSSP(Families of Structural Similar Proteins)蛋白质家族数据库 FSSP的网址:http:/www2.embl-ebi.ac.uk/dall/fssp/HSSP(Homology Deri
19、ved Secondary Structure of Proteins)同源蛋白质数据库 HSSP的网址:http:/www.cmbi.kun.nl/gv/hssp/,蛋白质二级结构数据库,代谢相关的数据库,Enzyme http:/expasy.org/enzyme/BRENDA http:/www.brenda-enzymes.info/KEGG http:/www.genome.jp/kegg/BIOCYC http:/biocyc.org/,Database of functional genomics,ArrayExpress http:/www.ebi.ac.uk/arrayexp
20、ress/Gene Expression Atlas http:/www.ebi.ac.uk/gxa/IntAct http:/www.ebi.ac.uk/intact/main.xhtmlSWISS-2DPAGE http:/expasy.org/ch2d/,三、生物信息学的主要理论方法,基于数据挖掘(知识发现)的方法(Data-mining,Knowledge Discovery)Extracts the hidden patterns from huge quantities of experimental data,and forms hypotheses as a result.基于
21、模拟分析的方法(Simulation-based Analysis)Tests hypotheses with in silico experiments,providing predictions to be tested by in vitro and in vivo studies.,生物信息学中的主要算法,生物信息学中的主要算法,四、生物信息学方法及其应用,数据获取相似性搜寻:blast数据分析多序列比对、系统发育分析、motif提取、功能位点预测蛋白质结构模型预测同源模型预测,Why do Database Search?,The sequence itself is not inf
22、ormative;it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and functionDoes the DNA sequence contain a gene?Is the gene a member of a known gene family?What is the encoded proteinWhat is the function of the protein?Do other organisms hav
23、e the protein or the gene?,Sequence Implications序列的涵义,DNA sequence of gene determine amino acid sequence of a protein.基因的DNA序列决定其蛋白质的氨基酸序列Primary sequence determines structure and function of a gene.基因的一级结构决定空间结构和功能Proteins with similar sequences and structures have similar functions蛋白质序列和结构相似,其功能相似
24、Similar sequences should have long regions of identical or similar residues.Why?相似的序列必定有长片段的相似或一致的区域Very rarely have functional convergence without sequence or structure similarity.功能相似但是序列和结构不相似的现象极为罕见,Important Terms for Sequence Similarity Searching with very different meanings,Similarity(相似性)The
25、 extent to which nucleotide or protein sequences are related.The extent of similarity between two sequences can be based on percent sequence identity and/or conservation.In BLAST similarity refers to a positive matrix score.Identity(同一性)The extent to which two(nucleotide or amino acid)sequences are
26、invariant.Homology(同源性)Similarity attributed to descent from a common ancestor.It is your responsibility as an informed bioinformatician to use these terms correctly:A sequence is either homologous or not.Dont use%with this term!,Some definitions,Scoring分值系统.定量描述比对的好坏。Quantifies the goodness of alig
27、nment.Exact match has highest score,substitution lower score and insertion and gaps may have negative scores.Substitution matrix替代矩阵.A symmetrical 20*20 matrix(20 amino acids to each side).Each element gives a score that indicates the likelihood that the two residue types would mutate to each other
28、in evolutionary time.Gap penalty空位罚分.Evolutionary events that makes gap insertion necessary are relatively rare,so gaps have negative scores.Three types:Single gap-open penalty.This will tend to stop gaps from occuring,but once they have been introduced,they can grow unhindered.Gap penalty proportio
29、nal to the gap length.Works against larger gaps.Gap penalty that combines a gap-open value with a gap-length value.,DNA Similarity ScoringCommon Method,Terminal mismatches(0)Match score(10)Mismatch penalty(-9)Gap penalty(50)Gap extension penalty(3),DNA Defaults,Substitution Matrices常用替代矩阵,PAM-Dayhof
30、fBLOSUM-Henikoff,CKHVFCRVCICKKCFC-KCVCKHVFCRVCICKKCFCK-CVC-KHVFCRVCICKKC-FC-CKVCKH-VFCRVCICKKC-FC-KCV,However if we used the more realistic PAM 250substitution matrix then these alignments would have different scores(and the NWS algorithm would have picked the alignment with the highest one).,Score
31、with PAM 250 and gap penalty-10,36+5+0 2+9 2+4 10=40,36+5+0 2+9+5+4 10=47,36+5 3+9 2+4 3x10=19,36+5+0+9 2+4 3x10=22,Gap penalty is important;biology does not like gaps,BLAST Searches BioDatabase,BLAST 是由美国国立生物技术信息中心(NCBI)开发的一个基于序列相似性的数据库搜索程序。BLAST是“局部相似性基本查询工具”(Basic Local Alignment Search Tool)的 缩写
32、。Blast 是一个序列相似性搜索的程序包,其中包含了很多个独立的程序,这些程序是根据查询的对象和数据库的不同来定义的。比如说查询的序列为核酸,查询数据库亦为核酸序列数据库,那么就应该选择blastn程序。,51,blast程序,Details of the BLAST Algorithm,3 steps:Compile a list of high-scoring wordsscan database for“hits”extend hits,BLAST Word Matching,MEAAVKEEISVEDEAVDKNIMEA EAA AAV AVK VKE KEE EEI EIS ISV
33、.,Break query into words:,Break database sequences into words:,Database Sequence Word Lists RTTAAQSDGKSSSRWLLNQELRWYVKIGKGDKINISLFCWDVAAVKVRPFRDEI,Compare Word Lists,Query Word List:MEAEAAAAVAVKVKLKEEEEIEISISV,?,Compare word lists by Hashing(allow near matches),ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTM
34、EAT,TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY,IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH,MEAEAAAAVAVKKLVKEEEEIEISISV,Find locations of matching words in database sequences,HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA,QSVFDYIYYGCYCGWGLG_GK_PRDA,Use two word matches as anchors to build an
35、alignment between the query and a database sequence.Then score the alignment.,Query:,Seq_XYZ:,E-val=10-13,HSPs are Aligned Regions,The results of the word matching and attempts to extend the alignment are segments-called HSPs(High-scoring Segment Pairs)BLAST often produces several short HSPs rather
36、than a single aligned region,BLAST Parameters,Nucleotide BLAST(blastn)Word Size 7,11 or 15(analagous to ktup)Protein-protein BLAST(blastp)Word size 2 or 3Matrix PAM30 or 70,BLOSUM80,62 or 45GAP 12/1,11/1,10/1,9/2,8/2,7/2,59,Blast程序评价序列相似性的两个数据,Score:使用打分矩阵对匹配的片段进行打分,这是对各对氨基酸残基(或碱基)打分求和的结果,一般来说,匹配片段越
37、长、相似性越高则Score值越大。E value:在相同长度的情况下,两个氨基酸残基(或碱基)随机排列的序列进行打分,得到上述Score值的概率的大小。E值越小表示随机情况下得到该Score值的可能性越低。,62,Sensitivity vs.Selectivity,Sensitivity:尽可能多地搜索到具有一定相似性的序列的能力。Selectivity:尽可能准确地搜索到对研究目的有用的相似性的序列的能力。,chite-ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat-DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKS
38、VAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS-KHSDLS-IVEMSKAAGAAWKELGPmouse-KPKRPRSAYNIYVSESFQ-EAKDDS-AQGKLKLVNEAWKNLSP*.:.:.:.*.*:*chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM-mouse AKDDRIRYDNEMKSWEEQMAE*:.*.:,Potential Uses of A Multiple Sequence Alignment?,Extrapolati
39、on,Motifs/Patterns,Phylogeny,Profiles,Struc.Prediction,Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques.,protein sequences that are 30%identical often have the same structure and function,Profiles谱 and motifs模体,一个序列模体是一段局部保守的区域或由一组序列共有的短的序列模式(pattern)A sequence motif is a locally co
40、nserved region of a sequence or a short sequence pattern shared by a set of sequences.模体指用于预测分子功能、结构特征或家族关系的模式The term motif refers to any sequence pattern that is predictive of a molecules function,a structural feature,or a family membership.Motifs can be detected in proteins,DNA and RNA sequences,
41、but they most commonly refer to protein motifs.模体可以表示为不同的形式因不同计算目的Motifs can be represented for computational purposes asFlexible patterns K,R-R-P-C-x(11)-C-V-S(qualitative,unweighted;see the Prosite database at www.expasy.org)Position-specific scoring matrices(PSSM,see next page)Profile hidden Mark
42、ov models(HMM).These are rigorous probabilistic formulation of a sequence profile.They contain the same probability information as PSSMs but can also account for gaps.,Profiles谱 and motifs模体,Profile The conserved regions of a multiple sequence alignment represented as a matrix.A profile can allow fo
43、r gaps in the sequence alignments.The columns of the matrix contain substitution scores for the amino acids.The rows contain scores for matching columns of the profile to a test sequence.Profile:a table which indicates the likelihood that any sequence symbol or gap would occur at any particular posi
44、tion in a“consensus共有或调和 sequence generated from a set of aligned sequences,Position specific scoring matrix,This corresponds to the flexible pattern of the paired box:K,R-R-P-C-x(11)-C-V-S,A B C D E F G H I K L M N P Q R S T V W X Y Z*-22-22-35-26-15-37-30-9-38 35-36-23-16-34-5 53-23-24-35-40-19-31
45、-9 0 0-51-52-62-57-46-64-59-33-66-16-63-49-44-64-34 70-51-53-63-64-46-57-40 0 0-42-58-59-55-53-68-59-54-63-51-65-57-62 73-54-56-50-53-59-72-51-69-54 0 0-42-69 99-75-84-49-66-72-43-76-54-53-62-79-74-75-51-48-42-65-58-59-79 0 0-21-38-19-41-30-29-43-36 6 32-16-13-35-44-25-15-34-22 47-41-18-36-27 0 0-21
46、 6-8-12-27-7-25-13 26-22 23 8 30-39-21-23-20-13 10-30-9-19-24 0 0-31-40-21-43-34-23-48-36 50 33-9-8-37-47-27-17-39-28 5-46-20-33-30 0 0-27-36-24-38-30-12-40-30-3 31 39 3-32-42-20-11-35-28-10-37-16-29-24 0 0-5 11-7-8-18-24-15-11 2-17-17-13 35-32-17-20 20-2 23-33-7-26-18 0 0 24-20 0-22-19-21-12-20 5-1
47、9-12-9-16-24-19-22 21 0 24-29-7-25-19 0 0 21 11-3-6-16-28-10-9-19-13-25-17 33-26-13-16 2 28-10-35-8-25-14 0 0-3-17-4-21-21-11-19-18-1-20 19 2-12-29-19-21 20 27-3-30-6-21-20 0 0-18 16-17 33-6-20-26 52 2-21-17-13-5-35-12-19-21-16 20-30-10-8-10 0 0-26-41-12-45-40 10-43-10 30-33 5 45-37-44-27-31-34-21 7
48、-4-17 45-33 0 0-27 12-22 33-13-8-28-21-10-27-5 42-15-40-20-28-28-24-14 73-14-5-17 0 0-42-69 99-75-84-49-66-72-43-76-54-53-62-79-74-75-51-48-42-65-58-59-79 0 0-40-73-33-75-63-45-72-68-6-66-29-28-71-71-65-67-59-40 64-57-45-56-64 0 0-25-40-35-44-45-59-39-45-60-47-63-56-36-55-47-48 61-24-52-66-39-57-46
49、0 0,Automated Multiple alignment methods,Multi-dimensional dynamic programming多维动态规划Progressive alignment步进比对Iterative alignment迭代比对,Multi-dimensional dynamic programming多维动态规划算法Simultaneous multiple alignment,准确但计算量巨大:2 sequences of length n n2 comparisonsComparison number increases exponentially i
50、.e.nN where n is the length of the sequences,and N is the number of sequences Impractical for even a small number of short sequences quite quickly,Progressive Alignment,1987年Feng和Doolittle 设计Devised by Feng and Doolittle in 1987.本质上是一种启发式算法,因此不保证比对结果“最优”Essentially a heuristic method and as such is