《基因芯片的生物信息学课件.ppt》由会员分享,可在线阅读,更多相关《基因芯片的生物信息学课件.ppt(67页珍藏版)》请在三一办公上搜索。
1、Microarray and Bioinformatics基因芯片的生物信息学,Dr Jingfu Qiu 邱景富School of Public Health公共卫生学院,Microarray and Bioinformatics,Aims for the Microarray Bioinformatics,Understand basic microarray technology and its use in gene expression analysis. 基因芯片技术与表达谱分析中的应用Learn basic data analysis methods and how to app
2、ly them in the analysis of gene expression data 基因芯片的数据分析Data acquisition 数据获得Data normalization 数据归一化Data analysis 数据分析Data Clustering 数据聚类,Aims for the Microarray Bioinf,Vocabulary-Review 回顾,Gene 基因: hereditary DNA sequence at a specific location on chromosome.Genetics 遗传学: study of heredity & var
3、iation in organisms.Genome 基因组: an organs total content (full DNA sequence)Genomics 基因组学: study of organisms in terms of their genome.2002年2月12日, 历时10载耗资20亿美元的人类基因组计划最终完成, 并报道了99% 的人类基因组序列.,Vocabulary-Review 回顾Gene 基因: h,Vocabulary-Review回顾,Protein 蛋白质 : sequence of amino acids that “does something”
4、Proteomics 蛋白质组学 : study of all of the proteins that can come from an organisms genomeBioinformatics 生物信息学 : the collection, organization & analysis of large-scale, complex biological data.Functional Genomics 功能基因组学: study of obtaining an overall picture of genome functions, including the expression
5、 profiles at the mRNA level and the protein level,Vocabulary-Review回顾Protein 蛋白质,Microarray 基因芯片, A high throughput technology that allows detection of thousands of genes Simultaneously gene chip, biochip ,array Much rely on computer aids Central platform for functional genomics,Microarray 基因芯片 A hi
6、gh throug,Types of Microarrays 芯片的种类,DNA microarrays, such as cDNA microarrays and oligonucleotide microarrays MMChips, for surveillance of microRNA populations Protein microarrays Tissue microarrays Cellular microarrays (also called transfection microarrays Chemical compound microarrays Antibody mi
7、croarrays,Types of Microarrays 芯片的种类DNA,Types of DNA Microarrays,1. cDNA chip (DNA microarray, two-channel array) cDNA芯片 : Probe cDNA (5005,000 bases long) is immobilized to a solid surface such as glass Using robot spotting Traditionally called DNA microarray Firstly developed at Stanford Universit
8、y2. Gene chip (DNA chip, Affymetrix chip) 基因芯片: Oligonucleotide (2080-mer oligos) is synthesized either in situ (on-chip) or by conventional synthesis followed by on-chip immobilization Historically called DNA chips Developed at Affymetrix, Inc. , under the GeneChip trademark Many companies are manu
9、facturing oligonucleotide based chips using alternative technologies,Types of DNA Microarrays1. cDN,History 历史,HGP (human genome project): suggested by Delbecco on Mar.7,1986,started in Oct. 1990, rapid and sensitive techniques for human genome information analysis80S: suggestion based on computer c
10、hip, W Brains tried it firstly.90S: Stephen Fodor(Present of Affymetrix now) made it successfully.1995:Quantitative monitoring of gene expression patterns with a complementary DNA microarray End of 1996: the first DNA chip,History 历史HGP (human genome pr,Microarrays are Popular 芯片技术的普及,At NYU Med Cen
11、ter now collecting about 3 GB of microarray data per week (60 chips, 6-10 different experiments)PubMed search microarray= 24,431 papers,Microarrays are Popular 芯片技术的普,What problems can it solve?基因芯片的应用, Differing expression of genes over time, between tissues, and disease states 基因表达差异 Identificatio
12、n of complex genetic diseases 复杂性基因疾病的诊断 Drug discovery and toxicology studies 药理与毒理学研究 Mutation/polymorphism detection (SNPs) SNP 检测 Pathogen analysis 诊断病原,What problems can it solve?基因,Features 特点, Parallelism 高平行 Thousands of genes simultaneously Miniaturization 小型化 Small chip size Multiplexing 高
13、通量 Multiple samples at the same time Automation 自动化 Chip manufacturing Reagents,Features 特点 Parallelism 高,Differential Gene Expression基因表达差异,A Few Examples:Cell type specific -e.g. skin cell vs. brain cell Developmental stage -e.g. embryonic skin cell vs. adult skin cellDisease state -e.g. normal sk
14、in cell vs. skin tumor cellEnvironment-specific -e.g. skin cell untreated vs. treateddrugs, toxins,Differential Gene Expression基,What is its pitfall 缺陷与不足?, Detect transcription mRNA level, not translation protein level Many factors (variations) can affect the result:影响因素众多 Chip and probe design Exp
15、eriment design Sample preparation Image acquisition Data normalization Data analysis . Success crucial 成功关键: You know both the biology problem and the computer aids (software, statistics).,What is its pitfall 缺陷与不足? De,Requriments,Array spotter 点样仪Array scanner 扫描仪Chemistry systems 杂交体系Softwares 软件,
16、RequrimentsArray spotter 点样仪,Market predict 市场预期,At 1999:1 billion USDLess than 5 yrs: 20 billions2005:5 billions(USA)2010:40 billions(USA) Dont include disease diagnosticThe largest industry instead of microelectrics,Market predict 市场预期At 1999:,Principle 原理, Similar to Northern Base-Pairing, hybrid
17、ization between nucleic cids Major differences from Northern Detects thousands of genes simultaneously /individual Probes fixation on glass slide / nylon membrane Target samples labeling with fluorescent/radioactive dNTP,Principle 原理 Similar to North,Designing the Probes 探针的设计,The probes need to be
18、of high specificity to avoid hybridization with wrong target molecules. 特异性The probes need to generate an output that is easy to read (spots lie in defined positions and be of regular size and shape and even spacing). 杂交结果容易判读The probes have to have high sensitivity to detect the mRNA and the intens
19、ity of the spot light must be differentiable from background noise. 敏感性Results must be reproducible across multiple experiments. 重复性,Designing the Probes 探针的设计The,Spotting Process 点样过程,Spotting Process 点样过程,点样针,点样针,基因芯片的生物信息学课件,基因芯片的生物信息学课件,Spot robot 点样仪,Cheung et al. 1999,Spot robot 点样仪Cheung et a
20、l. 1,Affymetrix 基因芯片,Affymetrix 基因芯片,表达差异检测,表达差异检测,基因芯片的生物信息学课件,Comparison of Probe Types两种探针比较,AdvantagesNo need to isolate and purify cDNAs because oligonucleotides can be synthesized.Short oligonucleotides are less likely to have cross-reactivity with other sequences in the target DNA.Density of
21、chips is higher than with cDNAs.,LimitationsThe sequence has to be known.Synthesis can be expensive and time-consuming.The short sequences are not as specific for target DNA, so appropriate controls must be added.,In-situ Synthesis / Oligos,PCR Products / cDNA Probes,AdvantagesFlexibility to study c
22、DNAs from any source.cDNAs do not require any a priori information about the corresponding genes.Longer sequences increase hybridization specificity, which reduces false positives.,LimitationsIsolation of individual cDNAs to immobilize on each spot can be cumbersome.Density is lower than synthesizin
23、g oligonucleotides on the surface of the chip.cDNAs are longer sequences and are more likely to randomly contain sequences found in target DNA, which results in cross-reactivity.,Many other variations of the technology exist, such as the use of longer oligos, the use of fibre optics, etc.,Comparison
24、 of Probe Types两种探针,HomemadeTailoredCheaper?Maximum 24,000 features per arrayProne to variability,Commercially available“Off the rack”More expensive?Maximum 500,000 features per arrayLess variability,Spotted Arrays,Affymetrix Arrays,HomemadeCommercially available,Process of manufacture a microarray芯
25、片制备流程,Start with individual genes, e.g. the 4,200 genes of the genome or Y.pestisAmplify all of them using polymerase chain reaction (PCR)“Spot” them on a medium, e.g. an ordinary glass microscope slideEach spot is about 100 m in diameterSpotting is done by a robotComplex and potentially expensive t
26、ask,Process of manufacture a micr,48矩阵,1717 点阵,一共8448个点;4005条鼠疫菌基因+若干对照DNA;每样品相邻重复两个点。,基因选择4015条,芯片点样,基因的PCR扩增,产物纯化和浓缩4005条基因,全基因组芯片研制,引物设计,B21B22B23B24B25B26B27B28B29B30,Biological questionDifferentially expressed genesSample class prediction etc.,Testing,Biological verification and interpretation,
27、Microarray experiment,Estimation,Experimental design,Image analysis,Normalization,Clustering,Discrimination,R, G,16-bit TIFF files,(Rfg, Rbg), (Gfg, Gbg),Biological questionTestingBiol,Microarray Steps 基因芯片分析过程, Experiment and Data Acquisition 实验过程与数据获得 Chip manufacturing 芯片制备 Sampling and labeling
28、点样 Hybridization 杂交 Image scaling 图像扫描 Data acquisition 数据获得 Data normalization 数据归一化 Data analysis 数据分析 Biological interpretation 生物学解释,Microarray Steps 基因芯片分析过程 Exp,Reading an array (cont.),Reading an array (cont.)BlockC,Color Coding扫描结果,Tables are difficult to readData is presented with a color s
29、caleCoding scheme:Green = repressed (less mRNA) gene in experimentRed = induced (more mRNA) gene in experimentBlack = no change (1:1 ratio)OrGreen = control condition (e.g. aerobic)Red = experimental condition (e.g. anaerobic)We only use ratio,Color Coding扫描结果Tables are di,Noise 干扰,Noise sources干扰来源
30、:Sample preparation, labeling, amplificationReaction variationsEnvironmentTarget volumeHybridization parameters (temperature, time, .)Aspecific hybridizationDustScanner settingsQuantization,Noise 干扰Noise sources干扰来源:,Other Image Processing Problems Spot Quality Problems,Other Image Processing Proble
31、m,Two slides,P04 vs. P01 (pg2),A1 vs. P01 (pg2),Two slidesP04 vs. P01 (pg2)A1,Noise filtering 干扰过滤,Noise filtering 干扰过滤,Noise filtering 干扰过滤,Gridding: identify spot locationsSegmentation: distinguish foreground from backgroundFixed Circle: put a circle around the foreground areaSeeded region growing
32、: identify initial spot “seeds” and grow high intensity regionsEdge detection algorithmsBackground cancellationIntensity = FGintensity - BGintensity,Noise filtering 干扰过滤Gridding:,Normalization 归一化,The word normalization describes techniques used to suitably transform the data before they are analyse
33、d.Goal is to correct for systematic differencesbetween samples on the same slide, orbetween slides, which do not represent true biological variation between samples.,Normalization 归一化The word norm,Normalization 归一化,Noralize data to correct for artificial variancesRed = FGred - BGredGreen = FGgreen B
34、GgreenPixelValue = log2(Red/Green)-log2(Redavg/Greenavg)Pixel color:Green if pixel value 0,Normalization 归一化Noralize data,Normalization 归一化,Calibrated, red and green equally detected,Uncalibrated, red light under detected,Normalization 归一化Calibrated, r,The origin of systematic differences系统误差的产生原因,S
35、ystematic differences due to Dye biases, which vary with spot intensity, Location on the array, Plate origin, Printing quality which may vary between PinsTime of printingScanning parameters,The origin of systematic diffe,DNA array Data AcquisitionDNA 芯片数据的获得,Image Analysis software packages exist fo
36、r the analysis of the output of custom made chips (e.g. GenePix Pro, Array Vision, TIGR Spot Finder, etc) Need chip description file (CDF) For probe location,DNA array Data AcquisitionDNA,Introduction of Software-SAMSAM软件介绍,Significance Analysis of MicroarraysTusher, Tibshirani and Chu (2001): Signi
37、ficance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116-5121, (Apr 24).,Excel pluginFreePermutation basedMost published method of microarray data analysis,Introduction of Software-SA,基因芯片的生物信息学课件,基因芯片的生物信息学课件,基因芯片的生物信息学课件,chose = .5. producing about 65 signific
38、ant genes and about 5.9 false positives on the average.The choice of is up to the user, depending how many false positives he/she is comfortable with.The False Discovery Rate (FDR) is computed as median (or 90th percentile) of the numberof falsely called genes divided by the number of genes called s
39、ignificant.,chose = .5. producing about,Handling Missing Data 丢失数据的操作,There are currently two options for imputing missing values in SAM.Row Average Each value is imputed with the average of non-missing values for that gene.K-Nearest Neighbor In the other (default) option- missing values are imputed
40、 using a k-nearest neighbor average in gene space (default k = 10):,Handling Missing Data 丢失数据的操作,Clustering 聚类软件,Hypothesis: Genes with similar function have similar expression profilesFind group of genes with similar expression profilesFind groupd of individuals with similar expression profiles wi
41、thin a population,Clustering 聚类软件Hypothesis: Ge,Clustering = Group identification,Clustering = Group identificat,Clustering Steps 聚类分析步骤,Choose a similarity metric to compare the transcriptional response or the expression profiles:Pearson CorrelationSpearman CorrelationEuclidean Distance特征抽取和模式表示Cho
42、ose a clustering algorithm:HierarchicalK-means,Clustering Steps 聚类分析步骤Choose,Cluster algorithm聚类算法,- Unsupervised Analysis 非监督算法 - HierarchicalK-meanSelf-organizing mapsOthers - Supervised Analysis: classification rules 监督算法,Cluster algorithm聚类算法 - Uns,Hyerarchical Clustering Example,Eisen et al. (1
43、998), PNAS, 95(25): 14863-14868,Hyerarchical Clustering Exampl,Hyerarchical Clustering Example,Hyerarchical Clustering Exampl,http:/www.pnas.org/cgi/content/full/95/25/14863,基因芯片的生物信息学课件,系统聚类法步骤,1、将n个样品各作为一类;2、计算n个样品两两之间的距离,构成距离矩阵;3、合并距离最近的两类为一新类;4、计算新类与当前各类的距离。 再合并、计算,直至只有一类为止;5、画聚类树形图,确定距离切点、类组,解释
44、。 在SPSS软件中的操作步骤:Analyze-Classify-Hierarchical,系统聚类法步骤1、将n个样品各作为一类;2、计算n个样品,Hierarchical Clustering系统聚类法,Find largest value is similarity matrix.,Join clusters together.,Recompute matrix and iterate.,Hierarchical Clustering系统聚类法g,Hierarchical Clustering 系统聚类,Find largest value is similarity matrix.,J
45、oin clusters together.,Recompute matrix and iterate.,Hierarchical Clustering 系统聚类g,Hierarchical Clustering系统聚类,Find largest value is similarity matrix.,Join clusters together.,Recompute similarity matrix and iterate.,Hierarchical Clustering系统聚类g1,Interpreting the Results,Interpreting the Resultsg1g4
46、g2,k-means 聚类分析,k-means 聚类分析是一种广为人知的方法,它通过尽量缩小一个分类中的项之间的差异,同时尽量拉大分类之间的距离,来分配分类成员身份。k-means 中的 means 指的是分类的“中点”,它是任意选定的一个数据点,之后反复优化,直到真正代表该分类中的所有数据点的平均值。k 指的是用于为聚类分析过程设种子的任意数目的点。k-means 算法计算一个分类中的数据记录之间的欧几里得距离的平方,以及表示分类平均值的矢量,并在和达到最小值时在最后一组 k 分类上收敛。k-means 算法仅仅将每个数据点分配给一个分类,并且不允许成员身份存在不确定性。分类中的成员身份表示
47、为与中点的距离。通常,k-means 算法用于创建连续属性的分类,在这种情况下,计算与平均值的距离非常简单。但是,Microsoft 实现通过使用概率针对分类离散属性对 k-means 方法进行改编。对于离散属性,数据点与特定分类的距离按如下公式计算:1 - P(数据点, 分类),k-means 聚类分析k-means 聚类分析是一种广为人,基因芯片的生物信息学课件,Recommended Texts,General overview of microarray data analysis“Microarray Gene Expression Data Analysis: A Beginner
48、s Guide” (Causton, Quakenbush and Brazma)“Microarray Bioinformatics” (Stekel)Data Mining“Data Mining: Concepts and Techniques” (Han),Recommended Texts,Affymetrix Michael Eisen Lab at LBL (hierarchical clustering software “Cluster” and “Tree View” (Windows) rana.lbl.gov/Stanford MicroArray Database (
49、“Xcluster” (Linux)genome-www4.stanford.edu/MicroArray/SMD/Review of Currently Available Microarray Softwarewww.the- DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.html,Some useful links,Affymetrix www.affymetrix.c,Eisen, M. B. et al., (1998).Cluster analysis and display of genome-wide expression pa
50、tterns. Proc Natl Acad Sci U S A 95(25): 14863-8.Wen, X., et al., (1998). Large-scale temporal gene ex- pression mapping of central nervous system development. Proc Natl Acad Sci U S A 95(1): 334-9.U. Alon, et al., (1999) “Broad patterns of gene expression revealed by clustering analysis of tumor an