《二代测序数据分析简介.ppt》由会员分享,可在线阅读,更多相关《二代测序数据分析简介.ppt(57页珍藏版)》请在三一办公上搜索。
1、二代测序数据分析简介,童春发,主要内容,重测序的原理及流程数据结构与质量评估SRA数据库及数据获取Bowtie2、BWA和SAMtools软件使用,重测序的原理及流程,数据结构与质量评估,Fastq格式FastQC,FASTQ format,http:/,A FASTQ file containing a single sequence might look like this,SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!*(*+)%+)(%).1*-+*)*55CCFCCCCCCC65,Illumin
2、a sequence identifiers,HWUSI-EAS100R:6:73:941:1973#0/1,Versions of the Illumina pipeline since 1.4 appear to use#NNNNNN instead of#0 for the multiplex ID,where NNNNNN is the sequence of the multiplex tag.,With Casava 1.8 the format of the line has changed,EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:1
3、8:ATCACG,Quality,A quality value Q is an integer mapping of p(i.e.,the probability that the corresponding base call is incorrect).Phred quality score:The Solexa pipeline(i.e.,the software delivered with the Illumina Genome Analyzer)earlier used,Quality,Encoding,Sanger format can encode a Phred quali
4、ty score from 0 to 93 using ASCII 33 to 126Illuminas newest version(1.8)of their pipeline CASAVA will directly produce fastq in Sanger formatSolexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from-5 to 62 using ASCII 59 to 126 Starting with Illumina 1.3 and before Illumina 1.8,the
5、 format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 Starting in Illumina 1.5 and before Illumina 1.8,the Phred scores 0 to 2 have a slightly different meaning,American Standard Code for Information Interchange(ASCII),FastQC,Double click“run_fastqc.bat”to run FastQCThe analysis r
6、esults for 11 modulesGreen tick for normalOrange triangle for slightly abnormal Red cross for very unusual,Basic Statistics,File typeConventional base callsEncodingSanger/Illumina 1.9Total Sequences3992798Filtered Sequences0Sequence length100%GC37,Per Base Sequence Quality,The central red line is th
7、e median valueThe yellow box represents the inter-quartile range(25-75%)The upper and lower whiskers represent the 10%and 90%pointsThe blue line represents the mean quality,Per Sequence Quality Scores,A warning is raised if the most frequently observed mean quality is below 27-this equates to a 0.2%
8、error rate.An error is raised if the most frequently observed mean quality is below 20-this equates to a 1%error rate.,Per Base Sequence Content,This module issues a warning if the difference between A and T,or G and C is greater than 10%in any position.This module will fail if the difference betwee
9、n A and T,or G and C is greater than 20%in any position.,Per Base GC Content,This module issues a warning it the GC content of any base strays more than 5%from the mean GC content.This module will fail if the GC content of any base strays more than 10%from the mean GC content.,Per Sequence GC Conten
10、t,A warning is raised if the sum of the deviations from the normal distribution represents more than 15%of the readsThis module will indicate a failure if the sum of the deviations from the normal distribution represents more than 30%of the reads,Per Base N Content,This module raises a warning if an
11、y position shows an N content of 5%This module will raise an error if any position shows an N content of 20%,Sequence Length Distribution,This module will raise a warning if all sequences are not the same lengthThis module will raise an error if any of the sequences have zero length,Duplicate Sequen
12、ces,This module will issue a warning if non-unique sequences make up more than 20%of the totalThis module will issue a error if non-unique sequences make up more than 50%of the total,Overrepresented Sequences,AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCG653111.636TruSeq Adapter,Index 10(97%over
13、 36bp)ATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT64640.162TruSeq Adapter,Index 10(97%over 36bp)AATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT46330.116TruSeq Adapter,Index 10(97%over 36bp)AATTAGTCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT44630.112TruSeq Adapter,Index 10(97%over 34bp)AATTAT
14、GGATAATTAAAGTATTCCCCCCTTTTTTTTATGATATTTTTGAC39940.100No HitWarning:0.1%Failure:1%,Overrepresented Kmers,This module will issue a warning if any k-mer is enriched more than 3 fold overall,or more than 5 fold at any individual positionThis module will issue a error if any k-mer is enriched more than 1
15、0 fold at any individual base position,Saving a Report,SRA数据库及数据获取,SRA数据库及数据获取,SRA数据库及数据获取,SRA数据库及数据获取,查看和下载SRR576183,Fastq-dum将SRA文件转化成FASTQ格式,fastq-dump-split-files-DQ“+”./SRR576183.srafastq-dump-split-files-DQ“+”-gzip./SRR576183.sra,直接下载FASTQ格式数据,将Reads比对到参考序列,BWABowtie2SoapSamtools,BWA,https:/ta
16、r-xjvfcdmakeDowload,BWA,ref.famem ref.fa test_PE1.fa aln-se.sammem ref.fa test_PE1.fa test_PE2.fa aln-se.sam,Bowtie2,下载 mv bowtie2-2.1.0 bowtie2 cd bowtie2/example mkdir work cd work,Bowtie2,Index a reference genome././bowtie2-build./reference/lambda_virus.fa lambda_virusAligning single-end reads./.
17、/bowtie2-x lambda_virus-U./reads/reads_1.fq-S eg1.samAligning paired-end reads././bowtie2-x lambda_virus-1./reads/reads_1.fq-2./reads/reads_2.fq-S eg2.sam-U:unpaired reads-S:sam format,SAM output,Name of read that alignedSum of all applicable flags.Flags relevant to Bowtie are:1:The read is one of a
18、 pair2:The alignment is one end of a proper paired-end alignment4:The read has no reported alignments8:The read is one of a pair and has no reported alignments16:The alignment is to the reverse reference strand,SAM output,32:The other mate in the paired-end alignment is aligned to the reverse refere
19、nce strand64:The read is mate 1 in a pair128:The read is mate 2 in a pairName of reference sequence where alignment occurs1-based offset into the forward reference strand where leftmost character of the alignment occurs,SAM output,Mapping qualityCIGAR string representation of alignmentName of refere
20、nce sequence where mates alignment occurs.Set to=if the mates reference sequence is the same as this alignments,or*if there is no mate.1-based offset into the forward reference strand where leftmost character of the mates alignment occurs.Offset is 0 if there is no mate,SAM output,Inferred fragment
21、size.Size is negative if the mates alignment occurs upstream of this alignment.Size is 0 if there is no mate.Read sequence(reverse-complemented if aligned to the reverse strand)ASCII-encoded read qualities(reverse-complemented if the read aligned to the reverse strand).The encoded quality values are
22、 on the Phred quality scale and the encoding is ASCII-offset by 33(ASCII char!),similarly to a FASTQ file.Optional fields.Fields are tab-separated.bowtie2 outputs zero or more of these optional fields for each alignment,depending on the type of the alignment:,SAM output,Optional fields:AS:i:Alignmen
23、t score.Only present if SAM record is for an aligned readXS:i:Alignment score for second-best alignment.Only present if the SAM record is for an aligned read and more than one alignment was found for the readYS:i:Alignment score for opposite mate in the paired-end alignment.Only present if the SAM r
24、ecord is for a read that aligned as part of a paired-end alignment.,SAM output,Optional fields:XN:i:The number of ambiguous bases in the reference covering this alignment.Only present if SAM record is for an aligned readXM:i:The number of mismatches in the alignment.Only present if SAM record is for
25、 an aligned readXO:i:The number of gap opens,for both read and reference gaps,in the aligment.Only present if SAM record is for an aligned read,SAM output,Optional fields:XG:i:The number of gap extensions,for both read and reference gaps,in the aligment.Only present if SAM record is for an aligned r
26、ead NM:i:The edit distance;that is,the minimal number of one-necleotide edits(substitutions,insertions and deletions)needed to transform the read string into the reference string.Only present if SAM record is for an aligned read,SAM output,Optional fields:YP:i:Equals 1 if the read is part of a pair
27、that has at least N concordant alignments,where N is the argument specified to M plus one.Equals 0 if the read is part of pair that has fewer than N alignments.E.g.if M 2 is specified and 3 distinct,concordant paired-end alignments are found,YP:i:1 will be printed.If fewer than 3 are found,YP:i:0 is
28、 printed.Only present if SAM record is for a read that aligned as part of a paired-end alignment.,SAM output,Optional fields:YM:i:Equals 1 if the read aligned with at least N unpaired alignments,where N is the argument specified to M plus one.Equals 0 if the read aligned with fewer than N unpaired a
29、lignments.E.g.if M 2 is specified and 3 distinct,valid,unpaired alignments are found,YM:i:1 is printed.If fewer than 3 are found,YM:i:0 is printed.Only present if SAM record is for a read that Bowtie 2 attempted to align in an unpaired fashion.,SAM output,Optional fields:YF:Z:String indicating reaso
30、n why the read was filtered out.Only appears for reads that were filtered out.MD:Z:A string representation of the mismatched reference bases in the alignment.Only present if SAM record is for an aligned read.,SAMtools,Install SAMtools:Dowloadtar xjvfOr:git clone git:/,SAMtools:Primer Tutorial,http:/
31、biobits.org/samtools_primer.htmlSample Data FilesAligning Reads Using Bowtie2Converting SAM to BAMSorting and IndexingIdentifying Genomic VariantsUnderstanding the VCF FormatVisualizing Reads,SAMtools:Primer Tutorial,Sample Data Files unzip samtools_primer-master.zipAligning Reads Using Bowtie2 cd s
32、amtools_primer-master/bowtie2/bowtie2-x indexes/e_coli-U simulated_reads/sim_reads.fq-S sim_reads_aligned.sam,SAMtools:Primer Tutorial,Converting SAM to BAM sim_reads_aligned.bam sim_reads_aligned.samSorting and Indexing sim_reads_aligned.bam sim_reads_aligned.sorted,SAMtools:Primer Tutorial,Identifying Genomic Variants mpileup-g-f genomes/NC_008253.fna sim_variants.bcfsim_variants.bcf sim_variants.vcf,SAMtools:Primer Tutorial,Understanding the VCF Format,SAMtools:Primer Tutorial,Visualizing Reads tview genomes/NC_008253.fna,