Bored to death seeing public releases of sequencing runs from E. coli coming off desktop sequencers? Today, Life Technologies released through the Ion Community a sequencing run that wasn’t from E. coli !! Thank goodness. Ion Torrent released a human shotgun sequencing run from NS12911 (aka Venter published in PLoS Biology). Unfortunately, it was only two separate runs (C18-99 and C24-141) from a 318 chip so has bugger all in terms of coverage. I am extremely grateful for the release of the data set (kudos to Matt in Life Tech for the early access :D), though it would have been much nicer if they released results from a custom capture because at least it wouldn’t be totally useless for analysis. However, all is not lost as the coverage from the mitochondrial genome was sufficient to do some analysis (Figure 1).
Figure 1. Shows the coverage (determined by BEDtools) from using the two supplied BAMs. The coverage output for bwa-sw aligned BAMs produces an almost identical coverage and produces a bit of a mess when it is plotted also.
What’s made the analysis of the mitochondrial genome (chrM) a bit annoying is that the two BAM files supplied don’t appear to be aligned to a hg19 chrM reference. According to the BAM header the chrM it was aligned to was 16569 bases in length (@SQ SN:chrM LN:16569). This is two bases less than the hg19 or hg18 chrM obtained from UCSC Genome Browsers. Since I was missing this version of chrM I decided to create my own alignments using bwa-sw, in addition to using tmap and subsequent base calling performed by the Genome Analysis Toolkit (GATK). The table below shows the results (tmap alignment of C24 is missing as it was still grinding away, while writing this :???:
|C18-99 (VCF supplied)
||29 (including 3 INDELS)|
|C24-141 (VCF supplied)
||30 (including 3 INDELS)|
It is hard to determine the overlap between the variants called by the supplied VCF (i.e. by mpileup) and the ones called by GATK as the differences between the two reference chrM creates an off by 1-2 base difference in the coordinates. On inspection the majority of the mpileup calls are due to differences between the two references evident in what is marked as the ref or alt base. Below is the Venn diagram showing the variants called by GATK between the two runs. Using the Integrative Genomics Viewer (IGV), the variants outside the intersection had reads supporting it on the run which it was not called on. The only exception is the 10279C>A variant, which appears to have borderline read support from each of the sequencing runs.
Figure 2. Relationship between the variants called on chrM from the two runs (i.e. C18 and C24). Ideally all variants should be in the intersection.
One noticeable difference is that GATK although the -glm BOTH option was turned on, did not call any insertions or deletions (INDELS) on chrM. Using Integrative Genomics Viewer (IGV), there does not appear to be enough reads supporting the deletions at positions 3105 and 3108. In contrast, the deletion at position 9905 had sufficient reads to support it. However, there appears to be an unusual amount of areas surrounding it in the form of colored bars (i.e. undercalls/overcalls) and black lines (deletions). For those that haven’t used IGV before, the bars/lines running horizontally are the reads which are mostly colored grey as they usually completely match with the reference.
A public release of data would not be complete unless it included an E. coli data set. This release included 194X coverage PGM run from a 318 chip (C22-169). Despite the very high coverage, the supplied VCF file showed there were 36 INDELS, which were all deletions. There seems be a bias in undercalling G or C bases as they account for 33/36, while 4/36 were A or T undercalls. There was a deletion that involved undercalling both a G and a T and hence the appearance that I can’t add. :oops: These variants were counted manually and without a calculator so there may be a mistake anyways :) Using IGV, I had a look at the sequence context for the A/T undercalls. All three (1829754, 3545779, 4497732) have the exact same sequence context, that is AAAATTTT (click on each link to view IGV screen shot). There is a possibility that errors in mapping to low complexity or repetitive regions may also explain some of these instances.
Using the same methodology to identify the G/C undercalls, will help to identify the systematic biases that still remain in terms of base calling. This in combination with Torrent Scout and the wealth of Test Fragments data available would be a good avenue to pursue for the Accuracy challenge. I’ll insert some details on the methods a little later.
Next week I’ll post regarding the contributions signal processing and base calling make in regards to accuracy. Until then back to my PhD thesis and having no life :cry:
materials and methods
The hg19 reference file labelled ucsc.hg19.fasta was taken from the GATK 1.3 resource bundle directory.
#tmap parameters taken from the supplied BAM file header tmap mapall -R LB:hg19 -R CN:IonSoCal -R PU:PGM/318B -R ID:16A7I -R SM:polymerase -R DT:2011-11-09T09:33:45 -R DS:100KMTph755uMedta559S788Q -R PL:IONTORRENT -n 6 -f ucsc.hg19.fasta -r in.fastq -v map1 map2 map3 > out.sam #Create a sorted BAM file compatible with GATK AddOrReplaceReadGroups.jar I=out.sam O=out.bam SORT_ORDER=coordinate RGPL=454 RGSM=NS12911 RGPU=PGM/318B RGLB=hg19
bwa bwasw -t 8 ucsc.hg19 in.fastq > out.sam #Create a sorted BAM file compatible with GATK AddOrReplaceReadGroups.jar I=in.sam O=out.bam SORT_ORDER=coordinate RGPL=454 RGSM=NS12911 RGPU=Random RGLB=hg19 #all BAMs used as input to GATK must be indexed first samtools index in.bam #GATK indel local realignment against known indels #Mark Duplicates MarkDuplicates.jar I=in.bam O=out.bam
GATK variant calling
#Variant calling restricted to chrM GenomeAnalysisTK.jar -et NO_ET -T UnifiedGenotyper -nt 8 -glm BOTH -R ucsc.hg19.fasta -I in.bam -o out.vcf -L chrM:1-16571
Coverage Plot (Figure 1)
coverageBed -abam in.bam -b mt.bed -d > out.txt #mt.bed chrM 1 16571
Comparing two variant call sets (Figure 2)
GenomeAnalysisTK.jar -et NO_ET -T CombineVariants -R ucsc.hg19.fasta --variant:C18 C18.vcf --variant:C24 C24.vcf -o merged.vcf
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I dedicate this post to the fish and chips I had two weeks ago at some random place in New Zealand called Tauranga!