Ion Torrent Sequencing on Humans

Bored to death seeing public releases of sequencing runs from E. coli coming off desktop sequencers? Today, Life Technologies released through the Ion Community a sequencing run that wasn’t from E. coli !! Thank goodness. Ion Torrent released a human shotgun sequencing run from NS12911 (aka Venter published in PLoS Biology). Unfortunately, it was only two separate runs (C18-99 and C24-141) from a 318 chip so has bugger all in terms of coverage. I am extremely grateful for the release of the data set (kudos to Matt in Life Tech for the early access :D), though it would have been much nicer if they released results from a custom capture because at least it wouldn’t be totally useless for analysis. However, all is not lost as the coverage from the mitochondrial genome was sufficient to do some analysis (Figure 1).

Figure 1. Shows the coverage (determined by BEDtools) from using the two supplied BAMs. The coverage output for bwa-sw aligned BAMs produces an almost identical coverage and produces a bit of a mess when it is plotted also.

mitochondrial variants

What’s made the analysis of the mitochondrial genome (chrM) a bit annoying is that the two BAM files supplied don’t appear to be aligned to a hg19 chrM reference. According to the BAM header the chrM it was aligned to was 16569 bases in length (@SQ     SN:chrM LN:16569). This is two bases less than the hg19 or hg18 chrM obtained from UCSC Genome Browsers.  Since I was missing this version of chrM I decided to create my own alignments using bwa-sw, in addition to using tmap and subsequent base calling performed by the Genome Analysis Toolkit (GATK). The table below shows the results (tmap alignment of C24 is missing as it was still grinding away, while writing this :???:

Variants Called
C18-99 (VCF supplied)
29 (including 3 INDELS)
C24-141 (VCF supplied)
30 (including 3 INDELS)
C18-99 (bwa-sw/GATK)
41
C24-141 (bwa-sw/GATK)
39
C18-99 (tmap/GATK)
35

It is hard to determine the overlap between the variants called by the supplied VCF (i.e. by mpileup) and the ones called by GATK as the differences between the two reference chrM creates an off by 1-2 base difference in the coordinates. On inspection the majority of the mpileup calls are due to differences between the two references evident in what is marked as the ref or alt base. Below is the Venn diagram showing the variants called by GATK between the two runs. Using the Integrative Genomics Viewer (IGV), the variants outside the intersection had reads supporting it on the run which it was not called on. The only exception is the 10279C>A variant, which appears to have borderline read support from each of the sequencing runs.

Figure 2. Relationship between the variants called on chrM from the two runs (i.e. C18 and C24). Ideally all variants should be in the intersection.

One noticeable difference is that GATK although the -glm BOTH option was turned on, did not call any insertions or deletions (INDELS) on chrM. Using Integrative Genomics Viewer (IGV), there does not appear to be enough reads supporting the deletions at positions 3105 and 3108. In contrast, the deletion at position 9905 had sufficient reads to support it. However, there appears to be an unusual amount of areas surrounding it in the form of colored bars (i.e. undercalls/overcalls) and black lines (deletions). For those that haven’t used IGV before, the bars/lines running horizontally are the reads which are mostly colored grey as they usually completely match with the reference.

Systematic Biases?

A public release of data would not be complete unless it included an E. coli data set. This release included 194X coverage PGM run from a 318 chip (C22-169). Despite the very high coverage, the supplied VCF file showed there were 36 INDELS, which were all deletions. There seems be a bias in undercalling G or C bases as they account for 33/36, while 4/36 were A or T undercalls. There was a deletion that involved undercalling both a G and a T and hence the appearance that I can’t add. :oops: These variants were counted manually and without a calculator so there may be a mistake anyways :) Using IGV, I had a look at the sequence context for the A/T undercalls. All three (18297543545779, 4497732) have the exact same sequence context, that is AAAATTTT (click on each link to view IGV screen shot). There is a possibility that errors in mapping to low complexity or repetitive regions may also explain some of these instances.

Using the same methodology to identify the G/C undercalls, will help to identify the systematic biases that still remain in terms of base calling. This in combination with Torrent Scout and the wealth of Test Fragments data available would be a good avenue to pursue for the Accuracy challenge. I’ll insert some details on the methods a little later.

Next week I’ll post regarding the contributions signal processing and base calling make in regards to accuracy. Until then back to my PhD thesis and having no life :cry:

materials and methods

The hg19 reference file labelled ucsc.hg19.fasta was taken from the GATK 1.3 resource bundle directory.

tmap alignment

#tmap parameters taken from the supplied BAM file header
tmap mapall -R LB:hg19 -R CN:IonSoCal -R PU:PGM/318B
-R ID:16A7I -R SM:polymerase -R DT:2011-11-09T09:33:45
-R DS:100KMTph755uMedta559S788Q -R PL:IONTORRENT
-n 6 -f ucsc.hg19.fasta -r in.fastq -v map1 map2 map3 >
out.sam

#Create a sorted BAM file compatible with GATK
AddOrReplaceReadGroups.jar I=out.sam O=out.bam
SORT_ORDER=coordinate RGPL=454 RGSM=NS12911
RGPU=PGM/318B RGLB=hg19

bwasw alignment

bwa bwasw -t 8 ucsc.hg19 in.fastq > out.sam

#Create a sorted BAM file compatible with GATK
AddOrReplaceReadGroups.jar I=in.sam O=out.bam
SORT_ORDER=coordinate RGPL=454 RGSM=NS12911 RGPU=Random
RGLB=hg19

#all BAMs used as input to GATK must be indexed first
samtools index in.bam
#GATK indel local realignment against known indels

#Mark Duplicates
MarkDuplicates.jar I=in.bam O=out.bam

GATK variant calling

#Variant calling restricted to chrM
GenomeAnalysisTK.jar -et NO_ET -T UnifiedGenotyper -nt 8
-glm BOTH -R ucsc.hg19.fasta -I in.bam -o out.vcf
-L chrM:1-16571

Coverage Plot (Figure 1)

coverageBed -abam in.bam -b mt.bed -d > out.txt
#mt.bed
chrM    1       16571

Comparing two variant call sets (Figure 2)

GenomeAnalysisTK.jar -et NO_ET -T CombineVariants -R
ucsc.hg19.fasta --variant:C18 C18.vcf --variant:C24 C24.vcf
-o merged.vcf

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I dedicate this post to the fish and chips I had two weeks ago at some random place in New Zealand called Tauranga!

6 responses to “Ion Torrent Sequencing on Humans

  1. Nice :-)

    Was it the G1k mitochondrial sequence? Or the one in Ensembl maybe?

  2. Hi Mick, thanks for the tips.
    According to the link regarding references used in the 1000 genomes project

    It is the G1k mitochondrial sequence (NC_012920) as you suggested, which is 16,569 bp in length.

    There should be a reference to reference all reference sequences!! :lol:

  3. Andrea who had problems posting a comment so emailed wanted to post the following:
    Hi, congratulation for the blog! we are planning to buy an ion torrent machine and so I’m very interested in what you post on your! I have a question, why are the regions of low coverage equal in the two runs and why the coverage plot in general is very similar? Considering this is a shot gun whole genome sequencing without enrichment phase, I would expect similar levels of coverage through the entire mitochondrial genome. Do you think that it could be due to the sample preparation or the sequence features itself? or something else?

  4. Andrea thanks for your email and perseverance in getting this comment up :) Your question is very interesting as I thought the exact same thing :D I have triple checked my analysis to ensure it wasn’t from incorrectly using coverageBed in BEDtools.

    Firstly, the DNA was sheared using a Covaris followed by clonal amplification. This is a probabilistic event so can not result in entirely uniform coverage. Interestingly, when you look at the coverage over the nuclear genome you will see coverage peaks in repetitive regions. This is due to the abundance of repetitive regions and the difficulty to confidently map to them.

    As the coverage peaks could be thought of as a “finger print”, the reason why the coverage from the two sequencing runs is very similar is that the same library prep was used for both runs.

    Someone please correct me if I am wrong with my interpretations.

  5. I have updated this blog post with the methods that I used to generate the data

  6. As far as I know, GATK disabled Indel calling for 454 reads due to large amounts of false positive calls because of the homopolymer problem. However, I can’t remember where I read that….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s