Monthly Archives: December 2011

Annotate: A Plugin for the Ion Torrent Browser

The main topic of this blog post is to detail a plugin that I have developed for the Torrent Browser. There are currently two plugins which does variant calling: (1) Germ-lineVariantCaller is a general variant caller plugin and (2) AmpliSeqCancerVariantCaller is specific to the AmpliSeq Cancer Kit. The plugin “Annotate” supplements the two variant caller plugin currently available as it addresses three important questions in disease genetics.

Novel versus Common Variants

Whether a variant is novel or common in the population. This can be done by seeing if a variant exists in dbSNP (version 132). A tool that can differentiate between novel and common variants saves time as novel variants are more likely to be disease causing compared to common variants.  The Genome Analysis Toolkit (GATK) has an option to incorporate annotation from a VCF file through the -D option but I have decided against using this as the chromosome order in the dbSNP VCF file MUST match with the reference file used for variant calling.  This creates a little dilemma as the hg19 reference stored on the Torrent Server is ordered different to the dbSNP VCF file from the GATK 1.2 resouce bundle. For this plugin, I have decided to index the VCF file using tabix and call the variants outside the GATK framework.

Functional Consequence of Variant

Whether a variant lies within a gene and the functional consequence. For example, does the variant result in an amino acid change? (i.e. non-synonymous variant). Common tools used are SNPEff (Latest update on Christmas Day!!) and ANNOVAR. Although SNPEff uses Gencode annotation and therefore is more comprehensive, it is quite hard to summarize information and the majority of transcripts (ENST) are non-coding, thus for this plugin I have decided to go with ANNOVAR which uses Refseq (NM) annotations.

Functional Impact of Novel Non-Synonymous Variants

Whether a novel non-synonymous variant is likely to have a functional impact on the resulting protein. This can be achieve using functional impact prediction tools. I have decided to use PolyPhen2 and SIFT for predictions as pre-computed values are available as text files on the ANNOVAR download page. I have decided not to use ANNOVAR for calling the functional impact predictions as the implementation is unusually slow. To speed up things I sorted the SIFT and PolyPhen2 prediction text file followed by indexing using tabix. This allows variants to be more efficiently searched within the now sorted text file.

Screenshots

Figure 1. Result from the C01-288 run of the AmpliSeq kit available for download in the Ion Community. All GATK variants called are KNOWN.

Figure 2. Result from the BUT-317 run of CFTR amplicon sequencing available for download in the Ion Community. Only one variant was called by GATK which was a novel variant. As this is a screenshot, you can’t see the tool tip for Polyphen2 (PP2) and SIFT. D = Damaging and SIFT scores < 0.05 are considered damaging.

The plugin can be obtain via the Ion Community at this link. If you find this plugin useful, please vote for it in the Ion Community by clicking on “like”.

 We will be using this plugin in an up coming project using custom designed AmpliSeq primers on 10 large muscle disease causing genes across our undiagnosed patient cohort. Big thanks to Kelly and  Life Technologies for awarding an Application Grant to our lab for this project 😀

Licensing

The Annotate plugin is a shell script which calls a collection of tools. It is important for organizations using this to have a look at the licenses and conditions of use for the following tools: ANNOVAR, PolyPhen2, SIFT, GATK, samtools, Picard Tools and tabix. For instance, ANNOVAR may not be free to use for commerical organizations “ANNOVAR is open-source and free for non-profit use. If you use it for  commercial purposes, please contact Ellen Purpus, director of OTT (PURPUS@email.chop.edu) directly for license related issues.”

Thanks to David from EdgeBio for the feedback. EdgeBio created the first community developed Plugin called SNPEff, a neat plugin and you can check out more details on their blog post.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. We sit on the shoulders on giants – this plugin is a script composed of available open source tools and resources.

Advertisements

Signal Processing – Room for improvements?

During the Ion Torrent User Group Meeting at ASHG, Rothberg talked about how he would approach the Accuracy challenge. He said he would seek out people who did well at the mathematics olympiad and get them to work on the problem and pay them $5,000. What a cheap skate 😀 He also said that there should be more focus on the actual raw signal processing. Thinking on the same lines, Yanick asked in the comments of a previous post, what contribution raw signal processing made with the accuracy improvements in Torrent Suite 1.5.

At the moment with the data processing, there is a way to separate the contributions that raw signal processing makes from signal normalization, dephasing and base calling (all three abbreviated to base calling for the rest of this blog). This can be done by using the 1.wells files generated as this marks the end of raw signal processing. Therefore, we can use a version of Torrent Suite  to do the raw signal processing and a different version to do the base calling using the –from-wells command line option (Figure 1).

Figure 1. Using different versions of Torrent Suite to do signal processing and base calling. Results grouped along the x-axis by which version did the base calling and colored by which 1.wells file version was used as the input (i.e. what version did the signal processing).

Input data is from control library of DH10B sequenced on a 314 chip in our lab (features in a few previous blog posts, including the one on rapid software improvements). This input data was used for each result featured in Figure 1 to make the analysis comparable.  In this bar plot the measure of performance is the number of 100AQ23 reads that results – the current measure of accuracy in the Accuracy challenge. The general conclusion is that signal processing changes between versions makes little contribution on accuracy performance. This is evident as there is little difference in heights between bars within each of the base calling groups along the x-axis. In contrast, all 1.wells files which used v1.5 for base calling showed marked improvements regardless of what version of Torrent Suite was used for signal processing (i.e. 1.wells file). Interestingly, the signal processing from v1.4 appears to perform better than v1.5. This can be largely explained by an increase in the number of beads categorized as “live” in v1.4 compared to v1.5.

There are two possibilities to explain the small contributions made by signal processing:

  1. This year most effort was dedicated to dephasing and normalization which are the source of major improvements in Torrent Suite 1.5. Improvements in signal processing will be the next focus. OR
  2. The current signal processing model has reached it’s limit and a new model needs to be developed in order to see further improvements.

In this post and last post on software improvements, only 314 data was featured. To give a more comprehensive representation, a 316 (STO-409) and 318 (C22-169) run was used to observe accuracy improvements (Figure 2). Thanks to Matt for supplying the 1.wells file for these publicly released runs.

Figure 2. The improvements in base calling between v1.4 and v1.5 using 1.wells file from v1.5 as input. I couldn’t get v1.3 to run on either the 316 or 318 data 😕

What made the analysis featured in this blog post a little challenging was the 1.wells format changed to make use of the HDF5 standard in Torrent Suite 1.5. This has allowed for the files to be better organized, parsed and to be compressed by approximately 2-fold. Torrent Suite 1.5 is able to read in 1.wells files generated by previous versions (i.e. backward compatible) but unfortunately vice versa does not apply 😡 I had to write a small program to convert the 1.wells files (from 1.5) back to the legacy format. Kudos to Simon for the tips 🙂 What was a little concerning is the current implementation loads the whole 1.wells into memory which consumes 25-35%, 50-70% of total memory for the 316 and 318 chip.

<insert some awesome conclusion here> 😀

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.

Ion Torrent Sequencing on Humans

Bored to death seeing public releases of sequencing runs from E. coli coming off desktop sequencers? Today, Life Technologies released through the Ion Community a sequencing run that wasn’t from E. coli !! Thank goodness. Ion Torrent released a human shotgun sequencing run from NS12911 (aka Venter published in PLoS Biology). Unfortunately, it was only two separate runs (C18-99 and C24-141) from a 318 chip so has bugger all in terms of coverage. I am extremely grateful for the release of the data set (kudos to Matt in Life Tech for the early access :D), though it would have been much nicer if they released results from a custom capture because at least it wouldn’t be totally useless for analysis. However, all is not lost as the coverage from the mitochondrial genome was sufficient to do some analysis (Figure 1).

Figure 1. Shows the coverage (determined by BEDtools) from using the two supplied BAMs. The coverage output for bwa-sw aligned BAMs produces an almost identical coverage and produces a bit of a mess when it is plotted also.

mitochondrial variants

What’s made the analysis of the mitochondrial genome (chrM) a bit annoying is that the two BAM files supplied don’t appear to be aligned to a hg19 chrM reference. According to the BAM header the chrM it was aligned to was 16569 bases in length (@SQ     SN:chrM LN:16569). This is two bases less than the hg19 or hg18 chrM obtained from UCSC Genome Browsers.  Since I was missing this version of chrM I decided to create my own alignments using bwa-sw, in addition to using tmap and subsequent base calling performed by the Genome Analysis Toolkit (GATK). The table below shows the results (tmap alignment of C24 is missing as it was still grinding away, while writing this 😕

Variants Called
C18-99 (VCF supplied)
29 (including 3 INDELS)
C24-141 (VCF supplied)
30 (including 3 INDELS)
C18-99 (bwa-sw/GATK)
41
C24-141 (bwa-sw/GATK)
39
C18-99 (tmap/GATK)
35

It is hard to determine the overlap between the variants called by the supplied VCF (i.e. by mpileup) and the ones called by GATK as the differences between the two reference chrM creates an off by 1-2 base difference in the coordinates. On inspection the majority of the mpileup calls are due to differences between the two references evident in what is marked as the ref or alt base. Below is the Venn diagram showing the variants called by GATK between the two runs. Using the Integrative Genomics Viewer (IGV), the variants outside the intersection had reads supporting it on the run which it was not called on. The only exception is the 10279C>A variant, which appears to have borderline read support from each of the sequencing runs.

Figure 2. Relationship between the variants called on chrM from the two runs (i.e. C18 and C24). Ideally all variants should be in the intersection.

One noticeable difference is that GATK although the -glm BOTH option was turned on, did not call any insertions or deletions (INDELS) on chrM. Using Integrative Genomics Viewer (IGV), there does not appear to be enough reads supporting the deletions at positions 3105 and 3108. In contrast, the deletion at position 9905 had sufficient reads to support it. However, there appears to be an unusual amount of areas surrounding it in the form of colored bars (i.e. undercalls/overcalls) and black lines (deletions). For those that haven’t used IGV before, the bars/lines running horizontally are the reads which are mostly colored grey as they usually completely match with the reference.

Systematic Biases?

A public release of data would not be complete unless it included an E. coli data set. This release included 194X coverage PGM run from a 318 chip (C22-169). Despite the very high coverage, the supplied VCF file showed there were 36 INDELS, which were all deletions. There seems be a bias in undercalling G or C bases as they account for 33/36, while 4/36 were A or T undercalls. There was a deletion that involved undercalling both a G and a T and hence the appearance that I can’t add. 😳 These variants were counted manually and without a calculator so there may be a mistake anyways 🙂 Using IGV, I had a look at the sequence context for the A/T undercalls. All three (18297543545779, 4497732) have the exact same sequence context, that is AAAATTTT (click on each link to view IGV screen shot). There is a possibility that errors in mapping to low complexity or repetitive regions may also explain some of these instances.

Using the same methodology to identify the G/C undercalls, will help to identify the systematic biases that still remain in terms of base calling. This in combination with Torrent Scout and the wealth of Test Fragments data available would be a good avenue to pursue for the Accuracy challenge. I’ll insert some details on the methods a little later.

Next week I’ll post regarding the contributions signal processing and base calling make in regards to accuracy. Until then back to my PhD thesis and having no life 😥

materials and methods

The hg19 reference file labelled ucsc.hg19.fasta was taken from the GATK 1.3 resource bundle directory.

tmap alignment

#tmap parameters taken from the supplied BAM file header
tmap mapall -R LB:hg19 -R CN:IonSoCal -R PU:PGM/318B
-R ID:16A7I -R SM:polymerase -R DT:2011-11-09T09:33:45
-R DS:100KMTph755uMedta559S788Q -R PL:IONTORRENT
-n 6 -f ucsc.hg19.fasta -r in.fastq -v map1 map2 map3 >
out.sam

#Create a sorted BAM file compatible with GATK
AddOrReplaceReadGroups.jar I=out.sam O=out.bam
SORT_ORDER=coordinate RGPL=454 RGSM=NS12911
RGPU=PGM/318B RGLB=hg19

bwasw alignment

bwa bwasw -t 8 ucsc.hg19 in.fastq > out.sam

#Create a sorted BAM file compatible with GATK
AddOrReplaceReadGroups.jar I=in.sam O=out.bam
SORT_ORDER=coordinate RGPL=454 RGSM=NS12911 RGPU=Random
RGLB=hg19

#all BAMs used as input to GATK must be indexed first
samtools index in.bam
#GATK indel local realignment against known indels

#Mark Duplicates
MarkDuplicates.jar I=in.bam O=out.bam

GATK variant calling

#Variant calling restricted to chrM
GenomeAnalysisTK.jar -et NO_ET -T UnifiedGenotyper -nt 8
-glm BOTH -R ucsc.hg19.fasta -I in.bam -o out.vcf
-L chrM:1-16571

Coverage Plot (Figure 1)

coverageBed -abam in.bam -b mt.bed -d > out.txt
#mt.bed
chrM    1       16571

Comparing two variant call sets (Figure 2)

GenomeAnalysisTK.jar -et NO_ET -T CombineVariants -R
ucsc.hg19.fasta --variant:C18 C18.vcf --variant:C24 C24.vcf
-o merged.vcf

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I dedicate this post to the fish and chips I had two weeks ago at some random place in New Zealand called Tauranga!