Crowd Sourcing 2.0 – Smaller and Faster = Better

It has been almost 3 months since my last post as I recently came out of a cave called PhD thesis writing at the end of March and once I stepped out, a ton of metaphoric snow (i.e. neglected paperwork) fell on top of me :cry:. I thought I would never get out of “writing hell” where words like “suggests”, “may”, “underlie”, “perhaps”, “however” and “although” were used ad nauseum. Yes, scientific writing is the art of suggesting everything but committing to nothing :lol:. Thanks to everyone who visited and supported Biolektures during the non-active periods.

During my 2012 Q1 hibernation, there has been some very exciting developments in terms of changes to the Ion Torrent crowd sourcing approach. In 2011 Q1, Life Technologies launched four $1 million dollar grand challenges. What I consider as a win/win public relations initiative:

  1. If someone achieves the goal of the challenge their solution is probably worth more than one million dollars.
  2. If no one can achieve the goals then it is good publicity because who doesn’t like reading about a competition with a one million dollar prize.

In August, I blogged a three part series that criticized the challenge in terms of fairness, resources and motivation. The root cause of these problems is the “hugeness” of the challenge – even anti-social unemployed geniuses don’t have the time to understand the entire problem and then identify areas that could be improved. The folks in Life Technologies have finally realized this and have created smaller self contained problems that require little background knowledge. However, solving smaller problems means smaller prize money as Life Technologies didn’t get more generous since 2011. The four challenges that are hosted on TopCoder that have been run so far are:

  1. DAT Lossless compression. DAT files are the raw voltage data that first comes off the Ion Torrent PGM.
  2. DAT Lossy compression.
  3. SFF compression. SFF files are the processed DAT files from which base-calling can be performed. These can be visualized as Ionograms.
  4. TMAP Smith-Waterman alignment optimization. TMAP is the Ion Torrent optimized sequence alignment tool that comes with the Torrent Suite. For example it maps each of the E. coli sequence reads to its corresponding position on the E. coli genome.

The above four challenges are not directly related to improving Accuracy, that is reducing the hugeness of the Accuracy challenge into smaller manageable problems. Instead it is aimed at reducing storage space and data analysis times, two extremely important improvements when sequencing throughput is further increased by the introduction of the Ion Proton. However, do smaller problems that are faster to solve actually assist the crowd sourcing community in terms of fairness, resources and motivation?

Fairness

Last year my biggest criticism was that it was impossible to compete against Life Technologies R&D employees who are actually employed to improve the Torrent Suite software. These employees are experienced, have a broad understanding of the whole problem and have months to years of experienced working on that specific problem. Having small focused challenges that are independent of  context and background addresses this fairness problem. The trick is just to design a Ion Torrent specific problem into a more general problem to make it appealing to experts in that field but at the same time just enough information such that an optimized solution can be achieved. I have hinted this in a previous post. Life Technologies has worked with TopCoder to produce such a format. The challenge of compression is a general problem, which mathematical theorems can be exploited and tweaked to form a specific optimized solution. Likewise with the Smith-Waterman alignment, which involves optimization of a dynamic programming algorithm.

Resources

The Accuracy challenge involves providing a solution that processes REAL Ion Torrent data and thus the person with the most Ion Torrent data to work with has an unfair advantage. This means people who collaborate or work in labs that heavily use their Ion Torrent PGM are more likely to identify systematic biases, statistical anomalies and just have more insight than someone who has access to a humble one or two publicly available data sets. In contrast, the TopCoder hosted challenges provides data sets derived from real data sets through an unknown process and thus HE/SHE with the most TOYS does not necessarily WIN. Furthermore, these challenges are completely independent of the Torrent Suite framework. This resolves the issue last year where the Torrent Suite source code was available only through the Ion Community, which required registration. This annoyed the open source purist such as blogger Peter Cock but Life Technologies addressed this later in 2011 by releasing the source code on Github.

The Accuracy Grand Challenge required not just a computer and an idea but also Ubuntu, gcc, supporting GNU libraries, etc and the patience to resolve compile errors in the source code due to subtle environment differences. Alternatively for the rich kids, purchasing time to use a Torrent Server instance on the Amazon EC2. Having small stand alone problems, truly requires just a computer and an idea, albeit a genius idea :D.

Motivation

The last barrier for satisfactory participation is that of motivation. Ask a young adult from this Internet generation to clean your house and IF they do a good job you might give them $100. They will most likely to tell you to go and F yourself, where F stands for “Find”… yeah right :roll:. Ask them to clean one small window and you will give them $20 if they do it in one hour. The latter tasks (equivalent to TopCoder Challenges) is more appealing than the former (equivalent to the Accuracy Grand Challenge). Each of the smaller challenges can be done in a few days and if you are UBER 133t h4x0r probably in just a few hours. I was surprised that in just a few hours how many submissions there were for the DAT compression challenge. More surprising, the top score achieved in the first day was close to the top score at the end, highlighting the optimal solution can be found by thinking and not grinding over days.

There is no moving target as the challenge runs for two weeks and the top score wins the challenge. More importantly, there is no minimum score such as reducing error by 50%. Thus the person with the top score at the end wins money regardless of score. Also, there are other cash prizes for second, third and others besides winning. For example, there may be some solutions if tweaked correctly would perform better than the top scoring solution and thus could be offered prizes also. Lastly, the publicly archived leader board for people who did not win prizes can act as a sense of achievement or something challengers can reference when applying for jobs.

Is this model successful so far?

The paradigm shift from “hugeness” one million dollar challenges reduced to DIRECTED small self contained problems, sounds like a winner but who gives a shit if it does not increase participation or optimal solutions :?. The following TopCoder results was kindly provided by Matt Dyer from Life Technologies.

  1. 60% improvement in SFF compression with 10x speed improvement
  2. 20% improvement in DAT Lossless compression
  3. >90% improvement in DAT Lossy compression
  4. 4-6x speed improvement in TMAP Smith-Waterman algorithm

In total this costs $40,000 and a few weeks, money well spent in my opinion :idea:. Although this does not address the accuracy problem, it does serve as a pilot for what can be achieved if the accuracy problem is reduced to smaller problems. Who knows, maybe the problem of this reduction can be a challenge within itself. All four achievements listed above, requires advanced mathematics and computing so what’s there for the mortals amongst us? This is where the Torrent Browser Plugin challenge comes in and is the focus of my next blog post.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct.

Are MiSeq miscalls influenced by preceding homopolymers?

A recent post in the Ion Community Torrent Dev section presents an analysis showing MiSeq systematic substitution errors which appear to be caused by preceding homopolymers (HPs). The Omics Omics blog post provides a very good summary and analysis of the Ion Community post. This analysis was performed on three publicly available data sets (1) E. coli   DH10B (2) E. coli MG1655 and (3) Ultra deep sequencing (UDT) of cancer genes from a recent Genome Biology publication. In the discussion thread that followed, an Ion Community member pointed out this finding is not entirely novel, as last year a Japanese group published on Illumina Sequence-specific error (SSE) profiles in Nucleic Acids Research. They surveyed sequencing data on SRA to conclude it was not organism or preparation specific problem but a systematic problem. In contrast, their SSE was not from HPs but from inverted repeats and CCG sequences. They also present on the mechanism that could be causing the SSE.

In this blog post I aim to perform an independent analysis on a data set available through BaseSpace on the Illumina website. The data set is from Bacillus cereus sequencing. I’m not too sure if this is one single run that was multiplex with 12 samples or 12 separate runs. Given the coverage it’s most likely a multiplex single run. Using a different methodology (see end), I wanted to see if this SSE influenced by preceding homopolymers (HP type SSE) also applied to the Bacillus cereus data set. I will only present the data and let the reader conclude for themselves.

Does this HP type SSE exist in Bacillus cereus?

A screenshot from IGV. This is very similar to what I observed for DH10B and MG1655. The miscalls always appear to be strand specific and a lot of the times are towards the end of the read and follows a HP.

How often does it occur in Bacillus cereus?

I analyzed all mapped alignments (i.e. BAM file) to see if this problem is isolated or data set wide. For each miscall, I counted how many times a HP of the miscalled base preceded on the reference. The random expectation distribution was created by simulating the 3 possible mismatch at every position in the genome and counting how many times it is preceded by a HP of the mismatch. This does not take into account coverage bias but come on peeps this is a blog not a publication! :P

Note: A and T plots not shown as the bias is not as large as the C/G biases.

Are there overlap between Bacillus cereus data sets?

All sites in which HP type SSE were occurring was identified. If this is a systematic bias we should see overlap between data sets from the same genome sequenced. The overlap appears to be approximately 1/3 when comparing two data sets. This highlights the systematic nature of a good proportion of these HP type SSE.

Note: Only mismatches with HP length >= 3 were included.

Where in the read do these occur?

Due to the nature of sequence by synthesis technology, we would expect the majority of these errors to occur much later in the read. This is what we observe in the two data sets. There seems to be a slight increase in T and C HP type SSE <50 bases into the read.

Note: Only mismatches with HP length >= 3 were included. This explains why the C/G bias is not obvious. The upward trend is broken at ~140bp due to read trimming and/or alignment/mapping trimming and not because the errors are reduced.

Methodology

I slightly edited the samtools calmd/fillmd source file (i.e. bam_md.c) to produce the metrics required to present each of the results. I will make this available if anyone is interested. This allowed me to use the Illumina supplied BAM in an unmodified form and thus removing any further bias caused by using a different alignment mapping. The annoying thing  about these BAMs is they don’t come with the MD tags for each mapped alignment!! Currently the CIGAR format in the SAM/BAM specification does not distinguish between matches and mismatches. Therefore, 150M does not always means 150 matches! The MD tag allows position of mismatches and the correct call to be determined.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.

Annotate: A Plugin for the Ion Torrent Browser

The main topic of this blog post is to detail a plugin that I have developed for the Torrent Browser. There are currently two plugins which does variant calling: (1) Germ-lineVariantCaller is a general variant caller plugin and (2) AmpliSeqCancerVariantCaller is specific to the AmpliSeq Cancer Kit. The plugin “Annotate” supplements the two variant caller plugin currently available as it addresses three important questions in disease genetics.

Novel versus Common Variants

Whether a variant is novel or common in the population. This can be done by seeing if a variant exists in dbSNP (version 132). A tool that can differentiate between novel and common variants saves time as novel variants are more likely to be disease causing compared to common variants.  The Genome Analysis Toolkit (GATK) has an option to incorporate annotation from a VCF file through the -D option but I have decided against using this as the chromosome order in the dbSNP VCF file MUST match with the reference file used for variant calling.  This creates a little dilemma as the hg19 reference stored on the Torrent Server is ordered different to the dbSNP VCF file from the GATK 1.2 resouce bundle. For this plugin, I have decided to index the VCF file using tabix and call the variants outside the GATK framework.

Functional Consequence of Variant

Whether a variant lies within a gene and the functional consequence. For example, does the variant result in an amino acid change? (i.e. non-synonymous variant). Common tools used are SNPEff (Latest update on Christmas Day!!) and ANNOVAR. Although SNPEff uses Gencode annotation and therefore is more comprehensive, it is quite hard to summarize information and the majority of transcripts (ENST) are non-coding, thus for this plugin I have decided to go with ANNOVAR which uses Refseq (NM) annotations.

Functional Impact of Novel Non-Synonymous Variants

Whether a novel non-synonymous variant is likely to have a functional impact on the resulting protein. This can be achieve using functional impact prediction tools. I have decided to use PolyPhen2 and SIFT for predictions as pre-computed values are available as text files on the ANNOVAR download page. I have decided not to use ANNOVAR for calling the functional impact predictions as the implementation is unusually slow. To speed up things I sorted the SIFT and PolyPhen2 prediction text file followed by indexing using tabix. This allows variants to be more efficiently searched within the now sorted text file.

Screenshots

Figure 1. Result from the C01-288 run of the AmpliSeq kit available for download in the Ion Community. All GATK variants called are KNOWN.

Figure 2. Result from the BUT-317 run of CFTR amplicon sequencing available for download in the Ion Community. Only one variant was called by GATK which was a novel variant. As this is a screenshot, you can’t see the tool tip for Polyphen2 (PP2) and SIFT. D = Damaging and SIFT scores < 0.05 are considered damaging.

The plugin can be obtain via the Ion Community at this link. If you find this plugin useful, please vote for it in the Ion Community by clicking on “like”.

 We will be using this plugin in an up coming project using custom designed AmpliSeq primers on 10 large muscle disease causing genes across our undiagnosed patient cohort. Big thanks to Kelly and  Life Technologies for awarding an Application Grant to our lab for this project :D

Licensing

The Annotate plugin is a shell script which calls a collection of tools. It is important for organizations using this to have a look at the licenses and conditions of use for the following tools: ANNOVAR, PolyPhen2, SIFT, GATK, samtools, Picard Tools and tabix. For instance, ANNOVAR may not be free to use for commerical organizations “ANNOVAR is open-source and free for non-profit use. If you use it for  commercial purposes, please contact Ellen Purpus, director of OTT (PURPUS@email.chop.edu) directly for license related issues.”

Thanks to David from EdgeBio for the feedback. EdgeBio created the first community developed Plugin called SNPEff, a neat plugin and you can check out more details on their blog post.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. We sit on the shoulders on giants – this plugin is a script composed of available open source tools and resources.

Signal Processing – Room for improvements?

During the Ion Torrent User Group Meeting at ASHG, Rothberg talked about how he would approach the Accuracy challenge. He said he would seek out people who did well at the mathematics olympiad and get them to work on the problem and pay them $5,000. What a cheap skate :D He also said that there should be more focus on the actual raw signal processing. Thinking on the same lines, Yanick asked in the comments of a previous post, what contribution raw signal processing made with the accuracy improvements in Torrent Suite 1.5.

At the moment with the data processing, there is a way to separate the contributions that raw signal processing makes from signal normalization, dephasing and base calling (all three abbreviated to base calling for the rest of this blog). This can be done by using the 1.wells files generated as this marks the end of raw signal processing. Therefore, we can use a version of Torrent Suite  to do the raw signal processing and a different version to do the base calling using the –from-wells command line option (Figure 1).

Figure 1. Using different versions of Torrent Suite to do signal processing and base calling. Results grouped along the x-axis by which version did the base calling and colored by which 1.wells file version was used as the input (i.e. what version did the signal processing).

Input data is from control library of DH10B sequenced on a 314 chip in our lab (features in a few previous blog posts, including the one on rapid software improvements). This input data was used for each result featured in Figure 1 to make the analysis comparable.  In this bar plot the measure of performance is the number of 100AQ23 reads that results – the current measure of accuracy in the Accuracy challenge. The general conclusion is that signal processing changes between versions makes little contribution on accuracy performance. This is evident as there is little difference in heights between bars within each of the base calling groups along the x-axis. In contrast, all 1.wells files which used v1.5 for base calling showed marked improvements regardless of what version of Torrent Suite was used for signal processing (i.e. 1.wells file). Interestingly, the signal processing from v1.4 appears to perform better than v1.5. This can be largely explained by an increase in the number of beads categorized as “live” in v1.4 compared to v1.5.

There are two possibilities to explain the small contributions made by signal processing:

  1. This year most effort was dedicated to dephasing and normalization which are the source of major improvements in Torrent Suite 1.5. Improvements in signal processing will be the next focus. OR
  2. The current signal processing model has reached it’s limit and a new model needs to be developed in order to see further improvements.

In this post and last post on software improvements, only 314 data was featured. To give a more comprehensive representation, a 316 (STO-409) and 318 (C22-169) run was used to observe accuracy improvements (Figure 2). Thanks to Matt for supplying the 1.wells file for these publicly released runs.

Figure 2. The improvements in base calling between v1.4 and v1.5 using 1.wells file from v1.5 as input. I couldn’t get v1.3 to run on either the 316 or 318 data :???:

What made the analysis featured in this blog post a little challenging was the 1.wells format changed to make use of the HDF5 standard in Torrent Suite 1.5. This has allowed for the files to be better organized, parsed and to be compressed by approximately 2-fold. Torrent Suite 1.5 is able to read in 1.wells files generated by previous versions (i.e. backward compatible) but unfortunately vice versa does not apply :mad: I had to write a small program to convert the 1.wells files (from 1.5) back to the legacy format. Kudos to Simon for the tips :) What was a little concerning is the current implementation loads the whole 1.wells into memory which consumes 25-35%, 50-70% of total memory for the 316 and 318 chip.

<insert some awesome conclusion here> :D

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.

Ion Torrent Sequencing on Humans

Bored to death seeing public releases of sequencing runs from E. coli coming off desktop sequencers? Today, Life Technologies released through the Ion Community a sequencing run that wasn’t from E. coli !! Thank goodness. Ion Torrent released a human shotgun sequencing run from NS12911 (aka Venter published in PLoS Biology). Unfortunately, it was only two separate runs (C18-99 and C24-141) from a 318 chip so has bugger all in terms of coverage. I am extremely grateful for the release of the data set (kudos to Matt in Life Tech for the early access :D), though it would have been much nicer if they released results from a custom capture because at least it wouldn’t be totally useless for analysis. However, all is not lost as the coverage from the mitochondrial genome was sufficient to do some analysis (Figure 1).

Figure 1. Shows the coverage (determined by BEDtools) from using the two supplied BAMs. The coverage output for bwa-sw aligned BAMs produces an almost identical coverage and produces a bit of a mess when it is plotted also.

mitochondrial variants

What’s made the analysis of the mitochondrial genome (chrM) a bit annoying is that the two BAM files supplied don’t appear to be aligned to a hg19 chrM reference. According to the BAM header the chrM it was aligned to was 16569 bases in length (@SQ     SN:chrM LN:16569). This is two bases less than the hg19 or hg18 chrM obtained from UCSC Genome Browsers.  Since I was missing this version of chrM I decided to create my own alignments using bwa-sw, in addition to using tmap and subsequent base calling performed by the Genome Analysis Toolkit (GATK). The table below shows the results (tmap alignment of C24 is missing as it was still grinding away, while writing this :???:

Variants Called
C18-99 (VCF supplied)
29 (including 3 INDELS)
C24-141 (VCF supplied)
30 (including 3 INDELS)
C18-99 (bwa-sw/GATK)
41
C24-141 (bwa-sw/GATK)
39
C18-99 (tmap/GATK)
35

It is hard to determine the overlap between the variants called by the supplied VCF (i.e. by mpileup) and the ones called by GATK as the differences between the two reference chrM creates an off by 1-2 base difference in the coordinates. On inspection the majority of the mpileup calls are due to differences between the two references evident in what is marked as the ref or alt base. Below is the Venn diagram showing the variants called by GATK between the two runs. Using the Integrative Genomics Viewer (IGV), the variants outside the intersection had reads supporting it on the run which it was not called on. The only exception is the 10279C>A variant, which appears to have borderline read support from each of the sequencing runs.

Figure 2. Relationship between the variants called on chrM from the two runs (i.e. C18 and C24). Ideally all variants should be in the intersection.

One noticeable difference is that GATK although the -glm BOTH option was turned on, did not call any insertions or deletions (INDELS) on chrM. Using Integrative Genomics Viewer (IGV), there does not appear to be enough reads supporting the deletions at positions 3105 and 3108. In contrast, the deletion at position 9905 had sufficient reads to support it. However, there appears to be an unusual amount of areas surrounding it in the form of colored bars (i.e. undercalls/overcalls) and black lines (deletions). For those that haven’t used IGV before, the bars/lines running horizontally are the reads which are mostly colored grey as they usually completely match with the reference.

Systematic Biases?

A public release of data would not be complete unless it included an E. coli data set. This release included 194X coverage PGM run from a 318 chip (C22-169). Despite the very high coverage, the supplied VCF file showed there were 36 INDELS, which were all deletions. There seems be a bias in undercalling G or C bases as they account for 33/36, while 4/36 were A or T undercalls. There was a deletion that involved undercalling both a G and a T and hence the appearance that I can’t add. :oops: These variants were counted manually and without a calculator so there may be a mistake anyways :) Using IGV, I had a look at the sequence context for the A/T undercalls. All three (18297543545779, 4497732) have the exact same sequence context, that is AAAATTTT (click on each link to view IGV screen shot). There is a possibility that errors in mapping to low complexity or repetitive regions may also explain some of these instances.

Using the same methodology to identify the G/C undercalls, will help to identify the systematic biases that still remain in terms of base calling. This in combination with Torrent Scout and the wealth of Test Fragments data available would be a good avenue to pursue for the Accuracy challenge. I’ll insert some details on the methods a little later.

Next week I’ll post regarding the contributions signal processing and base calling make in regards to accuracy. Until then back to my PhD thesis and having no life :cry:

materials and methods

The hg19 reference file labelled ucsc.hg19.fasta was taken from the GATK 1.3 resource bundle directory.

tmap alignment

#tmap parameters taken from the supplied BAM file header
tmap mapall -R LB:hg19 -R CN:IonSoCal -R PU:PGM/318B
-R ID:16A7I -R SM:polymerase -R DT:2011-11-09T09:33:45
-R DS:100KMTph755uMedta559S788Q -R PL:IONTORRENT
-n 6 -f ucsc.hg19.fasta -r in.fastq -v map1 map2 map3 >
out.sam

#Create a sorted BAM file compatible with GATK
AddOrReplaceReadGroups.jar I=out.sam O=out.bam
SORT_ORDER=coordinate RGPL=454 RGSM=NS12911
RGPU=PGM/318B RGLB=hg19

bwasw alignment

bwa bwasw -t 8 ucsc.hg19 in.fastq > out.sam

#Create a sorted BAM file compatible with GATK
AddOrReplaceReadGroups.jar I=in.sam O=out.bam
SORT_ORDER=coordinate RGPL=454 RGSM=NS12911 RGPU=Random
RGLB=hg19

#all BAMs used as input to GATK must be indexed first
samtools index in.bam
#GATK indel local realignment against known indels

#Mark Duplicates
MarkDuplicates.jar I=in.bam O=out.bam

GATK variant calling

#Variant calling restricted to chrM
GenomeAnalysisTK.jar -et NO_ET -T UnifiedGenotyper -nt 8
-glm BOTH -R ucsc.hg19.fasta -I in.bam -o out.vcf
-L chrM:1-16571

Coverage Plot (Figure 1)

coverageBed -abam in.bam -b mt.bed -d > out.txt
#mt.bed
chrM    1       16571

Comparing two variant call sets (Figure 2)

GenomeAnalysisTK.jar -et NO_ET -T CombineVariants -R
ucsc.hg19.fasta --variant:C18 C18.vcf --variant:C24 C24.vcf
-o merged.vcf

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I dedicate this post to the fish and chips I had two weeks ago at some random place in New Zealand called Tauranga!

Ion Torrent and the Democratization of Sequencing

Back from my North East American trip and still jet lagged so I’ll return to the blog sphere with a non-technical post. The term “democratizing sequencing” is synonymous with the Ion Torrent. This probably doesn’t mean Life Technologies are pitching to a bunch of hippie scientist trying to relive the 70s but what does it mean instead? The definitions of “democracy” usually refers to a form of government so this general definition would be more suitable – “The practice or principles of social equality”. This post will cover the following components of social equality: Economical equality, Freedom of speech and Freedom of information. This month has seen a massive effort introducing initiatives to emphasize these components.

Economical equality

This world map shows the positions of where all the next generation sequencers are in the world. This requires the facility to self report so is not entirely accurate but is close because people like to brag :) There are two things you may notice on this map:

  1. The richer countries tend to have more sequencers. This is not surprising as they tend to have more of everything including obese people :P
  2. Within each country it tends to be the richer Institutes and Universities that have these machines. In the case of my home city, Sydney there are three sites with us way out in suburbia.

Given the correlation between high impact publications and next generation sequencing, why aren’t there more in Sydney? Simple answer, it costs at least 1 million dollars to build the infrastructure and then there is on going costs. In Australia, this would require many investigators to get together to apply for a massive grant. Too many egos involved and that’s why it rarely happens. The other alternative is to sell 2 million dollars worth of charity chocolates. This would require you to sell one chocolate to approximately every adult in Sydney. If this charity model is successful, we will have an even bigger type 2 diabetes problem :cry:

What most researchers in Australia have to settle for is sending samples to Sequencing centres such as the Ramaciotti Centre and the Australian Genome Research Facility (AGRF), which provide a great service for Australian researchers. Then why get a sequencer, most researchers ask? We got a sequencer as a way of controlling each step of the workflow and more importantly the time frames in which projects can be completed. Ever collaborated in Science before? Felt disappointed how long things take? Well you are not the only one!! Then you would understand why controlling time frames is SO important for scientist. Most have realized this but never have had the money to act upon it. The Ion Torrent marketed at $USD 50 K is the first time a lab in Australia can seriously say lets get a sequencing machine. The Illumina MiSeq and Roche Junior are also competitively priced. A carefully planned strategy aligned with local sequencing facilities will now give everyone an equal opportunity to publish in good genetics journals as economics is no longer a barrier.

Freedom of speech

The advent of the Internet has amplified the freedom of speech of everyone! Something we should not take for granted. In the past (i.e. early 90s), if I wanted to communicate information I would use the following:

  • Publish a book, journal article, TV or radio
  • Local newspaper, public notice boards and town hall meetings
  • Letter box drops
  • Tell my mom!

There would be no way a teenager would have the ability to use the first option of communication if all they wanted to say was that “they had an epic World of Warcraft Raid” or a recording of them “owning a n00b on StarCraft 2“. Unfortunately they now can, it’s called Twitter, Facebook and YouTube :P

Life Technologies has embraced the Internet and freedom of speech through the Ion Community. This site allows members to provide feedback and problems that they are having on the Ion Torrent. The comments made by members is NOT censored in any way. This allows people like me to say absolutely whatever they want. Most of the time I alternate between skeptic hater and annoying bug. Many are still afraid to speak their minds or even contribute which is a shame. It is good to say stuff but is worthless if you can not get to your targeted audience. In other words, the reason why you complain is because you want something to be done. From my experiences, Life Technologies are very fast to respond to comments and try their best to help.

In addition, Ion Torrent is providing strong support to the blogging community. This takes on the form of early access to data and resources allowing bloggers to do what they do best… review and complain :D The release of affordable sequencing technology has seen a massive explosion in technical blogging. I think there are few reasons for this:

  1. First and foremost it’s affordable, therefore a lot of people want to know more about it and want the opinion of the wise Internet. No one nowadays goes to a restaurant, hotel or buys anything without reading a review on the Internet. Next generation sequencing is no different!
  2. It may be Science but no one can wait for a suppressed report in a journal article which usually goes something like this “we suggest perhaps maybe the Ion Torrent would be good for X, however further research will be required”.
  3. The release of publicly available data set and for the first time in the history of Biotech the exact data set used to generate the application notes and brochures! This is a gold mine for reviewing and complaining :D
  4. The support of Life Technologies, Illumina and Roche. Some more than others. I think they have realized… bloggers are like good global marketers, the only difference is we pay them absolutely nothing and people tend to believe them more!

Lastly, the greatest display of freedom of speech is allowing me to present at the Ion User Group Meeting. Putting everything in context, I am only a PhD student and quite unpredictable at times. I was given carte blanche so really could have said anything I felt like during the 10 minutes. Saying “I was busting to take a piss” during my talk shows I had freedom of speech.

Freedom of information

Currently Biotech companies have two types of customers, their preferred ones and the rest of their customers. The preferred customers usually get access to technology and information that other customers will see on a later date. How do they pick these preferred customers? Who knows! but I know one thing that these customers are usually the richer ones that can afford to do field testing for them. Having this information early gives these preferred customers an unfair advantage in terms of producing preliminary data for grant applications. These are usually the institutes that DO NOT require an advantage to compete for grants. This model is extremely non-democratic and not COOL :(, although makes economical sense to Biotech companies. There are two initiatives which Ion Torrent launched recently:

  1. Ion AmpliSeq Custom Kit Developer Access
  2. Ion 318 Chip Developer Access

In each of these initiatives, all customers are treated equally and therefore will be provided information whether they are a preferred customers or not. The main emphasis is on giving back to the community, in other words sharing what you have learned while having early access to the technology. A huge difference to using it to benefit only yourself! This will definitely rock the boat amongst the preferred customers but is the only way democracy and freedom of information can be achieved. Illumina being more established in sequencing will have a very difficult time doing this assuming they actually care about democratizing sequencing.

You can put a pipette (noun) in the hand of the scientist but you can’t make them pipette (verb)!

The paradigm shift in the business model implemented by Life Technologies is contingent upon  Ion Torrent PGM purchases and the success of the Ion Community. In order to help with the steep learning curve required for sample preparation, Ion Torrent has an Application Grant Program. The emphasis again is to give back to the community what you have learned. This will greatly help small labs like ours to develop successful workflows in order for us to produce preliminary data so we can be competitive for large government grants. The grant program is a great incentive to buy a PGM over the MiSeq or Junior.

The Ion Community like all online forums and communities in general suffer from the problem of participation. It’s in human nature to be more take than give. Due to internet lurking, forums typically follow the 1% rule or the 90:9:1 rule. 1% contribute, 9% edits/moderates, 90% just view. The Ion Community despite it’s steady increase in membership suffers from this same problem. It is no surprise the most active thread is the one where you get to boast how great your chip runs are with the possibility of winning a pack of chips. Thankfully, Ion Torrent has learned from this and have introduced an initiative called RecogitION. A program which aims to reward regular contributors. This reward system was extremely successful in the Sun Java forum I used to frequent to complain on. I nearly earned myself a free T-shirt :( Some people’s problems are just too difficult! Despite it’s extremely lame name, RecogitION will make for a more successful active community.

Scientist have recognized Ion Torrent through Semi-conductors as revolutionizing sequencing. After everything is said and done, it may be recognized instead as the first Biotech to make a bold move in embracing the Internet culture and what it stands for DEMOCRACY.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I dedicate this post to the all you can eat mud crabs in Rock Hall, Maryland. I try to send you bankrupt by eating all the crabs… only got to number 6 :(

Ion Torrent – Rapid Software Improvements

This is the second post of what is now to be a four part series looking at how Ion Torrent accuracy has improved over time. In this edition, I will show what a massive difference software can make with this technology. The results presented here was only possible because the software is open source. In addition, Mike and Mel have given me early access to binaries (ion-Analysis v1.61) that will be released in Torrent Suite v1.5. That’s a huge thank you to Mel and Mike!

There are three major areas that software can improve

  1. Throughput – Identify more Ion Sphere Particles (ISPs) with library fragments therefore increasing the total number of reads.
  2. Quality – More and longer reads aligning back to the reference sequence.
  3. Speed – Reduce the computational time to analyze the data.

The way I am going to present this data is to keep the data set the same (i.e. input DAT files) BUT perform the analysis using the different versions of the software, i.e. ion-Analysis. The ion-Analysis binary/software is responsible for ISP finding, signal processing and base calling. I have discussed signal processing and base calling in my previous blog posts. I have also briefly touched on bead finding  ISPs but will go into more detail in my Signal Processing blog series. The three versions I have used are:

  1. ion-Analysis v1.40 (from Torrent Suite v1.3) REL: 20110414
  2. ion-Analysis v1.52 (from Torrent Suite v1.4) REL: 20110712
  3. ion-Analysis v1.61 (pre-release Torrent Suite v1.5) DATED: 20110914

 Method

// The datadir contains a 314 Run of the DH10B library.
// Execution of the ion-Analysis program
// creates rawlib.sff and rawtf.sff
Analysisv1.nn datadir --no-subdir > Analysis.out
2> Analysis.err

// rename the files before trimming
mv rawlib.sff rawlib_untrimmed.sff
mv rawtf.sff rawtf_untrimmed.sff

// Trim the files based on quality, minimal length and
// remove 3' adapter
SFFTrim --in-sff rawlib_untrimmed.sff --out-sff rawlib.sff
--flow-order TACGTACGTCTGAGCATCGATCGATGTACAGC --key TCAG
--qual-cutoff 9 --qual-window-size 30 --adapter-cutoff 16
--adapter ATCACCGACTGCCCATAGAGAGGCTGAGAC --min-read-len 5

// Create the fastq file from the SFF file
SFFRead rawlib.sff -q rawlib.fastq

// performs tmap (v0.0.28) alignment, dumps quality metrics
// Q47.histo.dat used as input for a modified libGraphs.py
// python script to produce AQ47 quality distribution
alignmentQC.pl -i rawlib.fastq -g e_coli_dh10b
-q 7,10,17,20,23,47

// performs read length metrics.
// readLen.txt used as input for a modified
// trimmedReadLenHisto.py python script to produce
// Read Length distribution
SFFSummary -o quality.summary --sff-file rawlib.sff
--read-length 50,100,150 --min-length 0,0,0
--qual 0,17,20 -d readLen.txt

Throughput

The table below shows that between versions 1.40 and 1.52 there was a modest increase in the number of ISPs identified (i.e. occupied wells), resulting in an increase in final library reads. There has been a slight decrease in version 1.61 which I will show in the next section it is quality and not quantity which is really important. Between versions 1.52 and 1.61 there is a massive difference in the filtering metrics. The blame has been shifted from Poor signal reads to Mixed/Clonal reads. This has massive consequence on how throughput can be increased further. The problem of poor signal reads is largely due to the quality of the raw data and the downstream computational processing, while mixed/clonal reads are due to sample preparation. There is a possibility that there is a bug in the pre-release code.

Ion Sphere Particle (ISP) Metrics

v1.40 v1.52 v1.61
Library ISPs
737,584 860,574 843,844
Filtered: Too short
< 0.1% 3.24% 0.80%
Filtered: Keypass failure
10.0% 8.78% 0.60%
Filtered: Mixed/Clonal
6.70% 9.80% 40.16%
Filtered: Poor Signal Profile
31.70% 26.19% 8.26%
Final Library Reads
381,629 447,484 423,564

Read Length Distribution

The improvements in software has allowed not only for more reads but slightly longer reads. This can be observed as a slight shift of the distribution to the right and also a widening of the peak near 100 bp. Interestingly, version 1.61 also has a small shoulder at 150 bp.

Quality

Based on Full Library alignment to Reference (Quality: AQ20)

v1.40 v1.52 v1.61
Total Number of Bases [Mbp] 24.84 29.38 32.46
Mean Length [bp] 77 77 88
Longest Alignment [bp] 119 137 144
Mean Coverage Depth 5.30× 6.30× 6.90×
Percentage of Library Covered 98.96% 99.48% 99.66%

Based on Full Library alignment to Reference (Quality: AQ47)

v1.40 v1.52 v1.61
Total Number of Bases [Mbp] 23.21 27.29 29.68
Mean Length [bp] 72 72 82
Longest Alignment [bp] 119 133 138
Mean Coverage Depth 5.00× 5.80× 6.30×
Percentage of Library Covered 98.64% 99.26% 99.48%

Quality (AQ47) length distribution

At first glance the length distribution looks slightly better in version 1.40 compared to the later version 1.52. The peak in much higher and broader at around 100 bp for version 1.61. An important thing to note is that there will be a point that read length quality will be restricted by the library fragment length. For example, if the average library fragment length is 150 bp, it would be impossible to get a 400 bp read!


Speed (Computational Time)

There is a massive reduction in computational time between versions 1.40 and 1.52. This was when the NVIDIA Telsa GPU was employed through the use of the CUDA SDK. The use of GPU computing has been highly beneficial for Bioinformatics programs such as HMMER. In the case of Ion Torrent, the biggest reduction is observed within the processing of the raw flowgrams (i.e. signal processing). This requires loading data from all chip wells from 20 flows (i.e. 20 DAT files) into memory and performing parameter fitting (Levenberg–Marquardt algorithm) using matrix operation and linear algebra within armadillo, blas and lapack/lapackpp libraries. In addition, there is modest improvements between version 1.61 and 1.52. This maybe due to the tree dephasing algorithm used for base calling as most of the time reduction was observed in this stage. The name “tree” would suggests a non-greedy algorithm was implemented. See my previous post regarding the greedy implementation.

v1.40 v1.52 v1.61
Bead find
0.8 0.9 0.8
Bead Categorization
<0.1 <0.1 <0.1
Raw Flowgrams
48.2 22.5 24.9
Base calling
48.3 22.4 6.6
Total time
97.3
45.8
32.3

Note: Time is in minutes. Raw Flowgrams is the signal processing stage.

Besides more efficient algorithms, run time is dependent on number of wells and flows to process. As Ion Torrent aims to increase overall throughput through increasing number of reads (i.e. wells) and read lengths (i.e. flows), it is crucial to have computationally efficient algorithms which are constantly improving.

Life Grand Challenges – Accuracy

In this analysis there are two software versions that have or will be released shortly after the quarterly close of the accuracy challenge. This allows the unique opportunity to ask the question, would Ion Torrent software developers win the accuracy challenge themselves? In other words, how feasible given the time limits is it to achieve the goals set in the accuracy challenge. The goal is the equivalent of achieving a greater or equal the number of 100AQ20 reads (in the previous software release) but at 100AQ23 . It is the equivalent to the goals set by the challenge as software is released approximately every quarter.

v1.40 v1.52 v1.61
100AQ20 Reads
111,559 130,087 189,383
100AQ23 Reads
67,426 75,051 121,884

The 75,051 100AQ23 reads achieved by version 1.52 does not come close to the 111,559 100AQ20 reads achieved by version 1.40. Interestingly, the 121,884 100AQ23 reads is very close to the benchmark set by version 1.52 (i.e. 130,087 reads). If averaged over several input data sets, this may well have won the ONE million dollar accuracy challenge!! This shows the feasibility of the accuracy challenge and confirmed my initial thoughts, that with a moving target after the first two quarters it will be next to IMPOSSIBLE. There goes my chances so back to coming up with Facebook apps that may appeal to teenagers with too much time on their hands :P

Conclusion

There are several limitations with the analysis I have performed. First, the different versions of ion-Analysis may have been optimized for different goals in mind. For example, version 1.61 may have been optimized for the new longer read chemistry, new flow rates, 316 chip and the soon to be released 318 chip. However, it does do a pretty good job with analyzing our 314 data set. Second, performance on a DH10B library may not be a good reflection on how it may perform on “real world data” or even human data that may have different challenges. Third, this is only the result from one input data set therefore may not be representative of the average performance. Fourth, when I was supplied with the pre-release binary the guys at Ion Torrent forgot to include the improved phred prediction table. I instead used the one from version 1.52. Improved quality prediction may lead to different read lengths after trimming, further improving the read length metrics. Lastly, the preparation of the samples before pressing RUN places an upper limit on how good the results can be. This also includes the size selection on the fragments during library preparations. In other words don’t expect to see 400 bp reads! The girl who prepared this is experienced in molecular biology but this is her first Next Generation Sequencing experiment! This is testimony to how simple the technology is to use and how great the lab girls are in our lab :D

Again big thanks to Mel and Mike who have made a pre-release version 1.61 available to me. In the next post, I will discuss the thing that shall not be named…HP and that does not stand for Harry Potter :P

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. Having early access to the ion-Analysis has made me one of them :(