The periodic public release of data sets by Life Technologies and others in the scientific community has allowed me to perform a “longitudinal study” of the improvements made on the Ion Torrent. In fact, the last few months has been quite exciting with Ion Torrent engaging the community through public data release along with source code. This has made the whole scientific community feel in some way as being part of the action. In this three part series which will run in parallel with the Signal Processing series, I will look at three major developmental themes:
- Improvements in Accuracy
- Homopolymer problem – can’t call it improvements because I haven’t analyzed the data yet 😛
- Changes in the ion-Analysis source code. This binary is largely responsible for all the data analysis, that is going from raw voltages (DAT files) to corrected incorporation signal (SFF files). Subsequent base calling from SFF files is quite trival 🙂
The analysis was performed using Novoalign in the Novocraft package (v2.07.12) according to the instructions detailed on their Ion Torrent support page. The plots were produced using the Rscripts provided in the package with slight modifications to change the look and feel. I used the fastq files as input and did not do any pre-processing to ensure the reproducibility of the data. The same command line options was used and is noted at the bottom of each plot. The only exception is the last plot for the long range data set where the “-n 300” option was used to inspect quality past the default 150 bases. Kudos to Nick Loman for the help (see Comments below). I quite like the package and the fast support provided on the user forum (kudos to Colin). There is a nice gallery of figures provided on their Facebook page.
From the quality plots there are two very obvious things. First, the predicted estimation is overly conservative and they are underselling themselves by an average of 10 Phred points. This was noted also on the Omics Omics, EdgeBio and Pathogenomics blog posts. Second, the predicted quality along reads from the 316 data set (Figure below) used by the Illumina MiSeq application note is an unfair and incorrect representation of what is happening.
Raw accuracy cheat sheet:
Q10 = 90% accuracy Q20 = 99% accuracy Q23 = 99.5% accuracy Q30 = 99.9% accuracy
In my opinion, actual observed accuracy is more important than predicted. For example, I predicted network marketing was going to make me a fortune and I would be financially free by now. Unfortunately, my friends didn’t want to buy my stuff 😥 Their loss !! The plots from the long read data set shows the massive improvements made in just a few months. This makes me very optimistic for the future 😛
18th May 2011 (Analysis date)
Source: EdgeBio (Project: 0039010CA)
Run date: 2011-04-07
ion-Analysis version: 1.40-0
Flow cycles: 55
8th June 2011 (Analysis date)
Source: Life Technologies (316 data set)
Run date: 2011-06-06
ion-Analysis version: 1.49-3
Flow cycles: 65
21st July 2011 (Analysis date)
Source: Institute for Neuroscience and Muscle Research (my lab :))
Run date: 2011-07-21
ion-Analysis version: 1.52-11
Flow cycles: 65
28th July 2011 (Analysis date)
Source: Life Technologies (Long Read data set)
Run date: 2011-07-19
ion-Analysis version: 1.55-1
Flow cycles: 130
This wouldn’t be a blog post by me if I
wasn’t complaining about something… I mean giving feedback 🙂 A slogan that is regularly used by Ion Torrent is how they are “democratizing” sequencing. In terms of releasing data sets and source code they are far ahead of their competitors Illumina and Roche. The above analysis would not be possible without public release of data sets from Life Technologies and also EdgeBio (Ion Torrent Sequencing service provider). Illumina has provided some data sets from their MiSeq. This killed my bandwidth downloading as they forgot to compress the fastq files. What n00bs! When will Illumina and Roche provide more data sets for their competing desktop sequencers? In the case of Roche when will they provide any? Also when will they learn that people outside the major sequencer centers have brains and perhaps they should interact with them every now and then!
Despite the great efforts made by Life Technologies, there is still a long way to go in my mind to truly democratize sequencing. For example, early access to new products should be given to labs that are trying to make a difference in society and not just their “special customers”. What better way to promote your technology by showing that a small lab with little experience can get it to work. I am not impressed at all if an experienced sequencing lab can get it to work. Giving these products to just special customers (aka the “big boys”) is NOT democratizing sequencing, it is maintaining the dominance these labs have over high impact publications. Our lab has requested for early access to the TargetSeq enrichment system (not to be confused with the Qiagen SeqTarget junk). Having access to this enrichment would allow us to explore the possibility of diagnosing children with muscular dystrophy more efficiently and help parents, families and carers plan the future natural progression of these crippling diseases. Having early access will give us an opportunity to produce preliminary data for the next grant round. How about helping the “little guy” for a change?
In my next blog post of this series I will provide an independent analysis of homopolymers using the data sets above. This will provide further discussion in addition to the great post comparing Ion Torrent and 454 homopolymers from flxlex.
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. This is an independent analysis using Novoalign kept simple so others can reproduce the results. Despite begging, I have never been treated to free lunch/dinner or even a free T-shirt by Life Technologies 😥