This is the second post of what is now to be a four part series looking at how Ion Torrent accuracy has improved over time. In this edition, I will show what a massive difference software can make with this technology. The results presented here was only possible because the software is open source. In addition, Mike and Mel have given me early access to binaries (ion-Analysis v1.61) that will be released in Torrent Suite v1.5. That’s a huge thank you to Mel and Mike!
There are three major areas that software can improve
- Throughput – Identify more Ion Sphere Particles (ISPs) with library fragments therefore increasing the total number of reads.
- Quality – More and longer reads aligning back to the reference sequence.
- Speed – Reduce the computational time to analyze the data.
The way I am going to present this data is to keep the data set the same (i.e. input DAT files) BUT perform the analysis using the different versions of the software, i.e. ion-Analysis. The ion-Analysis binary/software is responsible for ISP finding, signal processing and base calling. I have discussed signal processing and base calling in my previous blog posts. I have also briefly touched on bead finding ISPs but will go into more detail in my Signal Processing blog series. The three versions I have used are:
- ion-Analysis v1.40 (from Torrent Suite v1.3) REL: 20110414
- ion-Analysis v1.52 (from Torrent Suite v1.4) REL: 20110712
- ion-Analysis v1.61 (pre-release Torrent Suite v1.5) DATED: 20110914
// The datadir contains a 314 Run of the DH10B library. // Execution of the ion-Analysis program // creates rawlib.sff and rawtf.sff Analysisv1.nn datadir --no-subdir > Analysis.out 2> Analysis.err // rename the files before trimming mv rawlib.sff rawlib_untrimmed.sff mv rawtf.sff rawtf_untrimmed.sff // Trim the files based on quality, minimal length and // remove 3' adapter SFFTrim --in-sff rawlib_untrimmed.sff --out-sff rawlib.sff --flow-order TACGTACGTCTGAGCATCGATCGATGTACAGC --key TCAG --qual-cutoff 9 --qual-window-size 30 --adapter-cutoff 16 --adapter ATCACCGACTGCCCATAGAGAGGCTGAGAC --min-read-len 5 // Create the fastq file from the SFF file SFFRead rawlib.sff -q rawlib.fastq // performs tmap (v0.0.28) alignment, dumps quality metrics // Q47.histo.dat used as input for a modified libGraphs.py // python script to produce AQ47 quality distribution alignmentQC.pl -i rawlib.fastq -g e_coli_dh10b -q 7,10,17,20,23,47 // performs read length metrics. // readLen.txt used as input for a modified // trimmedReadLenHisto.py python script to produce // Read Length distribution SFFSummary -o quality.summary --sff-file rawlib.sff --read-length 50,100,150 --min-length 0,0,0 --qual 0,17,20 -d readLen.txt
The table below shows that between versions 1.40 and 1.52 there was a modest increase in the number of ISPs identified (i.e. occupied wells), resulting in an increase in final library reads. There has been a slight decrease in version 1.61 which I will show in the next section it is quality and not quantity which is really important. Between versions 1.52 and 1.61 there is a massive difference in the filtering metrics. The blame has been shifted from Poor signal reads to Mixed/Clonal reads. This has massive consequence on how throughput can be increased further. The problem of poor signal reads is largely due to the quality of the raw data and the downstream computational processing, while mixed/clonal reads are due to sample preparation. There is a possibility that there is a bug in the pre-release code.
Ion Sphere Particle (ISP) Metrics
|Filtered: Too short
|Filtered: Keypass failure
|Filtered: Poor Signal Profile
|Final Library Reads
Read Length Distribution
The improvements in software has allowed not only for more reads but slightly longer reads. This can be observed as a slight shift of the distribution to the right and also a widening of the peak near 100 bp. Interestingly, version 1.61 also has a small shoulder at 150 bp.
Based on Full Library alignment to Reference (Quality: AQ20)
|Total Number of Bases [Mbp]||24.84||29.38||32.46|
|Mean Length [bp]||77||77||88|
|Longest Alignment [bp]||119||137||144|
|Mean Coverage Depth||5.30×||6.30×||6.90×|
|Percentage of Library Covered||98.96%||99.48%||99.66%|
Based on Full Library alignment to Reference (Quality: AQ47)
|Total Number of Bases [Mbp]||23.21||27.29||29.68|
|Mean Length [bp]||72||72||82|
|Longest Alignment [bp]||119||133||138|
|Mean Coverage Depth||5.00×||5.80×||6.30×|
|Percentage of Library Covered||98.64%||99.26%||99.48%|
Quality (AQ47) length distribution
At first glance the length distribution looks slightly better in version 1.40 compared to the later version 1.52. The peak in much higher and broader at around 100 bp for version 1.61. An important thing to note is that there will be a point that read length quality will be restricted by the library fragment length. For example, if the average library fragment length is 150 bp, it would be impossible to get a 400 bp read!
Speed (Computational Time)
There is a massive reduction in computational time between versions 1.40 and 1.52. This was when the NVIDIA Telsa GPU was employed through the use of the CUDA SDK. The use of GPU computing has been highly beneficial for Bioinformatics programs such as HMMER. In the case of Ion Torrent, the biggest reduction is observed within the processing of the raw flowgrams (i.e. signal processing). This requires loading data from all chip wells from 20 flows (i.e. 20 DAT files) into memory and performing parameter fitting (Levenberg–Marquardt algorithm) using matrix operation and linear algebra within armadillo, blas and lapack/lapackpp libraries. In addition, there is modest improvements between version 1.61 and 1.52. This maybe due to the tree dephasing algorithm used for base calling as most of the time reduction was observed in this stage. The name “tree” would suggests a non-greedy algorithm was implemented. See my previous post regarding the greedy implementation.
Note: Time is in minutes. Raw Flowgrams is the signal processing stage.
Besides more efficient algorithms, run time is dependent on number of wells and flows to process. As Ion Torrent aims to increase overall throughput through increasing number of reads (i.e. wells) and read lengths (i.e. flows), it is crucial to have computationally efficient algorithms which are constantly improving.
Life Grand Challenges – Accuracy
In this analysis there are two software versions that have or will be released shortly after the quarterly close of the accuracy challenge. This allows the unique opportunity to ask the question, would Ion Torrent software developers win the accuracy challenge themselves? In other words, how feasible given the time limits is it to achieve the goals set in the accuracy challenge. The goal is the equivalent of achieving a greater or equal the number of 100AQ20 reads (in the previous software release) but at 100AQ23 . It is the equivalent to the goals set by the challenge as software is released approximately every quarter.
The 75,051 100AQ23 reads achieved by version 1.52 does not come close to the 111,559 100AQ20 reads achieved by version 1.40. Interestingly, the 121,884 100AQ23 reads is very close to the benchmark set by version 1.52 (i.e. 130,087 reads). If averaged over several input data sets, this may well have won the ONE million dollar accuracy challenge!! This shows the feasibility of the accuracy challenge and confirmed my initial thoughts, that with a moving target after the first two quarters it will be next to IMPOSSIBLE. There goes my chances so back to coming up with Facebook apps that may appeal to teenagers with too much time on their hands :P
There are several limitations with the analysis I have performed. First, the different versions of ion-Analysis may have been optimized for different goals in mind. For example, version 1.61 may have been optimized for the new longer read chemistry, new flow rates, 316 chip and the soon to be released 318 chip. However, it does do a pretty good job with analyzing our 314 data set. Second, performance on a DH10B library may not be a good reflection on how it may perform on “real world data” or even human data that may have different challenges. Third, this is only the result from one input data set therefore may not be representative of the average performance. Fourth, when I was supplied with the pre-release binary the guys at Ion Torrent forgot to include the improved phred prediction table. I instead used the one from version 1.52. Improved quality prediction may lead to different read lengths after trimming, further improving the read length metrics. Lastly, the preparation of the samples before pressing RUN places an upper limit on how good the results can be. This also includes the size selection on the fragments during library preparations. In other words don’t expect to see 400 bp reads! The girl who prepared this is experienced in molecular biology but this is her first Next Generation Sequencing experiment! This is testimony to how simple the technology is to use and how great the lab girls are in our lab :D
Again big thanks to Mel and Mike who have made a pre-release version 1.61 available to me. In the next post, I will discuss the thing that shall not be named…HP and that does not stand for Harry Potter :P
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. Having early access to the ion-Analysis has made me one of them :(