Ion Torrent – Rapid Software Improvements

This is the second post of what is now to be a four part series looking at how Ion Torrent accuracy has improved over time. In this edition, I will show what a massive difference software can make with this technology. The results presented here was only possible because the software is open source. In addition, Mike and Mel have given me early access to binaries (ion-Analysis v1.61) that will be released in Torrent Suite v1.5. That’s a huge thank you to Mel and Mike!

There are three major areas that software can improve

  1. Throughput – Identify more Ion Sphere Particles (ISPs) with library fragments therefore increasing the total number of reads.
  2. Quality – More and longer reads aligning back to the reference sequence.
  3. Speed – Reduce the computational time to analyze the data.

The way I am going to present this data is to keep the data set the same (i.e. input DAT files) BUT perform the analysis using the different versions of the software, i.e. ion-Analysis. The ion-Analysis binary/software is responsible for ISP finding, signal processing and base calling. I have discussed signal processing and base calling in my previous blog posts. I have also briefly touched on bead finding  ISPs but will go into more detail in my Signal Processing blog series. The three versions I have used are:

  1. ion-Analysis v1.40 (from Torrent Suite v1.3) REL: 20110414
  2. ion-Analysis v1.52 (from Torrent Suite v1.4) REL: 20110712
  3. ion-Analysis v1.61 (pre-release Torrent Suite v1.5) DATED: 20110914

 Method

// The datadir contains a 314 Run of the DH10B library.
// Execution of the ion-Analysis program
// creates rawlib.sff and rawtf.sff
Analysisv1.nn datadir --no-subdir > Analysis.out
2> Analysis.err

// rename the files before trimming
mv rawlib.sff rawlib_untrimmed.sff
mv rawtf.sff rawtf_untrimmed.sff

// Trim the files based on quality, minimal length and
// remove 3' adapter
SFFTrim --in-sff rawlib_untrimmed.sff --out-sff rawlib.sff
--flow-order TACGTACGTCTGAGCATCGATCGATGTACAGC --key TCAG
--qual-cutoff 9 --qual-window-size 30 --adapter-cutoff 16
--adapter ATCACCGACTGCCCATAGAGAGGCTGAGAC --min-read-len 5

// Create the fastq file from the SFF file
SFFRead rawlib.sff -q rawlib.fastq

// performs tmap (v0.0.28) alignment, dumps quality metrics
// Q47.histo.dat used as input for a modified libGraphs.py
// python script to produce AQ47 quality distribution
alignmentQC.pl -i rawlib.fastq -g e_coli_dh10b
-q 7,10,17,20,23,47

// performs read length metrics.
// readLen.txt used as input for a modified
// trimmedReadLenHisto.py python script to produce
// Read Length distribution
SFFSummary -o quality.summary --sff-file rawlib.sff
--read-length 50,100,150 --min-length 0,0,0
--qual 0,17,20 -d readLen.txt

Throughput

The table below shows that between versions 1.40 and 1.52 there was a modest increase in the number of ISPs identified (i.e. occupied wells), resulting in an increase in final library reads. There has been a slight decrease in version 1.61 which I will show in the next section it is quality and not quantity which is really important. Between versions 1.52 and 1.61 there is a massive difference in the filtering metrics. The blame has been shifted from Poor signal reads to Mixed/Clonal reads. This has massive consequence on how throughput can be increased further. The problem of poor signal reads is largely due to the quality of the raw data and the downstream computational processing, while mixed/clonal reads are due to sample preparation. There is a possibility that there is a bug in the pre-release code.

Ion Sphere Particle (ISP) Metrics

v1.40 v1.52 v1.61
Library ISPs
737,584 860,574 843,844
Filtered: Too short
< 0.1% 3.24% 0.80%
Filtered: Keypass failure
10.0% 8.78% 0.60%
Filtered: Mixed/Clonal
6.70% 9.80% 40.16%
Filtered: Poor Signal Profile
31.70% 26.19% 8.26%
Final Library Reads
381,629 447,484 423,564

Read Length Distribution

The improvements in software has allowed not only for more reads but slightly longer reads. This can be observed as a slight shift of the distribution to the right and also a widening of the peak near 100 bp. Interestingly, version 1.61 also has a small shoulder at 150 bp.

Quality

Based on Full Library alignment to Reference (Quality: AQ20)

v1.40 v1.52 v1.61
Total Number of Bases [Mbp] 24.84 29.38 32.46
Mean Length [bp] 77 77 88
Longest Alignment [bp] 119 137 144
Mean Coverage Depth 5.30× 6.30× 6.90×
Percentage of Library Covered 98.96% 99.48% 99.66%

Based on Full Library alignment to Reference (Quality: AQ47)

v1.40 v1.52 v1.61
Total Number of Bases [Mbp] 23.21 27.29 29.68
Mean Length [bp] 72 72 82
Longest Alignment [bp] 119 133 138
Mean Coverage Depth 5.00× 5.80× 6.30×
Percentage of Library Covered 98.64% 99.26% 99.48%

Quality (AQ47) length distribution

At first glance the length distribution looks slightly better in version 1.40 compared to the later version 1.52. The peak in much higher and broader at around 100 bp for version 1.61. An important thing to note is that there will be a point that read length quality will be restricted by the library fragment length. For example, if the average library fragment length is 150 bp, it would be impossible to get a 400 bp read!


Speed (Computational Time)

There is a massive reduction in computational time between versions 1.40 and 1.52. This was when the NVIDIA Telsa GPU was employed through the use of the CUDA SDK. The use of GPU computing has been highly beneficial for Bioinformatics programs such as HMMER. In the case of Ion Torrent, the biggest reduction is observed within the processing of the raw flowgrams (i.e. signal processing). This requires loading data from all chip wells from 20 flows (i.e. 20 DAT files) into memory and performing parameter fitting (Levenberg–Marquardt algorithm) using matrix operation and linear algebra within armadillo, blas and lapack/lapackpp libraries. In addition, there is modest improvements between version 1.61 and 1.52. This maybe due to the tree dephasing algorithm used for base calling as most of the time reduction was observed in this stage. The name “tree” would suggests a non-greedy algorithm was implemented. See my previous post regarding the greedy implementation.

v1.40 v1.52 v1.61
Bead find
0.8 0.9 0.8
Bead Categorization
<0.1 <0.1 <0.1
Raw Flowgrams
48.2 22.5 24.9
Base calling
48.3 22.4 6.6
Total time
97.3
45.8
32.3

Note: Time is in minutes. Raw Flowgrams is the signal processing stage.

Besides more efficient algorithms, run time is dependent on number of wells and flows to process. As Ion Torrent aims to increase overall throughput through increasing number of reads (i.e. wells) and read lengths (i.e. flows), it is crucial to have computationally efficient algorithms which are constantly improving.

Life Grand Challenges – Accuracy

In this analysis there are two software versions that have or will be released shortly after the quarterly close of the accuracy challenge. This allows the unique opportunity to ask the question, would Ion Torrent software developers win the accuracy challenge themselves? In other words, how feasible given the time limits is it to achieve the goals set in the accuracy challenge. The goal is the equivalent of achieving a greater or equal the number of 100AQ20 reads (in the previous software release) but at 100AQ23 . It is the equivalent to the goals set by the challenge as software is released approximately every quarter.

v1.40 v1.52 v1.61
100AQ20 Reads
111,559 130,087 189,383
100AQ23 Reads
67,426 75,051 121,884

The 75,051 100AQ23 reads achieved by version 1.52 does not come close to the 111,559 100AQ20 reads achieved by version 1.40. Interestingly, the 121,884 100AQ23 reads is very close to the benchmark set by version 1.52 (i.e. 130,087 reads). If averaged over several input data sets, this may well have won the ONE million dollar accuracy challenge!! This shows the feasibility of the accuracy challenge and confirmed my initial thoughts, that with a moving target after the first two quarters it will be next to IMPOSSIBLE. There goes my chances so back to coming up with Facebook apps that may appeal to teenagers with too much time on their hands :P

Conclusion

There are several limitations with the analysis I have performed. First, the different versions of ion-Analysis may have been optimized for different goals in mind. For example, version 1.61 may have been optimized for the new longer read chemistry, new flow rates, 316 chip and the soon to be released 318 chip. However, it does do a pretty good job with analyzing our 314 data set. Second, performance on a DH10B library may not be a good reflection on how it may perform on “real world data” or even human data that may have different challenges. Third, this is only the result from one input data set therefore may not be representative of the average performance. Fourth, when I was supplied with the pre-release binary the guys at Ion Torrent forgot to include the improved phred prediction table. I instead used the one from version 1.52. Improved quality prediction may lead to different read lengths after trimming, further improving the read length metrics. Lastly, the preparation of the samples before pressing RUN places an upper limit on how good the results can be. This also includes the size selection on the fragments during library preparations. In other words don’t expect to see 400 bp reads! The girl who prepared this is experienced in molecular biology but this is her first Next Generation Sequencing experiment! This is testimony to how simple the technology is to use and how great the lab girls are in our lab :D

Again big thanks to Mel and Mike who have made a pre-release version 1.61 available to me. In the next post, I will discuss the thing that shall not be named…HP and that does not stand for Harry Potter :P

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. Having early access to the ion-Analysis has made me one of them :(

7 responses to “Ion Torrent – Rapid Software Improvements

  1. those god damn homopolymers

  2. Monkol,

    I really enjoyed your presentation in Montréal. Keep up the good work. I will make comments on your blog on the technical stuff when I get a chance, but not on the Britney Spears stuff! ;)

    Yanick

  3. Yanick, thanks for the kind words, support and encouragement. It was great to meet some of the active people in the Ion Community, while in Montreal. I hope to write my next technical post once I get back from this extended trip. I look forward to your comments in the future.

  4. Monkol

    Enjoyed your presentation in Montreal. I just took at the accuracy challenge and noted that the current goal is harder at 200bp Q23 (99.5%) from 200Q20, rather than the 100Q23. Any thoughts on how feasible this goal is to achieve?

    James

  5. James, thanks for the kind words and feedback.
    That’s very good and interesting question and makes for a very compact blog topic. I’m still touring North East America but will get on to that when I get back. Without looking at the data, here are my initial thoughts. The feasibility or difficulty of the challenge is a measure of how difficult it is to get the subset of sequence reads at Q20 that didn’t make it to Q23, to get there. A good question to ask is whether those Q20 reads would ever get to Q23 – TorrentScout would be a good tool to use to ask that question.

    The implementation of a non-greedy approach (i.e. treephaser), dynamic programming and adaptive normalization to me are considered low hanging fruits as most technical people could have predicted that these techniques would eventually be implemented. Therefore, increasing the difficulty of the challenge. However, it should close the gap between Q20 and Q23. In other words a smaller percentage improvement gains are now needed BUT that should not be used as a measure of difficulty.

  6. Monkol,

    Have you noticed improvements in pure signal processing in 1.5? Or is it mainly improvements in filtering and phase solving?

    Yanick

    • That’s a good question. I’ll get back to you shortly on that. One thing I’ll look at is mixing and matching between 1.4 and 1.5. Let 1.5 do the signal processing to produce the 1.wells file, then use 1.4 to do the phase solving and base calling. Then perform vice versa to sort of measure the improvements each module has made to the overall performance difference.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s