Ion Torrent – Rapid Software Improvements

This is the second post of what is now to be a four part series looking at how Ion Torrent accuracy has improved over time. In this edition, I will show what a massive difference software can make with this technology. The results presented here was only possible because the software is open source. In addition, Mike and Mel have given me early access to binaries (ion-Analysis v1.61) that will be released in Torrent Suite v1.5. That’s a huge thank you to Mel and Mike!

There are three major areas that software can improve

  1. Throughput – Identify more Ion Sphere Particles (ISPs) with library fragments therefore increasing the total number of reads.
  2. Quality – More and longer reads aligning back to the reference sequence.
  3. Speed – Reduce the computational time to analyze the data.

The way I am going to present this data is to keep the data set the same (i.e. input DAT files) BUT perform the analysis using the different versions of the software, i.e. ion-Analysis. The ion-Analysis binary/software is responsible for ISP finding, signal processing and base calling. I have discussed signal processing and base calling in my previous blog posts. I have also briefly touched on bead finding  ISPs but will go into more detail in my Signal Processing blog series. The three versions I have used are:

  1. ion-Analysis v1.40 (from Torrent Suite v1.3) REL: 20110414
  2. ion-Analysis v1.52 (from Torrent Suite v1.4) REL: 20110712
  3. ion-Analysis v1.61 (pre-release Torrent Suite v1.5) DATED: 20110914

 Method

// The datadir contains a 314 Run of the DH10B library.
// Execution of the ion-Analysis program
// creates rawlib.sff and rawtf.sff
Analysisv1.nn datadir --no-subdir > Analysis.out
2> Analysis.err

// rename the files before trimming
mv rawlib.sff rawlib_untrimmed.sff
mv rawtf.sff rawtf_untrimmed.sff

// Trim the files based on quality, minimal length and
// remove 3' adapter
SFFTrim --in-sff rawlib_untrimmed.sff --out-sff rawlib.sff
--flow-order TACGTACGTCTGAGCATCGATCGATGTACAGC --key TCAG
--qual-cutoff 9 --qual-window-size 30 --adapter-cutoff 16
--adapter ATCACCGACTGCCCATAGAGAGGCTGAGAC --min-read-len 5

// Create the fastq file from the SFF file
SFFRead rawlib.sff -q rawlib.fastq

// performs tmap (v0.0.28) alignment, dumps quality metrics
// Q47.histo.dat used as input for a modified libGraphs.py
// python script to produce AQ47 quality distribution
alignmentQC.pl -i rawlib.fastq -g e_coli_dh10b
-q 7,10,17,20,23,47

// performs read length metrics.
// readLen.txt used as input for a modified
// trimmedReadLenHisto.py python script to produce
// Read Length distribution
SFFSummary -o quality.summary --sff-file rawlib.sff
--read-length 50,100,150 --min-length 0,0,0
--qual 0,17,20 -d readLen.txt

Throughput

The table below shows that between versions 1.40 and 1.52 there was a modest increase in the number of ISPs identified (i.e. occupied wells), resulting in an increase in final library reads. There has been a slight decrease in version 1.61 which I will show in the next section it is quality and not quantity which is really important. Between versions 1.52 and 1.61 there is a massive difference in the filtering metrics. The blame has been shifted from Poor signal reads to Mixed/Clonal reads. This has massive consequence on how throughput can be increased further. The problem of poor signal reads is largely due to the quality of the raw data and the downstream computational processing, while mixed/clonal reads are due to sample preparation. There is a possibility that there is a bug in the pre-release code.

Ion Sphere Particle (ISP) Metrics

v1.40 v1.52 v1.61
Library ISPs
737,584 860,574 843,844
Filtered: Too short
< 0.1% 3.24% 0.80%
Filtered: Keypass failure
10.0% 8.78% 0.60%
Filtered: Mixed/Clonal
6.70% 9.80% 40.16%
Filtered: Poor Signal Profile
31.70% 26.19% 8.26%
Final Library Reads
381,629 447,484 423,564

Read Length Distribution

The improvements in software has allowed not only for more reads but slightly longer reads. This can be observed as a slight shift of the distribution to the right and also a widening of the peak near 100 bp. Interestingly, version 1.61 also has a small shoulder at 150 bp.

Quality

Based on Full Library alignment to Reference (Quality: AQ20)

v1.40 v1.52 v1.61
Total Number of Bases [Mbp] 24.84 29.38 32.46
Mean Length [bp] 77 77 88
Longest Alignment [bp] 119 137 144
Mean Coverage Depth 5.30× 6.30× 6.90×
Percentage of Library Covered 98.96% 99.48% 99.66%

Based on Full Library alignment to Reference (Quality: AQ47)

v1.40 v1.52 v1.61
Total Number of Bases [Mbp] 23.21 27.29 29.68
Mean Length [bp] 72 72 82
Longest Alignment [bp] 119 133 138
Mean Coverage Depth 5.00× 5.80× 6.30×
Percentage of Library Covered 98.64% 99.26% 99.48%

Quality (AQ47) length distribution

At first glance the length distribution looks slightly better in version 1.40 compared to the later version 1.52. The peak in much higher and broader at around 100 bp for version 1.61. An important thing to note is that there will be a point that read length quality will be restricted by the library fragment length. For example, if the average library fragment length is 150 bp, it would be impossible to get a 400 bp read!


Speed (Computational Time)

There is a massive reduction in computational time between versions 1.40 and 1.52. This was when the NVIDIA Telsa GPU was employed through the use of the CUDA SDK. The use of GPU computing has been highly beneficial for Bioinformatics programs such as HMMER. In the case of Ion Torrent, the biggest reduction is observed within the processing of the raw flowgrams (i.e. signal processing). This requires loading data from all chip wells from 20 flows (i.e. 20 DAT files) into memory and performing parameter fitting (Levenberg–Marquardt algorithm) using matrix operation and linear algebra within armadillo, blas and lapack/lapackpp libraries. In addition, there is modest improvements between version 1.61 and 1.52. This maybe due to the tree dephasing algorithm used for base calling as most of the time reduction was observed in this stage. The name “tree” would suggests a non-greedy algorithm was implemented. See my previous post regarding the greedy implementation.

v1.40 v1.52 v1.61
Bead find
0.8 0.9 0.8
Bead Categorization
<0.1 <0.1 <0.1
Raw Flowgrams
48.2 22.5 24.9
Base calling
48.3 22.4 6.6
Total time
97.3
45.8
32.3

Note: Time is in minutes. Raw Flowgrams is the signal processing stage.

Besides more efficient algorithms, run time is dependent on number of wells and flows to process. As Ion Torrent aims to increase overall throughput through increasing number of reads (i.e. wells) and read lengths (i.e. flows), it is crucial to have computationally efficient algorithms which are constantly improving.

Life Grand Challenges – Accuracy

In this analysis there are two software versions that have or will be released shortly after the quarterly close of the accuracy challenge. This allows the unique opportunity to ask the question, would Ion Torrent software developers win the accuracy challenge themselves? In other words, how feasible given the time limits is it to achieve the goals set in the accuracy challenge. The goal is the equivalent of achieving a greater or equal the number of 100AQ20 reads (in the previous software release) but at 100AQ23 . It is the equivalent to the goals set by the challenge as software is released approximately every quarter.

v1.40 v1.52 v1.61
100AQ20 Reads
111,559 130,087 189,383
100AQ23 Reads
67,426 75,051 121,884

The 75,051 100AQ23 reads achieved by version 1.52 does not come close to the 111,559 100AQ20 reads achieved by version 1.40. Interestingly, the 121,884 100AQ23 reads is very close to the benchmark set by version 1.52 (i.e. 130,087 reads). If averaged over several input data sets, this may well have won the ONE million dollar accuracy challenge!! This shows the feasibility of the accuracy challenge and confirmed my initial thoughts, that with a moving target after the first two quarters it will be next to IMPOSSIBLE. There goes my chances so back to coming up with Facebook apps that may appeal to teenagers with too much time on their hands 😛

Conclusion

There are several limitations with the analysis I have performed. First, the different versions of ion-Analysis may have been optimized for different goals in mind. For example, version 1.61 may have been optimized for the new longer read chemistry, new flow rates, 316 chip and the soon to be released 318 chip. However, it does do a pretty good job with analyzing our 314 data set. Second, performance on a DH10B library may not be a good reflection on how it may perform on “real world data” or even human data that may have different challenges. Third, this is only the result from one input data set therefore may not be representative of the average performance. Fourth, when I was supplied with the pre-release binary the guys at Ion Torrent forgot to include the improved phred prediction table. I instead used the one from version 1.52. Improved quality prediction may lead to different read lengths after trimming, further improving the read length metrics. Lastly, the preparation of the samples before pressing RUN places an upper limit on how good the results can be. This also includes the size selection on the fragments during library preparations. In other words don’t expect to see 400 bp reads! The girl who prepared this is experienced in molecular biology but this is her first Next Generation Sequencing experiment! This is testimony to how simple the technology is to use and how great the lab girls are in our lab 😀

Again big thanks to Mel and Mike who have made a pre-release version 1.61 available to me. In the next post, I will discuss the thing that shall not be named…HP and that does not stand for Harry Potter 😛

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. Having early access to the ion-Analysis has made me one of them 😦

Advertisements

The Ion Torrent Accuracy Challenge – A non-biological explanation

My computing friends have read my blog and don’t understand it but think my lame jokes are quite good. They’re my friends so they MUST think the lameness is funny 😛  Anyways, in this blog post I will attempt to explain the Accuracy challenge in a more computer science friendly albeit nerdier way. Most of the biological jargon is good to know but not absolutely necessary to be able to solve the problem. The below is aimed at people who are familiar with the ACM problem sets (Sorry I fixed the dodgy link, my bad). I used to solve these problems when I was working in IBM, it was a good way of staying awake after a big lunch. My point with this post is to show that this could be pitched to the talented people who participate in ACM challenges if approached correctly.

Problem:

King Koopa (Bowser?) – Has no respect for thesis writing. This guy is really bad!

Yoshi – I think this dude is a lizard

The pipe used to hide Yoshi’s PhD thesis!

The evil King Koopa has stolen Yoshi’s PhD thesis before he was able to submit. Yoshi being a lizard has a language composed of 4 letters {A,C,T,G} from which words are constructed. Koopa decides to shred Yoshi’s thesis into one million words and place each one of them in these green tubes and seals them.

In order to rewrite his thesis, Yoshi  must infer what word is in each tube. This can be achieved by tapping one of the four sides of the tube over a 11 second period (round). Yoshi taps at a rate of 15 taps per second. Yoshi repeats this procedure on the tube sides for 260 rounds. Therefore each tube receives a total of 15 x 11 x 260 taps spread equally across the 4 sides.

The echo caused by the tapping allows Yoshi to infer the letters (because that’s what lizards do!). The inferred letter length can be 0, 1, 2, etc depending on how loud the echos are at crucial taps in a round of tapping. Each of the 4 sides corresponds to one of the letters in Yoshi’s 4 letter language. Yoshi tests each letter and cycles in a logical fashion.

Input:

Measured echo intensities can be thought of as a two dimensional array.

int data[n][m]

0 <= n < 260 – The round number. This determines the letter currently being decoded (i.e. side of the tube that’s being tapped on).

0 <= m < 11 x 15 – The tap number. Each round of tapping goes for 11 seconds. Therefore at a tapping rate of 15 taps/sec, there are 11 x 15 taps in total for a round.

Output:

Below is the result of 4 different rounds of tapping the same side of the green tube corresponding to the letter T in Yoshi’s alphabet. Yoshi has very good ears and is able to tell the difference between the subwords T, TTT, TTTTT and when there are no T in the current position of the word. Although Yoshi has good ears, over time the echos created from previous taps persists for longer making it hard for Yoshi to hear echos within and between rounds. In addition, in later rounds Yoshi gets tired and doesn’t tap as hard so the echos produced aren’t as loud making it difficult for Yoshi to discriminate between letter lengths. Often Yoshi gives up before getting to the end of the word and takes guesses at times.

The accuracy challenge is to help Yoshi filter out the persistent echos within and between rounds so he can concentrate on the “true echo”. Also it’s to get Yoshi to HARDEN UP and not get so tired in the later rounds of tapping.

Figure 1. Yeah I didn’t have time to relabel the axis. Anyways, this is the echo intensities plotted from 4 rounds of tube tapping. Each round lasts for 11 seconds at 15 taps per second.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. If Koopa did that with my thesis, I would hit him with the red shell!

Rapid improvements may lead to comparison pitfalls

In a previous blog post, I showed the rapid improvements Ion Torrent has made over a short 3 month period. I assume that both the GS Junior and the Illumina MiSeq have experienced improvements also over this time. These desktop sequencers require less financial commitment and usually requires only one grant to fund. In contrast, the big toys (eg. HiSeq, SoLiD 5500) require a massive financial commitment to purchase and operate. The profit margin may be much higher for the big toys BUT not much volume will be sold. In Sydney, one hand is needed to count how many big toys out there. By selling small toys the profit margins can be reduced as larger volumes are expected to sell. This creates a much larger user community instead of 3 people in each city who have observed a “sequencer in the wild”. The best analogy would be the personal computer (PC) revolution largely replacing the need for mainframes. I believe that the affordable costs of all three technologies will push the biotech industry to be more innovative and cost effective. No matter what sequencer you buy all customers will enjoy the benefits of this healthy competition.

The strength of computational analysis (including bioinformatics) is the relative ease of reproducibility given a data set and method. This is made even more easier, if the method involves using software with all options/setting specified. No other science offers the same warm fuzzy feeling of reproducibility. The greatest thing that has come out of the “Sequencing Wars” is the public release of raw data sets to accompany the application notes. This is the first time EVER I have seen a biotech company say “You don’t believe us? Here’s the raw data analyze it yourself!”. This is extremely bold move that has been taken first by Ion Torrent then shortly followed by Illumina MiSeq. For once, this empowers the customers to make their own conclusions! Many kudos to Ion Torrent and Illumina for making their raw data available. I am yet to find any public raw data for the GS Junior. In fact, Roche appears to be behind Life Technologies and Illumina when it comes to embracing the information age. Come on guys, pick up your act and get rid of them dinosaurs in upper management!

There is a long list of the technicalities and nuances with comparing the competing desktop sequencers. This problem is not unique when comparing different technologies. The unique issue comes from the pitfalls resulting from the rapid improvements. This is not like comparing two different car models or laptop models where their performance is rather static before and after purchase. In contrast, sequencing technology purchased can further improve through the consumables themselves and software versions for data analysis. The point of my previous post was to highlight this fact. Therefore, it is extremely important to assess when a data set was analyzed as more rapidly evolving technologies may be extremely disadvantaged. Showing the performance of a sequencing technology from 2011 Q2 (i.e. 3 months ago) is not an accurate representation of the technologies current performance!

I will illustrate my point using the rapid evolution of the most talented music artist of all time, Britney Spears. Britney spears is a singer, dancer, actor, perfume model, fallen angel and mother all in one.

Britney the Singer

Britney the Dancer

Britney the Actor

Britney the Perfume Model

Britney the Fallen Angel

Britney the Mother

Similarly, depending on when you look at Britney Spears’ career you would be underestimating her amazing talents.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. People should just give Britney a break, she’s only human. 

Ion Torrent – QV Prediction Algorithm

In this stand alone blog post, I will attempt to detail the predicted quality value (phred scoring) algorithm that the Ion Torrent is currently using. As the quality values is one of the battlegrounds the Next Generation Sequencing wars (Clone Wars is way cooler!) are currently being fought, it would be good to explain the difficulty in using this as a benchmark. Illumina has fought this battle on the predicted quality values. This is a good ground to have a fight on considering Illumina’s prediction algorithm is mature and is quite good at predicting the empirical quality (Figure 1). Illumina has pointed this out in their recent application note. A good prediction algorithm is good if there is no reference sequence to compare your target against (aka de novo sequecing). In addition, “the point of predicted accuracy is that many tools use this in their calculations. The more accurate these estimates, the happier those tools are. Of course, you can always go and recalibrate everything , but that’s an extra step one would rather avoid.” (Thanks Keith for the input taken from the comments section). Ion Torrent has fought on the empirical quality battleground. Their argument is who cares what the predicted values are, actual values are more important. This is a great point, given Economist spend most of their time explaining why things they predicted yesterday didn’t happen today. On the rare occasion when they get it right… their ego expands faster than the rate the Universe expanded slightly after the big bang! 😀 The reason why Ion Torrent has fought this battle on the empirical battleground is mainly due to the current weakness in their quality prediction algorithms (Figure 2).

Figure 1. Illumina phred score prediction is closer to the empirically derived values. This is read 1 from the MiSeq DH10B data set.

Figure 2. The Ion Torrent prediction algorithm under predicts quality by approximately 10 phred points.

Since Ion Torrent have released the source code, I am able to interpret how per base quality values have been calculated. These quality values are determined after carry forward, incomplete extension and droop correction (aka CAFIE or Phase correction). The quality values are recorded with the corrected signal incorporation in the SFF file.

Please note all equations are MY INTERPRETATION of the source code and since I didn’t write the code, I am probably incorrect sometimes.

Big thanks to Eugene (see comments below) from Life Technologies for correcting and providing an example for Predictors 4 and 5.

There are six metrics that are used to predict the per base quality values:

  1.  Residue (float)- distance the corrected incorporation value is from the nearest integer.
  2. Local noise (float) – maximum residue amongst the previous, current and next corrected incorporation value. Radius of 3 bases.
  3. Global noise (float) – Calculated from the mean and standard deviation of all zero-mer and 1mer signals for this well/read.
  4. “The homopolymer length, but it is assigned to the last base in the homopolymer (since there is a much higher chance of being off by 1 in the homopolymer length than by 2 or more).” All other bases in the homopolymer are assigned the value 1.
  5. “The homopolymer length the last time this nucleotide was incorporated – this basically a penalty for incomplete incorporation.”
  6. Local noise (float) – Calculated similar to (2) but with a radius of 10 bases.

An example of predictors 4 and 5 is detailed below:

A A A A T A C C C
1 1 1 4 1 1 1 1 3 (Predictor 4)
0 0 0 0 0 4 0 0 0 (Predictor 5)

Note: Predictor 5 is dependent on flow order so in the above case it depends where in the 32 redundant flow cycle these bases were called.

Once these six metrics have been calculated for this flow/base call these values are compared to a empirically derived phred table (Note: each flow produces a base call, many just have a value of zero). There is currently two versions for this phred lookup table. The comparison is made from the top of the table (i.e. phred score 33) and works it’s way down until the six metrics are below the minimum criteria for that phred score. The maximum phred score is 33, while the minimum is 7 and 5 for phred table versions 1 and 2, respectively. As the Ion Torrent is quite new, it is understandable that the phred scoring algorithm still needs more calibration. Therefore, it is quite unfair to compare Illumina predicted QVs against the Ion Torrent one.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. This is an independent analysis using Novoalign kept simple so others can reproduce the results.

You are the 5,000th viewer – you won a thought of winning something really good

It’s been little over one month and Biolektures has received over 5,000 views. Thank you everyone for visiting and commenting. Now I will rant on and on.

First and foremost thank you to all my readers directed from SEQAnswers, Twitter, Ion Community, other blog posts and a mysterious company called Ion Flux. Without my readers and their need to click on all the blog posts, there would be probably less than 1,00 views. The 1,00 views would be a combination of me and spam trying to advertise the secret on how I can get “ripped in 4 weeks.” Well I got ripped in one night by eating too much now my pants need to be fixed ! 😆

I thank the support of established bloggers who have done an amazing job tweeting and mentioning my blog. This includes Nick, Keith, Justin, Lex and Daniel. Thank you for your support and encouragement.

I thank the Ion Community and Life Technologies for making the bioinformatics aspect as OPEN as their legal team will allow. This has made people like me with no life, feel like part of the LIFE action. Thank you to those in the Ion Community who have answered my many technical questions and random outbursts claiming shenenigans. This includes Mike, Simon, Mel, Matt, Jason and the dude with the cat picture as an avatar (sorry don’t know your first name). I especially like to thank Mike for his encouragement and being a regular commenter. I will keep you updated on my plans to convert the Torrent Server into a Street Fighter 4 arcade machine as a way of generating revenue to pay for our experiments 😛

I thank those who have written comments. At least I know you people have been reading the post and not just looking at the pictures 😛 This includes Mike, Peter, Andrew and Nick. Keep the comments coming but no big words please 😀

Lastly, I thank my wife Angela for putting up with my blogging. I haven’t spent quality time such as watching Gilmore Girls, One Tree Hill and Sex and the City. I will be more available this month for the quality viewing these shows have to offer 😦

I will leave with a link to a short love story of a guy trying to impress a girl with his uber l33t h4x0r skillz featuring Enrique’s Hero.

Ion Torrent Signal Processing (Part 1) – Background and Nucleotide Incorporation Models

This is the first part of a planned three part blog series on Ion Torrent signal processing. In this first part I will discuss the important aspects of the background and foreground model using key mathematical equations and pseudo code.  In the second part, I will outline the high level process of signal processing which includes the key parameters that must be fitted. In the final part, I will discuss the major assumptions and where the model breaks down.

The goal of Ion Torrent signal processing is summarize time series data (Figure 1) into just ONE value which is then stored in the 1.wells file. The 454 equivalent is the .cwf files (thanks flxlex), however the difference is that Life Technologies has made their signal processing OPEN through the release of their source code. Without the source code, I would just be speculating in this blog series. So yay to available source code and kudos goes to Ion Community contributors Mel, Mike and particularly Simon for answering all my questions in great detail.

In my opinion signal processing is the root cause of two problems:

  1. Reads that must be filtered out due to poor signal profile*. This can account up to 30% of the reads as observed in the long read data set that was released.
  2. The resulting base call particular towards the end of the reads. There is only so much signal normalization and correction (covered in Fundamentals of Base Calling series) that can be performed.

Therefore, improvements made will have the biggest effect on improving accuracy and increase the amount of reads. In other words, if you improve on this you can have ONE million dollars.

Ion Torrent – Signal Challenge

The major challenge of signal processing is that the foreground signal is not much bigger than the background signal. This is like trying to have a conversation with someone in a crowded noisy bar with loud music. This is very difficult but not impossible. Two reasons why it is possible:

  1. You start getting used to the background sound and learn to ignore it.
  2. You know how your friend sounds like and focus on only the key words in the sentence.

In reality though I refuse to try and instead nod my head away pretending to listen 😛 However, the Ion Torrent signal processing works on a similar principle.

Figure 1A. Uncorrected signal from the first 100 flows from a live well. This was from a 4 flow cycle (Q1 2001) and thus 25 flows per nucleotide.  If you look hard enough there are small bumps between 1500-2000 ms that represent nucleotide incorporation.

Figure 1B. A typical baseline corrected measurement from an occupied well (red) and an adjacent empty well (black). The tiny red bump between 1500-2000 ms represent a nucleotide incorporation.

Background Model

The background model aims to approximate how the signal will look like for a given flow if there was NO nucleotide incorporation. The problem is what to use as a point of reference.  The best and intuitive source is a zero-mer signal from the well itself as this would encapsulate all the well specific variance and parameters. A known zero-mer signal can be taken from the key flows (i.e. first 7 flows). The only draw back is that each well is a dynamic system which changes over time due to slight variance in flow parameters and changing state of the system. Another possibility is to re-estimate the zero-mer signal every N flows. The problem with this approach is that later on there will be no TRUE zero-mer signal as there will be contributions from lagging strands. The surrounding empty wells are the only candidate left.

The loading of a chip wells with Ion Sphere Particles is a probabalistic event and not all particles fall into wells. Due to the size of the particles and wells, it is physically impossible to fit two particles in a well.  Therefore, a well should either be empty or have one particle in it. The way the Ion Torrent detects whether a well is empty or not is by washing NaOH and measuring the signal delay compared to its neighboring wells (Figure 2). An empty well has less buffering capacity and therefore should respond earlier than its occupied neighbors with particles. There is sometimes a grey area in between and the Ion Torrent analysis uses clustering to best deal with this grey area.

Figure 2. The voltage response from the NaOH wash at the start to detect occupied and empty wells. I’ll explain in more detail in next blog post. The putative empty wells (colored black) respond earlier and much faster than occupied wells (rainbow colored).  The well represented as a red dotted line lies in the “grey zone”, i.e. hard to classify as either empty or occupied.

Background Signal

There are three major contributors to the background signal

  1. Measured average signal from neighboring empty wells (ve). This signal must be time shifted as it will be subtracted to leave foreground signal.
  2. Dynamic voltage change (delta v). Can’t explain it beyond that 😦
  3. Crosstalk flux (xtflux)

I will let the mathematics do all the talking below 🙂 This is a screen capture of a latex document I produced a few months a go so I don’t remember much 😥 Please note all equations are MY INTERPRETATION of the source code and since I didn’t write the code, I am probably incorrect sometimes.


Foreground Signal – Nucleotide Incorporation Model

The Foreground signal is calculated by subtracting the background signal away from the measured signal for an occupied well. By using this model, we can determine the value A which represents the nucleotide incorporation value (aka uncorrected signal) that gets stored in the 1.wells file.

During each nucleotide flow, the polymerase adds nucleotides in a relatively synchronous manner and therefore produces a combined signal for the well observed as a detectable voltage change. What I mean by “relatively” is that most nucleotides are incorporated soon after the nucleotide is flowed in, while some take a little longer to incorporate which is usually the case with the homopolymers . This looks like a sudden spike followed by an exponential decay (Figure 3). This foreground nucleotide incorporation is modeled as a Poisson distribution using empirically derived nucleotide base (A,C,T,G) specific parameters such as Km values (plagiarized from myself :lol:).

Figure 3. Signal produced by subtracting an empty well from an occupied live well, (i.e. subtracting the dotted black line from the red line in Figure 1). The peak of ~60. The average key flow peak in a typical Ion Torrent report is calculated in a similar way. This is from Q1 2011 DAT file so is not sampled at a more desired rate.

Nucleotide Specific Parameters

Nucleotide Incorporation Simulation

The goal is to find A that best reduces the error. I will let the mathematics below speak for itself.

In the next blog post for this series, I will list the major parameters used in signal processing. These are the mysterious unaccounted variables in all the above equations and also high level description on how parameter fitting is performed.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct. The mathematical interpretation was done some time ago when I was in my “happy place”. Now I’m not in that “happy place” so don’t remember a thing!

Ion Torrent – Rapid Accuracy Improvements

The periodic public release of data sets by Life Technologies and others in the scientific community has allowed me to perform a “longitudinal study” of the improvements made on the Ion Torrent. In fact, the last few months has been quite exciting with Ion Torrent engaging the community through public data release along with source code. This has made the whole scientific community feel in some way as being part of the action. In this three part series which will run in parallel with the Signal Processing series, I will look at three major developmental themes:

  1. Improvements in Accuracy
  2. Homopolymer problem – can’t call it improvements because I haven’t analyzed the data yet 😛
  3. Changes in the ion-Analysis source code. This binary is largely responsible for all the data analysis, that is going from raw voltages (DAT files) to corrected incorporation signal (SFF files). Subsequent base calling from SFF files is quite trival 🙂

The analysis was performed using Novoalign in the Novocraft package (v2.07.12) according to the instructions detailed on their Ion Torrent support page. The plots were produced using the Rscripts provided in the package with slight modifications to change the look and feel. I used the fastq files as input and did not do any pre-processing to ensure the reproducibility of the data. The same command line options was used and is noted at the bottom of each plot. The only exception is the last plot for the long range data set where the “-n 300” option was used to inspect quality past the default 150 bases. Kudos to Nick Loman for the help (see Comments below). I quite like the package and the fast support provided on the user forum (kudos to Colin). There is a nice gallery of figures provided on their Facebook page.

From the quality plots there are two very obvious things. First, the predicted estimation is overly conservative and they are underselling themselves by an average of 10 Phred points. This was noted also on the Omics Omics, EdgeBio and Pathogenomics blog posts. Second, the predicted quality along reads from the 316 data set (Figure below) used by the Illumina MiSeq application note is an unfair and incorrect representation of what is happening.

Raw accuracy cheat sheet:

Q10 = 90% accuracy
Q20 = 99% accuracy
Q23 = 99.5% accuracy
Q30 = 99.9% accuracy

In my opinion, actual observed accuracy is more important than predicted. For example, I predicted network marketing was going to make me a fortune and I would be financially free by now. Unfortunately, my friends didn’t want to buy my stuff 😥 Their loss !! The plots from the long read data set shows the massive improvements made in just a few months. This makes me very optimistic for the future  😛

18th May 2011 (Analysis date)

Source: EdgeBio (Project: 0039010CA)
Chip: 314
Run date: 2011-04-07
Library: DH10B
ion-Analysis version: 1.40-0
Flow cycles: 55
PGM: 1EBIO
 
 
 
 

8th June 2011 (Analysis date)

Source: Life Technologies (316 data set)
Chip: 316
Run date: 2011-06-06
Library: DH10B
ion-Analysis version: 1.49-3
Flow cycles: 65
PGM: B13

 

21st July 2011 (Analysis date)

Source: Institute for Neuroscience and Muscle Research (my lab :))
Chip: 314
Run date: 2011-07-21
Library: DH10B
ion-Analysis version: 1.52-11
Flow cycles: 65
PGM: sn11c032809

Source: Institute for Neuroscience and Muscle Research (Run 2)
Chip: 314
Run date: 2011-07-21
Library: DH10B
ion-Analysis version: 1.52-11
Flow cycles: 65
PGM: sn11c032809

28th July 2011 (Analysis date)

Source: Life Technologies (Long Read data set)
Chip: 314
Run date: 2011-07-19
Library: DH10B
ion-Analysis version: 1.55-1
Flow cycles: 130
PGM: B14
 
 
 
 

Democratizing sequencing?

This wouldn’t be a blog post by me if I wasn’t complaining about something… I mean giving feedback 🙂 A slogan that is regularly used by Ion Torrent is how they are “democratizing” sequencing. In terms of releasing data sets and source code they are far ahead of their competitors Illumina and Roche. The above analysis would not be possible without public release of data sets from Life Technologies and also EdgeBio (Ion Torrent Sequencing service provider). Illumina has provided some data sets from their MiSeq. This killed my bandwidth downloading as they forgot to compress the fastq files. What n00bs! When will Illumina and Roche provide more data sets for their competing desktop sequencers? In the case of Roche when will they provide any? Also when will they learn that people outside the major sequencer centers have brains and perhaps they should interact with them every now and then!

Despite the great efforts made by Life Technologies, there is still a long way to go in my mind to truly democratize sequencing. For example, early access to new products should be given to labs that are trying to make a difference in society and not just their “special customers”. What better way to promote your technology by showing that a small lab with little experience can get it to work. I am not impressed at all if an experienced sequencing lab can get it to work. Giving these products to just special customers (aka the “big boys”) is NOT democratizing sequencing, it is maintaining the dominance these labs have over high impact publications. Our lab has requested for early access to the TargetSeq enrichment system (not to be confused with the Qiagen SeqTarget junk). Having access to this enrichment would allow us to explore the possibility of diagnosing children with muscular dystrophy more efficiently and help parents, families and carers plan the future natural progression of these crippling diseases. Having early access will give us an opportunity to produce preliminary data for the next grant round. How about helping the “little guy” for a change?

In my next blog post of this series I will provide an independent analysis of homopolymers using the data sets above. This will provide further discussion in addition to the great post comparing Ion Torrent and 454 homopolymers from flxlex.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. This is an independent analysis using Novoalign kept simple so others can reproduce the results. Despite begging, I have never been treated to free lunch/dinner or even a free T-shirt by Life Technologies 😥