In part 1, I discussed signal normalization and droop correction. Below is the amount of variation in the signal still left to be explained.
Figure 1. Signal residues calculated by | Normalized Signal – Ideal Signal |. The residues are largest at the 3mers, 4mers and towards the end of the read.
There are a lot of journal articles out there that discusses the phase problem in a non-technical way but in order to fully appreciate the problem and to be of any benefit to those who want to improve the Ion Torrent accuracy, it MUST be discussed technically.
The Phase problem
Firstly, the problem of phase correction is not unique to Ion Torrent but is common to all technologies requiring sequence by synthesis, which also includes Illumina and 454 sequencing . What does sequence by synthesis actually mean? In a nutshell it is observing DNA polymerase adding nucleotides one by one, creating the complementary DNA strand. The difference between the technologies is how the observation is made.
In the case of Illumina, dye labelled nucleotides (A,C,T,G) are flowed and the polymerase incorporates the nucleotide then a picture is taken after this event. The genius of this technology is the reversible dye used to label the nucleotides. For each flow only one nucleotide can be added and hence removing the problem of detecting whether multiple nucleotides (homopolymers) were added. Unlike Sanger sequencing, the dye can be removed then allowing another dye label nucleotide to be incorporated in the next flow.
In the case of Ion Torrent, the nucleotides are not labelled and a by-product of incorporation (hydrogen ions) is detected instead. We know which of the 4 nucleotides was incorporated as ONLY ONE nucleotide is present in each flow. In addition, since these nucleotides are unmodified if you have multiple nucleotides to be incorporated (i.e. homopolymer) this would produce a proportionally larger amount of hydrogen ions. The 454 works on the same principle but a different by-product of nucleotide incorporation is detected. In all three technologies, a large number of identical template strands are concentrated in a small area (i.e. cluster or well) allowing the combined signal from a large number of identical strands to be detected.
Finally, the problem itself. For each flow, whether a nucleotide is incorporated is not a deterministic event but rather a probabilistic one.
Carry Forward errors are analogous to false positives, while Incomplete Extension is analogous to false negatives. The remainder of the discussions will mainly focus on Incomplete extensions as the contribution made by Carry Forward errors are negligible in comparison.
Consequence of Incomplete Extension
Before I start this discussion, I need to define the difference between the number of bases called and the number of incorporation events. I’ll illustrate with the example sequence below:
In this sequence there are 8 bases, however the Ion Torrent would have achieved this sequence read, through four positive flows – A, T, C, G flows in that order. In other words, only 4 incorporation events are needed to produce the above sequence. For Incomplete Extension, we are only interested in what happens during incorporation events (i.e. positive flows).
The above figure shows how quickly a sequencing read can rapidly fall out of phase to a point that the lagging fraction (or population) makes up the majority of the signal. To put this in perspective, Test Fragment A requires 72 incorporation events and the error spike mentioned earlier occurs at the 59th incorporation. The approximated Incomplete Extension for Test Fragment A is p=0.012, thus the blue line is the best match for what is happening in Test Fragment A. To achieve longer reads (> 200 bp) the Incomplete Extension value should be much lower and closer to 0.001 (purple line).
However, all is not lost if we can get good approximation of the probability of Incomplete Extension then the signal can be phase corrected (aka dephased) by either simulation (Torrent Suite v1.3) or from first principles (outlined below). The first principles method (using dynamic programming) discussed below is from deciphering the work of Helmy Eltoukhy published in IEEE and also in his Stanford PhD thesis. Like all good publications from mathematicians, it is completely incomprehensible! 😦 This is probably why the only referencing an IEEE publication receive from outsiders is when they are criticizing the efficiency of their algorithm 😮 Please appreciate the hours it took me to decipher the incomprehensible but brilliant work of Helmy Eltoukhy.
I will use a 4 flow cycle to illustrate how this is performed as this is much easier to understand than a 32 redundant flow cycle.
For a given flow (n), there are several components required for phase correction:
- Normalized signal incorporation value (worked out in Part 1 of this series) (yn)
- Nucleotide (A, C, T or G) that corresponds to the flow
- Ideal signal (a) from incorporation events for that nucleotide from preceding flows. Defined by: an-4, an-8, an-12….
- Probability of Incomplete Extension (p)
Using the value p, we can determine the fraction of strands in the in-phase population and also from the lagging populations (wn-4, wn-8 …).
The phase corrected signal for this flow would be, Observed signal – Lagging signal. This also requires normalization as only a fraction (wn) of the entire population make the in-phase population. The pseudo code looks like this:
un = (yn – wn-4*an-4 – wn-8*an-8 – wn-12*an-12….)/wn
Figure 2. Signal residues calculated by | Phase Corrected Signal – Ideal Signal |. The greatest improvement compared to Figure 1 is the 3mers and 4mers at the start of the read. The residues are still quite high for positive flows late into the read. The phase corrected signal was produced using the following fitted parameters: p = 0.988, ε = 0.0075, droop = 0.00075.
The problem now remains to model and account for the remaining residue. This will be discussed in Part 3 of this series along with the dynamic programming implementation of phase correction as C/C++ code.
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I have tried my best to keep all analyses correct.