Fundamentals of base calling (Part 2)

In part 1, I discussed signal normalization and droop correction. Below is the amount of variation in the signal still left to be explained.

Figure 1. Signal residues calculated by | Normalized Signal – Ideal Signal |. The residues are largest at the 3mers, 4mers and towards the end of the read.

There are a lot of journal articles out there that discusses the phase problem in a non-technical way but in order to fully appreciate the problem and to be of any benefit to those who want to improve the Ion Torrent accuracy, it MUST be discussed technically.

The Phase problem

Firstly, the problem of phase correction is not unique to Ion Torrent but is common to all technologies requiring sequence by synthesis, which also includes Illumina and 454 sequencing . What does sequence by synthesis actually mean? In a nutshell it is observing DNA polymerase adding nucleotides one by one, creating the complementary DNA strand. The difference between the technologies is how the observation is made.

In the case of Illumina, dye labelled nucleotides (A,C,T,G) are flowed and the polymerase incorporates the nucleotide then a picture is taken after this event. The genius of this technology is the reversible dye used to label the nucleotides. For each flow only one nucleotide can be added and hence removing the problem of detecting whether multiple nucleotides (homopolymers) were added. Unlike Sanger sequencing, the dye can be removed then allowing another dye label nucleotide to be incorporated in the next flow.

In the case of Ion Torrent, the nucleotides are not labelled and a by-product of incorporation (hydrogen ions) is detected instead. We know which of the 4 nucleotides was incorporated as ONLY ONE nucleotide is present in each flow. In addition, since these nucleotides are unmodified if you have multiple nucleotides to be incorporated (i.e. homopolymer) this would produce a proportionally larger amount of hydrogen ions. The 454 works on the same principle but a different by-product of nucleotide incorporation is detected. In all three technologies, a large number of identical template strands are concentrated in a small area (i.e. cluster or well) allowing the combined signal from a large number of identical strands to be detected.

Finally, the problem itself. For each flow, whether a nucleotide is incorporated is not a deterministic event but rather a probabilistic one.

Carry Forward errors are analogous to false positives, while Incomplete Extension is analogous to false negatives. The remainder of the discussions will mainly focus on Incomplete extensions as the contribution made by Carry Forward errors are negligible in comparison.

Consequence of Incomplete Extension

Before I start this discussion, I need to define the difference between the number of bases called and the number of incorporation events. I’ll illustrate with the example sequence below:

AATTCGGG

In this sequence there are 8 bases, however the Ion Torrent would have achieved this sequence read, through four positive flows  – A, T, C, G flows in that order. In other words, only 4 incorporation events are needed to produce the above sequence. For Incomplete Extension, we are only interested in what happens during incorporation events (i.e. positive flows).

The above figure shows how quickly a sequencing read can rapidly fall out of phase to a point that the lagging fraction (or population) makes up the majority of the signal. To put this in perspective, Test Fragment A requires 72 incorporation events and the error spike mentioned earlier occurs at the 59th incorporation. The approximated Incomplete Extension for Test Fragment A is p=0.012, thus the blue line is the best match for what is happening in Test Fragment A. To achieve longer reads (> 200 bp) the Incomplete Extension value should be much lower and closer to 0.001 (purple line).

However, all is not lost if we can get good approximation of the probability of Incomplete Extension then the signal can be phase corrected (aka dephased) by either simulation (Torrent Suite v1.3) or from first principles (outlined below). The first principles method (using dynamic programming) discussed below is from deciphering the work of Helmy Eltoukhy published in IEEE and also in his Stanford PhD thesis. Like all good publications from mathematicians, it is completely incomprehensible! 😦 This is probably why the only referencing an IEEE publication receive from outsiders is when they are criticizing the efficiency of their algorithm 😮 Please appreciate the hours it took me to decipher the incomprehensible but brilliant work of Helmy Eltoukhy.

Phase correction

I will use a 4 flow cycle to illustrate how this is performed as this is much easier to understand than a 32 redundant flow cycle.

For a given flow (n), there are several components required for phase correction:

  1. Normalized signal incorporation value (worked out in Part 1 of this series) (yn)
  2. Nucleotide (A, C, T or G) that corresponds to the flow
  3. Ideal signal (a) from incorporation events for that nucleotide from preceding flows. Defined by: an-4, an-8, an-12….
  4. Probability of Incomplete Extension (p)

Using the value p, we can determine the fraction of strands in the in-phase population and also from the lagging populations (wn-4, wn-8 …).

The phase corrected signal for this flow would be, Observed signal – Lagging signal. This also requires normalization as only a fraction (wn) of the entire population make the in-phase population. The pseudo code  looks like this:

un = (yn – wn-4*an-4 – wn-8*an-8 – wn-12*an-12….)/wn

Figure 2. Signal residues calculated by | Phase Corrected Signal – Ideal Signal |. The greatest improvement compared to Figure 1 is the 3mers and 4mers at the start of the read. The residues are still quite high for positive flows late into the read. The phase corrected signal was produced using the following fitted parameters: p = 0.988, ε = 0.0075, droop = 0.00075.

The problem now remains to model and account for the remaining residue. This will be discussed in Part 3 of this series along with the dynamic programming implementation of phase correction as C/C++ code.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct.

Advertisements

3 responses to “Fundamentals of base calling (Part 2)

  1. “From reading your blog I now wonder if each well has indentical bases in it and all the voltages are added up to make a read. Thats not right is it? How do they organise a million identical short sequences?” – From an Ion Community member.

  2. I’ll take a guess what you mean by “organize”. By your “s” you are probably non-American 😛

    Each well has millions of identical short sequences/strands (aka clones) attached to an Ion Sphere Particle (ISP). During each nucleotide flow, the polymerase adds nucleotides relatively in a synchronous manner and therefore produces a combined signal for the well observed as a detectable voltage change. What I mean by “relatively” is that most nucleotides are incorporated soon after the nucleotide is flowed in, while some take a little longer to incorporate which is usually the case with the homopolymers (intra-flow incorporation). This looks like a sudden spike followed by exponential decay. In the signal processing this is modeled as a Poisson distribution. I will discuss further in my upcoming series on Signal Processing which includes the time series data (i.e. nucleotide incorporation profiles) from the DAT files.

    It does get harder to organize the reads from these short sequences as time goes on. Two reasons:
    1. By the end of the nucleotide incorporation time window within a given flow, whether the correct number of nucleotides is incorporated is a probabilistic event (Inter-flow incorporation). Over time the strands start lagging behind. As discussed above in this Part of the series.

    2. Strands stop producing a signal resulting in a droop in the total signal (discussed in Part 1 of this series). DNA polymerase falling off the strand is the main cause of this.

    This answers your question about organizing identical short sequences if you were interested in the signals read from them. If the question was about how do you produce them and get them in the well then this is achieved by emulsion PCR. There is a nature protocols publication on this and this forms the major steps of Template Preparation for the Ion Torrent also described in the supplementary material of the Ion Torrent Nature publication.

  3. Thanks for that, I always wondered what Americans meant by organize.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s