This is the first part of a planned three part blog series on Ion Torrent signal processing. In this first part I will discuss the important aspects of the background and foreground model using key mathematical equations and pseudo code. In the second part, I will outline the high level process of signal processing which includes the key parameters that must be fitted. In the final part, I will discuss the major assumptions and where the model breaks down.
The goal of Ion Torrent signal processing is summarize time series data (Figure 1) into just ONE value which is then stored in the 1.wells file. The 454 equivalent is the .cwf files (thanks flxlex), however the difference is that Life Technologies has made their signal processing OPEN through the release of their source code. Without the source code, I would just be speculating in this blog series. So yay to available source code and kudos goes to Ion Community contributors Mel, Mike and particularly Simon for answering all my questions in great detail.
In my opinion signal processing is the root cause of two problems:
- Reads that must be filtered out due to poor signal profile*. This can account up to 30% of the reads as observed in the long read data set that was released.
- The resulting base call particular towards the end of the reads. There is only so much signal normalization and correction (covered in Fundamentals of Base Calling series) that can be performed.
Therefore, improvements made will have the biggest effect on improving accuracy and increase the amount of reads. In other words, if you improve on this you can have ONE million dollars.
Ion Torrent – Signal Challenge
The major challenge of signal processing is that the foreground signal is not much bigger than the background signal. This is like trying to have a conversation with someone in a crowded noisy bar with loud music. This is very difficult but not impossible. Two reasons why it is possible:
- You start getting used to the background sound and learn to ignore it.
- You know how your friend sounds like and focus on only the key words in the sentence.
In reality though I refuse to try and instead nod my head away pretending to listen 😛 However, the Ion Torrent signal processing works on a similar principle.
Figure 1A. Uncorrected signal from the first 100 flows from a live well. This was from a 4 flow cycle (Q1 2001) and thus 25 flows per nucleotide. If you look hard enough there are small bumps between 1500-2000 ms that represent nucleotide incorporation.
Figure 1B. A typical baseline corrected measurement from an occupied well (red) and an adjacent empty well (black). The tiny red bump between 1500-2000 ms represent a nucleotide incorporation.
The background model aims to approximate how the signal will look like for a given flow if there was NO nucleotide incorporation. The problem is what to use as a point of reference. The best and intuitive source is a zero-mer signal from the well itself as this would encapsulate all the well specific variance and parameters. A known zero-mer signal can be taken from the key flows (i.e. first 7 flows). The only draw back is that each well is a dynamic system which changes over time due to slight variance in flow parameters and changing state of the system. Another possibility is to re-estimate the zero-mer signal every N flows. The problem with this approach is that later on there will be no TRUE zero-mer signal as there will be contributions from lagging strands. The surrounding empty wells are the only candidate left.
The loading of a chip wells with Ion Sphere Particles is a probabalistic event and not all particles fall into wells. Due to the size of the particles and wells, it is physically impossible to fit two particles in a well. Therefore, a well should either be empty or have one particle in it. The way the Ion Torrent detects whether a well is empty or not is by washing NaOH and measuring the signal delay compared to its neighboring wells (Figure 2). An empty well has less buffering capacity and therefore should respond earlier than its occupied neighbors with particles. There is sometimes a grey area in between and the Ion Torrent analysis uses clustering to best deal with this grey area.
Figure 2. The voltage response from the NaOH wash at the start to detect occupied and empty wells. I’ll explain in more detail in next blog post. The putative empty wells (colored black) respond earlier and much faster than occupied wells (rainbow colored). The well represented as a red dotted line lies in the “grey zone”, i.e. hard to classify as either empty or occupied.
There are three major contributors to the background signal
- Measured average signal from neighboring empty wells (ve). This signal must be time shifted as it will be subtracted to leave foreground signal.
- Dynamic voltage change (delta v). Can’t explain it beyond that 😦
- Crosstalk flux (xtflux)
I will let the mathematics do all the talking below 🙂 This is a screen capture of a latex document I produced a few months a go so I don’t remember much 😥 Please note all equations are MY INTERPRETATION of the source code and since I didn’t write the code, I am probably incorrect sometimes.
Foreground Signal – Nucleotide Incorporation Model
The Foreground signal is calculated by subtracting the background signal away from the measured signal for an occupied well. By using this model, we can determine the value A which represents the nucleotide incorporation value (aka uncorrected signal) that gets stored in the 1.wells file.
During each nucleotide flow, the polymerase adds nucleotides in a relatively synchronous manner and therefore produces a combined signal for the well observed as a detectable voltage change. What I mean by “relatively” is that most nucleotides are incorporated soon after the nucleotide is flowed in, while some take a little longer to incorporate which is usually the case with the homopolymers . This looks like a sudden spike followed by an exponential decay (Figure 3). This foreground nucleotide incorporation is modeled as a Poisson distribution using empirically derived nucleotide base (A,C,T,G) specific parameters such as Km values (plagiarized from myself :lol:).
Figure 3. Signal produced by subtracting an empty well from an occupied live well, (i.e. subtracting the dotted black line from the red line in Figure 1). The peak of ~60. The average key flow peak in a typical Ion Torrent report is calculated in a similar way. This is from Q1 2011 DAT file so is not sampled at a more desired rate.
Nucleotide Specific Parameters
Nucleotide Incorporation Simulation
The goal is to find A that best reduces the error. I will let the mathematics below speak for itself.
In the next blog post for this series, I will list the major parameters used in signal processing. These are the mysterious unaccounted variables in all the above equations and also high level description on how parameter fitting is performed.
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I have tried my best to keep all analyses correct. The mathematical interpretation was done some time ago when I was in my “happy place”. Now I’m not in that “happy place” so don’t remember a thing!