This is the final part of the Fundamentals of Base Calling series. In the first part of the series, I covered signal normalization and droop correction. In the second part of the series, I introduced the problem of probabilistic nucleotide incorporation and its effects on the total signal (aka phasing problem). In addition, I briefly covered a dynamic algorithm for phase correction. Here is the source code for the implementation. Sorry it’s a doc file, couldn’t get wordpress to cooperate 😥 In this final part, I will detail the exact challenges faced with improving the raw accuracy of the Ion Torrent. This is an important topic as many know that the overall aim is to improve accuracy but are uncertain what actually needs improving. There is also a Blogtalk radio audio interview on this topic but it is NOT specific on what needs to be improved.
There are three broad levels which this problem can be approached on:
- Data acquisition
- Signal processing
- Signal correction and Base calling
Data acquisition involves measuring the voltage in each well at predefined times, during each flow. The results of these measurements are stored in a DAT file. In each DAT file is the time series data for all wells for a given flow. Therefore, there are 260 DAT files as a sequencing run contains 260 nucleotide flows. The file specifications can be obtained from the Ion Community. It is important to note that this file is in big endian and must be converted to little endian before processing on a server with an Intel processor such as the Torrent Server. The computing people reading this would know what I mean 😛
Figure 1. A typical time series data extracted from a DAT file for one well. The well specific baseline value was subtracted from all values. There is a noticeable time delay after the two flow changes due to distance from the flow inlet and pH buffering capacity of each well.
The figure above shows a typical time series voltage measurement. A flow is composed of two components:
- Wash buffer (well specific baseline measurement)
- Nucleotide Flow (acidic)
- Less acidic wash
In earlier versions, the voltage was measured at a frequency of 15 frames per second (i.e. every 66ms). However, there are two major disadvantages with this approach. First, the most important portion of this time series is the nucleotide incorporation, therefore most of the sampling should be performed during this time window. Second, the information contained in components (1) and (3) are worthless and should be sampled less frequently. This was acknowledged by introducing a variable frame rate, sampling component (2) more frequently and components (1) and (3) less frequently. This has resulted in a reduction of raw data, that is smaller DAT files. The challenges at this level are:
- Whether the current sampling rate at (2) is at the Nyquist frequency
- The reduction in overall flow time without compromising accuracy. This is required such that longer reads doesn’t necessarily mean longer run times.
Life Technologies has provided Flow Scripts o the Ion Community such that people with PGMs can experiment with tweaking these parameters. Unfortunately, the software/firmware on the PGM itself is closed source and thus the sampling rate cannot be tweaked by outsiders. Developers within Life Technologies would have access to sequencing runs where parameters in the Flow Scripts and sampling rates have been tweaked. This would be okay IF accuracy challengers were not competing with Life Technologies.
The goal of signal processing is to essentially summarize the time series data contained in the DAT files into ONE value. The result of this signal processing is stored in the 1.wells file. This contains for each well, for each flow the signal incorporation value. The file specifications can be obtained from the Ion Community. The goal at this level is to subtract the background signal, leaving only the nucleotide incorporation signal. The figure below shows that the deviation caused by nucleotide incorporation is only slightly larger than the background signal.
Figure 2. The small changes in the waveform due to nucleotide incorporation. This is an uncorrected signal and thus the 5mer signal does not deviate 5 times the distance away from the non-incorporation response (i.e. 0mer curve).
In order to summarize signal to just ONE value, a background and foreground model is currently implemented.
The background signal is how a signal would look like if there was no nucleotide incorporation. An obvious observation is that there is variation between flows and well distance from inlet and outlet fluid valves. Therefore, signals taken from surrounding empty wells is the best signal to estimate what the background signal should look like. A multi-parameter model is currently being used in Torrent Suite v1.4. This uses the surrounding empty wells to estimate each of the parameters for the background model. The parameter fitting, which uses linear algebra is the most computationally expensive task in the whole Analysis pipeline. The use of NVIDA Telsa GPU has greatly reduced processing time in Torrent Suite v1.4.
A signal incorporation profile is left after subtracting the background. The figure below shows a typical signal incorporation profile.
Figure 3. Averaged incorporation signal taken from the summary report for each run. Determined by subtracting the background response.
During each nucleotide flow, the polymerase adds nucleotides in a relatively synchronous manner and therefore produces a combined signal for the well observed as a detectable voltage change. What I mean by “relatively” is that most nucleotides are incorporated soon after the nucleotide is flowed in, while some take a little longer to incorporate which is usually the case with the homopolymers . This looks like a sudden spike followed by an exponential decay. This foreground nucleotide incorporation is modeled as a Poisson distribution using empirically derived nucleotide base (A,C,T,G) specific parameters such as Km values. In an up coming series, I will cover Signal processing in much greater detail.
Implementing an accurate and robust Background and Foreground model would be essential to improve accuracy at this level. Others may want to take a different angle on this signal processing but using a background and foreground model is considered best practice for signal processing problems of this nature. The biggest limiting factor of this model is having to use surrounding empty wells to fit the parameters. It makes an assumption that signals from occupied wells and empty wells evolve similarly over time, which includes properties such as variation and noise. In other words, their pH buffering capacity are correlated. The increase in signal residue over time shown in Figure blah in Part 2, shows this assumption may not be correct and must be adjusted.
In Parts 1 and 2 of this series, I have discussed the main steps and goals of base calling so no introduction is required. There were three main deficiencies in the current method of base calling which I will now discuss. It is important to note that deficiencies must be addressed at the current level. For example, it is very difficult to address the widening 1mer signal shown for Test Fragment C on the base calling level. This must be address on the signal processing level or even at the chemistry level.
The normalization method required estimating a 1mer signal from T, A, C 1mer signals from the key flows. The G 1mer is not included as it is not deterministic whether this is always a 1mer as it is the last base before the unknown library sequence starts. If the library sequence starts with a G then a 2mer would be read for this flow. If the library sequence starts with GG then a 3mer would be read for this flow and so on. Also it is important to note that the 1mer signal is independent of the bases, which could lead to bias in error. The figure below shows that there is a bias when 1mer signals from each base are considered the same.
Figure 4. The insertion and deletion counts were from the June cropped data set available on Innocentive. This is from Q10 reads of length greater than 50 bases (207 reads) using a modified alignStats binary. This nucleotide bias was also noted in the Mass Genomics blog post.
Another problem is using a 1mer signal to normalize a signal before base calling as there is little room for error when normalizing large homopolymers. For example, a signal between 1.5-2.5 will be called as a 2mer. This requires a 1mer signal between 0.75-1.25. A signal between 2.5-3.5 will be called as a 3mer. This requires a 1mer signal between 0.83-1.16. And so on. This demonstrates the estimate for the 1mer signal needs to be highly accurate for larger homopolymer calls. In other words, the % tolerance decreases as homopolymer length increases. The table below shows the % tolerance for each homopolymer length. This problem is compounded by the increase in signal residue deep into the sequencing runs making large homopolymers difficult to call correctly.
The number of bases called for a given flow, is calculated by rounding the signal to the nearest integer. When the value is 2.1, 2.05, 2.2 this is quite trivial as 2 is the closest integer. What happens when the value is 2.49, 2.51 ? In the case of 2.49 this would be called as a 2mer, while 2.51 will be called as a 3mer. However, these two values were only separated by 0.02 and now are separated by whole integers!! The threshold problem is a common problem in statistical classification with many methods to address this. In fact, PyroBayes implements a naive Bayes classifier as an attempt to address the homopolymer problem. Interestingly, earlier version of Torrent Suite have used statistical modeling or machine learning techniques implemented in the WEKA package.
In Torrent Suite v1.4, a greedy approach is taken with base calling. There are two types of greedy in this world.
- A person that takes the remaining Peking duck when everyone has had their share 😳
- A person that makes the best decision at the time and does not think about future repercussions. For example, turning off a main road to avoid a traffic jam but then you realize the back streets are actually slower 😡
Greedy type 2 is the one that is applicable to the base calling currently used on the Ion Torrent. This greedy approach works quite well and is fast and efficient BUT does not go so well with boundary cases such as 2.49 and 2.51.
Using a greedy approach results in the classic undercall-overcall or overcall-undercall problem observed in the data, highlighted in the Mass Genomics blog post. For example, if a C 3mer is incorrectly called as a C 2mer, the lagging signal produced may be mistaken for a 1mer in the next C flow.
A non-greedy approach would take both paths, that is call a 2mer and 3mer and observe the resulting future consequence in taking both paths before committing to a base call. In this case, future consequence is concerned with reducing the residue signal (i.e. unexplained variance). The goal of a non-greedy approach is to reduce the overall residue and not the residue for any one particular base call.
The thesis of Helmy Eltoukhy demonstrated that using a modified Vertibi algorithm (i.e. non-greedy approach) can reduce the homopolymer error in 454 sequencing. This was demonstrated using Markov Chain Monte Carlo (MCMC) simulations. He purposely left out all the important details so like the thesis marker with blind faith we’ll all believe him 🙂 Besides being adapted for base calling, the weakness with the Vertibi algorithm is that it only has a limited memory. A depth first or breath first tree search of homopolymer possibilities with edges weighted by resulting residue would perform better but its performance is highly dependent on the signal to noise ratio.
Variable Signal Correction Parameters
Currently the major parameters (Droop, Carry Forward, Incomplete Extension) used for base calling are constant and do not change between flows. First, these parameters may be nucleotide base specific. Second, they may vary dependent on the homopolymer length. It is intuitive that incomplete extension is more likely for longer homopolymers. Last, the buffering and reagent effectiveness in the well may have changed drastically compared to what they began with during the start of the sequencing.
This concludes the Fundamentals of Base Calling blog series. It is extremely important to note that each problem/improvement mentioned is entangled and is highly dependent on the other components. This would suggests a unified model in which signal processing and base calling modules can communicate would be beneficial due to entanglement. Currently the modules only communicate via the 1.wells file, where most of the information is lost due to data reduction.
In my opinion, the biggest gain would be to address this problem at the lowest level, that is the chemistry. Improving chemistry will make the problem easier on all three levels. In fact, the 200 bp long read data set is mainly due to chemistry improvements. I strongly believe due to this entanglement that major improvements in accuracy cannot be solved through computational methods alone. This is in contradiction to Rothberg who claims “homopolymer errors are pretty much due to software processing.”
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I have tried my best to keep all analyses correct. Care Bears is the best show ever!