Challenges in Improving Ion Torrent Raw Accuracy (Part 3)

This is the final part of the Fundamentals of Base Calling series. In the first part of the series, I covered signal normalization and droop correction. In the second part of the series, I introduced the problem of probabilistic nucleotide incorporation and its effects on the total signal (aka phasing problem). In addition, I briefly covered a dynamic algorithm for phase correction. Here is the source code for the implementation. Sorry it’s a doc file, couldn’t get wordpress to cooperate ๐Ÿ˜ฅ In this final part, I will detail the exact challenges faced with improving the raw accuracy of the Ion Torrent. This is an important topic as many know that the overall aim is to improve accuracy but are uncertain what actually needs improving. There is also a Blogtalk radio audio interview on this topic but it is NOT specific on what needs to be improved.

There are three broad levels which this problem can be approached on:

  1. Data acquisition
  2. Signal processing
  3. Signal correction and Base calling

Data acquisition

Data acquisition involves measuring the voltage in each well at predefined times, during each flow. The results of these measurements are stored in a DAT file. In each DAT file is the time series data for all wells for a given flow. Therefore, there are 260 DAT files as a sequencing run contains 260 nucleotide flows. The file specifications can be obtained from the Ion Community. It is important to note that this file is in big endian and must be converted to little endian before processing on a server with an Intel processor such as the Torrent Server. The computing people reading this would know what I mean ๐Ÿ˜›

Figure 1. A typical time series data extracted from a DAT file for one well. The well specific baseline value was subtracted from all values. There is a noticeable time delay after the two flow changes due to distance from the flow inlet and pH buffering capacity of each well.

The figure above shows a typical time series voltage measurement. A flow is composed of two components:

  1. Wash buffer (well specific baseline measurement)
  2. Nucleotide Flow (acidic)
  3. Less acidic wash

In earlier versions, the voltage was measured at a frequency of 15 frames per second (i.e. every 66ms). However, there are two major disadvantages with this approach. First, the most important portion of this time series is the nucleotide incorporation, therefore most of the sampling should be performed during this time window. Second, the information contained in components (1) and (3) are worthless and should be sampled less frequently. This was acknowledged by introducing a variable frame rate, sampling component (2) more frequently and components (1) and (3) less frequently. This has resulted in a reduction of raw data, that is smaller DAT files. The challenges at this level are:

  1. Whether the current sampling rate at (2) is at the Nyquist frequency
  2. The reduction in overall flow time without compromising accuracy. This is required such that longer reads doesn’t necessarily mean longer run times.

Life Technologies has provided Flow Scripts o the Ion Community such that people with PGMs can experiment with tweaking these parameters. Unfortunately, the software/firmware on the PGM itself is closed source and thus the sampling rate cannot be tweaked by outsiders. Developers within Life Technologies would have access to sequencing runs where parameters in the Flow Scripts and sampling rates have been tweaked. This would be okay IF accuracy challengers were not competing with Life Technologies.

Signal Processing

The goal of signal processing is to essentially summarize the time series data contained in the DAT files into ONE value. The result of this signal processing is stored in the 1.wells file. This contains for each well, for each flow the signal incorporation value. The file specifications can be obtained from the Ion Community. The goal at this level is to subtract the background signal, leaving only the nucleotide incorporation signal. The figure below shows that the deviation caused by nucleotide incorporation is only slightly larger than the background signal.

Figure 2. The small changes in the waveform due to nucleotide incorporation. This is an uncorrected signal and thus the 5mer signal does not deviate 5 times the distance away from the non-incorporation response (i.e. 0mer curve).

In order to summarize signal to just ONE value, a background and foreground model is currently implemented.

Background

The background signal is how a signal would look like if there was no nucleotide incorporation. An obvious observation is that there is variation between flows and well distance from inlet and outlet fluid valves. Therefore, signals taken from surrounding empty wells is the best signal to estimate what the background signal should look like. A multi-parameter model is currently being used in Torrent Suite v1.4. This usesย  the surrounding empty wells to estimate each of the parameters for the background model. The parameter fitting, which uses linear algebra is the most computationally expensive task in the whole Analysis pipeline. The use of NVIDA Telsa GPU has greatly reduced processing time in Torrent Suite v1.4.

Foreground

A signal incorporation profile is left after subtracting the background. The figure below shows a typical signal incorporation profile.

Figure 3. Averaged incorporation signal taken from the summary report for each run. Determined by subtracting the background response.

During each nucleotide flow, the polymerase adds nucleotides in a relatively synchronous manner and therefore produces a combined signal for the well observed as a detectable voltage change. What I mean by โ€œrelativelyโ€ is that most nucleotides are incorporated soon after the nucleotide is flowed in, while some take a little longer to incorporate which is usually the case with the homopolymers . This looks like a sudden spike followed by an exponential decay. This foreground nucleotide incorporation is modeled as a Poisson distribution using empirically derived nucleotide base (A,C,T,G) specific parameters such as Km values. In an up coming series, I will cover Signal processing in much greater detail.

Implementing an accurate and robust Background and Foreground model would be essential to improve accuracy at this level. Others may want to take a different angle on this signal processing but using a background and foreground model is considered best practice for signal processing problems of this nature. The biggest limiting factor of this model is having to use surrounding empty wells to fit the parameters. It makes an assumption that signals from occupied wells and empty wells evolve similarly over time, which includes properties such as variation and noise. In other words, their pH buffering capacity are correlated. The increase in signal residue over time shown in Figure blah in Part 2, shows this assumption may not be correct and must be adjusted.

Base Calling

In Parts 1 and 2 of this series, I have discussed the main steps and goals of base calling so no introduction is required. There were three main deficiencies in the current method of base calling which I will now discuss. It is important to note that deficiencies must be addressed at the current level. For example, it is very difficult to address the widening 1mer signal shown for Test Fragment C on the base calling level. This must be address on the signal processing level or even at the chemistry level.

Normalization

The normalization method required estimating a 1mer signal from T, A, C 1mer signals from the key flows. The G 1mer is not included as it is not deterministic whether this is always a 1mer as it is the last base before the unknown library sequence starts. If the library sequence starts with a G then a 2mer would be read for this flow. If the library sequence starts with GG then a 3mer would be read for this flow and so on. Also it is important to note that the 1mer signal is independent of the bases, which could lead to bias in error. The figure below shows that there is a bias when 1mer signals from each base are considered the same.

Figure 4. The insertion and deletion counts were from the June cropped data set available on Innocentive. This is from Q10 reads of length greater than 50 bases (207 reads) using a modified alignStats binary. This nucleotide bias was also noted in the Mass Genomics blog post.

Another problem is using a 1mer signal to normalize a signal before base calling as there is little room for error when normalizing large homopolymers. For example, a signal between 1.5-2.5 will be called as a 2mer. This requires a 1mer signal between 0.75-1.25. A signal between 2.5-3.5 will be called as a 3mer. This requires a 1mer signal between 0.83-1.16. And so on. This demonstrates the estimate for the 1mer signal needs to be highly accurate for larger homopolymer calls. In other words, the % tolerance decreases as homopolymer length increases. The table below shows the % tolerance for each homopolymer length. This problem is compounded by the increase in signal residue deep into the sequencing runs making large homopolymers difficult to call correctly.

Thresholding

The number of bases called for a given flow, is calculated by rounding the signal to the nearest integer. When the value is 2.1, 2.05, 2.2 this is quite trivial as 2 is the closest integer. What happens when the value is 2.49, 2.51 ? In the case of 2.49 this would be called as a 2mer, while 2.51 will be called as a 3mer. However, these two values were only separated by 0.02 and now are separated by whole integers!! The threshold problem is a common problem in statistical classification with many methods to address this. In fact, PyroBayes implements a naive Bayes classifier as an attempt to address the homopolymer problem. Interestingly, earlier version of Torrent Suite have used statistical modeling or machine learning techniques implemented in the WEKA package.

In Torrent Suite v1.4, a greedy approach is taken with base calling. There are two types of greedy in this world.

  1. A person that takes the remaining Peking duck when everyone has had their share ๐Ÿ˜ณ
  2. A person that makes the best decision at the time and does not think about future repercussions. For example, turning off a main road to avoid a traffic jam but then you realize the back streets are actually slower ๐Ÿ˜ก

Greedy type 2 is the one that is applicable to the base calling currently used on the Ion Torrent. This greedy approach works quite well and is fast and efficient BUT does not go so well with boundary cases such as 2.49 and 2.51.

Using a greedy approach results in the classic undercall-overcall or overcall-undercall problem observed in the data, highlighted in the Mass Genomics blog post. For example, if aย C 3mer is incorrectly called as a C 2mer, the lagging signal produced may be mistaken for a 1mer in the next C flow.

A non-greedy approach would take both paths, that is call a 2mer and 3mer and observe the resulting future consequence in taking both paths before committing to a base call. In this case, future consequence is concerned with reducing the residue signal (i.e. unexplained variance). The goal of a non-greedy approach is to reduce the overall residue and not the residue for any one particular base call.

The thesis of Helmy Eltoukhy demonstrated that using a modified Vertibi algorithm (i.e. non-greedy approach) can reduce the homopolymer error in 454 sequencing. This was demonstrated using Markov Chain Monte Carlo (MCMC) simulations. He purposely left out all the important details so like the thesis marker with blind faith we’ll all believe him ๐Ÿ™‚ Besides being adapted for base calling, the weakness with the Vertibi algorithm is that it only has a limited memory. A depth first or breath first tree search of homopolymer possibilities with edges weighted by resulting residue would perform better but its performance is highly dependent on the signal to noise ratio.

Variable Signal Correction Parameters

Currently the major parameters (Droop, Carry Forward, Incomplete Extension) used for base calling are constant and do not change between flows. First, these parameters may be nucleotide base specific. Second, they may vary dependent on the homopolymer length. It is intuitive that incomplete extension is more likely for longer homopolymers. Last, the buffering and reagent effectiveness in the well may have changed drastically compared to what they began with during the start of the sequencing.

This concludes the Fundamentals of Base Calling blog series. It is extremely important to note that each problem/improvement mentioned is entangled and is highly dependent on the other components. This would suggests a unified model in which signal processing and base calling modules can communicate would be beneficial due to entanglement. Currently the modules only communicate via the 1.wells file, where most of the information is lost due to data reduction.

In my opinion, the biggest gain would be to address this problem at the lowest level, that is the chemistry. Improving chemistry will make the problem easier on all three levels. In fact, the 200 bp long read data set is mainly due to chemistry improvements. I strongly believe due to this entanglement that major improvements in accuracy cannot be solved through computational methods alone. This is in contradiction to Rothberg who claims “homopolymer errors are pretty much due to software processing.”

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.ย  I have tried my best to keep all analyses correct. Care Bears is the best show ever!

Advertisements

11 responses to “Challenges in Improving Ion Torrent Raw Accuracy (Part 3)

  1. I agree, Care Bears IS the best show ever!!
    I really like your post, it looks like you are trying hard to get hired by Ion ๐Ÿ™‚ The post assumes that chemistry can be improved, but software is the least risky and can yield the greatest improvement in a short time period.

    • Mike thanks for the comments and promoting the Care Bears. If everyone was made to watch it in high school the world would be a better place. In fact, the days I write the hater stuff it’s because I forgot to watch Care Bears that morning.

      It is a good point that software is less risky and much faster in terms of development time. I’ve been there, my wet lab experiments tend to drag on for months/years while my bioinformatics stuff is done pretty quickly in comparison. I guess chemistry improvements will come in their due time… or it could be bought ๐Ÿ™‚

    • I like to get any job once I finish this PhD that drags on forever. Instant noodles are starting to lose their charm ๐Ÿ˜ฅ Not sure if I want to work for the dark side ๐Ÿ˜› I much like to struggle in the public sector and complain when grant results come back and when I get the peer reviewer from hell and not to mention the students with emotional problems. ๐Ÿ˜†

  2. Being a noob on the subject I was shocked by your figure 2. Is that the raw raw data you have to work with when using Ion Torrent? And there is still usable sequence coming off the instrument? Wow! But then, I am ‘spoiled’ by the images from the 454 instrument, where wells basically look black when there is no incorporation…

    • Hi flxlex thanks for the comment and great observation. Figure 2 is from a January 2011 sequencing run so was a long long time ago. It does not have the increased sampling rate I mentioned. Also separating a foreground signal from a comparatively large background/baseline signal is a common problem in signal processing.

      When working with the raw data (i.e. DAT files) in March-April, I was very skeptical that anything could be made out of such a signal. I was pleasantly surprised when the source code was released and I saw it for myself. The background and foreground modeling is testimony to 1-2 years hard work and the community is fortunate that we are able to look at it.

  3. someone in the know

    Don’t waste your time. The so-called Grand Challenge will never have a winner, trust me…Life Technologies is using this to take ownership of all the IP related to everyones submissions, and this is a fact.

    • Thanks for your comment. I think many have thought what you have written at times and it’s an obvious concern with submission especially a non-winning submission. Here is my solution.

      If you are ONLY interested in the money, you should apply for a job at Ion Torrent as I concluded in Part 1 of the Grand Challenge series.

      If you are interested in the recognition. It would best to publish in a journal like Bioinfomatics and then submit.

      If you own a PGM and just want the accuracy to improve, then just share your knowledge on the internet then submit. This is the approach I am taking. At the end of the day as a PGM user I just want the accuracy to improve.

  4. Great review! I think you see how the technology has a long way to go to fully mature. Improving the signal to noise will be critical for read lengths to increase and the base quality scores to increase. From your analysis do you think the plans to scale every six months are feasible? With smaller wells and less signal, I wonder how the buffering will impact the results. It would be interesting to fully compare the quality of 314 vs 316 vs 318. Chemistry looks to need to improve greatly to fit the trajectory they plot and this could be a big task.

  5. Captain Proton, I like the name ๐Ÿ™‚ Thanks for the kind comment.
    I don’t think I am in the best position to comment if the milestones/promises set by Life Technologies will be met. However, one good point Life Technologies made is that their machine is fast to run and highly flexible (eg. flow scripts to modify all aspects of flow) on many levels which means they can test many possibilities rapidly. In addition, in Part 1 of my Grand Challenge series I noted the target was twice as hard with the number of total reads remaining relatively constant. This means there was massive improvement between quarters… albeit in the test data set released to accuracy challengers.

    There has been 4 different data sets released
    314 – using standard 4 flow cycle
    314 – using a 32 flow cycle and variable frame rate compression
    316 – perhaps change in electronics?
    314 – long reads and a change in chemistry

    In an up coming blog I will cover improvements in raw signal between these releases, if I can get my hands on some DAT files.

  6. This is a great review. I am in the process of learning about base calling in Ion Torrent. Where can I find Part I and II of this series? Thank you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s