Monthly Archives: August 2011

Ion Torrent Signal Processing (Part 1) – Background and Nucleotide Incorporation Models

This is the first part of a planned three part blog series on Ion Torrent signal processing. In this first part I will discuss the important aspects of the background and foreground model using key mathematical equations and pseudo code.  In the second part, I will outline the high level process of signal processing which includes the key parameters that must be fitted. In the final part, I will discuss the major assumptions and where the model breaks down.

The goal of Ion Torrent signal processing is summarize time series data (Figure 1) into just ONE value which is then stored in the 1.wells file. The 454 equivalent is the .cwf files (thanks flxlex), however the difference is that Life Technologies has made their signal processing OPEN through the release of their source code. Without the source code, I would just be speculating in this blog series. So yay to available source code and kudos goes to Ion Community contributors Mel, Mike and particularly Simon for answering all my questions in great detail.

In my opinion signal processing is the root cause of two problems:

  1. Reads that must be filtered out due to poor signal profile*. This can account up to 30% of the reads as observed in the long read data set that was released.
  2. The resulting base call particular towards the end of the reads. There is only so much signal normalization and correction (covered in Fundamentals of Base Calling series) that can be performed.

Therefore, improvements made will have the biggest effect on improving accuracy and increase the amount of reads. In other words, if you improve on this you can have ONE million dollars.

Ion Torrent – Signal Challenge

The major challenge of signal processing is that the foreground signal is not much bigger than the background signal. This is like trying to have a conversation with someone in a crowded noisy bar with loud music. This is very difficult but not impossible. Two reasons why it is possible:

  1. You start getting used to the background sound and learn to ignore it.
  2. You know how your friend sounds like and focus on only the key words in the sentence.

In reality though I refuse to try and instead nod my head away pretending to listen 😛 However, the Ion Torrent signal processing works on a similar principle.

Figure 1A. Uncorrected signal from the first 100 flows from a live well. This was from a 4 flow cycle (Q1 2001) and thus 25 flows per nucleotide.  If you look hard enough there are small bumps between 1500-2000 ms that represent nucleotide incorporation.

Figure 1B. A typical baseline corrected measurement from an occupied well (red) and an adjacent empty well (black). The tiny red bump between 1500-2000 ms represent a nucleotide incorporation.

Background Model

The background model aims to approximate how the signal will look like for a given flow if there was NO nucleotide incorporation. The problem is what to use as a point of reference.  The best and intuitive source is a zero-mer signal from the well itself as this would encapsulate all the well specific variance and parameters. A known zero-mer signal can be taken from the key flows (i.e. first 7 flows). The only draw back is that each well is a dynamic system which changes over time due to slight variance in flow parameters and changing state of the system. Another possibility is to re-estimate the zero-mer signal every N flows. The problem with this approach is that later on there will be no TRUE zero-mer signal as there will be contributions from lagging strands. The surrounding empty wells are the only candidate left.

The loading of a chip wells with Ion Sphere Particles is a probabalistic event and not all particles fall into wells. Due to the size of the particles and wells, it is physically impossible to fit two particles in a well.  Therefore, a well should either be empty or have one particle in it. The way the Ion Torrent detects whether a well is empty or not is by washing NaOH and measuring the signal delay compared to its neighboring wells (Figure 2). An empty well has less buffering capacity and therefore should respond earlier than its occupied neighbors with particles. There is sometimes a grey area in between and the Ion Torrent analysis uses clustering to best deal with this grey area.

Figure 2. The voltage response from the NaOH wash at the start to detect occupied and empty wells. I’ll explain in more detail in next blog post. The putative empty wells (colored black) respond earlier and much faster than occupied wells (rainbow colored).  The well represented as a red dotted line lies in the “grey zone”, i.e. hard to classify as either empty or occupied.

Background Signal

There are three major contributors to the background signal

  1. Measured average signal from neighboring empty wells (ve). This signal must be time shifted as it will be subtracted to leave foreground signal.
  2. Dynamic voltage change (delta v). Can’t explain it beyond that 😦
  3. Crosstalk flux (xtflux)

I will let the mathematics do all the talking below 🙂 This is a screen capture of a latex document I produced a few months a go so I don’t remember much 😥 Please note all equations are MY INTERPRETATION of the source code and since I didn’t write the code, I am probably incorrect sometimes.


Foreground Signal – Nucleotide Incorporation Model

The Foreground signal is calculated by subtracting the background signal away from the measured signal for an occupied well. By using this model, we can determine the value A which represents the nucleotide incorporation value (aka uncorrected signal) that gets stored in the 1.wells file.

During each nucleotide flow, the polymerase adds nucleotides in a relatively synchronous manner and therefore produces a combined signal for the well observed as a detectable voltage change. What I mean by “relatively” is that most nucleotides are incorporated soon after the nucleotide is flowed in, while some take a little longer to incorporate which is usually the case with the homopolymers . This looks like a sudden spike followed by an exponential decay (Figure 3). This foreground nucleotide incorporation is modeled as a Poisson distribution using empirically derived nucleotide base (A,C,T,G) specific parameters such as Km values (plagiarized from myself :lol:).

Figure 3. Signal produced by subtracting an empty well from an occupied live well, (i.e. subtracting the dotted black line from the red line in Figure 1). The peak of ~60. The average key flow peak in a typical Ion Torrent report is calculated in a similar way. This is from Q1 2011 DAT file so is not sampled at a more desired rate.

Nucleotide Specific Parameters

Nucleotide Incorporation Simulation

The goal is to find A that best reduces the error. I will let the mathematics below speak for itself.

In the next blog post for this series, I will list the major parameters used in signal processing. These are the mysterious unaccounted variables in all the above equations and also high level description on how parameter fitting is performed.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct. The mathematical interpretation was done some time ago when I was in my “happy place”. Now I’m not in that “happy place” so don’t remember a thing!

Ion Torrent – Rapid Accuracy Improvements

The periodic public release of data sets by Life Technologies and others in the scientific community has allowed me to perform a “longitudinal study” of the improvements made on the Ion Torrent. In fact, the last few months has been quite exciting with Ion Torrent engaging the community through public data release along with source code. This has made the whole scientific community feel in some way as being part of the action. In this three part series which will run in parallel with the Signal Processing series, I will look at three major developmental themes:

  1. Improvements in Accuracy
  2. Homopolymer problem – can’t call it improvements because I haven’t analyzed the data yet 😛
  3. Changes in the ion-Analysis source code. This binary is largely responsible for all the data analysis, that is going from raw voltages (DAT files) to corrected incorporation signal (SFF files). Subsequent base calling from SFF files is quite trival 🙂

The analysis was performed using Novoalign in the Novocraft package (v2.07.12) according to the instructions detailed on their Ion Torrent support page. The plots were produced using the Rscripts provided in the package with slight modifications to change the look and feel. I used the fastq files as input and did not do any pre-processing to ensure the reproducibility of the data. The same command line options was used and is noted at the bottom of each plot. The only exception is the last plot for the long range data set where the “-n 300” option was used to inspect quality past the default 150 bases. Kudos to Nick Loman for the help (see Comments below). I quite like the package and the fast support provided on the user forum (kudos to Colin). There is a nice gallery of figures provided on their Facebook page.

From the quality plots there are two very obvious things. First, the predicted estimation is overly conservative and they are underselling themselves by an average of 10 Phred points. This was noted also on the Omics Omics, EdgeBio and Pathogenomics blog posts. Second, the predicted quality along reads from the 316 data set (Figure below) used by the Illumina MiSeq application note is an unfair and incorrect representation of what is happening.

Raw accuracy cheat sheet:

Q10 = 90% accuracy
Q20 = 99% accuracy
Q23 = 99.5% accuracy
Q30 = 99.9% accuracy

In my opinion, actual observed accuracy is more important than predicted. For example, I predicted network marketing was going to make me a fortune and I would be financially free by now. Unfortunately, my friends didn’t want to buy my stuff 😥 Their loss !! The plots from the long read data set shows the massive improvements made in just a few months. This makes me very optimistic for the future  😛

18th May 2011 (Analysis date)

Source: EdgeBio (Project: 0039010CA)
Chip: 314
Run date: 2011-04-07
Library: DH10B
ion-Analysis version: 1.40-0
Flow cycles: 55
PGM: 1EBIO
 
 
 
 

8th June 2011 (Analysis date)

Source: Life Technologies (316 data set)
Chip: 316
Run date: 2011-06-06
Library: DH10B
ion-Analysis version: 1.49-3
Flow cycles: 65
PGM: B13

 

21st July 2011 (Analysis date)

Source: Institute for Neuroscience and Muscle Research (my lab :))
Chip: 314
Run date: 2011-07-21
Library: DH10B
ion-Analysis version: 1.52-11
Flow cycles: 65
PGM: sn11c032809

Source: Institute for Neuroscience and Muscle Research (Run 2)
Chip: 314
Run date: 2011-07-21
Library: DH10B
ion-Analysis version: 1.52-11
Flow cycles: 65
PGM: sn11c032809

28th July 2011 (Analysis date)

Source: Life Technologies (Long Read data set)
Chip: 314
Run date: 2011-07-19
Library: DH10B
ion-Analysis version: 1.55-1
Flow cycles: 130
PGM: B14
 
 
 
 

Democratizing sequencing?

This wouldn’t be a blog post by me if I wasn’t complaining about something… I mean giving feedback 🙂 A slogan that is regularly used by Ion Torrent is how they are “democratizing” sequencing. In terms of releasing data sets and source code they are far ahead of their competitors Illumina and Roche. The above analysis would not be possible without public release of data sets from Life Technologies and also EdgeBio (Ion Torrent Sequencing service provider). Illumina has provided some data sets from their MiSeq. This killed my bandwidth downloading as they forgot to compress the fastq files. What n00bs! When will Illumina and Roche provide more data sets for their competing desktop sequencers? In the case of Roche when will they provide any? Also when will they learn that people outside the major sequencer centers have brains and perhaps they should interact with them every now and then!

Despite the great efforts made by Life Technologies, there is still a long way to go in my mind to truly democratize sequencing. For example, early access to new products should be given to labs that are trying to make a difference in society and not just their “special customers”. What better way to promote your technology by showing that a small lab with little experience can get it to work. I am not impressed at all if an experienced sequencing lab can get it to work. Giving these products to just special customers (aka the “big boys”) is NOT democratizing sequencing, it is maintaining the dominance these labs have over high impact publications. Our lab has requested for early access to the TargetSeq enrichment system (not to be confused with the Qiagen SeqTarget junk). Having access to this enrichment would allow us to explore the possibility of diagnosing children with muscular dystrophy more efficiently and help parents, families and carers plan the future natural progression of these crippling diseases. Having early access will give us an opportunity to produce preliminary data for the next grant round. How about helping the “little guy” for a change?

In my next blog post of this series I will provide an independent analysis of homopolymers using the data sets above. This will provide further discussion in addition to the great post comparing Ion Torrent and 454 homopolymers from flxlex.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. This is an independent analysis using Novoalign kept simple so others can reproduce the results. Despite begging, I have never been treated to free lunch/dinner or even a free T-shirt by Life Technologies 😥


Challenges in Improving Ion Torrent Raw Accuracy (Part 3)

This is the final part of the Fundamentals of Base Calling series. In the first part of the series, I covered signal normalization and droop correction. In the second part of the series, I introduced the problem of probabilistic nucleotide incorporation and its effects on the total signal (aka phasing problem). In addition, I briefly covered a dynamic algorithm for phase correction. Here is the source code for the implementation. Sorry it’s a doc file, couldn’t get wordpress to cooperate 😥 In this final part, I will detail the exact challenges faced with improving the raw accuracy of the Ion Torrent. This is an important topic as many know that the overall aim is to improve accuracy but are uncertain what actually needs improving. There is also a Blogtalk radio audio interview on this topic but it is NOT specific on what needs to be improved.

There are three broad levels which this problem can be approached on:

  1. Data acquisition
  2. Signal processing
  3. Signal correction and Base calling

Data acquisition

Data acquisition involves measuring the voltage in each well at predefined times, during each flow. The results of these measurements are stored in a DAT file. In each DAT file is the time series data for all wells for a given flow. Therefore, there are 260 DAT files as a sequencing run contains 260 nucleotide flows. The file specifications can be obtained from the Ion Community. It is important to note that this file is in big endian and must be converted to little endian before processing on a server with an Intel processor such as the Torrent Server. The computing people reading this would know what I mean 😛

Figure 1. A typical time series data extracted from a DAT file for one well. The well specific baseline value was subtracted from all values. There is a noticeable time delay after the two flow changes due to distance from the flow inlet and pH buffering capacity of each well.

The figure above shows a typical time series voltage measurement. A flow is composed of two components:

  1. Wash buffer (well specific baseline measurement)
  2. Nucleotide Flow (acidic)
  3. Less acidic wash

In earlier versions, the voltage was measured at a frequency of 15 frames per second (i.e. every 66ms). However, there are two major disadvantages with this approach. First, the most important portion of this time series is the nucleotide incorporation, therefore most of the sampling should be performed during this time window. Second, the information contained in components (1) and (3) are worthless and should be sampled less frequently. This was acknowledged by introducing a variable frame rate, sampling component (2) more frequently and components (1) and (3) less frequently. This has resulted in a reduction of raw data, that is smaller DAT files. The challenges at this level are:

  1. Whether the current sampling rate at (2) is at the Nyquist frequency
  2. The reduction in overall flow time without compromising accuracy. This is required such that longer reads doesn’t necessarily mean longer run times.

Life Technologies has provided Flow Scripts o the Ion Community such that people with PGMs can experiment with tweaking these parameters. Unfortunately, the software/firmware on the PGM itself is closed source and thus the sampling rate cannot be tweaked by outsiders. Developers within Life Technologies would have access to sequencing runs where parameters in the Flow Scripts and sampling rates have been tweaked. This would be okay IF accuracy challengers were not competing with Life Technologies.

Signal Processing

The goal of signal processing is to essentially summarize the time series data contained in the DAT files into ONE value. The result of this signal processing is stored in the 1.wells file. This contains for each well, for each flow the signal incorporation value. The file specifications can be obtained from the Ion Community. The goal at this level is to subtract the background signal, leaving only the nucleotide incorporation signal. The figure below shows that the deviation caused by nucleotide incorporation is only slightly larger than the background signal.

Figure 2. The small changes in the waveform due to nucleotide incorporation. This is an uncorrected signal and thus the 5mer signal does not deviate 5 times the distance away from the non-incorporation response (i.e. 0mer curve).

In order to summarize signal to just ONE value, a background and foreground model is currently implemented.

Background

The background signal is how a signal would look like if there was no nucleotide incorporation. An obvious observation is that there is variation between flows and well distance from inlet and outlet fluid valves. Therefore, signals taken from surrounding empty wells is the best signal to estimate what the background signal should look like. A multi-parameter model is currently being used in Torrent Suite v1.4. This uses  the surrounding empty wells to estimate each of the parameters for the background model. The parameter fitting, which uses linear algebra is the most computationally expensive task in the whole Analysis pipeline. The use of NVIDA Telsa GPU has greatly reduced processing time in Torrent Suite v1.4.

Foreground

A signal incorporation profile is left after subtracting the background. The figure below shows a typical signal incorporation profile.

Figure 3. Averaged incorporation signal taken from the summary report for each run. Determined by subtracting the background response.

During each nucleotide flow, the polymerase adds nucleotides in a relatively synchronous manner and therefore produces a combined signal for the well observed as a detectable voltage change. What I mean by “relatively” is that most nucleotides are incorporated soon after the nucleotide is flowed in, while some take a little longer to incorporate which is usually the case with the homopolymers . This looks like a sudden spike followed by an exponential decay. This foreground nucleotide incorporation is modeled as a Poisson distribution using empirically derived nucleotide base (A,C,T,G) specific parameters such as Km values. In an up coming series, I will cover Signal processing in much greater detail.

Implementing an accurate and robust Background and Foreground model would be essential to improve accuracy at this level. Others may want to take a different angle on this signal processing but using a background and foreground model is considered best practice for signal processing problems of this nature. The biggest limiting factor of this model is having to use surrounding empty wells to fit the parameters. It makes an assumption that signals from occupied wells and empty wells evolve similarly over time, which includes properties such as variation and noise. In other words, their pH buffering capacity are correlated. The increase in signal residue over time shown in Figure blah in Part 2, shows this assumption may not be correct and must be adjusted.

Base Calling

In Parts 1 and 2 of this series, I have discussed the main steps and goals of base calling so no introduction is required. There were three main deficiencies in the current method of base calling which I will now discuss. It is important to note that deficiencies must be addressed at the current level. For example, it is very difficult to address the widening 1mer signal shown for Test Fragment C on the base calling level. This must be address on the signal processing level or even at the chemistry level.

Normalization

The normalization method required estimating a 1mer signal from T, A, C 1mer signals from the key flows. The G 1mer is not included as it is not deterministic whether this is always a 1mer as it is the last base before the unknown library sequence starts. If the library sequence starts with a G then a 2mer would be read for this flow. If the library sequence starts with GG then a 3mer would be read for this flow and so on. Also it is important to note that the 1mer signal is independent of the bases, which could lead to bias in error. The figure below shows that there is a bias when 1mer signals from each base are considered the same.

Figure 4. The insertion and deletion counts were from the June cropped data set available on Innocentive. This is from Q10 reads of length greater than 50 bases (207 reads) using a modified alignStats binary. This nucleotide bias was also noted in the Mass Genomics blog post.

Another problem is using a 1mer signal to normalize a signal before base calling as there is little room for error when normalizing large homopolymers. For example, a signal between 1.5-2.5 will be called as a 2mer. This requires a 1mer signal between 0.75-1.25. A signal between 2.5-3.5 will be called as a 3mer. This requires a 1mer signal between 0.83-1.16. And so on. This demonstrates the estimate for the 1mer signal needs to be highly accurate for larger homopolymer calls. In other words, the % tolerance decreases as homopolymer length increases. The table below shows the % tolerance for each homopolymer length. This problem is compounded by the increase in signal residue deep into the sequencing runs making large homopolymers difficult to call correctly.

Thresholding

The number of bases called for a given flow, is calculated by rounding the signal to the nearest integer. When the value is 2.1, 2.05, 2.2 this is quite trivial as 2 is the closest integer. What happens when the value is 2.49, 2.51 ? In the case of 2.49 this would be called as a 2mer, while 2.51 will be called as a 3mer. However, these two values were only separated by 0.02 and now are separated by whole integers!! The threshold problem is a common problem in statistical classification with many methods to address this. In fact, PyroBayes implements a naive Bayes classifier as an attempt to address the homopolymer problem. Interestingly, earlier version of Torrent Suite have used statistical modeling or machine learning techniques implemented in the WEKA package.

In Torrent Suite v1.4, a greedy approach is taken with base calling. There are two types of greedy in this world.

  1. A person that takes the remaining Peking duck when everyone has had their share 😳
  2. A person that makes the best decision at the time and does not think about future repercussions. For example, turning off a main road to avoid a traffic jam but then you realize the back streets are actually slower 😡

Greedy type 2 is the one that is applicable to the base calling currently used on the Ion Torrent. This greedy approach works quite well and is fast and efficient BUT does not go so well with boundary cases such as 2.49 and 2.51.

Using a greedy approach results in the classic undercall-overcall or overcall-undercall problem observed in the data, highlighted in the Mass Genomics blog post. For example, if a C 3mer is incorrectly called as a C 2mer, the lagging signal produced may be mistaken for a 1mer in the next C flow.

A non-greedy approach would take both paths, that is call a 2mer and 3mer and observe the resulting future consequence in taking both paths before committing to a base call. In this case, future consequence is concerned with reducing the residue signal (i.e. unexplained variance). The goal of a non-greedy approach is to reduce the overall residue and not the residue for any one particular base call.

The thesis of Helmy Eltoukhy demonstrated that using a modified Vertibi algorithm (i.e. non-greedy approach) can reduce the homopolymer error in 454 sequencing. This was demonstrated using Markov Chain Monte Carlo (MCMC) simulations. He purposely left out all the important details so like the thesis marker with blind faith we’ll all believe him 🙂 Besides being adapted for base calling, the weakness with the Vertibi algorithm is that it only has a limited memory. A depth first or breath first tree search of homopolymer possibilities with edges weighted by resulting residue would perform better but its performance is highly dependent on the signal to noise ratio.

Variable Signal Correction Parameters

Currently the major parameters (Droop, Carry Forward, Incomplete Extension) used for base calling are constant and do not change between flows. First, these parameters may be nucleotide base specific. Second, they may vary dependent on the homopolymer length. It is intuitive that incomplete extension is more likely for longer homopolymers. Last, the buffering and reagent effectiveness in the well may have changed drastically compared to what they began with during the start of the sequencing.

This concludes the Fundamentals of Base Calling blog series. It is extremely important to note that each problem/improvement mentioned is entangled and is highly dependent on the other components. This would suggests a unified model in which signal processing and base calling modules can communicate would be beneficial due to entanglement. Currently the modules only communicate via the 1.wells file, where most of the information is lost due to data reduction.

In my opinion, the biggest gain would be to address this problem at the lowest level, that is the chemistry. Improving chemistry will make the problem easier on all three levels. In fact, the 200 bp long read data set is mainly due to chemistry improvements. I strongly believe due to this entanglement that major improvements in accuracy cannot be solved through computational methods alone. This is in contradiction to Rothberg who claims “homopolymer errors are pretty much due to software processing.”

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct. Care Bears is the best show ever!

Life Grand Challenges – Innocentive but not enough Incentives (Part 3)

In this series I explore three aspects of the Life Grand Challenge. In the first part, I briefly described the Life Grand Challenges, estimated active participation and lastly proposed the current unfairness in the challenge which may be what is holding back active participation. In the second part, I detailed the current resources available to accuracy challenge participants and the possibility that our current generation are only looking for quick rewards. In this last part, I will describe the successful NetFlix challenge and their crowd sourcing model.

NetFlix Challenge

Ever used IMDB or Rotten Tomatoes as an indicator of how good a movie is and whether you should watch it? I have but lately I really need to find out who has been rating crap movies as good. For example, Animal Kingdom is the biggest piece of artsy fartsy crap I’ve ever seen. It got 97% on Rotten Tomatoes, while Tekken best movie ever only got 5.0 on IMDB. This highlights the problem – just because someone else likes the movie doesn’t mean I would like the movie.

Netflix wanted to know, what a customer would rate a movie given a customer’s history of movie ratings. They have a current implementation called Cinematch but wanted people to improve the accuracy. The competition rules/specifications in point form

  1. Competition was open in 2006 and would end in 2011
  2. A million dollar prize was given to the person/team that improved accuracy by 10%. Once a participant has hit this target, other participants are notified and have 30 days to try and beat this submission.
  3. A progress prize of 50, 000 is given each year to the person/team that improved accuracy by 1% of the previous progress prize winner. Therefore, for the first year this would just be the submission with the best accuracy.

NetFlix Data Set

The challenge participants were given two data sets.

  1. “The training data set consists of more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles.” Customers details withheld and a random customer ID was used.
  2. “A qualifying test set is provided containing over 2.8 million customer/movie id pairs with rating dates but with the ratings withheld. These pairs were selected from the most recent ratings from a subset of the same customers in the training data set, over a subset of the same movies.”

The movie rating is an integer score between 1 and 5 stars. The goal is to predict what the withheld rating is in the test data set. The root mean square error (RMSE) is then calculated from predicted and actual (i.e. the withheld) ratings.  The RMSE score achieved by the Cinematch program was 0.9525, therefore the grand prize winner must achieve 0.8572 (i.e. 10% accuracy improvement). This was achieved in September 2009.

How big was the crowd Netflix was sourcing from?

According to the leader board:

“There are currently 51051 contestants on 41305 teams from 186 different countries. We have received 44014 valid submissions from 5169 different teams.”  This does not make the 500 or so accuracy challenge participants with currently NO submissions seem so impressive for the Life Grand Challenge. Now I will highlight the differences between the two crowd sourcing models.

Competition Fairness

First and foremost with the NetFlix challenge you are not competing with the company itself. There is not a team of Cinematch programmers with access to their whole database competing against you. Second, everyone has the same data set to try to figure out a better algorithm. In massive contrast, for the Ion Torrent, larger sequencing centers will have larger data sets for their employees to work with. I currently have a pathetic TWO whole 314 data set to work with. Yes, this is better than having ONE but not much better. While an accuracy challenger working at Sanger, Broad and BGI are swimming in this stuff. Not to mention preferred and early access customers. This clearly highlights that Life Technologies is NOT democratizing sequencing, more like a dictatorship. Don’t be surprised if someone from BGI wins this competition. This unfairness in competition is even worse for the speed and throughput challenge which also favors the sequencing centers with the massive budgets. i.e. Sanger, Broad and BGI!

Moving Target

The target for the NetFlix challenge was a RMSE 0.8572 set in 2006. This target did not change during the period of the competition. In contrast, the target for the accuracy challenge changes every 3 months. This quarter’s target is actually twice as hard to achieve than last quarter. If the active participation was next to zero last quarter, this isn’t a good way of encouraging people to take up the challenge.

Milestone rewards

The milestone reward each year was $50,000 for the NetFlix challenge. In massive contrast, the Life Grand Challenge gives you absolutely nothing for all your hard efforts. Kinda makes you wanna code to 4am each night and turn up to work as a zombie 😆 No way, I’m waiting for the release of Diablo 3 for doing that 😛

There are two simple things that made the NetFlix crowd sourcing model successful. These two concepts was taken from a great blog post discussing the NetFlix Challenge. I have included my own two points, that of fairness and the value of ideas and concepts.

Reward system

Both the NetFlix challenge and the Life Grand Challenge have a 1 million dollar grand prize, however the similarities end there. The NetFlix challenge rewards the best submission for the year with $50,000. Imagine all the instant noodle packets you can buy with that money 😆 In addition, the challenge participants are publicly known and therefore is a free advertisement to the world about their talents. This serves as non-monetary but good reward. At the moment for the grand challenge, all participants, submission and leader board is kept hidden from the public.

The value of Ideas and Concepts

My suggestion is to reward ideas/concepts and not just solutions. This in a way will provide small rewards or milestones in trying to reach the best solution. These ideas and solutions may be valued at 5,000 or 10,000 and of course a solution that smashes the benchmark is still worth 1 million. The likelihood of someone in the public coming up with a solution is low but much higher for ideas/concepts that appear to have potential.

To expand a little further on the solution vs idea/concepts discussion. A solution must work in a restrictive framework which computational analysis has no control of. For example, most people working on this solution do not have an Ion Torrent therefore cannot modify flow order, chemistry, etc. If you view the accuracy challenge as a holistic problem with many factors coming together to achieve the goal, sadly a solution may be difficult to achieve. However, an idea/concept does not have to be limited to a restrictive framework. For example, an idea/concept could become a solution if things upstream were tweaked to take advantage of this idea/concept.

Fairness

Besides some people having bigger brains and no life, the NetFlix challenge was quite fair. Everyone had the same data set available to them and they were not competing with NetFlix employees. While the Life Grand Challenge is as fair as an election in Zimbabwe 🙂 In the Ion Community the concept of offering a small grant to promising participants was entertained. Also a discount of 30% was offered for a PGM purchase for accuracy challenge participants. I argued that it would be impossible for someone to prove that they were active. It would go something like this “Thanks for the discounted PGM, I tried improving accuracy but it was too hard so I gave up after 5 minutes”. This offer on Innocentive was promptly removed for other reasons. I do hope they come up with some innovative initiatives to improve an obvious lack of fairness.

Motivation

The 454 homopolymer problem has existed since the technology was released. Why would Ion Torrent persist with a similar flawed design? The 454 problem still persists so is not going to be solved in one quarter, therefore it is important to keep the motivation of the challengers up at all times. My suggestion, a leader board that is publicly available. All programmers love to see their name in lights particularly if it is at the top. Also a substantial quarterly prize and I’m not talking about a free T-shirt 😛 Even the PGM Users in the Ion Community showing their best Ion Torrent runs are given a reward. Developers have been offered nothing, little surprise there hasn’t been any third party plugins developed for the Torrent Browser.

This concludes my opinion based blog on the Grand Challenge and how it may be a “grand challenge” to improve the current model that appears more like a public relations stunt rather than a true fair competition. The criticism is harsh but I believe I am voicing the concerns of the many and through feedback, I have faith the competition will improve. My next few posts will be purely technical because there so much opinions I can write before I start sounding like Abe Simpson.

My next post will be Part 3 on the Fundamentals of base calling, where I will outline the EXACT computational challenges that must be overcome to improve accuracy. This post will be released at the start of next week along with the source code used to generate the data for that series.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct. Opinons are like PGMs, everyone should have one 🙂

Life Grand Challenges – Mission Impossible to Improbable (Part 2)

In the first part of this three part series, I gave a brief introduction to the Life Grand Challenge, estimated the current participation then speculated why it isn’t higher. To be unbiased and not to be seen as a hater, I’ve decided to make this into a three part series. Before discussing the lessons learned from the successful NetFlix crowd sourcing model (Part 3), I will write about what Life Technologies has done right with the Grand Challenge.

One million dollar prize money

If this doesn’t motivate you then I guess nothing will. Rothberg and the PR machinery couldn’t have done anymore with getting it out there all over the internet. Although if the prize was a date with a girl, this would have worked equally well for Life Technologies. See Beauty and the Geek as a reference. I guess the girls working on the challenge would probably want the one million 😛

Adequate resources

Life Technologies are quite serious and have provided what I believe is adequate resources. First, the source code which is extremely important learning resource even if you aren’t interested in the accuracy challenge. It is very rare opportunity that people are allowed to get a look at this valuable piece of intellectual property. In addition, you get an insight on the progress of the code as new versions are released each quarter. The code itself is very impressive, particularly exploiting the NVIDIA Telsa GPU, although the matrix and linear algebra GNU libraries may not be fully optimized yet 😦 I can’t wait for more bioinformatics programs to exploit the power of GPU computing. Only criticism, get larger screens for the developers… I feel their pain. I can tell by where they are taking their new lines how big their screens are 😦

Second, a forum space on the Ion Community to engage with the actual people who have written with the code. This is an extremely rare opportunity to have a developer tell you why they have decided to do things the way they have.

Third, access to computing resources which emulate the Ion Torrent server. This is done through making a Ubuntu server image available on the Amazon EC2 cloud. The cloud includes ONE full 314 data set of an E. coli run, which is approximately 50 Gb in size. Access to this cloud, however comes at a cost depending on how much CPU time you consume. On Innocentive, a VMware Ubuntu 64 bit server image is offered as a download for the poor students. Before our Ion Torrent server arrived, I used VirtualBox which is a freeware alternative to VMware and is compatible with VMware virtual disks. There still some work to do with regards to be making MORE raw data sets (i.e. DAT files) easily accessible. This will remove the temptation for accuracy challengers to over fit their implementation to the ONE data set they’re given.

Last, a brief introduction to the Life Grand Challenges and major topics and themes in the accuracy challenge. This is in audio form on blog talk radio. Therefore, if you are like me and don’t like to read, you don’t have an excuse because someone reads it for you … too easy !!! And there are no tricky big words 🙂 Also a webinar series to help those starting out particularly with Amazon EC2. Another helpful tool is a custom R package to parse and analyze the data sets.

Why aren’t people motivated

There is one million dollars on offer and enough resources for someone who is competent in bioinformatics and programming. Besides the reasons I outlined in Part 1, what else could be holding people back?

I have been on this Earth for a while and find humans very weird and much prefer lizards because they can grow their tails back. Anyways, humans of our generation like small rewards and milestones while trying to achieve something big or hard to achieve otherwise they get disinterested extremely fast.  While, people from the Facebook/Twitter generation think that if something can’t be done in 5 minutes then it’s just too difficult or impossible. Nature publication – that’s easy that will take one week’s worth of work attitude. Someone in the Ion Community asked for the mathematical model of nucleotide incorporation (implemented in the source code) to be spoon fed to them. Probably after their brilliant idea of googling the C++ solution by using the query – “Life Grand Challenge C++ solution” didn’t work. But what is expected from a generation that googles their English essays.

Anyways back on topic. The way the reward system is structured is that you get absolutely nothing if you do not produce something twice as good. Although, it was hinted that any solution would be considered this promise does not seem to be set in stone because it was not written anywhere on the Innocentive specifications or I didn’t take notice of it.

In conclusion, it’s quite obvious having a challenge is much better than not having a challenge at all. If anything, it has made a load of resources available to the scientific community. In contrast, Roche never shared much resources with its 454 community. It explains why after years of struggling by scientists, the naive Bayesian approach taken by PyroBayes is the best publication on dealing with the homopolymer problem. A great publication but limited by the resources available. Roche and Illumina should loosen their grip on their Intellectual property, if someone finds a better way of doing something then everyone wins. Yes, people may laugh and criticize your Intellectual property but eventually someone will help find a better way of doing something you are struggling with and then everyone wins 🙂

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct. Open source, funny comments 🙂

Life Grand Challenges – Mission Impossible?

I am taking a break before writing Part 3 of The Fundamentals of base calling where I will be discussing on what technically needs to be addressed to improve the raw accuracy of the Ion Torrent. Stay tuned especially if you want the C/C++ example source code. This is the first post of a 2 part series on the Life Grand Challenge. The first part involves criticism because there must be something they are doing wrong, you would think that every nerd and his imaginary girlfriend would be working on this problem given the 1 million dollar prize. The second part are suggestions and discussions using the successful NetFlix crowd sourcing model. Before I start with the hater stuff, I would like to say that engaging the talent of the scientific community is a good thing. It recognizes that there exists smart and innovative people outside the Broad, Sanger, BGI, etc. and their opinions are important. Something Illumina and Roche must learn! Being in these large sequencing centers breeds laziness and unaccountability, while labs on a shoe string budget must be very resourceful and  innovative in their thinking and savvy in their spending to compete with the big boys for high impact publications. Anyways enough ranting and on to the topic.

This idea of crowd sourcing (not to be confused with crowd surfing) is the best thing since incorporating laws in nature that allow nuclear fusion to occur, allowing a sun that sustains life OR the best thing since sliced bread. This business model allows corporations to hire a limitless supply of talent from around the world without paying them unless they achieve the goal. This is even WORSE idea than outsourcing to India and China as it uses greed to take advantage of the WHOLE world!

The Life Grand Challenge is a crowd sourcing initiative launched by Life Technologies just before the Easter holidays. It is an innovative way of engaging the scientific community in solving at present, 3 Ion Torrent improvements:

  1. Throughput
  2. Speed
  3. Accuracy

The 4th challenge relates to the SoLiD sequencing platform. Not quite 7 at the moment so there are reporters out there that really need to use their fingers to count. These challenges are hosted on the Innocentive Pavillion (i.e. website). I will now concentrate all discussion on the Accuracy challenge which is purely computational or that’s what it seems. The way it works is that for every quarter a new internal benchmark is released and challengers aim to better that benchmark by first demonstrating it on the data set provided. The data set is an E. coli sequencing run, surprise surprise ! The performance benchmark is achieved when a challenger achieves the same amount of 100Q23 reads as that of 100Q20 reads by the ion-Analysis  program.

  • First Quarter benchmark – 366,880 100Q23 reads (from total 753,210 reads)
  • Second Quarter benchmark – 632,468 100Q23 reads (from total 874,282 reads)

In the first quarter of the competition (mid April – mid June), despite the number of “Innocentive Active Solvers”,  there were no submissions to Innocentive (to my knowledge) even ones that did not reach the benchmark. We are currently in the second quarter of the challenge, which ends September 15th then Life Technologies will release a new benchmark target at the start of Third quarter of the competition. There are currently no submissions that appear on the Leaderboard. Either the technology and chemistry has improved in one quarter OR the problem just got twice as hard as the number of total reads did not increase that much.

How many ACTIVE accuracy challengers are there?

There are two important factors in all competitions, difficulty and number and quality of your opposition? I have outlined above how difficult the problem is. Currently there are 541 “active solvers” according to Innocentive. This is not a measure of how many active accuracy challengers out there but merely the number of people who have clicked on the terms and conditions so they can view the competition details. For example, I am an active manuscript writer. Every day I intend to start my manuscript but never get around to it 🙂 Also in a recent press release, the claim was that there were 500 Life Grand Challengers. Many of these would be Accuracy challengers as the other two challenges require a PGM. Again it has not distinguished between active and the Innocentive count.

To be active in the grand challenge, the challenger must have done the following

  1. Join the Ion Community – TorrentDev space – 54 active out of 2694 possible people. Less than a handful have discussed Challenge specific topics.
  2. Read the technical specs on the Raw Image Acquistion File Format (DAT file) – less than 60 views on this document and most was actually me returning to it!
  3. Downloaded the latest source code (optional) –  35 views for latest. Earlier version has 500+ views.

You can be the judge based on the numbers above, how many people are actively competing, however it would be IMPOSSIBLE to do without at least joining the Ion Community to get either the file format specification or the source code. Although previous versions of the source code (i.e. Torrent Suite v1.3) had a high number views, I don’ t think many did much after viewing or even downloading. If they did you would expect the script kiddies to complain that they can’t get the code to compile, mainly because there were bits missing at each initial release. I have been the only one to correct them on this because yes I have downloaded and unpacked AND COMPILED the source code! In the two not so important source packages there are still bits missing.

From the point of view of Life Technologies/Ion Torrent

Normally a business would hire or allocate a team of people to achieve a goal and needs to pay a salary regardless of if they achieve the goal or not. In other words, if you are working on this problem, you are better off going for a job with Ion Torrent. Also these team of people take sick days, annual leave, go through divorces have psychological problems and usually some don’t get along with each other. In other words everyone is a potential HR nightmare.

In addition, R & D staff on Ion Torrent have at least a 2 years head start on everyone so having a moving target that resets every 3 months… come on the only people that are going to beat Ion Torrent are crazy Russian geniuses…. who are most probably doing crazy things instead. Not to mention staff probably get a bonus every quarter every time they make the target that much harder to achieve by the community, which is what they are employed to do.

From the point of view of the Challengers

First and foremost you are already behind the eight ball as you are competing with people with a two year head start and unlimited access to massive data sets and the latest planned hardware and chemistry improvement. Having access to code is great… code with limited comments and stuff there for historical reasons doesn’t really help with understanding it fast. Also there is a lot of code to decipher. Some may say why bother looking at the code? My response you need to know what paths have been taken first before you dedicate weeks of your time developing something else.. because you may find you have developed something that was already in the code and worse still it is a crappier version.Secondly, you probably have a day job and family/friends so not much free time. Again you are competing with people in Ion Torrent who do this during the day as they are employed to do this. Lastly, you try a new technique that takes 2 months to implement but it doesn’t work. What are you left with? A pissed off wife, that you didn’t spend time with the family in your free time OR if you are a student, time you could have spent impressing girls on Dance Dance Revolution! For those that are thinking of taking up the challenge. I can tell you from my experience it doesn’t impress girls and doesn’t make you more attractive to them which is a shame because in theory it should LOLZ. On a serious note, the optimal play I have analyzed is to be employed by Ion Torrent because at least if your ideas don’t work out…. you still have money in your bank account.

Now that I’m done with the harsh criticism, the next blog post (Part 2) will discuss what the Life Grand Challenge initiative can learn from the successful NetFlix crowd sourcing model and including my 2.2 cents (GST inc) worth.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct. I am also off the angry pills.

Fundamentals of base calling (Part 2)

In part 1, I discussed signal normalization and droop correction. Below is the amount of variation in the signal still left to be explained.

Figure 1. Signal residues calculated by | Normalized Signal – Ideal Signal |. The residues are largest at the 3mers, 4mers and towards the end of the read.

There are a lot of journal articles out there that discusses the phase problem in a non-technical way but in order to fully appreciate the problem and to be of any benefit to those who want to improve the Ion Torrent accuracy, it MUST be discussed technically.

The Phase problem

Firstly, the problem of phase correction is not unique to Ion Torrent but is common to all technologies requiring sequence by synthesis, which also includes Illumina and 454 sequencing . What does sequence by synthesis actually mean? In a nutshell it is observing DNA polymerase adding nucleotides one by one, creating the complementary DNA strand. The difference between the technologies is how the observation is made.

In the case of Illumina, dye labelled nucleotides (A,C,T,G) are flowed and the polymerase incorporates the nucleotide then a picture is taken after this event. The genius of this technology is the reversible dye used to label the nucleotides. For each flow only one nucleotide can be added and hence removing the problem of detecting whether multiple nucleotides (homopolymers) were added. Unlike Sanger sequencing, the dye can be removed then allowing another dye label nucleotide to be incorporated in the next flow.

In the case of Ion Torrent, the nucleotides are not labelled and a by-product of incorporation (hydrogen ions) is detected instead. We know which of the 4 nucleotides was incorporated as ONLY ONE nucleotide is present in each flow. In addition, since these nucleotides are unmodified if you have multiple nucleotides to be incorporated (i.e. homopolymer) this would produce a proportionally larger amount of hydrogen ions. The 454 works on the same principle but a different by-product of nucleotide incorporation is detected. In all three technologies, a large number of identical template strands are concentrated in a small area (i.e. cluster or well) allowing the combined signal from a large number of identical strands to be detected.

Finally, the problem itself. For each flow, whether a nucleotide is incorporated is not a deterministic event but rather a probabilistic one.

Carry Forward errors are analogous to false positives, while Incomplete Extension is analogous to false negatives. The remainder of the discussions will mainly focus on Incomplete extensions as the contribution made by Carry Forward errors are negligible in comparison.

Consequence of Incomplete Extension

Before I start this discussion, I need to define the difference between the number of bases called and the number of incorporation events. I’ll illustrate with the example sequence below:

AATTCGGG

In this sequence there are 8 bases, however the Ion Torrent would have achieved this sequence read, through four positive flows  – A, T, C, G flows in that order. In other words, only 4 incorporation events are needed to produce the above sequence. For Incomplete Extension, we are only interested in what happens during incorporation events (i.e. positive flows).

The above figure shows how quickly a sequencing read can rapidly fall out of phase to a point that the lagging fraction (or population) makes up the majority of the signal. To put this in perspective, Test Fragment A requires 72 incorporation events and the error spike mentioned earlier occurs at the 59th incorporation. The approximated Incomplete Extension for Test Fragment A is p=0.012, thus the blue line is the best match for what is happening in Test Fragment A. To achieve longer reads (> 200 bp) the Incomplete Extension value should be much lower and closer to 0.001 (purple line).

However, all is not lost if we can get good approximation of the probability of Incomplete Extension then the signal can be phase corrected (aka dephased) by either simulation (Torrent Suite v1.3) or from first principles (outlined below). The first principles method (using dynamic programming) discussed below is from deciphering the work of Helmy Eltoukhy published in IEEE and also in his Stanford PhD thesis. Like all good publications from mathematicians, it is completely incomprehensible! 😦 This is probably why the only referencing an IEEE publication receive from outsiders is when they are criticizing the efficiency of their algorithm 😮 Please appreciate the hours it took me to decipher the incomprehensible but brilliant work of Helmy Eltoukhy.

Phase correction

I will use a 4 flow cycle to illustrate how this is performed as this is much easier to understand than a 32 redundant flow cycle.

For a given flow (n), there are several components required for phase correction:

  1. Normalized signal incorporation value (worked out in Part 1 of this series) (yn)
  2. Nucleotide (A, C, T or G) that corresponds to the flow
  3. Ideal signal (a) from incorporation events for that nucleotide from preceding flows. Defined by: an-4, an-8, an-12….
  4. Probability of Incomplete Extension (p)

Using the value p, we can determine the fraction of strands in the in-phase population and also from the lagging populations (wn-4, wn-8 …).

The phase corrected signal for this flow would be, Observed signal – Lagging signal. This also requires normalization as only a fraction (wn) of the entire population make the in-phase population. The pseudo code  looks like this:

un = (yn – wn-4*an-4 – wn-8*an-8 – wn-12*an-12….)/wn

Figure 2. Signal residues calculated by | Phase Corrected Signal – Ideal Signal |. The greatest improvement compared to Figure 1 is the 3mers and 4mers at the start of the read. The residues are still quite high for positive flows late into the read. The phase corrected signal was produced using the following fitted parameters: p = 0.988, ε = 0.0075, droop = 0.00075.

The problem now remains to model and account for the remaining residue. This will be discussed in Part 3 of this series along with the dynamic programming implementation of phase correction as C/C++ code.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct.