Fundamentals of base calling (Part 1)

The main goal of base calling is to go from raw numerical values (Figure 1) to a sequence of bases (Figure 2). In the case of the Ion Torrent the numerical values are the signal incorporation values for each well, for each flow. This is contained in the 1.wells file which is approximately 2 Gb in size for a 314 Chip run. These values vary between wells in a chip position dependent manner. Therefore,  the values need to be first normalized. Following normalization, the values need to be corrected for signal decay observed in the later flows. Lastly, phase correction must be performed (discussed in detail in Part 2 of this series). The end result is a corrected signal for each well, for each flow which is contained in the SFF file. These values are rounded to the nearest integer to give the number of bases incorporated for that nucleotide flow, thus allowing bases to be called and the resulting sequence to be constructed. However, humans usually view this data as a graphical representation of the values, which is called an Ionogram. The 454 uses a very similar technique for base calling, which is no surprise as it is a single nucleotide flow based – sequence by synthesis technology. The SFF file format is very similar but I’m not sure if they have a raw value equivalent (1.wells file) as I have never worked with 454 data sets. The biggest innovation is that Ion Torrent have renamed the 454 Flowgram to Ionogram :o)

Figure 1. The Ionogram of a well containing Test Fragment A.These are raw values extracted from the 1.wells file. Note: The flow order does not follow a repeating 4 flow cycle of TACG but is a 32 flow cycle. I will defer the discussion of this until Part 3 of the series. All Ionograms were produced by modifying supplied python code which uses numpy, scipy and matlabplotlib packages to produce the plots.

Figure 2. The Ionogram of the same well containing Test Fragment A after signal correction. These are corrected values extracted from the SFF file.

Below are the bases called for this well. This was the 5th best Test Fragment A sequence read for this 314 Chip run. The first error occurring at the G 4mer towards the end.

TFA   ATCGTGTTTTAGGGTCCCCGGGGTTAAAAGGTTTCGAACTCAACAGCTGTCTGGCAGCTCGCTCTACGATCTGAGACTGCCAAGGCACACAGGGGATAGG
      |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||
Read  ATCGTGTTTTAGGGTCCCCGGGGTTAAAAGGTTTCGAACTCAACAGCTGTCTGGCAGCTCGCTCTACGATCTGAGACTGCCAAGGCACACAGGG-ATAG-

Normalization

For those who are observant, you may have noticed the biggest difference between the raw (Figure 1) and corrected Ionogram (Figure 2) is the y-scale. In other words, you can infer the bases called (by rounding to the nearest integer) from Figure 2 but not Figure 1. In order for normalization to occur, we need to know how a typical signal should look like. Here is where the test fragment (ATCG) and library fragment (TACG) specific tags come in. Therefore, based on the tag sequence the first 7 flows (also called the key flows) are deterministic once the well has been classified. The first 7 nucleotides flowed are TACGTAC so in the case of a test fragment, values from flow 1, 4 and 6 would be typical values for 1 mers, while flows 0, 2, 3, 5 are typical values for 0 mers (i.e. background noise). I have called the first flow 0 and not 1 as this better correlates to the source code I will release at the end of the series. Why isn’t the G 1mer included in the average? I will defer this discussion to part 3 of this series.

The method of signal normalization follows the following steps:

  1. Finding the average 0mer signal from the key flows.
  2. Subtracting the 0mer signal from all flow values
  3. Finding the average 1mer signal from the key flows
  4. Dividing each of the flow values by the 1mer signal – Assumes a linear relationship between 1mer signal and Nmer signal
  5. Use the first 50-60 flows to get a better average for a 1mer signal and perform Step 4 again. (Discussed in more detailed in Part 3)

In the Test Fragment A example we are running with it would like like this:

Step 1

Flow 0, 2, 3, 5 values (i.e. 0mer signal) = 0.11, 0.00, 0.07, 0.00

Average 0mer signal = 0.0450

Step 2 (C/C++ code)

    float zero = (val[0]+val[2]+val[3]+val[5])/4;
    for(i= 0; i< numFlows; i++){
        val[i] -= zero;
        if(val[i] < 0.0)
            val[i] = 0.0;
    }

Step 3

Flow 1, 4, 6 values (i.e. 1mer signal) = 3.88, 3.83, 3.98

Average 1mer signal (after background subtraction) = 3.8588

Step 4 (C/C++ code)

    float one = (val[1]+val[4]+val[6])/3;
    for(i= 0; i< numFlows; i++){
        val[i] /= one;
    }

The Ionogram below is the results of performing Steps 1-4.

Signal droop/decay correction

Over the number of flows, we would expect the signal to decay, which is typical of a lot of signal processing. There are many reasons for this eg. loss of active DNA polymerase, population of strands are unable to incorporate any more nucleotides, loss in sensitivity in the electrical sensors, etc. Typical fitted parameters used by the ion-Analysis is roughly 0.05% decay between flows.

    double droopEst = 1.0 - 0.0005;
    double droop = 1.0;

    for(i= 0; i< numFlows; i++){
        val[i] *= droop;
        droop /= droopEst;
    }

This correction is performed on all values before Step 3 outlined above. This ensures the 1mer estimation corrects for this signal droop. The figure below shows that a typical droop value of 0.05% has very little consequence other than eating up some CPU time. However, we may see it playing a more significant role for longer reads (i.e. 200-400 bp).

For those who are observant, you may have notice a big difference between Figure 2 and the Normalization figures particularly in the later flows. This is because the signal has not been phase corrected and starts becoming an issue after X incorporation events. In the next part of this 3 part series, I will cover phase correction ad discuss the X incorporation events.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct. Any resemblance of my code to that of the Life Technologies GPL code is pure coincidence and mainly due to poor creativity for variable names.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s