In this stand alone blog post, I will attempt to detail the predicted quality value (phred scoring) algorithm that the Ion Torrent is currently using. As the quality values is one of the battlegrounds the Next Generation Sequencing wars (Clone Wars is way cooler!) are currently being fought, it would be good to explain the difficulty in using this as a benchmark. Illumina has fought this battle on the predicted quality values. This is a good ground to have a fight on considering Illumina’s prediction algorithm is mature and is quite good at predicting the empirical quality (Figure 1). Illumina has pointed this out in their recent application note. A good prediction algorithm is good if there is no reference sequence to compare your target against (aka de novo sequecing). In addition, “the point of predicted accuracy is that many tools use this in their calculations. The more accurate these estimates, the happier those tools are. Of course, you can always go and recalibrate everything , but that’s an extra step one would rather avoid.” (Thanks Keith for the input taken from the comments section). Ion Torrent has fought on the empirical quality battleground. Their argument is who cares what the predicted values are, actual values are more important. This is a great point, given Economist spend most of their time explaining why things they predicted yesterday didn’t happen today. On the rare occasion when they get it right… their ego expands faster than the rate the Universe expanded slightly after the big bang! 😀 The reason why Ion Torrent has fought this battle on the empirical battleground is mainly due to the current weakness in their quality prediction algorithms (Figure 2).
Figure 1. Illumina phred score prediction is closer to the empirically derived values. This is read 1 from the MiSeq DH10B data set.
Figure 2. The Ion Torrent prediction algorithm under predicts quality by approximately 10 phred points.
Since Ion Torrent have released the source code, I am able to interpret how per base quality values have been calculated. These quality values are determined after carry forward, incomplete extension and droop correction (aka CAFIE or Phase correction). The quality values are recorded with the corrected signal incorporation in the SFF file.
Please note all equations are MY INTERPRETATION of the source code and since I didn’t write the code, I am probably incorrect sometimes.
Big thanks to Eugene (see comments below) from Life Technologies for correcting and providing an example for Predictors 4 and 5.
There are six metrics that are used to predict the per base quality values:
- Residue (float)- distance the corrected incorporation value is from the nearest integer.
- Local noise (float) – maximum residue amongst the previous, current and next corrected incorporation value. Radius of 3 bases.
- Global noise (float) – Calculated from the mean and standard deviation of all zero-mer and 1mer signals for this well/read.
- “The homopolymer length, but it is assigned to the last base in the homopolymer (since there is a much higher chance of being off by 1 in the homopolymer length than by 2 or more).” All other bases in the homopolymer are assigned the value 1.
- “The homopolymer length the last time this nucleotide was incorporated – this basically a penalty for incomplete incorporation.”
- Local noise (float) – Calculated similar to (2) but with a radius of 10 bases.
An example of predictors 4 and 5 is detailed below:
A A A A T A C C C 1 1 1 4 1 1 1 1 3 (Predictor 4) 0 0 0 0 0 4 0 0 0 (Predictor 5)
Note: Predictor 5 is dependent on flow order so in the above case it depends where in the 32 redundant flow cycle these bases were called.
Once these six metrics have been calculated for this flow/base call these values are compared to a empirically derived phred table (Note: each flow produces a base call, many just have a value of zero). There is currently two versions for this phred lookup table. The comparison is made from the top of the table (i.e. phred score 33) and works it’s way down until the six metrics are below the minimum criteria for that phred score. The maximum phred score is 33, while the minimum is 7 and 5 for phred table versions 1 and 2, respectively. As the Ion Torrent is quite new, it is understandable that the phred scoring algorithm still needs more calibration. Therefore, it is quite unfair to compare Illumina predicted QVs against the Ion Torrent one.
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. This is an independent analysis using Novoalign kept simple so others can reproduce the results.