Are MiSeq miscalls influenced by preceding homopolymers?

A recent post in the Ion Community Torrent Dev section presents an analysis showing MiSeq systematic substitution errors which appear to be caused by preceding homopolymers (HPs). The Omics Omics blog post provides a very good summary and analysis of the Ion Community post. This analysis was performed on three publicly available data sets (1) E. coli   DH10B (2) E. coli MG1655 and (3) Ultra deep sequencing (UDT) of cancer genes from a recent Genome Biology publication. In the discussion thread that followed, an Ion Community member pointed out this finding is not entirely novel, as last year a Japanese group published on Illumina Sequence-specific error (SSE) profiles in Nucleic Acids Research. They surveyed sequencing data on SRA to conclude it was not organism or preparation specific problem but a systematic problem. In contrast, their SSE was not from HPs but from inverted repeats and CCG sequences. They also present on the mechanism that could be causing the SSE.

In this blog post I aim to perform an independent analysis on a data set available through BaseSpace on the Illumina website. The data set is from Bacillus cereus sequencing. I’m not too sure if this is one single run that was multiplex with 12 samples or 12 separate runs. Given the coverage it’s most likely a multiplex single run. Using a different methodology (see end), I wanted to see if this SSE influenced by preceding homopolymers (HP type SSE) also applied to the Bacillus cereus data set. I will only present the data and let the reader conclude for themselves.

Does this HP type SSE exist in Bacillus cereus?

A screenshot from IGV. This is very similar to what I observed for DH10B and MG1655. The miscalls always appear to be strand specific and a lot of the times are towards the end of the read and follows a HP.

How often does it occur in Bacillus cereus?

I analyzed all mapped alignments (i.e. BAM file) to see if this problem is isolated or data set wide. For each miscall, I counted how many times a HP of the miscalled base preceded on the reference. The random expectation distribution was created by simulating the 3 possible mismatch at every position in the genome and counting how many times it is preceded by a HP of the mismatch. This does not take into account coverage bias but come on peeps this is a blog not a publication! 😛

Note: A and T plots not shown as the bias is not as large as the C/G biases.

Are there overlap between Bacillus cereus data sets?

All sites in which HP type SSE were occurring was identified. If this is a systematic bias we should see overlap between data sets from the same genome sequenced. The overlap appears to be approximately 1/3 when comparing two data sets. This highlights the systematic nature of a good proportion of these HP type SSE.

Note: Only mismatches with HP length >= 3 were included.

Where in the read do these occur?

Due to the nature of sequence by synthesis technology, we would expect the majority of these errors to occur much later in the read. This is what we observe in the two data sets. There seems to be a slight increase in T and C HP type SSE <50 bases into the read.

Note: Only mismatches with HP length >= 3 were included. This explains why the C/G bias is not obvious. The upward trend is broken at ~140bp due to read trimming and/or alignment/mapping trimming and not because the errors are reduced.


I slightly edited the samtools calmd/fillmd source file (i.e. bam_md.c) to produce the metrics required to present each of the results. I will make this available if anyone is interested. This allowed me to use the Illumina supplied BAM in an unmodified form and thus removing any further bias caused by using a different alignment mapping. The annoying thing  about these BAMs is they don’t come with the MD tags for each mapped alignment!! Currently the CIGAR format in the SAM/BAM specification does not distinguish between matches and mismatches. Therefore, 150M does not always means 150 matches! The MD tag allows position of mismatches and the correct call to be determined.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.

2 responses to “Are MiSeq miscalls influenced by preceding homopolymers?

  1. Regarding the note at the end about CIGAR strings – for some time the CIGAR format in the SAM/BAM specification has included matches (‘=’) and mismatches (‘X’), but most tools continue to use the older operator ‘M’ which means either.

  2. Thanks for the clarification. I agree, the problem is that the more widely used ones still use ‘M’. I think TMAP puts MD tags which the subsequent “alignStats” program requires. Anyways calmd/fillmd is a very good alternative if you don’t want to realign everything again.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s