Category Archives: Uncategorized

Time stamping Steve Avery’s Blood DNA

I recently completed binge watching the entire 10 episodes of Making a Murderer on Netflix. One of the concluding remarks from Steven Avery’s defense team was regarding whether there are new technologies available to better detect EDTA levels or other DNA tests. There was an obsessive fixation on finding more sensitive methods of detecting EDTA in the blood. In my opinion, even if there are new technologies in 2015/2016 to more accurately detect EDTA levels it would not be introducing new evidence.

What if there was a way of time stamping DNA using technology and methods that were not available in 2005? In other words, a method that can distinguish between the DNA from the blood vial from the 1996 version of Steven Avery (aged 34) and the DNA from the blood found in the RAV4 car from the 2005 version of Steven Avery (aged 43). This would hopefully strengthen or weaken the claim that blood was taken from a 1996 vial and was placed in the Toyota RAV4.

The purpose of this post is not to be an insufferable fan-boy amateur detective troll but to share with the world advances in DNA technology and analysis that were not available in 2005 that has a potential to be applied in this case.

A Primer on dna mutations

Disclaimer: This is a very simplistic explanation and any deviation is to ensure that the general public can understand this rather complicated concept.

Each cell in your body contains DNA, this has all the instructions to make a human. The DNA is composed of 4 letters A, C, G and T and each cell has 3 billion of these letters (or sites) in total that make up your genome. On average 1/1000 of these sites are different (i.e. there is a different letter) between two individuals. These sites, where the letters are different between two individuals (i.e. mutations) can be used to identify people. Using enough of these sites will accurately distinguish two individuals and also if two samples are from the same individual. This concept was used to identify the DNA from the pubic hair to be from Gregory Allen and was used to exonerate Steven Avery in 2003 for a crime committed in 1985.

Almost all of these mutations were inherited from your mother and father, in fact 50% from each. However, over the life time of each cell in your body there is a very, very small chance that a mutation will occur in the DNA. These mutations are called somatic mutations. The majority of these mutations are harmless as it has no biological consequences, while others occur in non-dividing cells or cells that rarely divide and thus have no chance to spread the mutations. A subset of these mutations that have biological consequences and occur in dividing cells may result in cancer.

DNA technology

Next Generation DNA Sequencing is a technology that was not available in 2005 and became available and mature from 2007 onwards. This technology allows scientist to read all 3 billion letters (human genome) contained in each of your cells. Prior to 2007,  this was not possible to do in an economical and timely manner and has revolutionized the field of human genetics making large population scale sequencing such as the 1000 genomes project and Exome Aggregation Consortium possible.

Scientific method

This blog post will concentrate on how accumulations of somatic mutations in DNA can be used to determine the relative age of an individual. This is one of several ways that next generation DNA sequencing can be used to approximate the relative age of an individual at the time of DNA extraction. It expands upon a concept introduced in two articles published in the New England Journal of Medicine (NEJM) by researchers from Harvard. The first published by Genovese et. al. and the other by Jaiswal et. al. in the same December 2014 issue of NEJM. The New England Journal of Medicine is one of the most respected journals in medical research.

time stamping dna

The concept introduced by Genovese et. al. and Jaiswal et. al. is that over time very, very rare somatic mutations start accumulating and the older the person the more of these mutations that they harbor (Figure 1 from Jaiswal and Figure 2D from Genovese). The number of somatic mutations are very low in younger individuals but it’s important to note that the search was limited to only genes (~1-2% of the genome) or an even smaller subset of genes known to cause blood cancers. Expanding this search to the whole genome will likely yield more somatic mutations (the majority with no biological consequences) and will be more applicable to younger individuals such as Steven’s earlier time points.

The DNA extracted from blood are from cells created from hematopoietic stem cells. The somatic mutations that can be detected are from stem cells that have divided multiple times to expand in number. This expansion is necessary to contribute to a higher fraction of DNA with these mutations, which later can be detected by sequencing (Figure 1 form Genovese).

The results from both studies, helps to explain the genetic reason why as you age, the risk of getting cancer increases as more time has pass for you to accumulate harmful somatic mutations that go on to cause cancer. A simple analogy would be suppose I am throwing darts at a dart board. I’m pretty bad at darts so the chances of me hitting a bulls-eye is very, very slim. However, if I stand there for a few hours throwing darts, I will eventually hit the bulls-eye. There is also a chance I will never hit the bulls-eye but hey I’m not that uncoordinated! Thus, as you age (hours throwing darts) you are more likely to accumulate somatic mutations that can cause cancer (hitting the bulls-eye). 

Now let’s apply this concept to blood DNA from various time points in Steven Avery’s life: 1985, 1996, 2005 and the present (Figure Below).

SA_DNA_Timeline

Note: Mutation X, Y, Z is use to represent a set of somatic mutations from each of the respective time periods 1985-1996, 1996-2005 and 2005-present.

As Steven ages he accumulates somatic mutations relative to his 1985 DNA sample. Blood DNA from his present-self should have all the somatic mutations identified in 1996 and 2005 (i.e. Mutations X and Y) in addition to any new mutations that occurred between 2005-present (i.e. mutation Z). Somatic mutations (i.e. Mutation Y) that occurred between 1995-2005 are UNIQUE to the 2005 version and can be used to distinguish between the evidence blood vial taken from 1996 and the blood found in the Toyota RAV4 in 2005.

Assumptions made:

  • There are blood samples or DNA extracted from blood for each of the critical time points. I’m assuming during the investigations this was the case.
  • There has been some somatic mutation events (i.e. Mutation Y) in Steven’s blood between 1996 and 2005 and can be detected reliably.
  • The quality and quantity of DNA available is sufficient to perform this analysis.
  • This method and technology is robust to some degree of DNA degradation.
  • Any modification to DNA post-blood draw is distinguishable from somatic mutations (i.e. mutations that occurred within the body)

Conclusion

This method, if applicable would require the development of a test that is robust and reproducible with the appropriate negative and positive controls. The detection limits also needs to be clearly defined. Fortunately, most of this development has been performed by clinical cancer tests. Further development of these tests for this specific application does take time but I have a lot more faith that the time taken by my peers would be far less than the circus of incompetence and stupidity, that is now the US legal system to revisit cases.

Disclaimer: These are my own thoughts and opinions and are completely independent to the institutions and universities that I am employed. Any mistakes in science, logical reasoning, grammar/spelling and lack of eloquence, I blame on my lack of sleep from jet-lag and exhaustion from binge watching 10 episodes in 24 hours! And Santa cuz he never gave me awesome writing skillz this year for Christmas!

Advertisements

Crowd Sourcing 2.0 – Smaller and Faster = Better

It has been almost 3 months since my last post as I recently came out of a cave called PhD thesis writing at the end of March and once I stepped out, a ton of metaphoric snow (i.e. neglected paperwork) fell on top of me :cry:. I thought I would never get out of “writing hell” where words like “suggests”, “may”, “underlie”, “perhaps”, “however” and “although” were used ad nauseum. Yes, scientific writing is the art of suggesting everything but committing to nothing :lol:. Thanks to everyone who visited and supported Biolektures during the non-active periods.

During my 2012 Q1 hibernation, there has been some very exciting developments in terms of changes to the Ion Torrent crowd sourcing approach. In 2011 Q1, Life Technologies launched four $1 million dollar grand challenges. What I consider as a win/win public relations initiative:

  1. If someone achieves the goal of the challenge their solution is probably worth more than one million dollars.
  2. If no one can achieve the goals then it is good publicity because who doesn’t like reading about a competition with a one million dollar prize.

In August, I blogged a three part series that criticized the challenge in terms of fairness, resources and motivation. The root cause of these problems is the “hugeness” of the challenge – even anti-social unemployed geniuses don’t have the time to understand the entire problem and then identify areas that could be improved. The folks in Life Technologies have finally realized this and have created smaller self contained problems that require little background knowledge. However, solving smaller problems means smaller prize money as Life Technologies didn’t get more generous since 2011. The four challenges that are hosted on TopCoder that have been run so far are:

  1. DAT Lossless compression. DAT files are the raw voltage data that first comes off the Ion Torrent PGM.
  2. DAT Lossy compression.
  3. SFF compression. SFF files are the processed DAT files from which base-calling can be performed. These can be visualized as Ionograms.
  4. TMAP Smith-Waterman alignment optimization. TMAP is the Ion Torrent optimized sequence alignment tool that comes with the Torrent Suite. For example it maps each of the E. coli sequence reads to its corresponding position on the E. coli genome.

The above four challenges are not directly related to improving Accuracy, that is reducing the hugeness of the Accuracy challenge into smaller manageable problems. Instead it is aimed at reducing storage space and data analysis times, two extremely important improvements when sequencing throughput is further increased by the introduction of the Ion Proton. However, do smaller problems that are faster to solve actually assist the crowd sourcing community in terms of fairness, resources and motivation?

Fairness

Last year my biggest criticism was that it was impossible to compete against Life Technologies R&D employees who are actually employed to improve the Torrent Suite software. These employees are experienced, have a broad understanding of the whole problem and have months to years of experienced working on that specific problem. Having small focused challenges that are independent of  context and background addresses this fairness problem. The trick is just to design a Ion Torrent specific problem into a more general problem to make it appealing to experts in that field but at the same time just enough information such that an optimized solution can be achieved. I have hinted this in a previous post. Life Technologies has worked with TopCoder to produce such a format. The challenge of compression is a general problem, which mathematical theorems can be exploited and tweaked to form a specific optimized solution. Likewise with the Smith-Waterman alignment, which involves optimization of a dynamic programming algorithm.

Resources

The Accuracy challenge involves providing a solution that processes REAL Ion Torrent data and thus the person with the most Ion Torrent data to work with has an unfair advantage. This means people who collaborate or work in labs that heavily use their Ion Torrent PGM are more likely to identify systematic biases, statistical anomalies and just have more insight than someone who has access to a humble one or two publicly available data sets. In contrast, the TopCoder hosted challenges provides data sets derived from real data sets through an unknown process and thus HE/SHE with the most TOYS does not necessarily WIN. Furthermore, these challenges are completely independent of the Torrent Suite framework. This resolves the issue last year where the Torrent Suite source code was available only through the Ion Community, which required registration. This annoyed the open source purist such as blogger Peter Cock but Life Technologies addressed this later in 2011 by releasing the source code on Github.

The Accuracy Grand Challenge required not just a computer and an idea but also Ubuntu, gcc, supporting GNU libraries, etc and the patience to resolve compile errors in the source code due to subtle environment differences. Alternatively for the rich kids, purchasing time to use a Torrent Server instance on the Amazon EC2. Having small stand alone problems, truly requires just a computer and an idea, albeit a genius idea :D.

Motivation

The last barrier for satisfactory participation is that of motivation. Ask a young adult from this Internet generation to clean your house and IF they do a good job you might give them $100. They will most likely to tell you to go and F yourself, where F stands for “Find”… yeah right :roll:. Ask them to clean one small window and you will give them $20 if they do it in one hour. The latter tasks (equivalent to TopCoder Challenges) is more appealing than the former (equivalent to the Accuracy Grand Challenge). Each of the smaller challenges can be done in a few days and if you are UBER 133t h4x0r probably in just a few hours. I was surprised that in just a few hours how many submissions there were for the DAT compression challenge. More surprising, the top score achieved in the first day was close to the top score at the end, highlighting the optimal solution can be found by thinking and not grinding over days.

There is no moving target as the challenge runs for two weeks and the top score wins the challenge. More importantly, there is no minimum score such as reducing error by 50%. Thus the person with the top score at the end wins money regardless of score. Also, there are other cash prizes for second, third and others besides winning. For example, there may be some solutions if tweaked correctly would perform better than the top scoring solution and thus could be offered prizes also. Lastly, the publicly archived leader board for people who did not win prizes can act as a sense of achievement or something challengers can reference when applying for jobs.

Is this model successful so far?

The paradigm shift from “hugeness” one million dollar challenges reduced to DIRECTED small self contained problems, sounds like a winner but who gives a shit if it does not increase participation or optimal solutions :?. The following TopCoder results was kindly provided by Matt Dyer from Life Technologies.

  1. 60% improvement in SFF compression with 10x speed improvement
  2. 20% improvement in DAT Lossless compression
  3. >90% improvement in DAT Lossy compression
  4. 4-6x speed improvement in TMAP Smith-Waterman algorithm

In total this costs $40,000 and a few weeks, money well spent in my opinion :idea:. Although this does not address the accuracy problem, it does serve as a pilot for what can be achieved if the accuracy problem is reduced to smaller problems. Who knows, maybe the problem of this reduction can be a challenge within itself. All four achievements listed above, requires advanced mathematics and computing so what’s there for the mortals amongst us? This is where the Torrent Browser Plugin challenge comes in and is the focus of my next blog post.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.  I have tried my best to keep all analyses correct.

Are MiSeq miscalls influenced by preceding homopolymers?

A recent post in the Ion Community Torrent Dev section presents an analysis showing MiSeq systematic substitution errors which appear to be caused by preceding homopolymers (HPs). The Omics Omics blog post provides a very good summary and analysis of the Ion Community post. This analysis was performed on three publicly available data sets (1) E. coli   DH10B (2) E. coli MG1655 and (3) Ultra deep sequencing (UDT) of cancer genes from a recent Genome Biology publication. In the discussion thread that followed, an Ion Community member pointed out this finding is not entirely novel, as last year a Japanese group published on Illumina Sequence-specific error (SSE) profiles in Nucleic Acids Research. They surveyed sequencing data on SRA to conclude it was not organism or preparation specific problem but a systematic problem. In contrast, their SSE was not from HPs but from inverted repeats and CCG sequences. They also present on the mechanism that could be causing the SSE.

In this blog post I aim to perform an independent analysis on a data set available through BaseSpace on the Illumina website. The data set is from Bacillus cereus sequencing. I’m not too sure if this is one single run that was multiplex with 12 samples or 12 separate runs. Given the coverage it’s most likely a multiplex single run. Using a different methodology (see end), I wanted to see if this SSE influenced by preceding homopolymers (HP type SSE) also applied to the Bacillus cereus data set. I will only present the data and let the reader conclude for themselves.

Does this HP type SSE exist in Bacillus cereus?

A screenshot from IGV. This is very similar to what I observed for DH10B and MG1655. The miscalls always appear to be strand specific and a lot of the times are towards the end of the read and follows a HP.

How often does it occur in Bacillus cereus?

I analyzed all mapped alignments (i.e. BAM file) to see if this problem is isolated or data set wide. For each miscall, I counted how many times a HP of the miscalled base preceded on the reference. The random expectation distribution was created by simulating the 3 possible mismatch at every position in the genome and counting how many times it is preceded by a HP of the mismatch. This does not take into account coverage bias but come on peeps this is a blog not a publication! 😛

Note: A and T plots not shown as the bias is not as large as the C/G biases.

Are there overlap between Bacillus cereus data sets?

All sites in which HP type SSE were occurring was identified. If this is a systematic bias we should see overlap between data sets from the same genome sequenced. The overlap appears to be approximately 1/3 when comparing two data sets. This highlights the systematic nature of a good proportion of these HP type SSE.

Note: Only mismatches with HP length >= 3 were included.

Where in the read do these occur?

Due to the nature of sequence by synthesis technology, we would expect the majority of these errors to occur much later in the read. This is what we observe in the two data sets. There seems to be a slight increase in T and C HP type SSE <50 bases into the read.

Note: Only mismatches with HP length >= 3 were included. This explains why the C/G bias is not obvious. The upward trend is broken at ~140bp due to read trimming and/or alignment/mapping trimming and not because the errors are reduced.

Methodology

I slightly edited the samtools calmd/fillmd source file (i.e. bam_md.c) to produce the metrics required to present each of the results. I will make this available if anyone is interested. This allowed me to use the Illumina supplied BAM in an unmodified form and thus removing any further bias caused by using a different alignment mapping. The annoying thing  about these BAMs is they don’t come with the MD tags for each mapped alignment!! Currently the CIGAR format in the SAM/BAM specification does not distinguish between matches and mismatches. Therefore, 150M does not always means 150 matches! The MD tag allows position of mismatches and the correct call to be determined.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.

Signal Processing – Room for improvements?

During the Ion Torrent User Group Meeting at ASHG, Rothberg talked about how he would approach the Accuracy challenge. He said he would seek out people who did well at the mathematics olympiad and get them to work on the problem and pay them $5,000. What a cheap skate 😀 He also said that there should be more focus on the actual raw signal processing. Thinking on the same lines, Yanick asked in the comments of a previous post, what contribution raw signal processing made with the accuracy improvements in Torrent Suite 1.5.

At the moment with the data processing, there is a way to separate the contributions that raw signal processing makes from signal normalization, dephasing and base calling (all three abbreviated to base calling for the rest of this blog). This can be done by using the 1.wells files generated as this marks the end of raw signal processing. Therefore, we can use a version of Torrent Suite  to do the raw signal processing and a different version to do the base calling using the –from-wells command line option (Figure 1).

Figure 1. Using different versions of Torrent Suite to do signal processing and base calling. Results grouped along the x-axis by which version did the base calling and colored by which 1.wells file version was used as the input (i.e. what version did the signal processing).

Input data is from control library of DH10B sequenced on a 314 chip in our lab (features in a few previous blog posts, including the one on rapid software improvements). This input data was used for each result featured in Figure 1 to make the analysis comparable.  In this bar plot the measure of performance is the number of 100AQ23 reads that results – the current measure of accuracy in the Accuracy challenge. The general conclusion is that signal processing changes between versions makes little contribution on accuracy performance. This is evident as there is little difference in heights between bars within each of the base calling groups along the x-axis. In contrast, all 1.wells files which used v1.5 for base calling showed marked improvements regardless of what version of Torrent Suite was used for signal processing (i.e. 1.wells file). Interestingly, the signal processing from v1.4 appears to perform better than v1.5. This can be largely explained by an increase in the number of beads categorized as “live” in v1.4 compared to v1.5.

There are two possibilities to explain the small contributions made by signal processing:

  1. This year most effort was dedicated to dephasing and normalization which are the source of major improvements in Torrent Suite 1.5. Improvements in signal processing will be the next focus. OR
  2. The current signal processing model has reached it’s limit and a new model needs to be developed in order to see further improvements.

In this post and last post on software improvements, only 314 data was featured. To give a more comprehensive representation, a 316 (STO-409) and 318 (C22-169) run was used to observe accuracy improvements (Figure 2). Thanks to Matt for supplying the 1.wells file for these publicly released runs.

Figure 2. The improvements in base calling between v1.4 and v1.5 using 1.wells file from v1.5 as input. I couldn’t get v1.3 to run on either the 316 or 318 data 😕

What made the analysis featured in this blog post a little challenging was the 1.wells format changed to make use of the HDF5 standard in Torrent Suite 1.5. This has allowed for the files to be better organized, parsed and to be compressed by approximately 2-fold. Torrent Suite 1.5 is able to read in 1.wells files generated by previous versions (i.e. backward compatible) but unfortunately vice versa does not apply 😡 I had to write a small program to convert the 1.wells files (from 1.5) back to the legacy format. Kudos to Simon for the tips 🙂 What was a little concerning is the current implementation loads the whole 1.wells into memory which consumes 25-35%, 50-70% of total memory for the 316 and 318 chip.

<insert some awesome conclusion here> 😀

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations.

Rapid improvements may lead to comparison pitfalls

In a previous blog post, I showed the rapid improvements Ion Torrent has made over a short 3 month period. I assume that both the GS Junior and the Illumina MiSeq have experienced improvements also over this time. These desktop sequencers require less financial commitment and usually requires only one grant to fund. In contrast, the big toys (eg. HiSeq, SoLiD 5500) require a massive financial commitment to purchase and operate. The profit margin may be much higher for the big toys BUT not much volume will be sold. In Sydney, one hand is needed to count how many big toys out there. By selling small toys the profit margins can be reduced as larger volumes are expected to sell. This creates a much larger user community instead of 3 people in each city who have observed a “sequencer in the wild”. The best analogy would be the personal computer (PC) revolution largely replacing the need for mainframes. I believe that the affordable costs of all three technologies will push the biotech industry to be more innovative and cost effective. No matter what sequencer you buy all customers will enjoy the benefits of this healthy competition.

The strength of computational analysis (including bioinformatics) is the relative ease of reproducibility given a data set and method. This is made even more easier, if the method involves using software with all options/setting specified. No other science offers the same warm fuzzy feeling of reproducibility. The greatest thing that has come out of the “Sequencing Wars” is the public release of raw data sets to accompany the application notes. This is the first time EVER I have seen a biotech company say “You don’t believe us? Here’s the raw data analyze it yourself!”. This is extremely bold move that has been taken first by Ion Torrent then shortly followed by Illumina MiSeq. For once, this empowers the customers to make their own conclusions! Many kudos to Ion Torrent and Illumina for making their raw data available. I am yet to find any public raw data for the GS Junior. In fact, Roche appears to be behind Life Technologies and Illumina when it comes to embracing the information age. Come on guys, pick up your act and get rid of them dinosaurs in upper management!

There is a long list of the technicalities and nuances with comparing the competing desktop sequencers. This problem is not unique when comparing different technologies. The unique issue comes from the pitfalls resulting from the rapid improvements. This is not like comparing two different car models or laptop models where their performance is rather static before and after purchase. In contrast, sequencing technology purchased can further improve through the consumables themselves and software versions for data analysis. The point of my previous post was to highlight this fact. Therefore, it is extremely important to assess when a data set was analyzed as more rapidly evolving technologies may be extremely disadvantaged. Showing the performance of a sequencing technology from 2011 Q2 (i.e. 3 months ago) is not an accurate representation of the technologies current performance!

I will illustrate my point using the rapid evolution of the most talented music artist of all time, Britney Spears. Britney spears is a singer, dancer, actor, perfume model, fallen angel and mother all in one.

Britney the Singer

Britney the Dancer

Britney the Actor

Britney the Perfume Model

Britney the Fallen Angel

Britney the Mother

Similarly, depending on when you look at Britney Spears’ career you would be underestimating her amazing talents.

Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. People should just give Britney a break, she’s only human. 

You are the 5,000th viewer – you won a thought of winning something really good

It’s been little over one month and Biolektures has received over 5,000 views. Thank you everyone for visiting and commenting. Now I will rant on and on.

First and foremost thank you to all my readers directed from SEQAnswers, Twitter, Ion Community, other blog posts and a mysterious company called Ion Flux. Without my readers and their need to click on all the blog posts, there would be probably less than 1,00 views. The 1,00 views would be a combination of me and spam trying to advertise the secret on how I can get “ripped in 4 weeks.” Well I got ripped in one night by eating too much now my pants need to be fixed ! 😆

I thank the support of established bloggers who have done an amazing job tweeting and mentioning my blog. This includes Nick, Keith, Justin, Lex and Daniel. Thank you for your support and encouragement.

I thank the Ion Community and Life Technologies for making the bioinformatics aspect as OPEN as their legal team will allow. This has made people like me with no life, feel like part of the LIFE action. Thank you to those in the Ion Community who have answered my many technical questions and random outbursts claiming shenenigans. This includes Mike, Simon, Mel, Matt, Jason and the dude with the cat picture as an avatar (sorry don’t know your first name). I especially like to thank Mike for his encouragement and being a regular commenter. I will keep you updated on my plans to convert the Torrent Server into a Street Fighter 4 arcade machine as a way of generating revenue to pay for our experiments 😛

I thank those who have written comments. At least I know you people have been reading the post and not just looking at the pictures 😛 This includes Mike, Peter, Andrew and Nick. Keep the comments coming but no big words please 😀

Lastly, I thank my wife Angela for putting up with my blogging. I haven’t spent quality time such as watching Gilmore Girls, One Tree Hill and Sex and the City. I will be more available this month for the quality viewing these shows have to offer 😦

I will leave with a link to a short love story of a guy trying to impress a girl with his uber l33t h4x0r skillz featuring Enrique’s Hero.