It has been almost 3 months since my last post as I recently came out of a cave called PhD thesis writing at the end of March and once I stepped out, a ton of metaphoric snow (i.e. neglected paperwork) fell on top of me :cry:. I thought I would never get out of “writing hell” where words like “suggests”, “may”, “underlie”, “perhaps”, “however” and “although” were used ad nauseum. Yes, scientific writing is the art of suggesting everything but committing to nothing :lol:. Thanks to everyone who visited and supported Biolektures during the non-active periods.
During my 2012 Q1 hibernation, there has been some very exciting developments in terms of changes to the Ion Torrent crowd sourcing approach. In 2011 Q1, Life Technologies launched four $1 million dollar grand challenges. What I consider as a win/win public relations initiative:
- If someone achieves the goal of the challenge their solution is probably worth more than one million dollars.
- If no one can achieve the goals then it is good publicity because who doesn’t like reading about a competition with a one million dollar prize.
In August, I blogged a three part series that criticized the challenge in terms of fairness, resources and motivation. The root cause of these problems is the “hugeness” of the challenge – even anti-social unemployed geniuses don’t have the time to understand the entire problem and then identify areas that could be improved. The folks in Life Technologies have finally realized this and have created smaller self contained problems that require little background knowledge. However, solving smaller problems means smaller prize money as Life Technologies didn’t get more generous since 2011. The four challenges that are hosted on TopCoder that have been run so far are:
- DAT Lossless compression. DAT files are the raw voltage data that first comes off the Ion Torrent PGM.
- DAT Lossy compression.
- SFF compression. SFF files are the processed DAT files from which base-calling can be performed. These can be visualized as Ionograms.
- TMAP Smith-Waterman alignment optimization. TMAP is the Ion Torrent optimized sequence alignment tool that comes with the Torrent Suite. For example it maps each of the E. coli sequence reads to its corresponding position on the E. coli genome.
The above four challenges are not directly related to improving Accuracy, that is reducing the hugeness of the Accuracy challenge into smaller manageable problems. Instead it is aimed at reducing storage space and data analysis times, two extremely important improvements when sequencing throughput is further increased by the introduction of the Ion Proton. However, do smaller problems that are faster to solve actually assist the crowd sourcing community in terms of fairness, resources and motivation?
Last year my biggest criticism was that it was impossible to compete against Life Technologies R&D employees who are actually employed to improve the Torrent Suite software. These employees are experienced, have a broad understanding of the whole problem and have months to years of experienced working on that specific problem. Having small focused challenges that are independent of context and background addresses this fairness problem. The trick is just to design a Ion Torrent specific problem into a more general problem to make it appealing to experts in that field but at the same time just enough information such that an optimized solution can be achieved. I have hinted this in a previous post. Life Technologies has worked with TopCoder to produce such a format. The challenge of compression is a general problem, which mathematical theorems can be exploited and tweaked to form a specific optimized solution. Likewise with the Smith-Waterman alignment, which involves optimization of a dynamic programming algorithm.
The Accuracy challenge involves providing a solution that processes REAL Ion Torrent data and thus the person with the most Ion Torrent data to work with has an unfair advantage. This means people who collaborate or work in labs that heavily use their Ion Torrent PGM are more likely to identify systematic biases, statistical anomalies and just have more insight than someone who has access to a humble one or two publicly available data sets. In contrast, the TopCoder hosted challenges provides data sets derived from real data sets through an unknown process and thus HE/SHE with the most TOYS does not necessarily WIN. Furthermore, these challenges are completely independent of the Torrent Suite framework. This resolves the issue last year where the Torrent Suite source code was available only through the Ion Community, which required registration. This annoyed the open source purist such as blogger Peter Cock but Life Technologies addressed this later in 2011 by releasing the source code on Github.
The Accuracy Grand Challenge required not just a computer and an idea but also Ubuntu, gcc, supporting GNU libraries, etc and the patience to resolve compile errors in the source code due to subtle environment differences. Alternatively for the rich kids, purchasing time to use a Torrent Server instance on the Amazon EC2. Having small stand alone problems, truly requires just a computer and an idea, albeit a genius idea :D.
The last barrier for satisfactory participation is that of motivation. Ask a young adult from this Internet generation to clean your house and IF they do a good job you might give them $100. They will most likely to tell you to go and F yourself, where F stands for “Find”… yeah right :roll:. Ask them to clean one small window and you will give them $20 if they do it in one hour. The latter tasks (equivalent to TopCoder Challenges) is more appealing than the former (equivalent to the Accuracy Grand Challenge). Each of the smaller challenges can be done in a few days and if you are UBER 133t h4x0r probably in just a few hours. I was surprised that in just a few hours how many submissions there were for the DAT compression challenge. More surprising, the top score achieved in the first day was close to the top score at the end, highlighting the optimal solution can be found by thinking and not grinding over days.
There is no moving target as the challenge runs for two weeks and the top score wins the challenge. More importantly, there is no minimum score such as reducing error by 50%. Thus the person with the top score at the end wins money regardless of score. Also, there are other cash prizes for second, third and others besides winning. For example, there may be some solutions if tweaked correctly would perform better than the top scoring solution and thus could be offered prizes also. Lastly, the publicly archived leader board for people who did not win prizes can act as a sense of achievement or something challengers can reference when applying for jobs.
Is this model successful so far?
The paradigm shift from “hugeness” one million dollar challenges reduced to DIRECTED small self contained problems, sounds like a winner but who gives a shit if it does not increase participation or optimal solutions :?. The following TopCoder results was kindly provided by Matt Dyer from Life Technologies.
- 60% improvement in SFF compression with 10x speed improvement
- 20% improvement in DAT Lossless compression
- >90% improvement in DAT Lossy compression
- 4-6x speed improvement in TMAP Smith-Waterman algorithm
In total this costs $40,000 and a few weeks, money well spent in my opinion :idea:. Although this does not address the accuracy problem, it does serve as a pilot for what can be achieved if the accuracy problem is reduced to smaller problems. Who knows, maybe the problem of this reduction can be a challenge within itself. All four achievements listed above, requires advanced mathematics and computing so what’s there for the mortals amongst us? This is where the Torrent Browser Plugin challenge comes in and is the focus of my next blog post.
Disclaimer: For the good of all mankind! This is purely my opinion and interpretations. I have tried my best to keep all analyses correct.