It seems to me as though the authors are unfamiliar with bioinformatics standard practice - it's not like this is an unsolved problem, and the existing solutions are faster, more computationally elegant, and more flexible. Actually computing and searching for all strings with an edit distance of X away from the query is an incredibly poor way (Guess how long their strategy takes when you increase the edit distance to 5?) of of solving the alignment problem that throws out 40+ years of research on the problem of sequence alignment. The actual solutions to this problem solve it in tenths of a second in vastly superior ways.<p>BLAT[0] is the most obvious and preferred solution designed for alignment against a reference sequence- a 20 BP search against the human genome should essentially be instantaneous. BLAST [1] is more versatile and a bit slower than BLAT, but would also align these sequence sizes against the human genome in ~1-2 seconds, and is a traditional solution to the alignment problem, and has no license restrictions. BWA [2] and Bowtie [3] default settings could also be modified for the task (they're optimized for performing this task with larger strings on the order of thousands of times per second).<p>More generally, it would not be difficult to re-implement any of the algorithms behind these software implementations if the authors really wanted to. It's weird, this is the second post I've seen recently when software folks who are now working in the bioinformatics space have seemed completely unaware of both the basic algorithms we use in computational biology and their common implementations, like Smith-Waterman and Burrows–Wheeler. These are complicated problems with 40+ years of research behind them, and the actual solutions are elegant and fast algorithms which solve the problem in a far superior way within reasonable computational time.<p>[0] <a href="http://genome.ucsc.edu/cgi-bin/hgBlat" rel="nofollow">http://genome.ucsc.edu/cgi-bin/hgBlat</a><p>[1] <a href="http://blast.ncbi.nlm.nih.gov/" rel="nofollow">http://blast.ncbi.nlm.nih.gov/</a><p>[2] <a href="http://bio-bwa.sourceforge.net/" rel="nofollow">http://bio-bwa.sourceforge.net/</a><p>[3] <a href="http://bowtie-bio.sourceforge.net/index.shtml" rel="nofollow">http://bowtie-bio.sourceforge.net/index.shtml</a>
This is exciting stuff. However, the sad takeaway to me is the broken patent system is already stifling what can be done with this innovation.<p>Consider this: the patent was awarded to a group that could not have invested more than a few thousand dollars of incremental time and resources (in fact, probably the majority of the costs were in the patent application and process itself). And yet the license is worth billions.<p>Patents were created - and the US Patent Charter still states this - to encourage and enhance the economic stature of the nation. Instead we use patents to throttle it. Imagine if this patent were only good for a few years or up to license fee commensurate with the incremental investment needed to produce and validate the research (even if this fee were 3x, 5x, 10x, etc. of the costs). Everyone could contribute to the work and the pace of innovation accelerates. Instead we've got a couple universities (and, inevitably, follow-on corporate licensors) locking it down for all but the publicly funded and litigous.<p>There is so much opportunity out there, there are so many brilliant minds, eager innovators, and great startups. Why do we shoot ourselves in the foot with patent nonsense that hasn't been significantly rethought since (in the US) its 18th-century English law origins?
Three ideas:<p>First, try the bitap algorithm.<p>Second, you can encode the search as a DFA - look up the Aho–Corasick algorithm. Then just run the DFA over the genome. It means that you don't need to match every string at every position. If you've read AAAAA and your string starts with CCCCC with an edit distance of 4, you can skip ahead for a while before you need to start reading again.<p>Third, you could build a suffix tree (O(n) preprocessing), and then use the standard fuzzy string matching algorithm using suffix trees on it.
Why did BLAST not work for you? In the comments here you keep mentioning memory, but we've never had that be the bottleneck (and we use BLAST for things much larger than the human genome).<p>I'm really concerned that the team behind a bioinformatics tool is talking about searching sequences without even a mention of BLAST. It should have been solution #1!
Nice. A bitset & Rabin Karp search.<p>Both winning strategies, but I suspect you can push it much much further to scan much faster if you're hitting memory bandwidth (ACGT = 2 bit space).<p>Of all the big-data problems I see, nothing feels as close to a personal problem like genetic data.
Nice post. What tools do you use to understand which line of the code is creating a bottleneck? For instance, the "if statement has too many false positives" in Solution #4. Do you locate these bottlenecks simply by manual inspection of the code?
I don't suppose anyone has a sample set of data that I could use to play around with various techniques for this?<p>I'd randomly generate it, but I don't know what the statistics should be - and that makes a huge difference for branch prediction / etc.
Have you thought about using the hammond distance, instead of the array?<p>It should give you the same answer in 3 CPU instructions on registers instead of 2 array lookups and arithmetic in ram. XOR, hammond weight/bitcount and and equality check.
PAM sites? I am working on a genome assembly now and I want to try to identify potential g-rna sites around genic regions. This post is pretty helpful.