TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Optimizing CRISPR: Sub-second searches on a 3 billion base genome

70 pointsby sajithwalmost 10 years ago

15 comments

tstactplsignorealmost 10 years ago
It seems to me as though the authors are unfamiliar with bioinformatics standard practice - it&#x27;s not like this is an unsolved problem, and the existing solutions are faster, more computationally elegant, and more flexible. Actually computing and searching for all strings with an edit distance of X away from the query is an incredibly poor way (Guess how long their strategy takes when you increase the edit distance to 5?) of of solving the alignment problem that throws out 40+ years of research on the problem of sequence alignment. The actual solutions to this problem solve it in tenths of a second in vastly superior ways.<p>BLAT[0] is the most obvious and preferred solution designed for alignment against a reference sequence- a 20 BP search against the human genome should essentially be instantaneous. BLAST [1] is more versatile and a bit slower than BLAT, but would also align these sequence sizes against the human genome in ~1-2 seconds, and is a traditional solution to the alignment problem, and has no license restrictions. BWA [2] and Bowtie [3] default settings could also be modified for the task (they&#x27;re optimized for performing this task with larger strings on the order of thousands of times per second).<p>More generally, it would not be difficult to re-implement any of the algorithms behind these software implementations if the authors really wanted to. It&#x27;s weird, this is the second post I&#x27;ve seen recently when software folks who are now working in the bioinformatics space have seemed completely unaware of both the basic algorithms we use in computational biology and their common implementations, like Smith-Waterman and Burrows–Wheeler. These are complicated problems with 40+ years of research behind them, and the actual solutions are elegant and fast algorithms which solve the problem in a far superior way within reasonable computational time.<p>[0] <a href="http:&#x2F;&#x2F;genome.ucsc.edu&#x2F;cgi-bin&#x2F;hgBlat" rel="nofollow">http:&#x2F;&#x2F;genome.ucsc.edu&#x2F;cgi-bin&#x2F;hgBlat</a><p>[1] <a href="http:&#x2F;&#x2F;blast.ncbi.nlm.nih.gov&#x2F;" rel="nofollow">http:&#x2F;&#x2F;blast.ncbi.nlm.nih.gov&#x2F;</a><p>[2] <a href="http:&#x2F;&#x2F;bio-bwa.sourceforge.net&#x2F;" rel="nofollow">http:&#x2F;&#x2F;bio-bwa.sourceforge.net&#x2F;</a><p>[3] <a href="http:&#x2F;&#x2F;bowtie-bio.sourceforge.net&#x2F;index.shtml" rel="nofollow">http:&#x2F;&#x2F;bowtie-bio.sourceforge.net&#x2F;index.shtml</a>
评论 #9938738 未加载
评论 #9938623 未加载
评论 #9939471 未加载
joehiltonalmost 10 years ago
This is exciting stuff. However, the sad takeaway to me is the broken patent system is already stifling what can be done with this innovation.<p>Consider this: the patent was awarded to a group that could not have invested more than a few thousand dollars of incremental time and resources (in fact, probably the majority of the costs were in the patent application and process itself). And yet the license is worth billions.<p>Patents were created - and the US Patent Charter still states this - to encourage and enhance the economic stature of the nation. Instead we use patents to throttle it. Imagine if this patent were only good for a few years or up to license fee commensurate with the incremental investment needed to produce and validate the research (even if this fee were 3x, 5x, 10x, etc. of the costs). Everyone could contribute to the work and the pace of innovation accelerates. Instead we&#x27;ve got a couple universities (and, inevitably, follow-on corporate licensors) locking it down for all but the publicly funded and litigous.<p>There is so much opportunity out there, there are so many brilliant minds, eager innovators, and great startups. Why do we shoot ourselves in the foot with patent nonsense that hasn&#x27;t been significantly rethought since (in the US) its 18th-century English law origins?
评论 #9942092 未加载
评论 #9938790 未加载
Ono-Sendaialmost 10 years ago
3 billion pairs * 2 bits&#x2F;pair = 750 MB of data. So you could put in GPU memory and do some kind of brute-force search on the GPU.
评论 #9938411 未加载
TheLoneWolflingalmost 10 years ago
Three ideas:<p>First, try the bitap algorithm.<p>Second, you can encode the search as a DFA - look up the Aho–Corasick algorithm. Then just run the DFA over the genome. It means that you don&#x27;t need to match every string at every position. If you&#x27;ve read AAAAA and your string starts with CCCCC with an edit distance of 4, you can skip ahead for a while before you need to start reading again.<p>Third, you could build a suffix tree (O(n) preprocessing), and then use the standard fuzzy string matching algorithm using suffix trees on it.
pbnjayalmost 10 years ago
Why did BLAST not work for you? In the comments here you keep mentioning memory, but we&#x27;ve never had that be the bottleneck (and we use BLAST for things much larger than the human genome).<p>I&#x27;m really concerned that the team behind a bioinformatics tool is talking about searching sequences without even a mention of BLAST. It should have been solution #1!
dluanalmost 10 years ago
This looks awesome! Just watched the brief intro video, how do you guys calculate the off-target effects?
评论 #9937632 未加载
gopalvalmost 10 years ago
Nice. A bitset &amp; Rabin Karp search.<p>Both winning strategies, but I suspect you can push it much much further to scan much faster if you&#x27;re hitting memory bandwidth (ACGT = 2 bit space).<p>Of all the big-data problems I see, nothing feels as close to a personal problem like genetic data.
nimishalmost 10 years ago
36GB is totally fine to store in memory but it wouldn&#x27;t have led to this seriously awesome article on algorithm design.
评论 #9938455 未加载
fradgalmost 10 years ago
Nice post. What tools do you use to understand which line of the code is creating a bottleneck? For instance, the &quot;if statement has too many false positives&quot; in Solution #4. Do you locate these bottlenecks simply by manual inspection of the code?
评论 #9938804 未加载
TheLoneWolflingalmost 10 years ago
I don&#x27;t suppose anyone has a sample set of data that I could use to play around with various techniques for this?<p>I&#x27;d randomly generate it, but I don&#x27;t know what the statistics should be - and that makes a huge difference for branch prediction &#x2F; etc.
tickingalmost 10 years ago
Have you thought about using the hammond distance, instead of the array?<p>It should give you the same answer in 3 CPU instructions on registers instead of 2 array lookups and arithmetic in ram. XOR, hammond weight&#x2F;bitcount and and equality check.
评论 #9939951 未加载
daemonkalmost 10 years ago
PAM sites? I am working on a genome assembly now and I want to try to identify potential g-rna sites around genic regions. This post is pretty helpful.
评论 #9937805 未加载
keithwhoralmost 10 years ago
Nice work. Great to see more bitstrings being used in genomics. Did I give you guys some ideas? ;)
abecedariusalmost 10 years ago
I&#x27;d have tried agrep on the bit-packed sequence.
imaginenorealmost 10 years ago
Perfect problem for a GPU. Even a naive solution should be crazy fast.