科技回声

8 条评论

lambda大约 9 年前

I don't know if there was any strict requirement to use Python, but you mention the difficulty in finding a library to parse regular expressions.One of the many great features of the Rust regex crate (<a href="https://doc.rust-lang.org/regex/regex/index.html" rel="nofollow">https://doc.rust-lang.org/regex/regex/index.html</a>) is that there's actually a regex_syntax crate that provides this parsing support (<a href="https://doc.rust-lang.org/regex/regex_syntax/index.html" rel="nofollow">https://doc.rust-lang.org/regex/regex_syntax/index.html</a>), which can be used for purposes like this, to do alternative kinds of matching engines.Rust's regex library also has pretty good DFA and NFA matching engines, and an Aho-Corasick engine for parts of the regex that can be decomposed into just a set of simple strings to match, as well as some fairly well optimized exact substring search to speed up search for expressions with some substring that must match exactly, so it can be fairly fast for the kinds of matches like your "TAG(A|C|T|G)GG" example.Another, even more radical change, would be to build a finite state transducer of your data set, which you can then do regex matching on directly by taking a union of an automaton representing the regex and the FST representing your data (which is itself an automaton). I learned about FSTs from the author of the Rust regex crate: <a href="http://blog.burntsushi.net/transducers/" rel="nofollow">http://blog.burntsushi.net/transducers/</a>. An FST is actually used internally for certain types of searches in Elasticsearch/Lucene (<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/reference/curr...</a>). I don't know if these would be suited to this problem as is by just building an FST from your set of sequences, or maybe using the overlapping subsequences idea to make FSTs from a larger number of shorter strings, but it would be an interesting project to explore.

评论 #11536952 未加载

评论 #11537470 未加载

daemonk大约 9 年前

Bioinformatican here with couple years of bench experience. Nice writeup, but what would be the use case here? I would imagine something like automatically annotating motifs/restriction sites and then searching for them would be more useful than searching for near-exact sequence?For example, finding all plasmids with a certain primer sequence, specific restriction sites, promoter motif...

评论 #11536714 未加载

评论 #11536825 未加载

评论 #11536490 未加载

tmaly大约 9 年前

I was going to say there was a faster way than regex, but I saw you were not actually using regex to parse in the way I first thought.I work in a different domain, and I find custom state machines to the preferred way to do matching over large sets of data. Of course using something like compressed bitmap indices are also helpful.

评论 #11536106 未加载

评论 #11539421 未加载

wrong_variable大约 9 年前

Could HN help me with this.I remember reading this really wonderful article on the levenshtein distance on HN 2-3 months ago.basically I have difficulty understanding why the Wagner-Fisher's algorithm works - the intuition behind it - I remember it was dynamic programming but I really want to read that article again.The format was similar to medium with wonderful diagrams :( I tried searching for it on HN but searching on HN is really difficult and it was my own fault for not saving it on Pocket.I would really appreciate the help since it was one of those "better explained" gems.

评论 #11536713 未加载

twotwotwo大约 9 年前

Part of the way down the post links to Russ Cox's excellent writeup of how Google used this general sort of technique (turn a regexp into a bunch of conditions involving three-symbol sequences, search an index for them, work from there) to do regexp searching scalably for Code Search: <a href="https://swtch.com/~rsc/regexp/regexp4.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp4.html</a>You can peruse his open-source (re)implementation of the approach at <a href="https://github.com/google/codesearch" rel="nofollow">https://github.com/google/codesearch</a> .

rurban大约 9 年前

You are still searching for char (256 values, 8 bit) when you only need 4 values (2 bit)? You know that cache misses dominate the search, so you need to compress the chars to 2 bit, and search in L1 cache size blocks. Esp. the inverted index will be much faster to search if compressed.Furthermore those search terms are so highly specialized that you can easily jit the regex.Why using the super slow python and not a fast language? I thought DNA matching is a big and important enough problem where you shouldn't toy around with dinosaurs.[Edit: 3 -> 2 bit]

评论 #11538046 未加载

评论 #11537977 未加载

评论 #11537815 未加载

评论 #11539757 未加载

alixaxel大约 9 年前

Nice article!BTW, Go has a very complete RE2 parser: <a href="https://golang.org/pkg/regexp/syntax/" rel="nofollow">https://golang.org/pkg/regexp/syntax/</a> - it even handles simplifications for you, I'm actually using it together with a RDPT in namegrep.com.And in JavaScript you also have ret.JS: <a href="https://github.com/fent/ret.js/" rel="nofollow">https://github.com/fent/ret.js/</a>.

noir_lord大约 9 年前

Absolutely fascinating write up, stuff like this is why I still love HN, as a web developer who mostly writes line-of-business stuff this kind of programming is a long way from what I do but reminds me just how big the field is.

Building a Regex Search Engine for DNA

8 条评论

Building a Regex Search Engine for DNA

8 条评论