A faster way to compute edit distance does not exist

150 点作者 dnt404-1将近 10 年前

14 条评论

a3_nm将近 10 年前

The title of the Boston Globe article and of the HN submission is misleading.The article does not prove that subquadratic edit distance algorithms do not exist. It only proves it conditional to another unproven complexity hypothesis (SETH). In other words, it shows that if such algorithms exist, then faster algorithms exist for another problem (SAT for CNF formulae with certain size bounds), where we haven't been able to find them yet. See the original article <a href="http://arxiv.org/abs/1412.0348" rel="nofollow">http://arxiv.org/abs/1412.0348</a> whose title is more accurate.This is important work, of course, and provides more evidence that faster edit distance computation is not possible, but it is not proven yet.

评论 #10041698 未加载

评论 #10040400 未加载

评论 #10042568 未加载

评论 #10040448 未加载

评论 #10040406 未加载

im3w1l将近 10 年前

>The fact that edit distance is only computable in quadratic time has big implications for genomics. The human genome contains about 3 billion base pairs. To compute the edit distance between two human genomes takes 3 billion-squared operations (to appreciate how big that is, write a 9 followed by 18 zeroes). That’s a number of operations, explains Piotr Indyk of MIT, that is “computationally infeasible.” Which is to say that our best computers chugging away for a really long time still won’t generate an answer.As was commented last time this was posted: Just because it's not possible to do better in the general case doesn't mean it isn't possible to do better in your case. For all we know, human genomes may not actually be 100% random strings, but may have some kind of structure that makes it possible to do it faster. Even when solving the problem exactly.

评论 #10039952 未加载

jongraehl将近 10 年前

The name "Wagner-Fischer" is new to me. I usually cite this as Levenshtein [1965]. Even the Wagner-Fischer wp page notes that Levenshtein was first by a long shot. <a href="https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm" rel="nofollow">https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...</a>

评论 #10041730 未加载

leni536将近 10 年前

It's a nice theoretical result. In practice there are some quirks though:- I assume this is worst case complexity. While this is useful, maybe the average time can be much lower for certain data (like the genome).- I assume the theorem doesn't restrict the input strings in any way. If you put restrictions on the input then there could be a faster algorithm than the general one. I don't know if this can be applied to the genome though.- People are interested in the constant factor too.

评论 #10039937 未加载

评论 #10041507 未加载

nomercy400将近 10 年前

As a quadratic algoritm, assuming 13 million base pairs can be solved in 7 seconds on a GTX275 (according to a paper on edit distance computation), is it correct to assume that 3 billion base pairs can be solved in (3bil * 3bil / (13mil * 13mil/7)) in 373443 seconds, or about 103 hours. Isn't that fast enough for a quadratic algorithm that samples DNA to be useful?

评论 #10042240 未加载

评论 #10042170 未加载

评论 #10040873 未加载

sanxiyn将近 10 年前

Previously here: <a href="https://news.ycombinator.com/item?id=9698785" rel="nofollow">https://news.ycombinator.com/item?id=9698785</a>

omouse将近 10 年前

That was refreshing; a regular newspaper publishing an article about computer science. And it does a fairly good job, I wish I could see something related to CS rather than to web apps and startups in the Canadian newspapers.

gtrubetskoy将近 10 年前

An article on edit distance without mentioning the person who invented it - Levenshtein. <a href="https://en.wikipedia.org/wiki/Vladimir_Levenshtein" rel="nofollow">https://en.wikipedia.org/wiki/Vladimir_Levenshtein</a>

CephalopodMD将近 10 年前

This is interesting, but it looks like this proves nothing according to the article since we don't know if P==NP. We are no closer to understanding whether or not a faster way to compute edit distance exists. All we know now is that it matches up to a class of problems that we _think_ are probably hard. It is still possible that we might find a faster solution to the problem.I'd like to request a title change - "New proof that a faster way to compute edit distance might be tied up in the P vs. NP problem"

nemo44x将近 10 年前

Hence the need and desire for approximation algorithms. Often, you can give up a tiny amount of precision in exchange for dramatically increased performance - if you can make an assumption.For instance, computing cardinality of a value over a huge dataset (distributed or otherwise) that is assumed to be evenly distributed would take memory proportional to the size of the dataset for 100% precision. Implementing a cardinality approximation algorithm like HyperLogLog++ lets you get a pretty close to accurate result for this calculation but with far fewer resources. There are many others and I think it is important to consider this trade off when it is appropriate.Of course, it makes an assumption that the data is distributed more or less evenly.

amelius将近 10 年前

Now we can start looking for a more precisely defined problem, that can be solved faster than in quadratic time.Or in other words, for certain classes of inputs, we can find a faster algorithm. We only need to find/define those classes.

wolfgke将近 10 年前

Does a similar result hold, if we don't consider inserts, deletions and substitutions (definition of edit distance), but only inserts and deletions?

评论 #10041522 未加载

评论 #10040032 未加载

erikb将近 10 年前

> Edit distance is useful, but also very time-consuming to calculate.So I guess we will soon see a lot of crypto based on calculating edit distances?

评论 #10039861 未加载

评论 #10040800 未加载

评论 #10039859 未加载

评论 #10039862 未加载

larsga将近 10 年前

This is misleading twice over. First, as a3_nm notes, this proof depends on unproven conjectures, and so this isn't proven yet.Secondly, it is possible to do better than the Wagner-Fischer algorithm in certain cases. I figured this out myself and was about to write a paper on it when I realized that other researchers had beat me to it by a couple of decades.If the two strings are equal you can establish that the edit distance is 0 in linear time, obviously. If you consider how the algorithm fills out the square down the diagonal you'll see that if there's only a one-letter difference you only actually need to fill out small parts of the grid around that difference.This is generalizable, so the complexity actually depends on the distance between the strings.

评论 #10040256 未加载