The title of the Boston Globe article and of the HN submission is misleading.<p>The article does not prove that subquadratic edit distance algorithms do not exist. It only proves it conditional to another unproven complexity hypothesis (SETH). In other words, it shows that if such algorithms exist, then faster algorithms exist for another problem (SAT for CNF formulae with certain size bounds), where we haven't been able to find them yet. See the original article <a href="http://arxiv.org/abs/1412.0348" rel="nofollow">http://arxiv.org/abs/1412.0348</a> whose title is more accurate.<p>This is important work, of course, and provides more evidence that faster edit distance computation is not possible, but it is not proven yet.
>The fact that edit distance is only computable in quadratic time has big implications for genomics. The human genome contains about 3 billion base pairs. To compute the edit distance between two human genomes takes 3 billion-squared operations (to appreciate how big that is, write a 9 followed by 18 zeroes). That’s a number of operations, explains Piotr Indyk of MIT, that is “computationally infeasible.” Which is to say that our best computers chugging away for a really long time still won’t generate an answer.<p>As was commented last time this was posted: Just because it's not possible to do better in the <i>general case</i> doesn't mean it isn't possible to do better in <i>your case</i>. For all we know, human genomes may not actually be 100% random strings, but may have some kind of structure that makes it possible to do it faster. Even when solving the problem exactly.
The name "Wagner-Fischer" is new to me. I usually cite this as Levenshtein [1965]. Even the Wagner-Fischer wp page notes that Levenshtein was first by a long shot. <a href="https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm" rel="nofollow">https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...</a>
It's a nice theoretical result. In practice there are some quirks though:<p>- I assume this is worst case complexity. While this is useful, maybe the average time can be much lower for certain data (like the genome).<p>- I assume the theorem doesn't restrict the input strings in any way. If you put restrictions on the input then there could be a faster algorithm than the general one. I don't know if this can be applied to the genome though.<p>- People are interested in the constant factor too.
As a quadratic algoritm, assuming 13 million base pairs can be solved in 7 seconds on a GTX275 (according to a paper on edit distance computation), is it correct to assume that 3 billion base pairs can be solved in (3bil * 3bil / (13mil * 13mil/7)) in 373443 seconds, or about 103 hours. Isn't that fast enough for a quadratic algorithm that samples DNA to be useful?
That was refreshing; a regular newspaper publishing an article about computer science. And it does a fairly good job, I wish I could see something related to CS rather than to web apps and startups in the Canadian newspapers.
An article on edit distance without mentioning the person who invented it - Levenshtein. <a href="https://en.wikipedia.org/wiki/Vladimir_Levenshtein" rel="nofollow">https://en.wikipedia.org/wiki/Vladimir_Levenshtein</a>
This is interesting, but it looks like this proves nothing according to the article since we don't know if P==NP. We are no closer to understanding whether or not a faster way to compute edit distance exists. All we know now is that it matches up to a class of problems that we _think_ are probably hard. It is still possible that we might find a faster solution to the problem.<p>I'd like to request a title change - "New proof that a faster way to compute edit distance might be tied up in the P vs. NP problem"
Hence the need and desire for approximation algorithms. Often, you can give up a tiny amount of precision in exchange for dramatically increased performance - if you can make an assumption.<p>For instance, computing cardinality of a value over a huge dataset (distributed or otherwise) that is assumed to be evenly distributed would take memory proportional to the size of the dataset for 100% precision. Implementing a cardinality approximation algorithm like HyperLogLog++ lets you get a pretty close to accurate result for this calculation but with far fewer resources. There are many others and I think it is important to consider this trade off when it is appropriate.<p>Of course, it makes an assumption that the data is distributed more or less evenly.
Now we can start looking for a more precisely defined problem, that <i>can</i> be solved faster than in quadratic time.<p>Or in other words, for certain classes of inputs, we can find a faster algorithm. We only need to find/define those classes.
Does a similar result hold, if we don't consider inserts, deletions and substitutions (definition of edit distance), but only inserts and deletions?
> Edit distance is useful, but also very time-consuming to calculate.<p>So I guess we will soon see a lot of crypto based on calculating edit distances?
This is misleading twice over. First, as a3_nm notes, this proof depends on unproven conjectures, and so this isn't proven yet.<p>Secondly, it <i>is</i> possible to do better than the Wagner-Fischer algorithm in certain cases. I figured this out myself and was about to write a paper on it when I realized that other researchers had beat me to it by a couple of decades.<p>If the two strings are equal you can establish that the edit distance is 0 in linear time, obviously. If you consider how the algorithm fills out the square down the diagonal you'll see that if there's only a one-letter difference you only actually need to fill out small parts of the grid around that difference.<p>This is generalizable, so the complexity actually depends on the distance between the strings.