I believe this article may have produced the same algorithm as [1] and [2] but applied to tree diffs [3] instead of string diffs.<p>In particular, Myers' algorithm [2] is generally considered the best general-purpose string diff today, and I bet he was doing A* search without realizing it. I would be very curious to find out if you could describe his algorithm as an A* search.<p>A* search brought his algorithm's runtime down from O(N^2) to O(N•diffsize). Tree diffs are often O(N^3) or O(N^2•log(N)), and this A* search seems to have brought that down to O(N•log(N)•diffsize). The A* search seems to reduce the extra O(N or N^2) to O(diffsize)!<p>I'm excited by the general use of A* to prune the dynamic programming search space. We might be onto a more general understanding of diffing.<p>[1] <a href="https://www.google.com/patents/US7313555" rel="nofollow">https://www.google.com/patents/US7313555</a><p>[2] <a href="http://www.xmailserver.org/diff2.pdf" rel="nofollow">http://www.xmailserver.org/diff2.pdf</a><p>[3] <a href="http://www.sciencedirect.com/science/article/pii/S0304397505000174" rel="nofollow">http://www.sciencedirect.com/science/article/pii/S0304397505...</a>
Interesting problem, and useful for creating semantically meaningful diffs, i.e. using entities of the programming language. A different approach to that is implemented in gumtree [1,2].<p>[1] Fine-grained and Accurate Source Code Differencing <a href="https://hal.archives-ouvertes.fr/hal-01054552/document" rel="nofollow">https://hal.archives-ouvertes.fr/hal-01054552/document</a><p>[2] <a href="https://github.com/GumTreeDiff/gumtree" rel="nofollow">https://github.com/GumTreeDiff/gumtree</a>
Pretty nice writeup, congrats. Tree diffing is super fun to code! (I've used it to animate between two different shader programs...) Applying A* is a great idea.<p>Not sure if this will help anyone interested, but edit operations on trees have a couple more categories besides inserts & deletes that I think makes it hard to think about as a grid the same way you can with strings. I spent a lot of time boiling down the math notation in a paper on tree edit distance to a single diagram that visualizes the edit operations: <a href="https://www.dropbox.com/s/iubwg113mm7ey6q/treeEditDist.png?dl=0" rel="nofollow">https://www.dropbox.com/s/iubwg113mm7ey6q/treeEditDist.png?d...</a><p>This is a visualization of "T. Jiang, L. Wang, and K. Zhang. Alignment of trees- an alternative to tree edit. Theoretical Computer Science, 143(1):137–148, 1995."<p>The interesting part is the right side, comparing a forest of children to another forest of children, which happens when a node changes type, but the subnodes are identical.<p>Another critical piece in tree diff is whether the nodes are ordered or unordered. If you have an AST for a math expression, a "+" operation is unordered, and you may have to diff two plus nodes twice, once with the arguments swapped - which is represented on the left in my diagram.<p>ASTs with 3+ argument nodes also confound visualizing the edit search space on a grid.<p>After having implemented tree diff using careful dynamic programming with backtracking (<a href="https://sourceforge.net/p/blot/code/HEAD/tree/trunk/lib/function/align.cpp" rel="nofollow">https://sourceforge.net/p/blot/code/HEAD/tree/trunk/lib/func...</a>) I came to the conclusion that memoizing would be a lot easier and have the same complexity. And you get A* for free-ish if you add the cost function and skip comparing two sub-trees that are further away than the best answer you've found so far.
Similar dynamic programming idea from Zhang + Shasha 1992: <a href="http://www.cs.nyu.edu/shasha/papers/treebook3.pdf" rel="nofollow">http://www.cs.nyu.edu/shasha/papers/treebook3.pdf</a>
> When I first heard the problem, it sounded pretty easy and I thought I’d be done within a few days, but I discovered a number of catches after starting and it ended up taking 5 weeks<p>It's really refreshing to hear about this type of real world estimation error that reflects so well the problem in software of making estimates with so many unknowns.
There's also ydiff in case anyone is interested (it works well with lisp/scheme): <a href="https://github.com/JexCheng/ydiff" rel="nofollow">https://github.com/JexCheng/ydiff</a>
Some of you in here reading this might like this new game:<p><a href="https://github.com/BusFactor1Inc/explain" rel="nofollow">https://github.com/BusFactor1Inc/explain</a><p>It's about exploring language spaces. Syntax free.<p>And metaprogramming.