TechEcho

10 comments

toomimover 7 years ago

I believe this article may have produced the same algorithm as [1] and [2] but applied to tree diffs [3] instead of string diffs.In particular, Myers' algorithm [2] is generally considered the best general-purpose string diff today, and I bet he was doing A* search without realizing it. I would be very curious to find out if you could describe his algorithm as an A* search.A* search brought his algorithm's runtime down from O(N^2) to O(N•diffsize). Tree diffs are often O(N^3) or O(N^2•log(N)), and this A* search seems to have brought that down to O(N•log(N)•diffsize). The A* search seems to reduce the extra O(N or N^2) to O(diffsize)!I'm excited by the general use of A* to prune the dynamic programming search space. We might be onto a more general understanding of diffing.[1] <a href="https://www.google.com/patents/US7313555" rel="nofollow">https://www.google.com/patents/US7313555</a>[2] <a href="http://www.xmailserver.org/diff2.pdf" rel="nofollow">http://www.xmailserver.org/diff2.pdf</a>[3] <a href="http://www.sciencedirect.com/science/article/pii/S0304397505000174" rel="nofollow">http://www.sciencedirect.com/science/article/pii/S0304397505...</a>

robertkrahn01over 7 years ago

Interesting problem, and useful for creating semantically meaningful diffs, i.e. using entities of the programming language. A different approach to that is implemented in gumtree [1,2].[1] Fine-grained and Accurate Source Code Differencing <a href="https://hal.archives-ouvertes.fr/hal-01054552/document" rel="nofollow">https://hal.archives-ouvertes.fr/hal-01054552/document</a>[2] <a href="https://github.com/GumTreeDiff/gumtree" rel="nofollow">https://github.com/GumTreeDiff/gumtree</a>

dahartover 7 years ago

Pretty nice writeup, congrats. Tree diffing is super fun to code! (I've used it to animate between two different shader programs...) Applying A* is a great idea.Not sure if this will help anyone interested, but edit operations on trees have a couple more categories besides inserts & deletes that I think makes it hard to think about as a grid the same way you can with strings. I spent a lot of time boiling down the math notation in a paper on tree edit distance to a single diagram that visualizes the edit operations: <a href="https://www.dropbox.com/s/iubwg113mm7ey6q/treeEditDist.png?dl=0" rel="nofollow">https://www.dropbox.com/s/iubwg113mm7ey6q/treeEditDist.png?d...</a>This is a visualization of "T. Jiang, L. Wang, and K. Zhang. Alignment of trees- an alternative to tree edit. Theoretical Computer Science, 143(1):137–148, 1995."The interesting part is the right side, comparing a forest of children to another forest of children, which happens when a node changes type, but the subnodes are identical.Another critical piece in tree diff is whether the nodes are ordered or unordered. If you have an AST for a math expression, a "+" operation is unordered, and you may have to diff two plus nodes twice, once with the arguments swapped - which is represented on the left in my diagram.ASTs with 3+ argument nodes also confound visualizing the edit search space on a grid.After having implemented tree diff using careful dynamic programming with backtracking (<a href="https://sourceforge.net/p/blot/code/HEAD/tree/trunk/lib/function/align.cpp" rel="nofollow">https://sourceforge.net/p/blot/code/HEAD/tree/trunk/lib/func...</a>) I came to the conclusion that memoizing would be a lot easier and have the same complexity. And you get A* for free-ish if you add the cost function and skip comparing two sub-trees that are further away than the best answer you've found so far.

how_gaucheover 7 years ago

Similar dynamic programming idea from Zhang + Shasha 1992: <a href="http://www.cs.nyu.edu/shasha/papers/treebook3.pdf" rel="nofollow">http://www.cs.nyu.edu/shasha/papers/treebook3.pdf</a>

评论 #15104019 未加载

bhrgunathaover 7 years ago

> When I first heard the problem, it sounded pretty easy and I thought I’d be done within a few days, but I discovered a number of catches after starting and it ended up taking 5 weeksIt's really refreshing to hear about this type of real world estimation error that reflects so well the problem in software of making estimates with so many unknowns.

dawwover 7 years ago

Hey you are the guy who made syntect! Great job and thanks

评论 #15103604 未加载

kuwzeover 7 years ago

There's also ydiff in case anyone is interested (it works well with lisp/scheme): <a href="https://github.com/JexCheng/ydiff" rel="nofollow">https://github.com/JexCheng/ydiff</a>

评论 #15102359 未加载

kruhftover 7 years ago

Some of you in here reading this might like this new game:<a href="https://github.com/BusFactor1Inc/explain" rel="nofollow">https://github.com/BusFactor1Inc/explain</a>It's about exploring language spaces. Syntax free.And metaprogramming.

nerdponxover 7 years ago

Great post, thank you!But why is the config represented as a sexpr and not, say, YAML?

评论 #15104076 未加载

评论 #15104044 未加载

senatorobamaover 7 years ago

What's the state of the art of progarm-diffing (i.e based on LLVM IR)?

评论 #15104567 未加载

10 comments

toomimover 7 years ago

robertkrahn01over 7 years ago

dahartover 7 years ago

how_gaucheover 7 years ago

Similar dynamic programming idea from Zhang + Shasha 1992: <a href="http://www.cs.nyu.edu/shasha/papers/treebook3.pdf" rel="nofollow">http://www.cs.nyu.edu/shasha/papers/treebook3.pdf</a>

评论 #15104019 未加载

bhrgunathaover 7 years ago

dawwover 7 years ago

Hey you are the guy who made syntect! Great job and thanks

评论 #15103604 未加载

kuwzeover 7 years ago

There's also ydiff in case anyone is interested (it works well with lisp/scheme): <a href="https://github.com/JexCheng/ydiff" rel="nofollow">https://github.com/JexCheng/ydiff</a>

评论 #15102359 未加载

kruhftover 7 years ago

nerdponxover 7 years ago

Great post, thank you!But why is the config represented as a sexpr and not, say, YAML?

Designing a Tree Diff Algorithm Using Dynamic Programming and A*

10 comments

Designing a Tree Diff Algorithm Using Dynamic Programming and A*

10 comments