TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Designing a Tree Diff Algorithm Using Dynamic Programming and A*

230 pointsby yminskyover 7 years ago

10 comments

toomimover 7 years ago
I believe this article may have produced the same algorithm as [1] and [2] but applied to tree diffs [3] instead of string diffs.<p>In particular, Myers&#x27; algorithm [2] is generally considered the best general-purpose string diff today, and I bet he was doing A* search without realizing it. I would be very curious to find out if you could describe his algorithm as an A* search.<p>A* search brought his algorithm&#x27;s runtime down from O(N^2) to O(N•diffsize). Tree diffs are often O(N^3) or O(N^2•log(N)), and this A* search seems to have brought that down to O(N•log(N)•diffsize). The A* search seems to reduce the extra O(N or N^2) to O(diffsize)!<p>I&#x27;m excited by the general use of A* to prune the dynamic programming search space. We might be onto a more general understanding of diffing.<p>[1] <a href="https:&#x2F;&#x2F;www.google.com&#x2F;patents&#x2F;US7313555" rel="nofollow">https:&#x2F;&#x2F;www.google.com&#x2F;patents&#x2F;US7313555</a><p>[2] <a href="http:&#x2F;&#x2F;www.xmailserver.org&#x2F;diff2.pdf" rel="nofollow">http:&#x2F;&#x2F;www.xmailserver.org&#x2F;diff2.pdf</a><p>[3] <a href="http:&#x2F;&#x2F;www.sciencedirect.com&#x2F;science&#x2F;article&#x2F;pii&#x2F;S0304397505000174" rel="nofollow">http:&#x2F;&#x2F;www.sciencedirect.com&#x2F;science&#x2F;article&#x2F;pii&#x2F;S0304397505...</a>
robertkrahn01over 7 years ago
Interesting problem, and useful for creating semantically meaningful diffs, i.e. using entities of the programming language. A different approach to that is implemented in gumtree [1,2].<p>[1] Fine-grained and Accurate Source Code Differencing <a href="https:&#x2F;&#x2F;hal.archives-ouvertes.fr&#x2F;hal-01054552&#x2F;document" rel="nofollow">https:&#x2F;&#x2F;hal.archives-ouvertes.fr&#x2F;hal-01054552&#x2F;document</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;GumTreeDiff&#x2F;gumtree" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;GumTreeDiff&#x2F;gumtree</a>
dahartover 7 years ago
Pretty nice writeup, congrats. Tree diffing is super fun to code! (I&#x27;ve used it to animate between two different shader programs...) Applying A* is a great idea.<p>Not sure if this will help anyone interested, but edit operations on trees have a couple more categories besides inserts &amp; deletes that I think makes it hard to think about as a grid the same way you can with strings. I spent a lot of time boiling down the math notation in a paper on tree edit distance to a single diagram that visualizes the edit operations: <a href="https:&#x2F;&#x2F;www.dropbox.com&#x2F;s&#x2F;iubwg113mm7ey6q&#x2F;treeEditDist.png?dl=0" rel="nofollow">https:&#x2F;&#x2F;www.dropbox.com&#x2F;s&#x2F;iubwg113mm7ey6q&#x2F;treeEditDist.png?d...</a><p>This is a visualization of &quot;T. Jiang, L. Wang, and K. Zhang. Alignment of trees- an alternative to tree edit. Theoretical Computer Science, 143(1):137–148, 1995.&quot;<p>The interesting part is the right side, comparing a forest of children to another forest of children, which happens when a node changes type, but the subnodes are identical.<p>Another critical piece in tree diff is whether the nodes are ordered or unordered. If you have an AST for a math expression, a &quot;+&quot; operation is unordered, and you may have to diff two plus nodes twice, once with the arguments swapped - which is represented on the left in my diagram.<p>ASTs with 3+ argument nodes also confound visualizing the edit search space on a grid.<p>After having implemented tree diff using careful dynamic programming with backtracking (<a href="https:&#x2F;&#x2F;sourceforge.net&#x2F;p&#x2F;blot&#x2F;code&#x2F;HEAD&#x2F;tree&#x2F;trunk&#x2F;lib&#x2F;function&#x2F;align.cpp" rel="nofollow">https:&#x2F;&#x2F;sourceforge.net&#x2F;p&#x2F;blot&#x2F;code&#x2F;HEAD&#x2F;tree&#x2F;trunk&#x2F;lib&#x2F;func...</a>) I came to the conclusion that memoizing would be a lot easier and have the same complexity. And you get A* for free-ish if you add the cost function and skip comparing two sub-trees that are further away than the best answer you&#x27;ve found so far.
how_gaucheover 7 years ago
Similar dynamic programming idea from Zhang + Shasha 1992: <a href="http:&#x2F;&#x2F;www.cs.nyu.edu&#x2F;shasha&#x2F;papers&#x2F;treebook3.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cs.nyu.edu&#x2F;shasha&#x2F;papers&#x2F;treebook3.pdf</a>
评论 #15104019 未加载
bhrgunathaover 7 years ago
&gt; When I first heard the problem, it sounded pretty easy and I thought I’d be done within a few days, but I discovered a number of catches after starting and it ended up taking 5 weeks<p>It&#x27;s really refreshing to hear about this type of real world estimation error that reflects so well the problem in software of making estimates with so many unknowns.
dawwover 7 years ago
Hey you are the guy who made syntect! Great job and thanks
评论 #15103604 未加载
kuwzeover 7 years ago
There&#x27;s also ydiff in case anyone is interested (it works well with lisp&#x2F;scheme): <a href="https:&#x2F;&#x2F;github.com&#x2F;JexCheng&#x2F;ydiff" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;JexCheng&#x2F;ydiff</a>
评论 #15102359 未加载
kruhftover 7 years ago
Some of you in here reading this might like this new game:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;BusFactor1Inc&#x2F;explain" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;BusFactor1Inc&#x2F;explain</a><p>It&#x27;s about exploring language spaces. Syntax free.<p>And metaprogramming.
nerdponxover 7 years ago
Great post, thank you!<p>But why is the config represented as a sexpr and not, say, YAML?
评论 #15104076 未加载
评论 #15104044 未加载
senatorobamaover 7 years ago
What&#x27;s the state of the art of progarm-diffing (i.e based on LLVM IR)?
评论 #15104567 未加载