How different are different diff algorithms in Git?

152 点作者 aw1621107大约 5 年前

10 条评论

wcarss大约 5 年前

From the section about how the qualitative difference between the two algorithms was found:"In the second step, we conducted a manual comparison between two diff outputs produced by Myers and Histogram algorithms from all files in the sample. The first two authors of this paper were involved to independently annotate the diff outputs that makes the result is expected to be more reliable. ... The comparison results between two authors from 377 files were subsequently computed to find the kappa agreement.Footnote16 We obtained 70.82%, which is categorized into ‘substantial agreement’ (Viera and Garrett 2005). This means, the statistic result of our manual study is acceptable."Even though I'm inclined to agree with the example given in the paper and a lot of work clearly went into the qualitative evaluation, this feels like a very weak way to perform a qualitative analysis. Specifically:- this is a sample size of two academic authors who chose to write a paper together about the quality of different diffing algorithms, ie, a very skewed and small sample.- there is no mention of any blinding in the labeling process, so any preconceptions about the quality of different diffing may have been present in qualitative grading -- or it may not have! We don't even know.- there does not seem to be a clear mention of how the representative sample was chosen, or of what factors were taken into consideration for determining a representative sample of changes, so that reviewers/other researchers could potentially make different choices in the future and draw informed comparisons with this work.To sum up: in my admittedly not at all authoritative opinion this portion of the paper cannot conclude more than something like, "further study is warranted on this topic, with a far better controlled and far larger sample size, and clearer explications of the methodological choices".Regardless of that, it was an interesting read and not something previously on my radar as worth experimenting with at all! Kudos to the authors for drawing attention to it and for the other more quantitative aspects of the paper (which I examined less and charitably assume are top notch).

hyperpallium大约 5 年前

Usability of diffs, to do actual useful tasks, is an excellent research topic. But a very difficult area.To add my subjective expt: I just compared myers vs histogram on my latest commit.- myers presented a function I'd 90% gutted as the the same function, edited (the rest was moved in the file, so no LCS algo could find it), like word-diff often does. I thought this was clever.- histogram presented it as one function deleted, and a completely new function added. This was cleaner.But I'm not even sure which is more usable. Might even vary with the specific task, e.g. function evolution vs function readability. Difficult area!

评论 #22700956 未加载

评论 #22729068 未加载

zaptheimpaler大约 5 年前

Diff is reverse engineering a many to one function - many possible insert/delete sequences applied to string X map to the same string Y.What would it look like to store files natively as insert/delete sequences instead? So instead of filesystems and diffs on top, we could have DIFFsystems and files on top. Kind of like a WAL. Files would be checkpoints in the WAL for efficiency, and diffs would be 100% accurate between two checkpoints. Probably takes a hell of a lot more space & CPU though..

评论 #22703169 未加载

评论 #22700098 未加载

shoo大约 5 年前

in practice, suppose you want to do a merge of files that have more structure than plain text. e.g. code belonging to specific a programming language, json, some data format for an application, etc. This all has a lot more structure than just plain text. You can do a better job of diffing and merging if your difftool and mergetool is aware of the structure of your file. text-based tooling that git offers is to some extent a lowest-common denominator. in principle you can build custom tooling for your format / context that does a better job.one extension point that git offers for this is the "merge driver". you can define an external script/application that will be called by git whenever a merge conflict needs to be resolved for some particular file (based on path or a pattern).Here's an older blog post describing a custom git merge driver for merging data files in a game engine: <a href="http://bitsquid.blogspot.com/2010/06/avoiding-content-locks-and-conflicts-3.html" rel="nofollow">http://bitsquid.blogspot.com/2010/06/avoiding-content-locks-...</a> In this gamedev context of merging game data files, it was less important to produce the "correct" merge result than it was to produce some result that had a valid file format. Less technical users could then fix up bad mergetool decisions in an editor with a UI instead of trying to resolve the merge conflicts at the level of the raw serialisation format itself (which could corrupt the data file and make it not possible to load into the editor).In other situations where there is a large cost to automatically producing the wrong merge result, it would be a better tradeoff to "halt the line" if there is ambiguity about how a merge should be resolved, and escalate to a human to decide what to do.Further reading: How to wire a custom merge driver in to git: <a href="https://git-scm.com/docs/gitattributes#_defining_a_custom_merge_driver" rel="nofollow">https://git-scm.com/docs/gitattributes#_defining_a_custom_me...</a> What values you can pass in to a merge driver on the command line: <a href="https://github.com/git/git/blob/f1d4a28250629ae469fc5dd59ab843cb2fd68e12/ll-merge.c" rel="nofollow">https://github.com/git/git/blob/f1d4a28250629ae469fc5dd59ab8...</a> Simple example of the plumbing to wire in a merge driver, with a trivial dumb driver script: <a href="https://github.com/Praqma/git-merge-driver" rel="nofollow">https://github.com/Praqma/git-merge-driver</a>

DaiPlusPlus大约 5 年前

I use git GUIs a lot though (mostly GitKraken and VS’ built-in Git UI) - they aren’t all as-customizable though :/

评论 #22698975 未加载

Buetol大约 5 年前

I also researched the best diff algorithm. Google's diff-match-patch [1] library produce very good diffs for example. But I found that the best diffs are produced by wikidiff2 [2], the MediaWiki diff engine. Both engines produce word-by-word diff.[1]: <a href="https://github.com/google/diff-match-patch" rel="nofollow">https://github.com/google/diff-match-patch</a>[2]: <a href="https://www.mediawiki.org/wiki/Wikidiff2" rel="nofollow">https://www.mediawiki.org/wiki/Wikidiff2</a>

评论 #22701340 未加载

_pastel大约 5 年前

Are there any other articles answering this question? I would like to compare them in my upcoming paper:"How different are different 'How different are different diff algorithms in Git' articles?

评论 #22699978 未加载

评论 #22700326 未加载

Hendrikto大约 5 年前

This article feels very drawn out. Did they have some kind of length requirement they needed to fill?

modeless大约 5 年前

The diff algorithms seem like a hacky collection of heuristics; the type that's begging to be replaced by a machine learning system. Something that understands the content could do much better than either of these algorithms. I'm not sure what the training data would look like though.

评论 #22702115 未加载

评论 #22701243 未加载

hervature大约 5 年前

TL;DR Use the Histogram diff algorithm in Git

评论 #22699167 未加载

评论 #22698765 未加载

评论 #22698619 未加载

评论 #22698629 未加载