An open source, generic, semantically aware diff/patch toolchain.<p>This is a fairly big undertaking, including at least:<p>- A standard for representing and installing grammar information. Perhaps a file in /usr/share/grammar/ for each programming language, packaged in each language's own tools. Since not all languages are context free, the format must support more advanced grammars as well, at least indexed grammars, maybe more. Perhaps grammars should be able to depend on and extend each other (e.g. HTML can have a JavaScript node), perhaps they definitely should not.<p>- A new text-based diff/patch format that is less line based so that the difference can be communicated on a more semantic level, while still retaining exact reproduction i.e. B = patch(A,diff(A,B)). Add and delete should probably still be the only supported operations, but maybe not.<p>- A standard for representing semantic weight. The patch operation is deterministic but diff is not, i.e. given A and B, there are many possible patches that, applied to A, would correctly produce B. The job for the diff tool is to find a patch that both is guaranteed to correctly produce B, and as closely as possible describes the difference the same way a human would. Example: In Python, removal of an If condition and promotion of the contained 100 line code block should not be represented as delete 101 lines and as 100 lines. Rather, it should be represented as delete 1 line and delete 100 indents, on the correct level. The semantic weight information that allows diff to choose the best representation in most cases, might be produced using ML. Imagine a collaborative effort to produce training data using commits from open projects, where in most cases a human would simply select the patch candidate that is most easily understood. The semantic weight representation would probably have concepts for other operations than add and delete.<p>- The actual patch and diff tools. Patch implements only the patch format and has no dependency on grammar or semantic weight data. Diff is where the effort lies. Perhaps some languages are not best served by the generic approach with grammar and semantic weight. In these cases, diff should be pluggable. Perhaps /usr/share/diff can contain configuration for each language e.g. whether it uses a certain grammar and semantic weight file or a custom binary. It also needs some way of detecting which language to use for a certain file.<p>The vast majority of version control information (including git) is snapshot-based, meaning that among the many possible patches from A to B, none is preferred by the version control data. We are completely free to improve the way our tools select these patches, and it will be completely backward compatible with the existing wealth of version control history. We'd just be able to look at it with more clarity.