This is way over my head, but I was reminded of <i>The C language is purely functional</i> by Conal Elliott: <a href="http://conal.net/blog/posts/the-c-language-is-purely-functional" rel="nofollow">http://conal.net/blog/posts/the-c-language-is-purely-functio...</a>
The source code: <a href="https://github.com/appleseedlab/superc/">https://github.com/appleseedlab/superc/</a>
Figure 1 spoke to me. It's an expanded syntax tree that branches depending on on the value of a preprocessor definition "CONFIG...X". I've often found myself doing the kind of code archeology that this paper seems to be trying to automate: exploring all the configuration possibilities implied by the codebase / build system. A C program that makes heavy use of the preprocessor is generally harder to grok by both h humans and static analysis because 1. the C preprocessor syntax is different from C, 2. the inputs are not necessarily bounded by what appears in the source files alone ("-DCONFIG...X=foo" passed in from the build system), and 3. the resulting program and its control flow may be quite different depending on preprocessor options. As a simple example embedded systems often define an "ASSERT(X)" macro as either noop, an infinite loop, a print statement or the like.<p>This is definitely a niche space but I see clear use for large, portable and configurable c codebases (e.g. Linux kernel, FreeRTOS) for providing better visibility into the configuration system.
Fwiw, ~20 years ago my experience was that preprocessor use in open-source C code was <i>very</i> idiomatic, and iirc, a simple backtracking parser with idioms was sufficient to parse all I tried it against, including the linux kernel.
By the way, GNU Bison implements general LR (GLR) parsing by something that can be called "fork merge LR". The documentation states that Bison's GLR algorithm resolves ambiguities by forking parallel parses, which then merge. It's not the same as forking due to a preprocessor conditional, but worth mentioning.
I am obviously not able to understand what, specific, problem this is solving based on the title of "parsing all of C" when the preprocessor is apparently left intact by design<p><pre><code> static int mousedev_open(struct inode *inode, struct file *file)
{
int i;
#ifdef CONFIG_INPUT_MOUSEDEV_PSAUX
if (imajor(inode) == 10)
i = 31;
else
#endif
i = iminor(inode) - 32;
return 0;
}
(b) The preprocessed source preserving all configurations
</code></pre>
and my experience with C is that there are untold number of "unbound" tokens that are designed to be injected in by -D or auto-generated config.h files, so presumably this works closer to the "ready for compilation" phase versus something one could use to make tree-sitter better (as an example)
> <i>In exploring configuration-preserving parsing, we focus on performance.</i><p>Why, because this goose is so thoroughly cooked that all that is left is optimizing for speed?<p>There is a lot of misplaced focus on performance in CS academia, and also in software.<p>Suppose we have some accurate tool that does something useful with a C program, but it takes 5 minutes to run instead of 5 seconds. So what? Someone still wants to use it. Suppose the program is used by millions of people, and that 5 minute run only has to be repeated half a dozen times during development.<p>Get it right, and get it in people's hands should be the priorities, and not necessarily in that order.