> Instead, we'll be single-pass: code generation happens during parsing<p>IIRC, C was specifically designed to allow single-pass compilation, right? I.e. in many languages you don't know what needs to be output without parsing the full AST, but in C, syntax directly implies semantics. I think I remember hearing this was because early computers couldn't necessarily fit the AST for an entire code file in memory at once
I made similar project in TypeScript[1]. Basically multipass compiler that generates x86 assembly, compiles it to binary and runs it. The worst thing were register allocator, designing IR code and assembler.<p>[1] <a href="https://github.com/Mati365/ts-c-compiler">https://github.com/Mati365/ts-c-compiler</a>
I am pretty certain the following is a valid "for"-loop translation:<p><pre><code> block
;; code for "i = 0"
loop
;; code for "i < 5"
i32.eqz
br_if 1
i32.const 1
loop
if
;; code for "i = i + 1"
br 2
else
end
;; code for "j = j * 2 + 1"
i32.const 0
end
end
end
</code></pre>
It doesn't require cloning the lexer so probably would still fit in 500 lines? But yeah, in normal assembly it's way easier, even in one-pass:<p><pre><code> ;; code for "i = 0"
.loop_test:
;; code for "i < 5"
jz .loop_end
jmp .loop_body
.loop_incr:
;; code for "i = i + 1"
jmp .loop_test
.loop_body:
;; code for "j = j * 2 + 1"
jmp .loop_incr
.loop_end:
</code></pre>
Of course, normally you'd want to re-arrange things like so:<p><pre><code> ;; code for "i = 0"
jmp .loop_test
.loop_body:
;; code for "j = j * 2 + 1"
.loop_incr:
;; code for "i = i + 1"
.loop_test:
;; code for "i < 5"
jnz .loop_body
.loop_end:
</code></pre>
I propose the better loop syntax for languages with one-pass implementations, then: "for (i = 0) { j = j * 2 + 1; } (i = i + 1; i < 5);" :)
A time-honored approach!<p><a href="https://www.blackhat.com/presentations/win-usa-04/bh-win-04-aitel.pdf" rel="nofollow noreferrer">https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...</a><p>(minus directly emitting opcodes, and fitting into 500 lines, of course.)
Somewhat unrelated question, but I think one of the second most difficult things of learning C for coders who are used to scripting languages is to get your head around how the various scaler data types like short, int, long,... (and the unsigned/hex version of each) are represented and how they relate to each other and how they relate to the platform.<p>I am wondering if this complexity exists due to historical reasons, in other words if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?
This looks a lot like the Tiny Pascal compiler that BYTE published a listing of back in 1978.<p><a href="http://www.trs-80.org/tiny-pascal/" rel="nofollow noreferrer">http://www.trs-80.org/tiny-pascal/</a><p>I figured out the basics of how a compiler works by going through it line by line.
It is interesting to think that 500 lines of code is something one can write in one or two days. But, writing a C compiler in 500 of comprehensible code (even in python) is challenge in itself that may take months after a few years of solid learning.<p>I wonder if is this a good path to becoming an extremely productive developer. If some one spends time developing projects like this, but for different areas... A kernel, a compressor, renderer, multimedia/network stack, IA/ML... Will that turn a good dev into a 0.1 Bellard?
Writing your own compiler<p>- demystifies compilers, interpreters, linkers/loaders and related systems software, which you now understand. This understanding will no doubt one day help in your debugging efforts;<p>- elevates you to become a higher level developer: you are now a tool smith who can make their own language if needed (e.g. to create domain specific languages embedded in larger systems you architect).<p>So congratulations, on top of other forms of abstraction, you have mastered meta-linguistic abstraction (see the latter part of Structure and Interpretation of Computer Programs, preferably the 1st or 2nd ed.).
> [Building parse trees] is really great, good engineering, best practices, recommended by experts, etc. But... it takes too much code, so we can't do it.<p>It takes too much code <i>in Python</i>. (Not a phrase one gets to say often, but it’s generally true for tree processing code.) In, say, SML this sort of thing is wonderfully concise.
Actually with SLY (<a href="https://sly.readthedocs.io" rel="nofollow noreferrer">https://sly.readthedocs.io</a>) now dead, what is the recommended Lexer/Parser library in Python?
Just for comparison the LOCs for some other small C or C like compilers.
It's not that far away from Ritchie's<p>C4x86 | 0.6K (very close)<p>small C (x86) | 3.1K<p>Ritchie's earliest struct compiler | 2.3K<p>v7 Unix C compiler | 10.2K<p>chibicc | 8.4K<p>Biederman's romcc | 25.0K
These kinds of posts are one of the things that keeps me coming back to HN. Right when I start thinking I'm a professional badass for implementing several features with great well tested code in record time, I stumble along posts like this that set me in my place.
This is crazy cool! Esolangs have been a hobby of mine, (more just an interest lately, since I haven't built any in a while,) so this is like a fun code golf game for compilation. Nice work, and even better, nice explanation article!
I am really confused by what people call compilers nowadays. This is now a compiler that takes input text and generates output text, which then gets read by a compiler that takes input text and generates JIT code for execution.<p>This is more of a transpiler, than an actual compiler.<p>Am I missing something?
For some value of “C”:<p>> <i>Notably, it doesn't support:</i><p>> <i>structs :-( would be possible with more code, the fundamentals were there, I just couldn't squeeze it in</i><p>> <i>enums / unions</i><p>> <i>preprocessor directives (this would probably be 500 lines by itself...)</i><p>> <i>floating point. would also be possible, the wasm_type stuff is in, again just couldn't squeeze it in</i><p>> <i>8 byte types (long/long long or double)</i><p>> <i>some other small things like pre/post cremements, in-place initialization, etc., which just didn't quite fit any sort of standard library or i/o that isn't returning an integer from main()</i><p>> <i>casting expressions</i>