These kinds of posts are my favorite part of HN. The deep dives into LLM's, state machine theory, fourier transforms, locking data structures, and memory allocation that is miles above the basic posts around the internet you get with searching - but not quite a full book yet.<p>The author spent 7 months tinkering and cared enough to come back and share that with us.
A side topic: how can a company find people like the author, who can really articulate how to build a system? I've been interviewing senior engineers for years, and more than 95% of them, if not more, are really box drawers. They throw terms around confidently but fail miserably the moment I ask for specifics. No, I don't ask for details like how Raft works. That level of details probably would fail all the candidates. Just simple things like how you organize the data, how you route the traffic, how you manage metadata, or how your data path keeps metadata in sync. Really basic ones, yet most interviewers fail. Indeed, the more senior they are, the more likely they lose touch of the details -- even those who claim to work on distributed databases.
Here is nice interactive and visual representation of how raft works: <a href="http://thesecretlivesofdata.com/raft/" rel="nofollow">http://thesecretlivesofdata.com/raft/</a>
I was lucky enough to take David Beazley's "Rafting Trip," a five-day training course that guides each student through building their own Raft implementation from scratch:<p><a href="https://www.dabeaz.com/raft.html" rel="nofollow">https://www.dabeaz.com/raft.html</a><p>I'd recommend the course to anyone with development experience working with or near distributed systems. David is a fantastic instructor and facilitator, and the blend of student backgrounds led to some great learning and discussion.<p>(I have no financial or personal interest here; I just loved the course.)
Thank you for the write up eatonphil.<p>I experimentally implemented Raft in Java but I am not very confident that I did it correctly.<p>I wish there was a way to implement stateful programs that guarantee "forward progress" and are "steady state systems". I think essentially a state machine that cannot be trapped in a state. Debugging the absence of something of forward moving progress or lack of causation is very difficult.<p>When there's essentially different actors in the system and they can interact with eachother by communicating, they each have a number of states they can get into. There's no guarantee that the system shall converge on a state that forward progress can be made. Maybe TLA+ is the right answer here.<p>YMMV but I think (my) reasoning over stateful systems is rather difficult, I think there's lots of hidden states that we cannot easily detect or reason about because they're in our blind spots. Especially related to synchronization. I think it's part of what makes multithreading and distributed systems so hard, because every component can be in a different state and if something is not where it is expected to be, the baton doesn't get passed to the correct state. If you check for something too early, you have a race condition.<p>If we could see in slow motion what was going on, an interaction between different actors, we could work out why something happens the way it does. But usually the logs are too numerous to get to this detail. I think animation can save us, but what does a Raft animation look like?<p>How often have you seen an endless spinner? It's as if a completion event was raised but didn't get detected and the system is waiting for something that shall never occur. I want this kind of error to be impossible. This is one form of hidden state that prevents progress.<p>I wrote an eventually consistent mesh protocol in Python and tested it with Jepsen, it is not linearizable because the consistency level is "eventually consistent".<p>I don't understand how Raft can scale writes or reads across multiple machines due to the round trip time talking to other nodes.
Cool post. Wonder if it would have helped to take a look at the MIT distributed systems course on the web and YouTube - one of the projects is exactly this (Go Raft implementation)<p><a href="https://pdos.csail.mit.edu/6.824/labs/lab-shard.html" rel="nofollow">https://pdos.csail.mit.edu/6.824/labs/lab-shard.html</a>
Your note about encoding/gob being inefficient is somewhat accurate for how you're using it, but I want to talk a bit about how you could improve your use.<p>encoding/gob is intended for streams, not stateless marshals/unmarshals. The first thing that is sent over the stream is the type information the receiver should expect, that's why your payload was so large. After the first type is received, subsequent messages are much smaller. You can see this by extending your example to do multiple writes; each write after the first is only 10 bytes: <a href="https://play.golang.com/p/Po_iaXrTUER" rel="nofollow">https://play.golang.com/p/Po_iaXrTUER</a><p>You have to plan differently, but you could get large improvements to transmission sizes by changing to append only files and creating the gob encoder once per file. If you find you're creating a gob encoder/decoder very often, that's a telltale sign you're not using it as intended.
This is essentially do-it-yourself etcd.<p><a href="https://github.com/etcd-io/etcd">https://github.com/etcd-io/etcd</a>
Recently, I have been watching the course MIT6.824. This article appeared very timely :)<p>Here is another raft implementation in Go <a href="https://github.com/eliben/raft">https://github.com/eliben/raft</a>
I think that in the specific case of a key-value store (with just a get/put API), there is a more scaleable way to get consensus than putting a KV store on Raft. Raft (and multi-paxos) present a shared log abstraction, which is good if you want to impose a single ordering (what a replicated state machine needs). But writes to different keys are order independent, and you don't need ordering (between keys) to get a linearizable kv store. By going through raft or multi-decree paxos or VS replication, you pay a penalty for ensuring that all writes are sequentialised via the log. The penalty includes having to go through a single leader for all reads and writes, and lots of jitter when a leader fails up until the new leader becomes stable.<p>In contrast, if you simply use the equivalent of basic (single-decree) paxos per key, the writes can be leaderless. Of course, for performance reasons, each server must batch network and disk i/o.
Thanks for the post!<p>Another great blog post series about implementig Raft in Go that I found is this one <a href="https://eli.thegreenplace.net/2020/implementing-raft-part-0-introduction/" rel="nofollow">https://eli.thegreenplace.net/2020/implementing-raft-part-0-...</a>
I highly recommend MIT open courseware 6.824. Incredibly valuable for learning distributed systems, and one of the lab assignments is implementing raft in Go.
<a href="http://nil.csail.mit.edu/6.824/2022/schedule.html" rel="nofollow">http://nil.csail.mit.edu/6.824/2022/schedule.html</a><p>There are a ton of fascinating and potentially frustrating edge cases and gotchas to implementing raft correctly. There's no better way to understand it than actually implementing it, and I probably never would have done it myself without these course materials.