(Disclosure: I’m an Antithesis employee.)<p>It’s briefly mentioned in a footnote here, but we have a <i>lot</i> of debugging war stories around the hypervisor protocol, many of which could themselves be blog posts. My personal favorite: we expected a certain hyperproperty related to determinism to hold during a refactor of the component on the other end of the hypervisor, but it was only holding some of the time, depending on the values of some parameters that were getting randomized during our testing. We dug in and figured out that, because we were round-robining across proposers of protocol messages into several pipelines, determinism held iff the number of proposers divided the number of pipelines or vice versa, and totally failed if they were coprime! If they had a smaller common factor greater than 1, there would be “partial determinism.” We very rarely ditch a suggested test property instead of trying to make it work, but that time we were defeated by number theory.
If you are ever building a platform and have control over everything, one thing that can make problems like this easier to find is to not use regular intervals like 5/15/30/60 minutes everywhere.<p>At some point you'll have a weird problem, or a load spike that shows up at regular intervals. If all of your intervals are 5/15/30 minutes, you will have 2 things running every 15 minutes and 3 things running every 30 minutes, you won't necessarily know which one causes the issue.<p>If you use (co)prime numbers, say, 5/7/11/13/17/19 as intervals: One, you won't have a thundering herd of tasks all running at the exact same time every few minutes, and two, when someone notices a weird issue that happens every 17 minutes, you will know exactly what the cause is.
Great read!<p>But...<p>“Can you check /var/log/messages and see if there’s messages every 30 minutes about ENA going down and then back up?”<p>Isn't this "sysadmin 101" ?
Like... the first thing to check on any server exhibiting weird behaviour ? :-)
A message about a NIC going up & down every 30min would have triggered many here instantly.<p>Interesting journey nevertheless!
Seems like the other lesson is every time you're adding a 9 to your uptime by fixing a bug, it's going to take longer each time to find those issues, either on wall time or dev time.
Kudos. We have a similar unknown bug at work so we’ll see how it goes as we scale. Folks aren’t currently giving the fix too high of a priority but I suspect it will become a real problem soon enough.