TechEcho

7 comments

(Disclosure: I’m an Antithesis employee.)It’s briefly mentioned in a footnote here, but we have a lot of debugging war stories around the hypervisor protocol, many of which could themselves be blog posts. My personal favorite: we expected a certain hyperproperty related to determinism to hold during a refactor of the component on the other end of the hypervisor, but it was only holding some of the time, depending on the values of some parameters that were getting randomized during our testing. We dug in and figured out that, because we were round-robining across proposers of protocol messages into several pipelines, determinism held iff the number of proposers divided the number of pipelines or vice versa, and totally failed if they were coprime! If they had a smaller common factor greater than 1, there would be “partial determinism.” We very rarely ditch a suggested test property instead of trying to make it work, but that time we were defeated by number theory.

justinsaccount12 months ago

If you are ever building a platform and have control over everything, one thing that can make problems like this easier to find is to not use regular intervals like 5/15/30/60 minutes everywhere.At some point you'll have a weird problem, or a load spike that shows up at regular intervals. If all of your intervals are 5/15/30 minutes, you will have 2 things running every 15 minutes and 3 things running every 30 minutes, you won't necessarily know which one causes the issue.If you use (co)prime numbers, say, 5/7/11/13/17/19 as intervals: One, you won't have a thundering herd of tasks all running at the exact same time every few minutes, and two, when someone notices a weird issue that happens every 17 minutes, you will know exactly what the cause is.

评论 #40437776 未加载

rdg4212 months ago

Great read!But...“Can you check /var/log/messages and see if there’s messages every 30 minutes about ENA going down and then back up?”Isn't this "sysadmin 101" ? Like... the first thing to check on any server exhibiting weird behaviour ? :-) A message about a NIC going up & down every 30min would have triggered many here instantly.Interesting journey nevertheless!

评论 #40434410 未加载

cbanek12 months ago

Seems like the other lesson is every time you're adding a 9 to your uptime by fixing a bug, it's going to take longer each time to find those issues, either on wall time or dev time.

评论 #40442891 未加载

ajkjk12 months ago

So why the 8 minute offset? I think they never said?

评论 #40437266 未加载

nusl12 months ago

Kudos. We have a similar unknown bug at work so we’ll see how it goes as we scale. Folks aren’t currently giving the fix too high of a priority but I suspect it will become a real problem soon enough.

评论 #40437280 未加载

maherbeg12 months ago

I'm curious what the fix was, presumably just retry?

评论 #40437272 未加载

7 comments

intuitionist12 months ago

justinsaccount12 months ago

评论 #40437776 未加载

rdg4212 months ago

评论 #40434410 未加载

cbanek12 months ago

Seems like the other lesson is every time you're adding a 9 to your uptime by fixing a bug, it's going to take longer each time to find those issues, either on wall time or dev time.

The worst bug we faced at Antithesis

7 comments

The worst bug we faced at Antithesis

7 comments