TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The worst bug we faced at Antithesis

56 点作者 wizerno12 个月前

7 条评论

intuitionist12 个月前
(Disclosure: I’m an Antithesis employee.)<p>It’s briefly mentioned in a footnote here, but we have a <i>lot</i> of debugging war stories around the hypervisor protocol, many of which could themselves be blog posts. My personal favorite: we expected a certain hyperproperty related to determinism to hold during a refactor of the component on the other end of the hypervisor, but it was only holding some of the time, depending on the values of some parameters that were getting randomized during our testing. We dug in and figured out that, because we were round-robining across proposers of protocol messages into several pipelines, determinism held iff the number of proposers divided the number of pipelines or vice versa, and totally failed if they were coprime! If they had a smaller common factor greater than 1, there would be “partial determinism.” We very rarely ditch a suggested test property instead of trying to make it work, but that time we were defeated by number theory.
justinsaccount12 个月前
If you are ever building a platform and have control over everything, one thing that can make problems like this easier to find is to not use regular intervals like 5&#x2F;15&#x2F;30&#x2F;60 minutes everywhere.<p>At some point you&#x27;ll have a weird problem, or a load spike that shows up at regular intervals. If all of your intervals are 5&#x2F;15&#x2F;30 minutes, you will have 2 things running every 15 minutes and 3 things running every 30 minutes, you won&#x27;t necessarily know which one causes the issue.<p>If you use (co)prime numbers, say, 5&#x2F;7&#x2F;11&#x2F;13&#x2F;17&#x2F;19 as intervals: One, you won&#x27;t have a thundering herd of tasks all running at the exact same time every few minutes, and two, when someone notices a weird issue that happens every 17 minutes, you will know exactly what the cause is.
评论 #40437776 未加载
rdg4212 个月前
Great read!<p>But...<p>“Can you check &#x2F;var&#x2F;log&#x2F;messages and see if there’s messages every 30 minutes about ENA going down and then back up?”<p>Isn&#x27;t this &quot;sysadmin 101&quot; ? Like... the first thing to check on any server exhibiting weird behaviour ? :-) A message about a NIC going up &amp; down every 30min would have triggered many here instantly.<p>Interesting journey nevertheless!
评论 #40434410 未加载
cbanek12 个月前
Seems like the other lesson is every time you&#x27;re adding a 9 to your uptime by fixing a bug, it&#x27;s going to take longer each time to find those issues, either on wall time or dev time.
评论 #40442891 未加载
ajkjk12 个月前
So why the 8 minute offset? I think they never said?
评论 #40437266 未加载
nusl12 个月前
Kudos. We have a similar unknown bug at work so we’ll see how it goes as we scale. Folks aren’t currently giving the fix too high of a priority but I suspect it will become a real problem soon enough.
评论 #40437280 未加载
maherbeg12 个月前
I&#x27;m curious what the fix was, presumably just retry?
评论 #40437272 未加载