TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What's Your Worst Bug?

15 点作者 hliyan超过 1 年前
Inspired by this: https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37158827<p>Some memorable ones for me:<p>1. Around 2006: hand-coded replication link between primary and secondary of an HFT component, each with its own data store; primary flushes its queue when secondary acks up to the message sequence number it has done a write() to disk; primary crashes in production, secondary takes over, rebuilds state from its own data store, but one sequenced message is missing; data stream cannot continue; pandemonium; and that is how I met disk write buffers in Linux. Write doesn&#x27;t mean write until flushed to disk.<p>2. In the aftermath of (1), to solve the above problem, called fdatasync() after each message write by the secondary; production comes to a screeching halt; latencies go from around 5ms to 100ms; pandemonium; and that is how I learned the cost of blocking I&#x2F;O in a single threaded environment. I ended up calling fdatasync() only once in every 100 writes.

5 条评论

jacquesm超过 1 年前
Not quite a bug, but I think typing &#x27;shutdown -h now&#x27; on the wrong tab (which I thought was a test box) of a bunch of tabs monitoring live systems wasn&#x27;t my best moment. Worse so because it required an actual trip to the DC to get that box up again.<p>3 hours of downtime on a very busy service...<p>The very worst bug I&#x27;ve ever encountered (and that still bugs me to this day more than 20 years later) was an occasion in a hosting facility called &#x27;Webcity&#x27; in Hoofddorp, NL. A machine called &#x27;chopper&#x27; would periodically register as &#x27;down&#x27; according to the watchdog and then it would be automatically power cycled. This happened frequently enough to be a real problem. Investigating that showed that the &#x27;ping&#x27; packets that were supposed to test whether or not the box was up would randomly disappear (the responses).<p>After many, many hours of head scratching, changing out all kinds of gear and cabling (coaxial ethernet) we finally had <i>one</i> part left that we had not changed out. The BNC &#x27;T&#x27; connector at the back of the machine. We all looked at each other and said &#x27;that can&#x27;t be it&#x27; and went for another round of verifying everything. Then, upon completion and just before again coming to the conclusion that that couldn&#x27;t be it my colleague J suggested that maybe we should swap it out any way. Crazy idea but still. Problem solved.<p>Long story short: the BNC connector was a near perfect sink for a part of the bitstream that made up the ping packet. I still can&#x27;t figure out <i>how</i> that could be the case but the signal dropped low enough that there was a good chance that the receiving machine didn&#x27;t decode it properly. And once three packets in a row had been mangled it would reboot the box.<p>Total elapsed time was many hours and the fix was $1 or so. We ritually destroyed the old splitter.
评论 #37161745 未加载
helsinkiandrew超过 1 年前
Shipped a new version of a live trading system that was still pointing to non live permissions. Head of trading couldn&#x27;t use the system - lots of swearing and name calling entailed, but all was forgiven&#x2F;forgotten a few days later.<p>I took the blame for a junior member of my team not understanding the issues of the different dev&#x2F;staging&#x2F;live environments.
评论 #37161568 未加载
warrenm超过 1 年前
In college, I took-down the RS6000 all the CS&#x2F;CIS students used<p>IT hadn&#x27;t enabled process isolation, and a miswritten memory allocator took all available free memory (8GB, iirc (it was late &#x27;99 &#x2F; early &#x27;00 timeframe) and kernel panicked the box in a couple seconds
johnny99k超过 1 年前
15 years ago, I was working on a CRUD system in the automotive industry. It had an ORM, but custom SQL was hard-coded. I was really sick during the winter and had brain fog for weeks during this time.<p>I accidentally left of the last half of an update statement and whenever anyone logged into the system, it was continually writing to all records.<p>Luckily, we had hourly backups for production.
armchairhacker超过 1 年前
LLVM code generation bug. Imagine loading gdb and reproducing the crash, only to see a stack trace where every frame is `????????` and raw assembly (because the upper half is generated code and the lower half is corrupted). Then you finally determine a global variable (field in a struct pointer) is responsible for the crash, run rr and find a C function which modifies said variable…except it’s modifying a completely unrelated local variable (like, of a different pointer type) which happens to have the same address as this field.<p>Also, it’s hard to find a minimal example, and the range of code where the bug could be is very large.<p>Perhaps LLVM is compiling an allocation which somehow allocated memory in the same place as the global’s allocation? or perhaps somehow the global is getting assigned to the variable (though it seems trivially not, maybe LLVM is compiling arguments wrong)? I don’t actually know, I haven’t fixed the bug but it’s not currently an issue.<p>How do you even fix something like that? I’d probably insert logs in the generated code to minimize the location where the bug could be found. But I have no idea how the local variable gets the same address as the global’s field…