科技回声

5 条评论

jacquesm超过 1 年前

Not quite a bug, but I think typing 'shutdown -h now' on the wrong tab (which I thought was a test box) of a bunch of tabs monitoring live systems wasn't my best moment. Worse so because it required an actual trip to the DC to get that box up again.3 hours of downtime on a very busy service...The very worst bug I've ever encountered (and that still bugs me to this day more than 20 years later) was an occasion in a hosting facility called 'Webcity' in Hoofddorp, NL. A machine called 'chopper' would periodically register as 'down' according to the watchdog and then it would be automatically power cycled. This happened frequently enough to be a real problem. Investigating that showed that the 'ping' packets that were supposed to test whether or not the box was up would randomly disappear (the responses).After many, many hours of head scratching, changing out all kinds of gear and cabling (coaxial ethernet) we finally had one part left that we had not changed out. The BNC 'T' connector at the back of the machine. We all looked at each other and said 'that can't be it' and went for another round of verifying everything. Then, upon completion and just before again coming to the conclusion that that couldn't be it my colleague J suggested that maybe we should swap it out any way. Crazy idea but still. Problem solved.Long story short: the BNC connector was a near perfect sink for a part of the bitstream that made up the ping packet. I still can't figure out how that could be the case but the signal dropped low enough that there was a good chance that the receiving machine didn't decode it properly. And once three packets in a row had been mangled it would reboot the box.Total elapsed time was many hours and the fix was $1 or so. We ritually destroyed the old splitter.

评论 #37161745 未加载

helsinkiandrew超过 1 年前

Shipped a new version of a live trading system that was still pointing to non live permissions. Head of trading couldn't use the system - lots of swearing and name calling entailed, but all was forgiven/forgotten a few days later.I took the blame for a junior member of my team not understanding the issues of the different dev/staging/live environments.

评论 #37161568 未加载

warrenm超过 1 年前

In college, I took-down the RS6000 all the CS/CIS students usedIT hadn't enabled process isolation, and a miswritten memory allocator took all available free memory (8GB, iirc (it was late '99 / early '00 timeframe) and kernel panicked the box in a couple seconds

johnny99k超过 1 年前

15 years ago, I was working on a CRUD system in the automotive industry. It had an ORM, but custom SQL was hard-coded. I was really sick during the winter and had brain fog for weeks during this time.I accidentally left of the last half of an update statement and whenever anyone logged into the system, it was continually writing to all records.Luckily, we had hourly backups for production.

armchairhacker超过 1 年前

LLVM code generation bug. Imagine loading gdb and reproducing the crash, only to see a stack trace where every frame is `????????` and raw assembly (because the upper half is generated code and the lower half is corrupted). Then you finally determine a global variable (field in a struct pointer) is responsible for the crash, run rr and find a C function which modifies said variable…except it’s modifying a completely unrelated local variable (like, of a different pointer type) which happens to have the same address as this field.Also, it’s hard to find a minimal example, and the range of code where the bug could be is very large.Perhaps LLVM is compiling an allocation which somehow allocated memory in the same place as the global’s allocation? or perhaps somehow the global is getting assigned to the variable (though it seems trivially not, maybe LLVM is compiling arguments wrong)? I don’t actually know, I haven’t fixed the bug but it’s not currently an issue.How do you even fix something like that? I’d probably insert logs in the generated code to minimize the location where the bug could be found. But I have no idea how the local variable gets the same address as the global’s field…

5 条评论

jacquesm超过 1 年前

评论 #37161745 未加载

helsinkiandrew超过 1 年前

评论 #37161568 未加载

warrenm超过 1 年前

johnny99k超过 1 年前

armchairhacker超过 1 年前

Ask HN: What's Your Worst Bug?

5 条评论

Ask HN: What's Your Worst Bug?

5 条评论