TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

CrowdStrike's outage should not have happened

49 pointsby b-man10 months ago

16 comments

beardedwizard10 months ago
This outage says more to me about the state of software engineering in 2024 than it does about crowdstrike. Starting with the fact that software like crowdstrike exists to compensate for even poorer software rife with exploitable vulnerabilities. It is certainly hard to defend crowdstrike, but is even harder to hear so many hot takes when the engineer emperor has no clothes.<p>Critical software engineering is a race to the bottom across many domains. Healthcare, banking, flight systems, etc.
评论 #41187317 未加载
评论 #41187558 未加载
评论 #41187453 未加载
评论 #41187462 未加载
评论 #41187495 未加载
评论 #41187421 未加载
评论 #41191270 未加载
评论 #41187432 未加载
ethbr110 months ago
&gt;&gt; <i>No staged deployment {changing to} Add staged deployment</i><p>That&#x27;s the thing that amazed me.<p>How do you <i>regularly</i> YOLO patches worldwide to something that runs with enough permissions to crash a system?<p>I don&#x27;t care if this was a configuration update vs a new sensor capability -- universal rollout should never have been allowed by CrowdStrike&#x27;s release team.
评论 #41187278 未加载
评论 #41187538 未加载
eugenekolo10 months ago
Sentiment might hold some merit, but this article is 80% copy pasting from an RCA report and 2 sentences saying nothing more than &quot;This shouldn&#x27;t happen&quot; while offering no alternative or deep thought into improvement...
评论 #41190975 未加载
dwheeler10 months ago
The Crowdstrike report explains why it crashed, but <i>not</i> how it passed final end-to-end testing. There appears to have been many tests of piece parts (unit testing), but that&#x27;s not the same as testing the full system.<p>I would think <i>all</i> the end-to-end tests of the full system would have been instantly detected the problem and prevented it, because it would have failed all the end-to-end tests.<p>Did I miss something? Did they never test the complete system as deployed? Looks like it, but maybe I misunderstood something.
评论 #41187475 未加载
greenthrow10 months ago
Extremely low quality post by the submitter. Yes these shouldn&#x27;t happen, but software engineers -- so far -- are all human. It&#x27;s more useful to talk about the ways this could be mitigated than to just post a few sentences repeating that it shouldn&#x27;t happen.
评论 #41187521 未加载
评论 #41190902 未加载
siliconc0w10 months ago
What commonly happens in these organizations is they have a software delivery path that has a lot of these best practices but soon people figure out that it is too slow so they invent a new, faster, path. From what I can tell Crowdstrike had a lot of the usual best practices like canary rollouts on their binary but they didn&#x27;t on this configuration file despite it having the same consequences of a bad binary push. This wasn&#x27;t even an edge case, it reliably BSOD&#x27;d every windows machine that got this update.<p>One strategy Google SRE uses is that the team ensuring reliability has a different reporting path than the product team - so there is always a check and balance when things like rollout policies get worked around by clever product teams.<p>It&#x27;s a shame because I hear it&#x27;s actually a pretty good product.
zamadatix10 months ago
What is whooshing over my head about &quot;Figure 1&quot;?
评论 #41187420 未加载
satisfice10 months ago
I’m not concerned about the technical solutions. Any technical solution has to be implemented by people.<p>The thing not mentioned in CrowdStrike’s report is anything about people— especially management. Bad management and understaffed teams will defeat any technical solution, any day.
halayli10 months ago
&gt; Multiple engineers identified the issue via analysis of stack dumps as being triggered by a null pointer bug in the C++ the Crowdstrike update was written in; it appears to have tried to call an invalid region of memory that results in a process getting immediately killed by Windows, but that take looked increasingly controversial and Crowdstrike itself said that the incident was not due to &quot;null bytes contained within Channel File 291 [the update that triggered the crashes] or any other Channel File.&quot;
randerson10 months ago
Nation-state hackers of the world must <i>love</i> the idea of a supply chain that pushes out immediate untested updates to half the US Fortune 500, to be processed by a C++ kernel driver. If CrowdStrike&#x27;s goal is to secure companies at scale, they could easily be doing the opposite.
luxuryballs10 months ago
I still think it was intentional, someone activated the CrowdStrike feature that was purchased by the DoD.<p>Maybe people with inside knowledge of recent events were trying to make an exit so they had to smash the glass and hit the red button to stop air travel so they could snag them in time?<p>Making it a perfect update failure is clever enough, but the name of the product is the best part. Imagine a system that can stop breaches even after they occur ;)
Dwedit10 months ago
So you have an &quot;Index Out Of Bounds&quot; problem. It could either directly lead to reading out-of-bounds memory and generating an Access Violation exception, or you could see the out-of-bounds array access and throw an exception.<p>Either way, you&#x27;ve got a kernel-mode exception that isn&#x27;t being caught, and that&#x27;s a BSOD.
评论 #41187497 未加载
insane_dreamer10 months ago
That&#x27;s quite a list of problems with that update; wasn&#x27;t just a single bug that slipped through the cracks.
99990000099910 months ago
CrowdStrike outsourced their SDET positions to save a buck.<p>This is what happens. Stop skimping on QA.
jokoon10 months ago
I&#x27;m not a fan of rust, but if microsoft required that those sort of critical software be written in rust, it would be a good thing.<p>Anything that is doing something sensitive or critical that can crash the system should be written in rust.<p>If not, insurance companies would be mandated by law to run static analysis on such C++ code.
echelon10 months ago
There is <i>no reason</i> to not use Rust for these systems anymore. It&#x27;s why the US Government is pushing so much for Rust adoption.<p>We&#x27;re going to keep seeing these horror stories until C&#x2F;C++ go away.
评论 #41187308 未加载
评论 #41187280 未加载
评论 #41187463 未加载
评论 #41187269 未加载
评论 #41187357 未加载