The CrowdStrike Lesson

Sun, 15 Dec 2024 10:00:00 +0000

Ok so, remember that Friday in July 2024 where half the internet basically just stopped for the day? Under the hood, CrowdStrike had pushed one broken file and taken down around 8.5 million Windows machines around the world. Around 4am UTC on the 19th they sent a routine update to their Falcon sensor (Falcon is their EDR, which is basically a souped-up antivirus that sits deep in the operating system and watches everything a computer is doing), and within minutes airlines couldn’t fly, hospitals pushed back surgeries, 911 lines went quiet, TV channels went black. And nobody was attacking anything, the antivirus tripped over itself.

Ok but what actually broke

Falcon ships with a kernel-mode driver, which is a fancy way of saying a piece of code that runs right next to the Windows kernel with full access to memory and hardware. EDRs need to live there because attackers often operate at that level too. That driver pulls in small configuration files called “channel files” constantly, little rule updates about new threats, and that morning one of them (C-00000291) shipped with malformed data inside. The driver tried to read a memory address it shouldn’t have (an out-of-bounds read, in the jargon), and because kernel code runs in ring 0 (the most privileged level the CPU offers) there’s no graceful “oh well, that thread crashed” fallback. When a ring-0 driver faults, the whole OS faults with it. Blue screen, reboot, load the same broken file, blue screen again.

Now picture that happening on thousands of machines at once inside an airport. The “fix” was that someone physically had to walk up to every single one, boot into safe mode, delete the file by hand, and reboot, because the driver loaded before Windows could even reach the network for a remote patch. You saw the consequences of that for days on the news: Delta staff scribbling flight info on whiteboards, gate screens stuck on blue, passengers camping out on terminal floors in Atlanta with handwritten boarding passes, radiology departments turning people away because the scanners wouldn’t come back up.

LaGuardia Airport, July 19th 2024. Photo: Smishra1, CC BY-SA 4.0 via Wikimedia Commons.

The awkward bit

None of this was exotic. CrowdStrike isn’t some sketchy vendor, they’re one of the biggest names in the business, and Falcon is everywhere because it actually works well. The update pipeline did exactly what it was designed to do, which was take a signed, vendor-approved file and ship it to every sensor on the planet in a few minutes. Everything worked as intended, and that’s kind of the uncomfortable part.

The thing that makes modern EDR useful (push fast, push everywhere, run in ring 0 because that’s where the interesting stuff happens) is the same thing that turned one bad file into a worldwide outage. No attacker, no zero-day, nothing clever going on, just a signed update shipped at scale with nothing meaningful in the way of a circuit breaker.

What I actually took from it

Your security tool is a supply chain too

We spend a lot of time worrying about supply chain attacks, meaning malicious code slipped in through some dependency like a compromised npm package. We spend way less time worrying about supply chain accidents, where the code is fine, signed correctly, and just behaves badly anyway. From the customer’s seat, the outcome is pretty much identical: the kernel still panics and the plane still doesn’t move.

Any vendor that can auto-push code into ring 0 on every machine in your fleet is, by definition, a single point of failure for that fleet, and it’s worth being honest about that rather than filing it under “boring dependency.”

“Defense in depth” is doing more lifting than it can

Everyone says “defense in depth” like it’s a magic phrase, but in practice most places run one EDR on every endpoint, from laptops to servers to point-of-sale machines. When that EDR breaks, there’s no second layer underneath, the whole defensive stack goes with it. Real depth implies heterogeneity (different vendors for different tiers, or different OSes), and heterogeneity costs money and complexity that nobody volunteers for.

The outage didn’t really punish companies for having bad security, it punished them for having the same security everywhere, which is a genuinely different problem.

Staged rollouts, even for “just config”

The detail from the post-mortem that really stuck with me was that the file went out to everyone at once. No ring deployment (shipping to 1% of machines first, then 10%, then the rest), no canary fleet, nothing. The reasoning was basically “this is threat intelligence content, not code, content is low risk, threats move fast, we gotta ship fast.” Which is also the exact reasoning that lets a broken file land on 8.5 million machines before anything can stop it.

The content-vs-code distinction is convenient for the vendor but doesn’t give the customer any real safety. If a file can crash your kernel, it’s code in every sense that matters, and it deserves the same care: canary deployments, ring rollouts, a rollback mechanism, a pause button customers can reach.

The recovery plan is kinda the real product

Every vendor has an incident response plan. Almost nobody has one for the scenario where the vendor is the incident, and that’s what separated the people who handled July 19th okay from the ones who didn’t. The ones who did okay had thought about it ahead of time: Bitlocker recovery keys (which you need just to get into an encrypted Windows machine) stored somewhere reachable and not only on the servers that were also down; out-of-band management like IPMI or iLO to get into machines without the OS; runbooks written from the assumption that the security tool itself could be the thing breaking everything.

If your recovery plan quietly assumes your security vendor is healthy, it’s less of a plan and more of an aspiration.

What I keep thinking about

Security software earns its place by being trusted, privileged, and basically everywhere. You can’t really design it any other way, an EDR that can’t see inside the kernel can’t catch an attacker who’s operating in the kernel. But the flip side of “trusted, privileged, everywhere” is that one bad push from the vendor can become everybody’s problem at the same moment.

CrowdStrike didn’t teach me Falcon is bad, it isn’t. What it reminded me of is that thinking about security software purely as a protector is only half the picture. It’s also a dependency in the hard operational sense, sitting in the kernel with update access to thousands of machines, and a dependency that powerful needs to be treated as one, with some redundancy where you can afford it, some staged trust in the update pipeline, and an actual plan for the day something about it goes sideways.

One vendor shouldn’t really be able to take down an airline, or a hospital, or someone’s emergency line, and when it happens, honestly it’s less about blaming CrowdStrike specifically and more about how we keep architecting systems in a way that lets one vendor have that kind of reach.

Anyway, maybe keep a Bitlocker key on paper somewhere :)

Incident on tiago mendes