Table of contents of the article:
On November 18, thousands of websites protected by Cloudflare suffered severe and widespread outages, giving the appearance that “half the Internet” was out of orderOn the surface, it might have looked like a cyber attack or a global infrastructure problem, but the reality was much more surprising: a change to a database's permissions triggered a domino effect that led to the repeated crash of one of Cloudflare's most critical services. The database in question is clickhouse, a high-performance columnar database used for real-time analytics, chosen for its ability to execute complex queries on large volumes of data in extremely short times. This is a very powerful technology, but—like all global and shared components—it can become a point of fragility if something changes unpredictably.
What exactly happened
Cloudflare Bot Management is based on a machine learning model that analyzes each HTTP request by evaluating a set of features collected from internal systems. These features are extracted via a query on ClickHouse and are then included in a file that is updated every 5 minutes and distributed across the global Cloudflare network. It is therefore a mechanism highly dynamic and distributed: any error in the generation or deployment step propagates within minutes.
Changing database permissions
The Cloudflare team was making an internal improvement to its security processes: moving away from shared accounts in favor of dedicated user accounts and explicit permissions. This step involved updating the permissions on the ClickHouse database used by Bot Management. The operation seemed completely harmless. No one would have imagined that a change in access could change the behavior of a query that had been consolidated for years. And yet, that's exactly what happened.
The hidden side effect: duplicate data
Before the change, the query returned about 60 features from the default database. After the permission change, ClickHouse started reading data not only from the "default" database, but also from the "r0" database. The result was a output greater than 200 features. The Bot Management module had a hard-coded limit to 200 itemsWhen the list exceeded the threshold, the service did not manage the situation, did not record any errors, did not activate fallbacks or controlled degradations: it simply sent everything to crashA theoretically improbable condition, never seen before, suddenly became reproducible every five minutes, depending on the node generating the feature file.
Because it looked like a DDoS attack
The configuration file was regenerated and redistributed every five minutes. This meant that some instances on the Cloudflare network received a correct file, while others received a corrupted file. The result was behavior intermittent and irregular: services dropping and picking up, nodes asynchronously entering and exiting the error condition. Such a pattern is also typical of a distributed DDoS attack, especially when it affects multiple independent regions. To make matters worse, the incident also caused the Cloudflare status page, for a completely unrelated issue. A deadly coincidence, which led engineers down the wrong track for over two hours.
Diagnosis and resolution
It wasn't until around 14:30 PM that the team identified the source of the problem: a configuration file containing more than 200 features, generated by queries modified following ClickHouse's permission review. Once the cause was identified, engineers stopped the propagation of the faulty file and deployed a version known as correctThe complete recovery was completed at approximately 17:06 p.m. two and a half hours after corrective surgery.
Technical Lessons: What Engineers Need to Take Home
1. Unwrap() kills systems
The most serious problem was not the number of features, but the way the software reacted to the anomaly. The offending module used .unwrap() in Rust: a convenient choice for rapid development, but extremely dangerous for mission-critical systems. Unwrap() assumes that the operation always succeeds. If something goes wrong, it causes a panic which interrupts the entire service, without logs, without countermeasures, without diagnosis. If the system had logged a simple error message instead of crashing, the problem would have been identified within minutes. A single .unwrap() in a global pipeline can cause a international incident.
2. Global database changes are fragmentation grenades
A permissions change seemed harmless. In a complex and distributed system, however, even what seems impossible can generate side effects in subsystems that had no direct connection to the change. Staging environments never perfectly replicate a global infrastructure. The idea that a simple change could cause a duplication of query results wasn't among the foreseen risks. Yet that's exactly what happened. The lesson is simple: any global change can generate side effects. unexpected emerging behaviors.
3. Coincidences are more misleading than bugs
When two unrelated problems occur at the same time, the mind creates connections that don't exist. The Bot Management crash and the offline status page made the hypothesis of an external attack credible. Cloudflare is bombarded daily with attack attempts: starting from that hypothesis is natural. However, it cost 2,5 hours of investigation into a nonexistent problem. The lesson: data always beats narrative.
4. CDNs are real single point of failure
The incident highlighted a difficult fact to accept: the internet is massively dependent on a few global players. When a CDN like Cloudflare fails, much of the web is immediately impacted. Most companies don't have a second CDN ready, nor do they have an autonomous infrastructure to absorb traffic. Multi-CDN redundancy exists, but it's expensive, complex, and often impractical. The dependency is real, and the November 18 incident made it clear again.
Conclusion
The November 18th outage was a textbook case showing how seemingly irrelevant details can have enormous effects in global systems. A change in database permissions changed the behavior of a single query. The unexpected output exceeded a hard limit. A critical module relied on .unwrap() and reacted with total panic rather than a managed error. The global distribution of the faulty file every five minutes amplified the impact. An external coincidence led to a prolonged misdiagnosis. And so, from a single technical detail, tens of millions of websites have gone offlineAnd that's why, in complex architectures, it's not the attacks that are the most scary: it's the small, unexpected changes. Or worse yet, it's the .unwrap() hidden in the code.