Thirty-Seven Minutes
Jonas Hultenius
2023-11-03
On October 31 a big part of what makes up the modern internet function, Broke. For 37 minutes we were all living in the void of not being able to run one of the most pinnacle technologies of this last decade, NPM.
To a none-tech person this might seem like nothing, thirty-seven minutes is more or less nothing in the grand scheme of things, but to a tech enthusiast like myself this was far from a pleasant experience and something that we might need to address in the future.
So, what happened?
Well, put simply Cloudflare broke itself. They have released a great summarization of the event and what happened and who is affected but if you summarize their summation, it boils down to them relying too much on their own tech.
Their awesome and well used KV solution, Workers KV, was used to run their own custom progressive deployment so when an update was rolled out that failed, they did have to do the role back manually. The whole ordeal only took just about a half hour, and I. can only applaud the team for the quick thinking and openness in charring what went wrong, but the lasting effect is the scare that these kind of outages gives me as a developer.
Why should we be scared?
Well, as developers and tech companies tend to place all their eggs in one basket and relying more and more on others to handle their infrastructure this kind of outages and wide-reaching implication will only become more and more prevalent in the years to come. Infrastructure is hard and expensive and when we have the opportunity to outsource it for less than nothing to other actors we tend to do so.
On-prem severs gave way to the cloud and while the coverage is still dense more and more startups and forward leaning companies and individuals (like me, full disclosure) have moved away from it for an even more stress-free Infrastructure as a Service model where everything just works out of the box. Most of the time that is, except for the times it doesn’t work at all.
The edge has given us so many advantages that the prior models have not. And, before the cloud bros out there start attacking, I need to clarify that yes, the edge is technically a part of the cloud and that most of the things done by edge providers can be done by the big three and their cloud offerings. I know and agree!
But, the offerings of Vercel, Netify and Cloudflare are just so enticing. They are themself often customers of the big three and do not host their own datacenters, but their pricing and simplicity just makes more sense to me. It just works, like Todd Howard often claims Bethesda’s products do, and that is what I like the best with it. It works all the time, not counting all the time it is broken naturally.
The problem is that we have become too used to that fact, that it always works, to plan for contingencies when it doesn’t. We use NPM and other package managers to breed life into our codebase and to give us something to build upon. This makes us quicker and more productive and helps us not reinventing the wheel again. And this is great, but we have no plan for what happens when that service goes out or a package is removed, compromised or deprecated. We have once again put all our eggs in one basket.
It will be changed and fixed in a matter of hours up to a day or so depending on the problem. And we will have learned nothing from the whole ordeal.
As infrastructure becomes more and more something you buy as a service (which I love) and software is increasingly becoming an endless list of dependencies we will need to take this into consideration. Perhaps it time we take a step back and start planning for a more robust tomorrow.
By taking steps and distributing our risks we may alleviate some of the worst-case scenarios from impacting us fully. And there are solutions out there that may help but that is a blogpost for another day.
For thirty-seven minutes, one of the pillars of the edge was down and countless solutions, websites and products ceased to work. And no one seems to have notice. But for me this was one of the most frightful Halloweens up to date.
Addendum I find it a bit ominous that today’s post about Cloudflare’s outage earlier in the week is followed up by another outage of an even greater magnitude just a few hours later. This time a power outage in Oregon causing a much more dire situation. Hopefully the awesome team at Cloudflare will get things up and running again so we once more can back to thinking about how to alleviate these problems in the future.