Could a security questionnaire have prevented Facebook's outage?
too late, too late, for what mistake?
Like everyone else, my internet channels were buzzing with the news that Facebook, Instagram and WhatsApp faced an extended outage earlier today in the Eastern Daylight Time zone.
The outage caused some significant collateral damage, including slowing down Google's popular 8.8.8.8 DNS lookups and causing an array of top-level domains controlled by Facebook to be unavailable.
Some banana farmers might mistakenly appreciate monocultures, but those of us in security do not.
There is a lot of speculation on what happened, and while some explanations seem logical, some also prompt belly-aching laughter.
Not everyone’s pencils were at their sharpest amid this news onslaught. Sorry, Mr. Krebs, there is no such thing as a “DNS global routing table.” We think you are referring to BGP routing tables.
From my internet point of view, mostly based on various BSD and Tor Project community channels, the buzz was intense, even though there are likely fewer Facebook, WhatsApp or Instagram users in that community than in most other corners of the internet. Our community dwells on IRC, not Slack or the other more hip mediums. That should say it all.
My first action was to check the Facebook .onion site, only accessible on the Tor network.
For the uninitiated, .onion sites are hidden web sites, but better known as the "Dark Net", a media-inspired panic-infused sphere spreading misperception that it is where only the most misanthropic dwell.
Actually, the Facebook .onion site is the most popular site on the "Dark Net." But maybe after last night's 60 Minutes episode there is a kernel of truth to that misperception at least in regards to the Facebook overlords.
Since the .onion site wasn't working, a web site only accessible for Tor Browser users, I knew there was something more to the story than just a simple and temporary DNS-related problem. Onion sites don’t rely on the ordinary DNS system, as they are built to bypass internet censorship which is often done with DNS.
No, this Facebook outage, from current chatter, seems to be tied to someone mistakenly deleting all the BGP routing information from Facebook's routers. If that does turn out to be the case, keep your chin up poor DevOps or sysadmin person! You're not the first, and you certainly won't be the last to cause an outage at this scale.
Quite honestly, you’re the last to blame. What kind of resilient system rests on the keyboard input of one individual? Especially when the consequences of an error are so disastrous?
Furthermore, there is speculation that the internal system to get credentials was also down, meaning there was an inability to restore, further delaying resolution.
The more important question is how can a single mistake take down a huge proportion of the internet, especially without a relatively quick and painless fix?
Next time I send Facebook a security questionnaire, I'll be sure to add that specific query in addition to the standard questions about backups, incident response, disaster recovery and access.
But is such an outage an actual "security incident"?
Could the infinite number of people who rely on Facebook and Instagram beyond the actual users, turn around and sue Facebook, based on alleged negligence in security operations, a single-point of failure and the inability to quickly restore the deleted BGP data?
Maybe. While this doesn't appear to rise to the level of damages suffered in Delta versus 24/7.ai, where an allegedly fraudulent answer to a security review prompted an ugly lawsuit in the Goliath versus David tradition, there is speculation that it was costly.
Should we feel sorry for Goliath or for David? Hint: David usually doesn't get to skip gleefully out of court, no matter who is really at fault.
Who's going to win a street fight? The side well-armed with superior numbers, or the unarmed side that's weaker but "in the right"?
So, yes, this outage was a security incident. Investigation of current security practices should have uncovered the weaknesses. While we have seen many companies fail to practice the best security protocols, even when they demand it of others, they usually catch it through audits and assessments. It is doubtful that Facebook has to respond to security questionnaires, but this is the first time where perhaps going through that exercise might have uncovered the weaknesses that led to this outage.
We commonly ask “how can we mitigate against tomorrow's security threats?” Well, we can probably start by implementing the security mitigations that were supposed to be implemented yesterday.
George is a co-founder and CTO of ClearOPS. By trade, George is a systems administrator out of BSD Unix land, with long-time involvement in privacy-enhancing technologies. By nature, he thrives on creating unorthodox solutions to ordinary problems.
ClearOPS offers knowledge management for privacy and security ops data that is turned into information that can be used to respond to security questionnaires and conduct vendor monitoring. Do you know who your vendors are?