compassion, collaboration & cooperation iN transistion
from the Internet Network of Networks
Today at 15:51 UTC, we opened an internal incident entitled "Facebook DNS lookup returning SERVFAIL" because we were worried that something was wrong with our DNS resolver 1.1.1.1. But as we were about to post on our public status page we realized something else more serious was going on.
Social media quickly burst into flames, reporting what our engineers rapidly confirmed too. Facebook and its affiliated services WhatsApp and Instagram were, in fact, all down. Their DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had "pulled the cables" from their data centers all at once and disconnected them from the Internet.
This wasn't a DNS issue itself, but failing DNS was the first symptom we'd seen of a larger Facebook outage.
How's that even possible?
Views: 19
Tags:
Facebook has now published a blog post giving some details of what happened internally. Externally, we saw the BGP and DNS problems outlined in this post but the problem actually began with a configuration change that affected the entire internal backbone. That cascaded into Facebook and other properties disappearing and staff internal to Facebook having difficulty getting service going again.
Facebook posted a further blog post with a lot more detail about what happened. You can read that post for the inside view and this post for the outside view.
At the end of the day THE [DATA] TRAFFIC between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, Facebook's META engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fibre line, adding more capacity, or updating the software on the router itself.
This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Their systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.
This change caused a complete disconnection of Facebook's server connections between their data centres and the internet. And that total loss of connection caused a second issue that made things even worse.
Add a Comment
© 2024 Created by Michael Grove. Powered by
You need to be a member of Gaia Community to add comments!
Join Gaia Community