pwshub.com

The Day Github Almost CrowdStriked Us All (Again)

In the world of tech startups, everyone aspires to become the next unicorn. Okay, maybe not this kind of unicorn:

If you have no idea what I’m talking about, it probably means you were either grabbing some coffee, perhaps taking a break or simply procrastinating. Whatever you were doing at the time, it was definitely not interacting with GitHub on August 14, 2024. Basically, all of GitHub’s services went down for a reasonable amount of time, leaving lots of people anxious given the recent CrowdStrike incident which took place earlier this year.

Not the First, Not the Last

It was 10:52 PM UTC on Sunday, October 21, 2018. Everything seemed to be going perfectly with a great week ahead to look forward to. Except, as you might already have guessed, a major incident happened during routine maintenance.

For the longest 43 seconds of some of GitHub’s employees’ lives, the situation looked dire until… the connection was restored! It would certainly have been a sweet ending for the story; however, this brief and seemingly insignificant outage triggered a chain of events resulting in 24 hours and 11 minutes of service degradation. Now, you try to imagine how many people didn’t sleep well that night when users were unable to log in to the platform, outdated files were being served, and many more problems kept arising for hours to come.

That being said, there have been other times when GitHub’s services were degraded, and you might not have even noticed due to: The Orchestrator.

The Orchestrator

Of course, when you have the huge task of maintaining a platform responsible for storing code worth billions of dollars, you need to be prepared for when things go south.

GitHub’s Orchestrator is a system that helps manage MySQL clusters, but more importantly, handles automated failover.

When the main server fails or has issues, Orchestrator steps in to promote one of the replicas to become the new primary. This ensures that the service can continue running smoothly with minimal downtime. The system is designed to detect failures, choose the best replica to promote, and make the necessary changes automatically, so the transition happens as quickly as possible.

The problem in 2018 started when GitHub experienced a network issue. Despite lasting less than a minute, it was enough to make the data centers on the East and West Coasts of the U.S. lose sync with each other, leading to a situation where data written on one coast wasn’t properly replicated to the other. The system’s Orchestrator, which manages database leadership, reacted by shifting database responsibilities to the West Coast data center.

Since each database had unique data that the other didn’t, this made it impossible to switch back to the East Coast data center without risking data loss and, as a result, GitHub had to keep operations running on the West Coast, even though the East Coast applications couldn’t handle the increased latency caused by this change. This decision caused significant service disruptions, but it was necessary to protect user data.

Back to 2024 and the Outcome

Fortunately for us (and for them), there was no data loss or corruption, and, apparently things are already back to normal. The issue was caused by a misconfiguration which disrupted traffic routing and led to critical services unexpectedly losing database connectivity. It was resolved by reverting the configuration change and restoring connectivity to their databases.

While we cannot predict every possible scenario (otherwise, bugs in the code would never exist), there were actual improvements to GitHub’s service status reports after the 2018 incident, especially since some users were unable to tell which services were down at the time.

Additionally, there were likely enhancements in infrastructure redundancy, the Orchestrator, and even the physical layout of their data centers.

And even though this time the damage was not as severe as in 2018, humans will always be error-prone and bound to face misfortune, it will happen. So, all we can do is learn from these experiences and work to reduce the chances of similar problems occurring in the future. After all, a pessimist is just an optimist with experience.

References

https://github.blog/news-insights/company-news/oct21-post-incident-analysis/

https://github.blog/engineering/orchestrator-github/

https://github.blog/engineering/mysql-high-availability-at-github/

https://github.blog/engineering/evolution-of-our-data-centers/

We want to work with you. Check out our "What We Do" section!

Source: blog.codeminer42.com

Related stories
2 weeks ago - In the second part of this series, Joas Pambou aims to build a more advanced version of the previous application that performs conversational analyses on images or videos, much like a chatbot assistant. This means you can ask and learn...
2 weeks ago - This year's Blue Ridge Ruby conference in Asheville, North Carolina, was not only a wonderful experience but also highlighted the opportunity (and what's at stake) in the Ruby community.
1 month ago - If you've built a frontend project in the last five years, you will have likely written some components, and maybe even used a component library. Components and libraries have been an important part of the web development landscape for...
1 week ago - With the arrival of September, AWS re:Invent 2024 is now 3 months away and I am very excited for the new upcoming services and announcements at the conference. I remember attending re:Invent 2019, just before the COVID-19 pandemic. It was...
1 month ago - Have you heard about the Rails config file? Discover how this file can be helpful to set your rails project templates and never start from scratch again
Other stories
2 hours ago - On this week's episode of the podcast, freeCodeCamp founder Quincy Larson interviews Eddie Jaoude who is a software engineer and open source creator. He's worked more than 15 years as a developer everywhere from Germany banking sector to...
2 hours ago - Oftentimes when looking at something, you can tell what looks good or bad, however struggle to verbalize why. The post A guide to the Law of Pragnanz appeared first on LogRocket Blog.
3 hours ago - To support the complexity of today's business IT resources, network monitoring must now provide comprehensive observability and automation
7 hours ago - HELLO EVERYONE!!! It’s September 13th 2024 and you are reading the 28th edition of the Codeminer42’s tech news report. Let’s check out what the tech …
9 hours ago - If you are having trouble accomplishing as much as you want to accomplish, we have a course for you. We just published a course on the freeCodeCamp.org YouTube channel that is designed to equip software developers with essential soft...