pwshub.com

The Day Github Almost CrowdStriked Us All (Again)

In the world of tech startups, everyone aspires to become the next unicorn. Okay, maybe not this kind of unicorn:

If you have no idea what I’m talking about, it probably means you were either grabbing some coffee, perhaps taking a break or simply procrastinating. Whatever you were doing at the time, it was definitely not interacting with GitHub on August 14, 2024. Basically, all of GitHub’s services went down for a reasonable amount of time, leaving lots of people anxious given the recent CrowdStrike incident which took place earlier this year.

Not the First, Not the Last

It was 10:52 PM UTC on Sunday, October 21, 2018. Everything seemed to be going perfectly with a great week ahead to look forward to. Except, as you might already have guessed, a major incident happened during routine maintenance.

For the longest 43 seconds of some of GitHub’s employees’ lives, the situation looked dire until… the connection was restored! It would certainly have been a sweet ending for the story; however, this brief and seemingly insignificant outage triggered a chain of events resulting in 24 hours and 11 minutes of service degradation. Now, you try to imagine how many people didn’t sleep well that night when users were unable to log in to the platform, outdated files were being served, and many more problems kept arising for hours to come.

That being said, there have been other times when GitHub’s services were degraded, and you might not have even noticed due to: The Orchestrator.

The Orchestrator

Of course, when you have the huge task of maintaining a platform responsible for storing code worth billions of dollars, you need to be prepared for when things go south.

GitHub’s Orchestrator is a system that helps manage MySQL clusters, but more importantly, handles automated failover.

When the main server fails or has issues, Orchestrator steps in to promote one of the replicas to become the new primary. This ensures that the service can continue running smoothly with minimal downtime. The system is designed to detect failures, choose the best replica to promote, and make the necessary changes automatically, so the transition happens as quickly as possible.

The problem in 2018 started when GitHub experienced a network issue. Despite lasting less than a minute, it was enough to make the data centers on the East and West Coasts of the U.S. lose sync with each other, leading to a situation where data written on one coast wasn’t properly replicated to the other. The system’s Orchestrator, which manages database leadership, reacted by shifting database responsibilities to the West Coast data center.

Since each database had unique data that the other didn’t, this made it impossible to switch back to the East Coast data center without risking data loss and, as a result, GitHub had to keep operations running on the West Coast, even though the East Coast applications couldn’t handle the increased latency caused by this change. This decision caused significant service disruptions, but it was necessary to protect user data.

Back to 2024 and the Outcome

Fortunately for us (and for them), there was no data loss or corruption, and, apparently things are already back to normal. The issue was caused by a misconfiguration which disrupted traffic routing and led to critical services unexpectedly losing database connectivity. It was resolved by reverting the configuration change and restoring connectivity to their databases.

While we cannot predict every possible scenario (otherwise, bugs in the code would never exist), there were actual improvements to GitHub’s service status reports after the 2018 incident, especially since some users were unable to tell which services were down at the time.

Additionally, there were likely enhancements in infrastructure redundancy, the Orchestrator, and even the physical layout of their data centers.

And even though this time the damage was not as severe as in 2018, humans will always be error-prone and bound to face misfortune, it will happen. So, all we can do is learn from these experiences and work to reduce the chances of similar problems occurring in the future. After all, a pessimist is just an optimist with experience.

References

https://github.blog/news-insights/company-news/oct21-post-incident-analysis/

https://github.blog/engineering/orchestrator-github/

https://github.blog/engineering/mysql-high-availability-at-github/

https://github.blog/engineering/evolution-of-our-data-centers/

We want to work with you. Check out our "What We Do" section!

Source: blog.codeminer42.com

Related stories
2 weeks ago - In the second part of this series, Joas Pambou aims to build a more advanced version of the previous application that performs conversational analyses on images or videos, much like a chatbot assistant. This means you can ask and learn...
3 weeks ago - This year's Blue Ridge Ruby conference in Asheville, North Carolina, was not only a wonderful experience but also highlighted the opportunity (and what's at stake) in the Ruby community.
1 month ago - If you've built a frontend project in the last five years, you will have likely written some components, and maybe even used a component library. Components and libraries have been an important part of the web development landscape for...
2 weeks ago - With the arrival of September, AWS re:Invent 2024 is now 3 months away and I am very excited for the new upcoming services and announcements at the conference. I remember attending re:Invent 2019, just before the COVID-19 pandemic. It was...
1 month ago - I’m always amazed by the talent and passion of our Amazon Web Services (AWS) community members, especially in their efforts to increase diversity, equity, and inclusion in the tech community. Last week, I had the honor of speaking at the...
Other stories
34 minutes ago - The 2024 Gartner Magic Quadrant positions AWS as a Leader, reflecting our commitment to diverse virtual desktop solutions and operational excellence - driving innovation for remote and hybrid workforces.
1 hour ago - Understanding design patterns are important for efficient software development. They offer proven solutions to common coding challenges, promote code reusability, and enhance maintainability. By mastering these patterns, developers can...
1 hour ago - APIs (Application Programming Interfaces) play an important role in enabling communication between different software systems. However, with great power comes great responsibility, and securing these APIs is necessary to protect sensitive...
1 hour ago - This article aims to celebrate the power of introversion in UX research and design. Victor Yocco debunks common misconceptions, explores the unique strengths introverted researchers and designers bring to the table, and offers practical...
1 hour ago - The Zeigarnik effect explains why people tend to remember incomplete tasks first and complete work that’s already started. The post Understanding and applying the Zeigarnik effect appeared first on LogRocket Blog.