November 24, 2022 9:07 AM
As we settle into the time of year when we reflect on what we’re thankful for, we tend to focus on important basics such as health, family and friends.
But on a professional level, IT operations (ITOps) practitioners are thankful to avoid disastrous outages that can cause confusion, frustration, lost revenue and damaged reputations. The very last thing ITOps, network operations center (NOC) or site reliability engineering (SRE) teams want while eating their turkey and enjoying time with family is to get paged about an outage. These can be extremely costly — $12,913 per minute, in fact, and up to $1.5 million per hour for larger organizations.
To understand the peace of mind that comes with avoiding downtime, however, you have to have endured the pain and anxiety that comes with outages first-hand. Here are a handful of the horror stories ITOps pros are thankful to avoid this season.
A case of janky command structure
One longtime IT pro was on a shift with three others as 7 p.m. rolled around. The crew received an alert about a problem impacting the front-end user interface for its global traffic manager device. Thankfully, there was a runbook for it housed in a database, so it appeared the problem would be resolved quickly. One of the team members saw two things to type in: A command and a secondary input. He typed in the commands and, based on the way the runbook looked, was waiting for the command line to ask for an input, such as “what do you want to restart?”
Intelligent Security Summit
Learn the critical role of AI & ML in cybersecurity and industry specific case studies on December 8. Register for your free pass today.
The way the command structure was set up, if you didn’t provide an input, the device itself would restart. He typed in what he thought was the correct command — “bigstart, restart” — and the entire front-end global traffic manager was taken down.
Just as a reminder, this took place in the early evening. The customer was a finance company, and the system went down just around the time when businesses were closing and trying to do their books and other finance-related tasks. Terrible timing, to say the least.
Five minutes into the outage, the ITOps team realized what happened: The tool they used for their runbook used text wrapping by default, so what looked like two separate commands was actually just one. Even though the outage was relatively short, it came at a critical time and created a chain reaction of headaches. The lesson learned? Ensure your command structure is optimized.
When Google is your best friend in the middle of the night
For one 15-year-plus IT veteran, what seemed like a quiet overnight shift quickly devolved into an anxiety-riddled nightmare. “I never found myself panicking so fast as when the remote terminal I was in all of a sudden went blank,” he said.
What he was trying to do was restart a service while working on a remote machine, but he inadvertently disabled the network connector in the process. Calling someone and waking them up in the middle of the night to tell them he had “nuked” a network adapter was less than ideal, so he and his teammates started doing some digging.
After what he calls “not an insignificant amount of Googling,” he was able to find his way to a Dell server and restarted the network adapter from there. It took longer than it should have to get fixed, but the issue was eventually resolved.
His pro tip: “Don’t disable the network adapter on a machine you remote into in the middle of the night.” That may sound obvious, but the underlying lesson is to have a contingency plan in place should something go terribly wrong.
ITOps: Leaning on email was great — until it wasn’t
Back when email was the main way NOC teams received alerts, one longtime IT pro recalls having a teammate whose sole job was essentially dispatch: Monitoring emails and creating tickets for incidents that needed attention now, and others for those they could get to later. The system worked well, but it was actually a time bomb waiting to explode considering this was a large multinational corporation.
That fear was realized when the company’s entire data center went down.
This was its own set of problems in its own right, but the incident generated so many email alerts that it also crashed the corporate Outlook server. “At that point, you’re really blind,” this IT hero remembered.
The event happened to take place in the middle of the night, so the on-call team had to reluctantly start waking up fellow teammates. After the issue was eventually resolved, the team developed a sense of humor about it. As they recalled: “We used to joke that we DDoS ourselves with our own alert noise. Good times!”
In the end, the overarching moral of the story is this: Any time a hand touches a keyboard, there is a risk that something could go wrong. This is unavoidable at times, of course, but teams that are able to automate and simplify their IT operations processes as much as possible give themselves the best chance of avoiding costly outages — so they can enjoy their Thanksgiving celebrations uninterrupted.
Mohan Kompella is vice president of product marketing at BigPanda.