Microsoft services suffer downtime following failed wide-area network update - SiliconANGLE

[siliconangle.com] 3 days ago

Microsoft Corp. customers were none too pleased today after the company suffered a widespread outage that resulted in services including Azure, Teams and Outlook being unavailable for nearly three hours.

The outage resulted from a planned update to the Microsoft Wide Area Network that started at 2 a.m. EST. According to an Azure status update, “customers experienced issues with networking connectivity, manifesting as network latency and/or timeouts when attempting to connect to Azure resources in Public Azure regions, as well as other Microsoft services including Microsoft 365 and PowerBI.”

⚠️We are currently investigating a networking issue impacting connectivity to Azure for a subset of users. More information will be provided as it becomes available. For more information, please refer to https://t.co/GIfq5mC5Eb

— Azure Support (@AzureSupport) January 25, 2023

Microsoft addressed the issue by rolling back the change implemented in the WAN update. Azure services were restored by 4:35 a.m. EST, with other Microsoft cloud services restored around the same time.

The exact issue that caused the outage, aside from it being the scheduled update, was not disclosed. The Microsoft 365 team and others at Microsoft described the issue as a “networking issue.”

We've identified a potential networking issue and are reviewing telemetry to determine the next troubleshooting steps. You can find additional information on our status page at https://t.co/pZt32fOafR or on SHD under MO502273.

— Microsoft 365 Status (@MSFT365Status) January 25, 2023

Following the outage, Microsoft committed to do a follow-up, including producing a preliminary “Post Incident Review.” The review will cover the initial root cause and repair items. A final review, which will include a deep dive into the incident, will be completed within 14 days.

“The Microsoft service outage is a more common event than many realize,” Alex Hoff, co-founder and chief product officer at network management software company Auvik Networks Inc., told SiliconANGLE. “For most organizations, changes to the network occur daily or weekly, and the IT team doesn’t always have complete visibility into those changes.”

Hoff noted that documentation of network changes and configurations are often incomplete or have a significant lag time in getting up to date. “This makes it far more difficult for IT teams and network managers to pinpoint and correct issues when the network goes down.” he added.

Matthew Hodgson, chief executive officer of secure messaging platform Element, highlighted that this wasn’t the first time Teams has gone down, forcing businesses to fall back on cumbersome email — except that this time, Outlook also failed.

“One of the biggest problems with using centralized platforms like Teams is that when it goes down, you have put all your eggs in one basket: Your critical conversations have been held hostage in a single system, with a single point of failure,” Hodgson explained.

Photo: Georgetown University