How AppNeta Speeds up MTTI: Diagnosing a recent outage
by Amy Potvin Amy Potvin on

We work with our customers everyday to provide proactive insight on the key apps supporting the business and best practices for addressing issues when they arise. Up this week is a great example of how having a continuous baseline of your cloud applications will help you take action before your users are impacted.

On March 15th, saw an issue with Microsoft Azure’s Active Directory reported here (sorting on Azure Active Directory if necessary). As Teams is critical infrastructure for many of our customers we are actively monitoring it from many locations and began seeing issues just after 12 PM PT. We captured this through our Experience monitoring, which provides continuous monitoring of actual user experience through a normal business cycle—in this case for accessing, logging in, and using apps as a user does.

The test timeline shows a simple script with three milestones that opens the Microsoft Login page, authenticates through to Teams, and confirms the page loaded as expected. This example is run from our Vancouver office and looks at the end-user experience of users in that region opening the Teams app. To rule out a location-specific anomaly, we also recommend monitoring key apps from regions where your users are.

At around 12 PM PT on the chart we see the timing of the “end of script” authentication step jump almost 100 percent. Our alert profile triggered after a second test showed the same error.

Because our Monitoring Points run a full browser, each pulls a copy of the page being tested, which allows us to dive deeply into the resources collected. We can see from the test detail page that API endpoints responding to the page’s request for authentication were receiving a 401 Client Error. We can see from the earlier link to Azure’s breakdown of the issue that this was due to a removed key due to an error in the normal key rotation.

Looking further into the POST response we can see that an invalid token error is the root cause, citing the missing key that Azure posted about in their retrospective. This clearly indicated (before the Azure error report came out) that there was an authentication error on the Microsoft side.

Often in troubleshooting the goal is to find the root cause of the problem as soon as possible in order to identify the scope of an issue and prove the guilt or innocence of involved parties. In this case remediation would require Microsoft to fix the problem, but as relying on cloud providers has become commonplace IT needs to be able to isolate what they can fix and what they need to contact support about. They also need to be able to validate that performance returned to baseline after a fix, which we saw was completed at 15:00 PT via our Test Timeline.

When you’re armed with all the evidence to understand the root cause, you can take this knowledge to impacted end users and assure them that IT is on the case. In this instance, IT can proactively alert the enterprise to the outage and present an alternative means of hosting meetings for the short-term. Frequent communication of these issues to the wider company can bolster confidence in IT’s ability to identify and deal with these problems as they arise.


Get Network KPIs without the Overhead
To learn how your IT team can speed up root cause diagnosis using AppNeta Performance Manager, read our whitepaper Get Network KPIs Without the Overhead.

Download Whitepaper

Filed Under: Industry Insights

Tags: network management , enterprise IT , cloud transformation , cloud , cloud computing , Active Directory , Azure outage , Azure , Microsoft Azure , Microsoft , MTTR , MTTI , IT troubleshooting , network performance monitoring