Week-long network outage drives cloud migration at Amherst College
It’s not just enterprise networks that are feeling the impact of digital transformation. From government agencies to higher education, cloud migration and SaaS adoption are changing networking from the top-down, and putting unprecedented strain on IT departments as a result.
One recent incident at Amherst College in Massachusetts illustrated perfectly what can happen when IT gets buried under an avalanche of network issues as colleges across the country become increasingly dependent on connected technologies. A week-long string of network outages occurred at the campus beginning on February 11 that impacted everything from campus-wide WiFi to email access, sending the school into week-long disarray.
While the college was still diagnosing what exactly happened as of a week later, much of the trouble was attributed to equipment failures, including cabling issues and MAC flap incidents that saturated the network with unmetered waves of messaging traffic, resulting in a sequences of crashes. Along with a configuration issue at one of the school’s central servers, there was a third as-yet unidentified issue that IT was working to uncover, according to the school’s newspaper, The Amherst Student.
In a nutshell, after almost a week of near-complete connectivity outage, campus IT was still struggling to pinpoint what was plaguing their network, even though it appears Amherst College owns and controls the majority of its network architecture. Unlike networks that have moved at least in part to the cloud, Amherst College IT controlled most of their network hardware, as opposed to organizations that leverage direct internet access (DIA) or SD-WAN (which is admittedly more common in enterprise environments than educational networks, at least for now).
Amherst College Chief Information Officer David Hamilton told the student newspaper that he had never been aware of similar incidents in his 12 years at the school, and that it is “a confluence of accidents that caused it.”
There are a few initial takeaways here.
Firstly, the impact of the down network over the course of the week were far-ranging; not only were email and WiFi down, but card scanning systems that keep dorms and halls across campus secure were essentially “unlocked.” Students were forced to use their own cellular data to access online materials for classes, payroll systems became inaccessible, and even campus laundry cards were frozen out of use. Not only were campus administrators unable to give a clear answer on what caused the outage by week’s end, but just how they would reimburse students and faculty for the inconvenience was another factor that they had yet to consider, according to The Amherst Student.
A February 15 alert to students across campus read:
“IT is working to restore services by moving them to the cloud. This is taking longer than expected because of the instability of the existing network.”
So what could campus IT have done to prevent this? For starters, if campus IT had really been expecting conditions to not change across the network significantly over the course of the past decade-plus as their statements have indicated, then a major outage like this was inevitable.
One question that immediately comes to mind is whether the campus had been conducting any network monitoring ahead of the incident in the first place.
For “a confluence of accidents” to all take place at once like this would require some major network blind spots on the part of IT prior to the incident. Even for networks that aren’t reliant on significant cloud architectures should be using network performance monitoring and diagnostics tools to alert them to potential pitfalls on an active basis, not just rely on networking hardware to stay performant because of a good track record.
These same monitoring solutions should have been deployed during the outage to aid in identifying the issue. The approach of shutting down network servers and performing hardware tests was ultimately costly to students and staff while failing to produce clear answers. Had the team been pre-active about their network performance monitoring, IT could have very likely sped up their mean-time-to-resolution (MTTR) if not more quickly gotten to the root of the problem.
According to Hamilton’s response to the network outage, Amherst will be retiring their hardware-centric network infrastructure and migrating their central business systems to the cloud in the hopes of delivering more secure and reliable connectivity to staff and students alike.
While things are settling, this was a very drawn out incident at the end of the day that likely could’ve been avoided -- if not ameliorated faster -- with more modern network technology, as well as dependable network performance monitoring.