A two hour and twenty four minute outage in Microsoft’s Azure cloud network that affected thousands of European networks on the 26th of July was amazingly brought about by a very minor configuration error according to Microsoft experts.
In an MSDN blog Microsoft’s Windows Azure General Manager Mike Neil explains how the triggering of a network ‘safety valve’ generated a torrent of network messages which ultimately caused a cascading chain of network failures. Servers overwhelmed by the stream of automatically generated service messages ended up swamped, with networks eventually keeling over due to 100% CPU utilisation on the servers.
It’s troubles like this which very clearly highlight that higher standards of network management are required for cloud computing networks in comparison to ordinary networks and serve to emphasise just how important this can be.
Minor IT problems go unnoticed all the time around the world but when networks the size of Microsoft’s Azure cloud go down (even with such simple causes) then this is very definitely a major incident that has the potential of generating tens of thousands of service calls worldwide.
The Microsoft blog entry can be read here.