The Central Point of Failure: What AWS Tells Us About Risk Management

The Central Point of Failure: What AWS Tells US About Risk Management

Once upon a time, the failure of a business’s servers would not be huge news. A blow to the firm in question, to be sure, but not necessarily rippling beyond a few days of lost business. However, this is not the world we live in anymore. The mass migration to the cloud has greatly decentralized aspects of the internet and expanded access, but crucially it has also centralized a few points of failure. Case in point, when one of the major cloud providers – Amazon, Google, Microsoft, etc. – go down, it can mean vast swathes of the internet and other business services become inoperable. Recently, it has been the turn of Amazon’s AWS, with a couple of major incidents disrupting service. 

The Guardian’s first article looks at some of the services affected: 

“A Washington state Amazon driver said his facility had been “at a standstill” since Tuesday morning, CNBC reported

Other services, including Amazon’s Ring security cameras, mobile banking app Chime and robot vacuum cleaner maker iRobot were also facing difficulties, according to their social media pages. 

Other websites and apps affected include the Internet Movie Database (IMDb), language learning provider Duolingo and dating site Tinder, according to Downdetector.” 

For IT leaders and enterprise architects, there are plenty of potential lessons to learn from this incident. Many of them are likely to be bad moves overall. Halting cloud migration might prevent these failures, but it will also leave your level of service and internal efficiency trailing behind competitors. Switching to Private cloud might sound good, but is there any reason to think it won’t also be at risk of failures? 

A major problem is interconnectivity. Even if you decide moving away from public cloud is the best balance of risk and performance, pretty much every business will still be exposed to risk through third parties. There will inevitably be applications in use that rely on AWS or Azure of Google cloud, and if these applications fail it could still halt activity for your own company. 

Does this mean you should simply give up and accept the risk? Not so fast – while you cannot hope to completely control cloud availability or software resilience, there are still large advantages from understanding what aspects of your IT landscape are at risk and how. This is one of the common tasks for enterprise architecture (EA), with a fully designed architecture able to highlight all interconnections between applications and technology and pinpoint issues. Many EA teams will be tasked with creating business impact analyses that can deliver easily understood statistics like Recovery Time Objectives and Maximum Acceptable Outages for a wide range of potential threats, not just cloud problems. 

Looking towards software implementation, many firms are turning towards a microservices approach that utilizes containers, which can better isolate individual parts of your application landscape and enable some service to remain operational when disaster strikes. 

Neither of these solutions are necessarily simple to implement, with both requiring specialist, in-demand skills in areas such as architecture frameworks or Kubernetes, but the old adage that nothing worth having comes easy does apply here.

Fortunately, there is help available: Orbus Software are experts in enterprise architecture and can kickstart your EA practice, laying the foundations for an agile, cloud-native approach to software development. Get in touch with us today to find out more.