About a month ago. an entire AWS region went up in flames for almost an entire day. And you know what? If your business was affected by it, it is entirely your fault.
AWS Outage Reason
So what happened? Amazon, reports that entire Virginia region was unavailable for several hours. The cause of the outage was widely reported as a ‘typo’ in a configuration. There are some arguments to be had as to what ACTUALLY happened, but these are not part of this post.
Why it’s your fault
This is going to be a short paragraph, as the answer to this lies simply within every AWS design guideline you chose not to follow, especially this one: design your application for failure. AWS constantly advises their customers to build applications with resilience across 2 availability zones.
An availability zone is just like a data center. Meaning, building your production, failover, and backup infrastructure in a single availability zone is literally just as foolish as putting your production, failover, and backup servers all on a single RAID1 ESXi host in a 1U chassis in your local colo.
How the AWS outage affected PEI
Our ERP software provider runs off AWS, and as evidenced by the outage we experienced, suffers from the poor design explained above. However, PEI did not necessarily experience a show-stopping moment simply because we do not rely on a single point of failure to run our business. In reality, this is much more basic than it sounds—90% of the job of the ERP software that went down is to process tickets for our customers. While AWS was down, our ERP software did go down, but we did not suffer ‘business down’ consequences as our email system (which the ERP software talks to) is in a different cloud provider and our Phone system is completely separate from that as well.
What this means is that for the duration of the outage we were reduced to monitoring the good-old shared mailbox as the service requests kept flowing in—all without visible impact to our customers.
Do try this
Ask yourself and your design team, “How resilient are our cloud based services?” Can we suffer an entire AWS region failure? How will we recover?
Simple planning is all it takes to successfully run your cloud services. Do take the time to design your architecture properly and instead of being pulled into a meeting in front of your board of directors, you’ll be writing blog posts like this, explaining how well your business was running while everyone else was in chaos.
Jacob Rottermund, Systems Support Engineer, PEI