Amazon’s Rule #1: “Everything fails all the time.” (Werner Vogel, CTO at Amazon.com)
A common misconception for consumers of cloud services is that SLAs ensure availability. In reality, SLAs do nothing to improve availability. SLAs only provide a way for attorneys to make money arguing over compensation after a failure. In most cases, the eventual compensation is the cost of the services for the outage period, not the loss the outage caused. Cloud service providers must plan and engineer for failure in order to be successful.
At CloudConnect 2012, Jesse Robbins, Co-Founder, Opscode, gave a keynote (PDF) recommending Game Day real world testing, which has three facets:
- Preparation: Identification and mitigation of risks and impact from failure. This reduces frequency of failure (MTBF) and reduces duration of recovery (MTTR).
- Participation: Builds confidence and competence responding to failure under stress. It also strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types.
- Exercises: Trigger and expose “latent defects”. This lets you choose when to discover issues, instead of letting that be determined by the next real disaster.
The lessons that come out of Game Days usually include:
- We have a bunch of manual processes that we need to automate. Jesse’s advice is to automate everything to point that you can view your infrastructure as code, reconstructing the business from nothing but a source code repository, an application data backup, and base resources (hardware). This requires continuous integration & deployment, and automatic failover and fallback
- We need better incident management. Development and operations need to work together (which may represent a culture change for the organization).
- One of the cloud service tiers (e.g., load balancing, website, DNS, database, etc.) failover didn’t work. We need to test and maintain our emergency tools & processes. Infrastructure as code can help here too. Build emergency management processes into what you do every day (e.g., deployment code).
If you run a cloud service (whether private or public), I’d recommend following Jesse’s advice. If you are a consumer, pay attention to SLAs but more importantly, talk to your vendor about how they test for failure and how often they test production failover.