On February 28, 2017, thousands of websites and apps experienced performance issues after a failure of the popular cloud computing platform S3, provided by Amazon Web Services. The outage started around 9:30 AM PST and the S3 platform wasn’t fully functioning until just before 2 PM PST. The size and length of the outage was unprecedented and the economic impact has been estimated to be upwards of $150 million.
Amazon Web Services released a statement that the outage occurred during planned maintenance when “one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended”, meaning that a typo took down the S3 platform for hours. While it may be tempting to label the typo, or human error, “THE root cause” of this event, there is (of course) a bit more to the story.
Building a Cause Map, a visual root cause analysis, can intuitively show the systems of causes (plural) that contributed to this incident. Identifying multiple causes, rather than focusing on a single root cause, helps ensure that a wider range of possible solutions are considered. The first step in the Cause Mapping method is to fill in an Outline with the basic background information for an issue, which includes space to list how the issue impacted the organization’s goals. Once the Outline is completed, the Cause Map is built by starting at one of the impacted goals and asking “why” questions to identify the cause-and-effect relationships that led to an incident occurring. (Click on the thumbnail above to see an intermediate Cause Map of this issue.)
So why did a large S3 outage occur? A large number of servers were unexpectedly taken offline AND these servers were necessary to keep the system running. (When two causes both contribute to an effect, they are both listed vertically on the Cause Map and separated with an “AND”.) An S3 team member intended to take a small number of servers offline while debugging an issue with slow billing, but accidentally entered a command incorrectly and took more servers than intended offline. Once the mistake has been identified as human error, it may be tempting to end the investigation, but it is important to go further than just identifying the typo that occurred. In order to thoroughly understand the problem, the fact that design of the system did not prevent a large number of servers from dropping offline once the command was entered needs to be considered as a cause. A thorough incident investigation would include the details of exactly what safeguards are included (or not) in the S3 platform design. These details were not released to the public, but the same technique of asking “why” questions could be used to build a more detailed Cause Map.
Once a Cause Map is built, the final step in the Cause Mapping process is to brain storm solutions and select the ones that should be implemented. In this example, it would be nice to prevent a typo from occurring and Amazon web services may want to consider adding a double check into their work process or try to reduce the number of mistakes in another way, but reliability would also be greatly improved if the system was designed to minimize the impacts of a single, common error. Ideally, a robust system would be designed so that a simple typo can’t cause an $150 million outage. Ensuring that the investigation and solutions focus on more than just “THE root cause” of a command being entered incorrectly can help reduce the impact of any similar errors in the future.
Based on the changes they have already made and more they intend to make in the future, Amazon’s web services is clearly working to improve the reliability of their S3 platform and minimize the consequences of a similar mistake in the future. They have modified their system so that servers will be removed more slowly and have added safeguards so that no subsystem is reduced below its minimum required capacity level as servers are removed. Audits are also being done on their other operational tools to ensure similar safety checks are in place. Additionally, changes are being made to improve recovery time of key S3 subsystems so that the duration of future outages is reduced.
Only time will tell exactly how effective Amazon’s web services solutions have been, but they seem to be taking many steps in the right direction to prevent similar issues in the future.