Better Outage Retrospectives

Sun 15 August 2021 by Moshe Zadka

Originally published on Enable Architect.

Modern computer systems supply business-critical services everywhere -- from Amazon providing shopping services to Healthcare.gov providing enrollment in health insurance plan. We all rely on such systems. But, unfortunately, these systems are complex and can fail in surprising ways.

By now, it is a well-understood best practice that when failure happens, it's an opportunity to learn and improve. Thus, blameless retrospectives (sometimes called "post-mortems") are by now a development-cycle staple.

However, the processes by which organizations conduct the failure analysis, and make improvement recommendations, are still based on shaky foundations. It is time to do better.

Root cause analysis

It is possible to do Root Cause Analysis (RCA) as originally defined. This means looking for the initial action that started the problem (i.e. the "root") and then figuring out how to prevent it in the future. However, in recent years this method is seen to be of limited value. The root cause is hard to define in increasingly complex systems and not necessarily the right thing to change.

Most organizations that conduct RCA do not follow the original definition. Instead, they do ad-hoc modifications. They look for all contributing causes, starting with the root cause, and then offer mitigation.

In acknowledgment of the limitations of RCA, there is a new emphasis on service reliability. Reliability often focuses on the need to have services resilient to upstream failure.

Causal analysis

Acknowledging the complexity of modern systems and formalizing it, the Causal Analysis based on System Theory (CAST) process does precisely that: a way to improve service reliability. Instead of ad-hoc modifications to a fundamentally broken analysis process, CAST offers an alternative from-the-ground-up analysis method based on Professor Levenson's research into system safety theory.

CAST is a modern approach to analyze failure, as described in Professor Levenson's book. As written, it assumes a physical system. However, this process is adaptable to investigating software, and especially for service outages. It is an alternative to the so-called RCA.

Performing CAST

CAST contains five steps. Although it sometimes makes sense to go back to a previous stage as you uncover more information, in general, the analysis should follow the steps in order:

  1. Assemble basic information
  2. Model safety control structure
  3. Analyze each component in loss
  4. Identify control structure flaws
  5. Create improvement program

Assemble basic information

When assembling basic information, the first part is to define the system involved. This indicates what the boundaries of the analysis are. This part is essential: it should be clear what part is the system and the environment.

Next, describe the loss: the undesirable behavior. Explain the hazard (the original change) that led to it

From the hazard, identify the system-level safety constraints required to prevent it. Those are the system safety requirements and constraints.

The next part is to construct a timeline. Describe what happened. Avoid any conclusions, and especially avoid assigning blame. This part will usually include open questions, especially about why things happened.

Analyze the loss in terms of the system requirements and controls in place. This includes any mechanisms that were put in place to prevent such problems. Indicate what interactions happened between different parts that led to the problem. Note any contextual factors that influenced the events.

Model safety control structure

The model of underlying causality CAST treats safety as a control problem, not a failure problem. Thus, the cause is always that the control structure and controls constructed to prevent the hazard.

If a control structure for the system does not already exist, it might be helpful to start with an abstract high-level control structure.

Analyze each component in loss

Examine the components of the control structure to determine why they were not effective in preventing the loss.

Start at the bottom of the control structure. Explain each component's role in the accident and analyze its behavior and why it did what it did. As context, add the details from the original design for why these controls were deemed adequate.

Identify control structure flaws

Identify general systemic factors that contributed to the loss. These factors cut across the different control structure components. Thus, it is important to add this step explicitly to account for such cross-cutting concerns.

Create improvement program

Create recommendations for changes to the control structure to prevent a similar loss in the future. These might include a continuous improvement program as part of an overall risk management program.

Summary

The CAST process is a modern theory-inspired method that is tested by practice, improving safety and reliability. Professor Levenson has many of her books, including the CAST handbook, available from the MIT website, where you can learn more about the background, the theory, and the practice.

Now go forth, and conduct better retrospectives!