Originally published on Enable Architect.
Modern computer systems supply business-critical services everywhere -- from Amazon providing shopping services to Healthcare.gov providing enrollment in health insurance plan. We all rely on such systems. But, unfortunately, these systems are complex and can fail in surprising ways.
By now, it is a well-understood best practice that when failure happens, it's an opportunity to learn and improve. Thus, blameless retrospectives (sometimes called "post-mortems") are by now a development-cycle staple.
However, the processes by which organizations conduct the failure analysis, and make improvement recommendations, are still based on shaky foundations. It is time to do better.
Root cause analysis
It is possible to do Root Cause Analysis (RCA) as originally defined. This means looking for the initial action that started the problem (i.e. the "root") and then figuring out how to prevent it in the future. However, in recent years this method is seen to be of limited value. The root cause is hard to define in increasingly complex systems and not necessarily the right thing to change.
Most organizations that conduct RCA do not follow the original definition. Instead, they do ad-hoc modifications. They look for all contributing causes, starting with the root cause, and then offer mitigation.
In acknowledgment of the limitations of RCA, there is a new emphasis on service reliability. Reliability often focuses on the need to have services resilient to upstream failure.
Acknowledging the complexity of modern systems and formalizing it, the Causal Analysis based on System Theory (CAST) process does precisely that: a way to improve service reliability. Instead of ad-hoc modifications to a fundamentally broken analysis process, CAST offers an alternative from-the-ground-up analysis method based on Professor Levenson's research into system safety theory.
CAST is a modern approach to analyze failure, as described in Professor Levenson's book. As written, it assumes a physical system. However, this process is adaptable to investigating software, and especially for service outages. It is an alternative to the so-called RCA.
CAST contains five steps. Although it sometimes makes sense to go back to a previous stage as you uncover more information, in general, the analysis should follow the steps in order:
- Assemble basic information
- Model safety control structure
- Analyze each component in loss
- Identify control structure flaws
- Create improvement program
Assemble basic information
When assembling basic information, the first part is to define the system involved. This indicates what the boundaries of the analysis are. This part is essential: it should be clear what part is the system and the environment.
Next, describe the loss: the undesirable behavior. Explain the hazard (the original change) that led to it
From the hazard, identify the system-level safety constraints required to prevent it. Those are the system safety requirements and constraints.
The next part is to construct a timeline. Describe what happened. Avoid any conclusions, and especially avoid assigning blame. This part will usually include open questions, especially about why things happened.
Analyze the loss in terms of the system requirements and controls in place. This includes any mechanisms that were put in place to prevent such problems. Indicate what interactions happened between different parts that led to the problem. Note any contextual factors that influenced the events.
Model safety control structure
The model of underlying causality CAST treats safety as a control problem, not a failure problem. Thus, the cause is always that the control structure and controls constructed to prevent the hazard.
If a control structure for the system does not already exist, it might be helpful to start with an abstract high-level control structure.
Analyze each component in loss
Examine the components of the control structure to determine why they were not effective in preventing the loss.
Start at the bottom of the control structure. Explain each component's role in the accident and analyze its behavior and why it did what it did. As context, add the details from the original design for why these controls were deemed adequate.
Identify control structure flaws
Identify general systemic factors that contributed to the loss. These factors cut across the different control structure components. Thus, it is important to add this step explicitly to account for such cross-cutting concerns.
Create improvement program
Create recommendations for changes to the control structure to prevent a similar loss in the future. These might include a continuous improvement program as part of an overall risk management program.
The CAST process is a modern theory-inspired method that is tested by practice, improving safety and reliability. Professor Levenson has many of her books, including the CAST handbook, available from the MIT website, where you can learn more about the background, the theory, and the practice.
Now go forth, and conduct better retrospectives!
Empathy vs. sympathy for Site Reliability Engineers (SRE)
This article was originally published on Enable Architect
Many people have had the insight that DevOps is about people. Often, they will summarize it as "DevOps is about empathy". I have found, however, that idealizing empathy is just as bad as thinking that DevOps is about a single technology.
I …read more
Minimal packing list
With in-person conferences starting to open up, I need to clear the dust off of some skills that have not been used in a while. One of those is how to pack for travel.
This list works for me. It will probably not work for you as-is. Among other things …read more
Post that PR
Sometimes you will be working on hairy and complicated feature in a shared repository. Maybe it's for work. Maybe it's an open source project.
As a responsible person, you are working on a branch. The usual way of working involves a lot of "intermediate" check-ins. Those serve, if nothing else …read more
Portable Python Binary Wheels
It is possible to work with Python quite a bit and not be aware of some of the subtler details of package management. Since Python is a popular “glue” language, one of its core strengths is integrating with libraries written in other languages: from database drivers written in C, numerical …read more
So you want to create a universe
A story about looking for a universe, and finding a pi(e)
This is fine. You need not feel shame. Many want to create a universe. But it is good you are being careful. A universe with sentient beings is a big moral responsibility.
It is good to start with …read more
Virtual Buffet Line
Many people have written about the logistical challenges of food in a conference. You trade off not just, as Chris points out, expensive food versus terrible food, but also the challenges of serving the food to everyone at once.
One natural method of crowd control is the buffet line. People …read more
DRY is a Trade-Off
DRY, or Don't Repeat Yourself is frequently touted as a principle of software development. "Copy-pasta" is the derisive term applied to a violation of it, tying together the concept of copying code and pasta as description of software development bad practices (see also spaghetti code).
It is so uniformly reviled …read more
Fifty Shades of Ver
Computers work on binary code. If statements take one path: true, or false. For computers, bright lines and clear borders make sense.
Humans are more complicated. What's an adult? When are you happy? How mature are you? Humans have fuzzy feelings with no clear delination.
I was more responsible as …read more
I have written before about my Inbox Zero methodology. This is still what I practice, but there is a lot more that helps me.
The concept behind "Universal Binary" is that the only numbers that make sense asymptotically are zero, one, and infinity. Therefore, in order to prevent things from …read more