Monday, October 10, 2011

Contain First, Countermeasure Later

Root cause analysis takes time.  It is unfair and unrealistic to be determining and addressing root causes of problems while you're in the middle of a problem and its undesirable effects.

However, you still need something to contain those effects immediately.

Therefore
Contain First, Countermeasure Later. The first response to a problem should be to work out a way to contain the issue.  This buys time to more thoroughly examine and address the root causes.

As an example...  We detect an alert where a server is running out of disk space.  Containment might be to temporarily increase the disk space such that based on current growth rate we have a week before it is filled up again.  This is obviously not a long-term solution so the countermeasures will come from analysing the causes of growth and addressing them.  This might be a misconfiguration of logging which might be in turn caused by unfamiliarity with the tooling which might be in turn caused by lack of training with the tooling, etc.

There is a danger of forgetting to get to longer-term countermeasures, especially if the containment actions remove the immediate pain.  Therefore, it is useful to create a visualisation or other reminder that root cause countermeasures are still pending.  A Problem / Countermeasure board works well for this.

If you are unable to preserve the quality of the output using containment actions, then it may be necessary to "stop the line" and initiate root cause analysis and countermeasures immediately.

No comments:

Post a Comment