Thursday, August 18, 2011

Thinking about patterns to escape Pager Hell

If you've ever been in an operations role, when I say Pager Hell, do you understand what I'm referring to? And by understand, I don't just mean intellectually, but also emotionally.

I've been thinking recently about why Pager Hell occurs in terms of specific causes and root causes, and also patterns and anti-patterns for escaping it.

Here are some initial thoughts:

Page only if an action can be taken. Pages should only occur when you can do something to contain or recover the situation.

Understand the difference between normal and abnormal behaviour. If the "incident" is actually normal, then it's a system design issue which is probably not solvable at 3 am by yourself, which means there's nothing you can do and you shouldn't be receiving the page.

Paging is primarily about containment. In most cases, root cause analysis and designing longer-term countermeasures takes too long to be a part of initial incident response. The default behaviour of a person on pager duty should be to ensure the situation is contained but defer analysis and countermeasures for later and/or even other people.

A lot of paging demand is actually predictable and can therefore be levelled with appropriate scheduling. As an example, batch jobs, deployments, upgrades, etc.

----
Do you have any experiences you could share about pager hell? and what you did to escape it?

No comments:

Post a Comment