Downhound posted their first blog post today, and it’s an important one for Operations people to know. Having been in operations for almost 8 years now, I’ve come to a realization that not everyone deals with outages in the same way. There are some that will take any action possible (restart all the things) and some that investigate the problem but don’t take any action until they verify with someone. I’ve been in both of those parties and more.
Downhound listed the following must dos:
- Above all, stay calm.
- Establish a situation room
- Look, but don’t touch
- Pause before fixing
Stressing on the need to stay calm is very important. Junior and Senior Operations folks often get a surge of panic when a crisis hits but if you can take a deep breath and keep a cool head, figuring out the problem will be much easier.
They mention creating a new “room” for every outage. I’ve never tried that and maybe I will some day. However, I do like having a dedicated situation room that I can refer back to when assessing a new outage. Slack rooms are technically free and unlimited, but I like being able to search one room for keywords and commands used previously.
The third “do” is pretty important. This is the fact gathering stage where you figure out what is wrong with the system. It’s important to assess why there is an issue. Although, on one of these fact finding missions, I’ve accidentally cat’ed a large file instead of running a tail.
Pausing before fixing gives you the opportunity to think of the repercussions of your fix. “Will there be a split brain scenario?” “What happens if you rerun that script?” “How long will this take?” Often times that last question forces you to think about a long term vs a short term fix. You might be able to get away with turning off puppet and manually editing a config file to get out of the hurdle and then turning around and editing the puppet files to make the config stick.
It’s also very important to remember that complaining, blaming and making fun of technical issues during the outage doesn’t help. When SHTF you should fix the problem now and critique later.