Preparing for failure in IT

Introduction

Question: what does a £5 USB pen drive have in common with a multi billion pound IT contract?

Answer: both will fail at some time, at some level.

As IT professionals and as organisations, a strong measure of our success should be how we both prepare for and deal with any such failures and everything in between.

Embracing failure

All too often over my career, I’ve seen individuals and companies go in to panic mode when something fails, even more so when it leads to a service outage. This usually exhibits itself through some/all of the following:

  • People asking questions during the outage that should be reserved for the post mortem
  • Fingers being pointed and voices being raised
  • People terrified to admit what they did, which prolongs the incident
  • Any resemblance of an incident management process being completely ignored
  • At the other end of the spectrum, an over engineered IM process crippling the repair effort
  • Incessant hovering by ‘do-gooders’ over the person trying to fix the problem

These should be familiar to most IT professionals with anything more than a couple of incidents under their belt even if, like me, you are lucky enough to currently be at a company that has a culture of embracing failure.

What do I mean when I say embracing failure? If I was to list some of the behaviours associated with that mindset, it would include the following:

  • Proactive monitoring
  • Capacity planning
  • Good documentation sets in place
  • Mock incident scenarios
  • Open, no blame culture

More importantly than anything else is that any failure, regardless of whether it causes an incident or not, should be nurtured as an opportunity to learn. Improve individual knowledge, find the holes in your processes, firm up your monitoring, help build confidence and relationships, etc.

Post mortem

The port mortem is perhaps the most important part of the entire process. You can get a tricky issue resolved in record time, get a pat on the back from the customer and senior management and then see the whole thing ruined by some prat who thinks the key requirement of the port mortem is determining which poor numpty is to blame. Inevitably, you end up with people’s confidence and willingness to take on more risky tasks nose dive.

The post mortem should be a relaxed affair where everybody’s main goal is to learn. Learn exactly what went wrong, learn how the process to deal with the issue could be improved, learn how to reduce the risk of the issue recurring, learn how to address other peripheral risks, learn where the knowledge gaps are in your team, learn what makes your colleagues tick…the list goes on.

Summary

Whether you like it or not, failure is something you will experience whilst working in IT. The key thing that should separate you from the headless chickens is how you prepare for, deal with and learn from failure when it inevitably happens.

Till the next time.

4 Replies to “Preparing for failure in IT”

  1. Great piece, and I couldn’t agree more. If people are afraid of failure, they are also afraid to take a chance and try something new. Some of my greatest Ah-Ha moments have come when something didn’t work properly.

Please let me know your thoughts!