Incident management

It is quite obvious that Enterprise IT systems are complex and that reliability is a day to day fight. To overcome those difficulties, best practices such as ITIL have been deployed introducing state of art processes.

Best practices are not only recipes

Looking more specifically at incident management, which is key for complex systems, a lot of those best practices have been introduced which is a very important move. Nevertheless, what can be observed is that putting in place such processes is sometimes done as following a recipe. There is not always a deep understanding of the reasons why the process as been defined in such a way. Best practices are for sure useful as they provide a short cut but they should not prevent us to think about the essence of what we are doing.

IT Operation should not only defend itself

By definition, for IT operation, the best possible result is to be invisible: if everything works perfectly, then users are not facing outages and are not realizing the huge effort produced on the IT Operation side to achieve this result. Therefore, each outage is only bringing negative impact on IT operation. In this context, incident management and KPIs are seen as a defensive tool to prove that the system is working better that one could imagine. Tools and recipes coming from ITIL are applied giving some kind of label to this defensive strategy.

Understanding the root cause is key

Instead of sweeping durst under the carpet and presenting a defensive approach, there is another possible way: implement those processes and tools in a proactive mode as a means to better understand what is going on. What is important is not the KPI by it self but the comment which goes with it. What is important is not the incident, but to ensure that this incident is not occurring again because root cause has been neglected.

Real improvements are coming when deep analysis is made on the root cause of an incident. This is not easy, sometimes even impossible. It is taking time, potentially slowing down other processes but it is paying off. Of course, IT operation teams in order to look good will try to restart the system asap paying less attention to what caused the issue. This natural inclination should be discouraged by management, forcing an analysis to be performed and an action plan to be defined and executed.

Incident and KPIs reports are communication means

This kind of true transparency from IT operation is demonstrating to users that the IT team is taking care and responsibility. Setting up commented incident reports and monthly KPIs is also giving visibility to IT operation which is not any more the invisible department one can easily outsource. Producing commented KPIs mixing technical and business matters is demonstrating that IT operation is clearly servicing the business bringing its contribution to business development.