According to a recent report by IBM, the damage caused by major IT incidents is greater than ever. An incident that results from a data breach will cost the organization an average of $3.86 million, with the average time to breach containment coming in at 280 days!
And according to the ITIC, hourly downtime costs come in at over $300,000, with some at even $1 million per hour.
Clearly, reigning in incidents, resolving them as quickly and efficiently as possible, and learning from past mistakes to optimize the resolution of future events is a top priority for anyone whose day-to-day involves major incidents.
Insights from the ITIL
Aligning with industry standards for efficient resolution has long been the strategy in focus for addressing this objective, with the ITIL serving as the preferred source for insightful methodologies and processes.
When it comes specifically to learning from past mistakes, nothing serves up knowledge and insights better than the right incident report.
It facilitates the incident review including the unfolding of the incident itself and how well (or not so well) the processes were executed, as well as the post-mortem, root cause analysis, and risk mitigation for future incidents.
According to the ITIL, the incident report should explain the following:
- What was the incident about?
- When did it occur?
- Where did it occur?
- How much time did it take to resolve?
- Who resolved?
- Who was involved in handling the incident?
- What troubleshooting steps were taken?
Download your free Major Incident Reporting Template
Answering just these questions, though, is not enough. Namely, the report should be comprehensive enough not only to determine what, when, and who.
It should cover a much broader set of incident parameters, as follows:
This part of the report provides a holistic overview of all the incident parameters that require analysis. These are needed for the team to arrive at conclusions that will enable it to optimize resolution for forthcoming incidents.
Among these parameters are:
- Which services had been impacted and which related services had not?
- What were the symptoms, including errors and their impact on performance?
- What is the baseline state of performance and the delta during the incident?
- Which geographies and time windows had been affected?
- What was the interruption consistency?
- What was the correlation of impact on affected processes with that on other business processes?
- Which escalation steps had been taken?
- Which steps were taken that had proven to be helpful towards a more speedy and efficient resolution?
- What was the documentation that was created to support major incident management, and with whom was this documentation shared?
- Which actions are mandatory for restoring specific services during such an incident?
- What were the costs involved with these actions?
For each of these parameters, it is important to also note who were the stakeholders involved.
One of the most critical pillars upon which the success or failure of major incident management lies is communications.
Accordingly, it is mandatory to document and report the effectiveness of each of the communication channels that were involved throughout the incident lifecycle. These include emails, conference calls, and Slack, for example.
Moreover, it must also be noted which stakeholder was or was not available and the steps that were taken to overcome the challenge of reaching them (and whether or not it was successful).
This is key to understanding how to ensure seamless communication, which is one of the key capabilities required for accelerating resolution and learning for future optimization.