News‎ > ‎

Major Incident Metrics...What do we do with the data?

posted Aug 4, 2015, 11:08 AM by Robert Nessler - OIT

Metrics.jpgSince the Major Incident Management Process was formalized in April of this year, we have started collecting data from each Major Incident that occurs. The goal of this action is ultimately to recognize, analyze, and remediate issues that are causing disruptions of service to our user base and ultimately to the people of Colorado.


Collection of Metrics


Following each Major Incident, we record several KPI’s (Key Performance Indicators) that will

give a high level view of the issues that are occurring across the different agencies.  Some examples of these include:

  1. Agency - We record which agency is impacted or choose OIT if the issue spans across multiple agencies.

  2. Master Ticket Number - This is captured so that we can go back and take a closer look at the details of the Major Incident that are logged.  This will eventually be linked to a problem management ticket that will be used to track Root Cause Analysis.

  3. MTTR (Mean Time To Resolution) - This is the total time in which the users experienced impact.  This is calculated from the initial ticket reporting time to the final user validation.

  4. Summary - A brief description of the reported issue and the restorative actions.

  5. Resolver Group - Who actually resolved the issue whether it be an internal team or external vendor.

  6. Impacted Device - When applicable, the name of the physical device that caused the issue.


Reporting


At the beginning of each month, we submit a report to the Executive Leadership Team that graphically outlines the above KPI’s.  Below are some examples of what we report:
JuneIncidents.jpg
JuneIncidents2.jpg

Actions based on trends


Once a baseline is established and trends become identifiable, then actions can be taken to reduce the number of Major Incidents and in turn reduce the downtime of the services we provide.  One example would be the downtime caused by vendor outages.  An outage here and there may go unnoticed and quickly forgotten until you take a high level view at the data below.

Vendor trends.JPG

From this data we can extrapolate everything we need to approach the vendor and start asking questions.  We can compare the MTTR to the SLA’s and request paybacks, or ask what is being done to alleviate repeat issues or even congratulate them on a job well done!


In time, other data points that can be collected during a Major Incident may come to light that will provide more granular data than what you see here.  Please feel free to contact our team if you have ideas or have further questions.
Comments