Since the Major Incident Management Process was formalized in April of this year, we have started collecting data from each Major Incident that occurs. The goal of this action is ultimately to recognize, analyze, and remediate issues that are causing disruptions of service to our user base and ultimately to the people of Colorado.
Collection of Metrics
Following each Major Incident, we record several KPI’s (Key Performance Indicators) that will
give a high level view of the issues that are occurring across the different agencies. Some examples of these include:
At the beginning of each month, we submit a report to the Executive Leadership Team that graphically outlines the above KPI’s. Below are some examples of what we report:
Actions based on trends
Once a baseline is established and trends become identifiable, then actions can be taken to reduce the number of Major Incidents and in turn reduce the downtime of the services we provide. One example would be the downtime caused by vendor outages. An outage here and there may go unnoticed and quickly forgotten until you take a high level view at the data below.
From this data we can extrapolate everything we need to approach the vendor and start asking questions. We can compare the MTTR to the SLA’s and request paybacks, or ask what is being done to alleviate repeat issues or even congratulate them on a job well done!
In time, other data points that can be collected during a Major Incident may come to light that will provide more granular data than what you see here. Please feel free to contact our team if you have ideas or have further questions.
Having put time in at multiple levels during my career path, I often questioned many of the policy and procedure changes that were implemented across IT organizations without adequate information as to why these changes were being made. OIT personnel are very aware of many of our procedural changes from the past 6 months, though the overall reasoning may be less clear, especially to employees that are not a part of OIT. In this blog, I hope to answer the “why” as it applies to what we are doing with our new approaches to Change, Incident, and Problem Management (coming soon) and help everyone to understand how these things work together to allow for better IT service. First, I would like to list some definitions to help you understand what each of these things are, at least in the context that OIT uses them.
Change - The addition, modification or removal of anything that could have an effect on IT services.
Change Management - A process used for managing, tracking, and communicating any changes in our live production environment. This applies both to changes that the user will not see, as well as those that are obvious to anyone accessing an IT service.
Major Incident - Any disruption of a production service, whether it is lack of access, performance degradation, or any issue that prevents the reasonable use of a production service which impacts all or most of the users of that service. Many people refer to these as “outages,” which is often true. But this also applies to slowness issues, intermittent issues, etc.
Problem - The unknown root cause of one or more existing or potential Incidents.
For a real world example of how the 3 processes intersect, I will offer the following step by step scenario:
1. A monitoring tool alerts OIT that an office has lost network communications causing users to lose the ability to serve customers.
a. This issue is reported as a Major Incident, and the Major Incident Management process is initiated.
b. The proper support personnel are contacted, and they discover that a network device at that office is not responding, and the backup device has not engaged.
c. Network engineers bring the backup device online and restore connectivity. The Major Incident is resolved.
2. A Problem ticket is initiated, and investigation into the cause of the failure is begun.
a. Network engineers discover that the primary network device has failed, and needs to be replaced. A case is opened with our supply vendor to acquire a new device.
b. Engineers further discover an issue with how the primary and backup devices are talking to each other, which caused the backup device to not take over when the primary failed.
3. Once the root cause(s) has been identified, OIT begins the Change Management process to fix the underlying problems.
a. When the replacement device arrives, a change is entered to replace the failed device on an approved after-hours maintenance window.
b. A second change is entered, under the same window, to make the programming changes required to allow the primary and backup devices to properly talk to each other.
To recap, this is how the processes work together to reach a better state. The issue was identified when an outage occurred. Through the Major Incident Management process, resources were able to restore service. Once service was restored, the Problem Management process was initiated to help identify why the outage happened to begin with. When the cause was determined, the Change Management process was used to make the necessary changes to prevent any further occurrences.
In the end, this adds up to services being unavailable less, thus helping all of our users do the work that they are here to do.
It is ever the aim of OIT to make things easier, more accessible, and more consistent for all of the agencies we serve, and then through agency work, the public at large. These new processes have helped OIT to better communicate with each other, and helped OIT to better serve the needs of the customers.Stay tuned for more enhancements and process improvements!
As part of OIT's efforts to better provide the services used by state agencies and as part of a major incident management initiative, we talked with our Service Desk personnel -- the frontline resources who interact on a daily basis with our customers. We wanted to gain insights into what they are hearing from our customers as well as their own opinions about where we are meeting needs and how we could better serve those who have the incredible mission of serving Coloradans. We learned that one of the most common criticisms is around the way we communicate service disruptions and system outages and primarily that customers are receiving alerts that do not apply to them or they are receiving too many alerts in general.
In response, we added an action item to address how to provide more focused and clear information (e.g., less “techy”) to the existing initiative. From this was born the Major Incident Management Dashboard you see today. This 24x7 site is now the hub for all Major Incident information. (A major incident is one that prevents reasonable use of a service such as the network or a specific application and that has a significant impact to productivity.)
Customers will no longer receive the Service Impact notifications with the red, yellow, or green banners across the top. Instead, you now have the option to receive notifications generated this site and to customize what you receive and how often you receive it!
It’s easy -- all you need to do to receive these messages is to join the appropriate group! Just click on the agency acronym for which you wish to receive notifications and click on the blue “Join Group” button. If you wish to receive a notification every time a new incident is posted, click on the “Join this group” button and you’re done. But if you’d rather receive a daily or weekly summary, first change the “Email delivery preference” (you can always modify this later). That’s it -- you’re done.
If you don’t wish to receive notifications, that’s easy too. You won’t receive any unless you subscribe. Regardless of how often you wish to be notified, or even if you don’t subscribe to this service, you can always catch up on the latest status on this dashboard.