Having put time in at multiple levels during my career path, I often questioned many of the policy and procedure changes that were implemented across IT organizations without adequate information as to why these changes were being made. OIT personnel are very aware of many of our procedural changes from the past 6 months, though the overall reasoning may be less clear, especially to employees that are not a part of OIT. In this blog, I hope to answer the “why” as it applies to what we are doing with our new approaches to Change, Incident, and Problem Management (coming soon) and help everyone to understand how these things work together to allow for better IT service. First, I would like to list some definitions to help you understand what each of these things are, at least in the context that OIT uses them.
Change - The addition, modification or removal of anything that could have an effect on IT services.
Change Management - A process used for managing, tracking, and communicating any changes in our live production environment. This applies both to changes that the user will not see, as well as those that are obvious to anyone accessing an IT service.
Major Incident - Any disruption of a production service, whether it is lack of access, performance degradation, or any issue that prevents the reasonable use of a production service which impacts all or most of the users of that service. Many people refer to these as “outages,” which is often true. But this also applies to slowness issues, intermittent issues, etc.
Problem - The unknown root cause of one or more existing or potential Incidents.
For a real world example of how the 3 processes intersect, I will offer the following step by step scenario:
1. A monitoring tool alerts OIT that an office has lost network communications causing users to lose the ability to serve customers.
a. This issue is reported as a Major Incident, and the Major Incident Management process is initiated.
b. The proper support personnel are contacted, and they discover that a network device at that office is not responding, and the backup device has not engaged.
c. Network engineers bring the backup device online and restore connectivity. The Major Incident is resolved.
2. A Problem ticket is initiated, and investigation into the cause of the failure is begun.
a. Network engineers discover that the primary network device has failed, and needs to be replaced. A case is opened with our supply vendor to acquire a new device.
b. Engineers further discover an issue with how the primary and backup devices are talking to each other, which caused the backup device to not take over when the primary failed.
3. Once the root cause(s) has been identified, OIT begins the Change Management process to fix the underlying problems.
a. When the replacement device arrives, a change is entered to replace the failed device on an approved after-hours maintenance window.
b. A second change is entered, under the same window, to make the programming changes required to allow the primary and backup devices to properly talk to each other.
To recap, this is how the processes work together to reach a better state. The issue was identified when an outage occurred. Through the Major Incident Management process, resources were able to restore service. Once service was restored, the Problem Management process was initiated to help identify why the outage happened to begin with. When the cause was determined, the Change Management process was used to make the necessary changes to prevent any further occurrences.
In the end, this adds up to services being unavailable less, thus helping all of our users do the work that they are here to do.
It is ever the aim of OIT to make things easier, more accessible, and more consistent for all of the agencies we serve, and then through agency work, the public at large. These new processes have helped OIT to better communicate with each other, and helped OIT to better serve the needs of the customers.Stay tuned for more enhancements and process improvements!