Learn how the University of Michigan IT department improved response time, incident handling, and communication.
When I first started working at the University of Michigan, ITIL was just another acronym; few people knew what it meant or how it applied to us. There was one person leading the ITIL implementation, and I was the help desk manager at the time. We started to look at ITIL processes in-depth, and as many teams attended ITIL foundations, we decided as an organization to move forward with service management. I started by creating an incident management process, service restoration targets, and a Tier 2 Operating Level Agreement (OLA). By 2010, the service management team was beginning to gain momentum and I moved from the help desk into a new dual role of incident/knowledge manager, which is when we started to formulate the major incident process. By 2016, we established an efficient process that is currently being used with a few new challenges and that always has room for improvement.
When I was the help desk manager and my team was responsible for supporting the PeopleSoft Administrative Systems, Human Resources, and Student and Financials, we experienced a major crisis. On the first week back to school, the phones were ringing and the queue was increasing. Callers were reporting that they were unable to do anything in Student Administration. This included curriculum, financial aid, and student records. This meant that, on the busiest week of the school year, no rosters could be accessed, so professors did not know who was in their classes; students could not print class schedules, so they did not know where to go; and additionally, they were unable drop and/or add any classes or get financial information. The outage lasted a full day and service was not restored until the following day.
This outage was complete chaos for our users as well as for us and was reported in the media; we realized we needed to do better in the future. We wanted to create a major incident process that would accomplish the following:
- Restore service as quickly as possible
- Identify the right people necessary to fix the issue
- Communicate across teams and to leadership
- Categorize action items to produce an organized plan to resolve
- Accurately document the outage
For the initial release, I first focused on how to identify when a critical outage needed to follow the major incident guidelines. Based on the incident management process, everyone was familiar with categorizing critical incidents, identified as Priority One. Outages with High Impact (widespread: the majority of users of the service are affected by the issue) and High Urgency (work stoppage) were categorized as Critical (there is a significant risk to the business for not restoring the user’s ability to perform a vital business function), and we set a service target for resolution for within four hours. Just because an incident is Critical does not make it Major and sometimes it is not easy to identify.
I began streamlining the process by creating process guidelines for handling these incidents (referred to as significant incidents [SIs]) and an incident template for documenting. Roles were limited, and the guidelines instructed managers to contact me as the incident manager to request a significant incident. I would coordinate stand-up meetings and document the outage and resolution in an incident template to distribute. We formulated five high-level steps to the process:
- Determine and Notify: when a manager identified a critical outage and did not have a plan to resolve within one hour, they would contact me to start the process.
- Initial Stand-Up Meeting: meetings were generally held in the lunchroom with a sign-in sheet and these goals:
- Roll Call
- Overview of Issue
- Next Steps and/or Contingencies
- Action Items and Assignment
- Establish Next Check-In Date and Time
- Communication Plan
- Coordination: I would act as facilitator and the main point of contact during outages; I would be responsible for documenting and configuring an email group for updates.
- Ongoing Meetings: meetings would be scheduled as necessary; I was tracking those involved and keeping them in the loop.
- Major Incident Closure: once the issue was resolved, I would draft a summary to distribute to leadership.
That first year we handled nine SIs, and the improvements were noticeable. Outages were more quickly resolved, incidents were documented, and both internal and external communication was consistent. This was partly because “I” was running the process; but I realized I had created a single point of failure! It was difficult to get the right people notified in time, there was confusion about when to use the process, and the adoption of the process was inconsistent. People were skeptical and hesitant to identify an issue and felt like it was extra work. Over the next several years we focused on process improvements, specifically in four areas, based on lessons learned that first year:
- Staff Training: Confusion on when to use the process was addressed by distributing “one pagers” to help simplify. We distributed documentation through team meetings, all staff meetings, newsletter article, intranet documentation, KBA Article for Support Reference, and MyLINC Job Aid for Configuring Notifications.
- Process Adoption: Adoption was inconsistent across the organization; some were skeptical and hesitant to identify an issue while others benefited. We constructed a list of key contact people to include during each outage and drafted work plan descriptions of the process for managers to use to communicate expectations.
- Consistency: The key to consistency is documentation. The incident template was enhanced with structure and a resolution template was developed.
- Ease of Use: Ease was key. We had a few technical challenges:
- Conference line capacity: we expanded the conference line capacity from 30 to 100.
- Notifications: 24/7 protocol, using a calendar with subscriptions that sends to email, phone, and cell phone.
- Scheduling meetings: using the same calendar, have all meetings send notifications to everyone that determine when to participate.
To alleviate the manual process, we transitioned from pushing the information to allow pull. Instead of one person documenting the information and sharing, we moved the process to Google Docs for collaboration so everyone could contribute and access the information.
The process improvements we implemented were based on feedback from staff. One thing we tried was implementing major incident levels. This was developed based on those incidents that lasted for an extended period of time. The resolution was in progress so the main focus was on the communication needs, and technical staff wanted to concentrate on fixing rather than the communication piece. We tried this for a while and found it did not work. It was just as easy to identify who is needed after the initial call. We learned that not all feedback is good feedback. We added the levels to help people understand when they were needed, but it actually caused more confusion. A major incident is a major incident with all hands on deck. By 2013, the SIs became an adaptable routine and leadership noticed the improvements and relied on the significant incident process. While our progress was commendable, we sought to refine the process and focused on four key areas:
- Participation: We wanted to increase participation across the organization by updating SI procedures to include a broader audience as well as pushing decision-making out to the entire organization.
- Facilitation: We needed to facilitate communication and coordination so that the organization could address significant service interruptions, challenges, expectations, and issues and minimize customer impact.
- On-call: We created a facilitator checklist for training and directions and a facilitator guide to use during an outage.
- Roles: We structured the roles and responsibilities by defining the specifics of the roles and clarifying expectations. With RACI, we identified seven roles: all staff, scheduler, facilitator, support, service owner, communications representative, and on-call teams.
Currently our process for major incidents follows the original five steps I identified, and with the process maturing, it is focused on calling, scheduling, facilitating, and communicating. In first scenario I gave when the PeopleSoft student system went down, the process now is that, within the first hour, the service desk initiates a significant incident and notifies the SI scheduler, a parent incident is established to record all details, a service status announcement is added so all users are aware of the issue, and a stand-up meeting is scheduled for all hands on deck within 30 minutes of reporting with calendar notifications going out to all support staff. During the initial stand-up meeting, an incident document is created to track all meeting notes, is available for anyone to view, and includes the agenda items we identified early on: verify on-call group representation, overview of issue, next steps and/or contingencies, action items and assignment, establish next check in date/time, and communication plan.
Today, we have an efficient process. In 2016 we handled 27 significant incidents and rely on the major incident process. We are also now focused on the four As:
- Avoidance: avoid major incidents by problem investigation; every SI has a problem record and investigation.
- Automation: automate and make things easier by investigating different technologies and enhancements.
- Accountability: establish more emphasis on service owners and responsibility since they are the most affected with the service being operational.
- Action: make it possible for anyone in the organization to handle by moving from centrally located to empowering groups across the organization.
When attempting to set up a major incident process in your organization, you will experience many challenges. Don’t be discouraged about struggling when starting off. Our greatest challenge was adoption, which took a long time because it is a culture change.
Aside from the information I’ve already shared, I’ll add a few more pointers. Get the appropriate level of management to champion (have a process to shop and get buy in before rolling it out). Talk to your boss to discuss what value are you trying to bring (depending on your size, you might not need all of it; find the right size for your organization). If everyone is participating in a major incident process, these factors will all be addressed:
- What is the issue? Categorize and document the terms of the issue for the stand-up meeting in the incident template.
- Who is needed? All hands on deck; identify individuals during meetings, and assign action items as needed.
- Is there a workaround? Discuss during the stand-up meeting, with service desk manager and communications lead on call to distribute information to users.
- When will it be fixed? Technical teams will provide estimated time to repair.
- What can I tell customers? Technical teams, service owners, and communications collaborate.
- Does leadership know? Have a communication plan.
- What was the resolution? Create a resolution template.
I hope this information, both the mistakes and the triumphs, helps you in a smooth and effective transition to a major incident process for your organization.