What Are the Primary ITIL Major Incident Management Roles and Responsibilities?
*This post originally appeared on the Cherwell blog, prior to the acquisition by Ivanti.
The ITIL® framework is the leading global standard for IT Service Management (ITSM). Most recently, ITIL has contained 26 separate and distinct processes and four functions that are organized into the five stages of the IT service lifecycle. There are ITIL processes to help organizations strategize about what services they will offer, effectively design services, build and deploy services, operate services, and, finally, to facilitate the continual improvement of services the organization has chosen to deploy.
While we're anxiously awaiting ITIL4, ITIL v3 and the subsequent 2011 version contained five volumes that each correspond to a single phase of the service lifecycle:
- Service Strategy
- Service Design
- Service Transition
- Service Operation
- Continual Service Improvement
Within the Service Operation manual, ITIL organizations can find information about the four functions of ITIL, including the all-important Service Desk that exists to facilitate the Incident Management process. ITIL defines an incident as an unplanned interruption or reduction in quality of an IT services, and all incidents are typically reported to and managed by the IT organization through a service desk.
In this guide, we're focusing in on one of the most important sub-processes of Incident Management: the management of major incidents, or Major Incident Management. We'll explain how they're defined in ITIL and how IT organizations work to resolve them, as well as reviewing the most important ITIL Major Incident Management roles and responsibilities.
What Is Major Incident Management?
The goal of the overall Incident Management process is to effectively manage the lifecycle of all incidents and to restore IT services for users or customers as quickly as possible when an interruption takes place. Incident Management is comprised of nine sub-processes that work together to ensure that Incident Management is conducted efficiently by the IT organization. While our present focus is on Major Incident Management, let's take a look at how these sub-processes work together within the Incident Management process:
- Incident Management Support aims to provide and maintain the tools, processes, skills, and rules that support technicians need to handle incidents efficiently.
- Incidents that are reported to the Service Desk pass through an Incident Logging and Categorization step that is typically conducted by a 1st-level technician. Incidents must be recorded and prioritized according to their urgency to ensure that they are resolved in a timely manner. Major incidents represent the highest priority incidents that must be resolved by the service desk.
- Immediate Incident Resolution by 1st-Level Support happens when a reported incident can be resolved on the first call. First-level technicians should aim to recover services as quickly as possible using a workaround.
- When an incident cannot be immediately resolved, the next step is Incident Resolution by 2nd-Level Support, with the goal of resolving the incident as quickly as possible (within the agreed time schedule).
- Outstanding incidents are continuously monitored through a process known as Incident Monitoring and Escalation, ensuring that the IT organization can allocate additional resources toward a high-priority incident that must be resolved to maintain service level agreements.
- When a major interruption occurs, ITIL organizations can follow the Handling of Major Incidents sub-process to guide their actions and decisions in resolving the incident as quickly as possible. A major incident is one that causes a serious interruption to business activities and must be resolved with the utmost urgency. For large organizations, a major interruption could result in hundreds of thousands or even millions of dollars in lost revenue. When an incident is escalated to a "Major Incident," Incident Managers do everything they can to resolve the issue promptly, including leveraging special support groups or third-party suppliers with more advanced or specific technical knowledge.
- The Incident Closure and Evaluation process ensures that resolved incidents are reviewed for quality and that all information about incidents is accurately recorded.
- The Incident Management team plays a role in supplying Proactive User Information about planned service outages.
- Incident-related information and data is supplied to the other service management processes through Incident Management Reporting.
Major Incidents challenge Incident Managers to effectively notify and coordinate resources and then deploy them to resolve a problem within an extremely short time frame. While the majority of reported incidents are resolved by 1st- or 2nd-level tech support, major incidents often require additional resources to ensure a timely resolution.
How Does ITIL Qualify a Major Incident?
Based on our examination of the sub-processes that make up Incident Management, we can make some simple inferences about Major Incident Management and how IT organizations handle their highest-priority tickets. We know that incidents are logged and categorized based on their urgency, so IT organizations regularly rely on 1st-level technicians to correctly identify high priority incidents. We also know that incident monitoring and escalation are ongoing processes, so a 1st-level technician has the capacity to escalate issues that can't be resolved on the first call or may require additional resources.
For the IT organization to initiate its Major Incident Management process, there must be some criteria for designating an incident as "major." In fact, the ITIL framework includes an incident priority matrix that Incident Managers can use to organize and prioritize how the IT organization responds to incidents. The incident priority matrix assigns a rating of high, medium, or low to each incident across two separate dimensions: urgency and impact.
High urgency incidents are those for which the damage caused can increase rapidly, or which prevent staff from completing time-sensitive work. Situations where immediate action can prevent a minor incident from becoming a major incident are also considered urgent, as are outages that affect one or more VIP users. Here, the idea of urgency means that the organization can derive significant benefits from addressing the issue sooner rather than later.
Incidents are also assessed for their impact on the organization. A high impact service outage is one that affects a large number of staff and may actually prevent some staff from doing their jobs. High impact incidents have the capacity to cost the company thousands or even tens of thousands of dollars (or more) and the reputation of the business itself could be damaged by the outage.
Ratings of impact and urgency for incidents are used to assign a priority level—commonly between one and five for each incident. Incidents with priority 1 are considered critical—the IT organization aims to respond immediately to such events and rectify them within one hour. In contrast, category 5 incidents are a very low priority—the IT organization will act on them within 24 hours and aim for a resolution within one week. Three level priorities are also common.
Many IT organizations define additional criteria for identifying major incidents and responding appropriately. It is useful to designate certain groups of services, applications, or infrastructure components as business-critical and to trigger the Major Incident Handling process when one of these components becomes unavailable and the estimated time to recover the service is exceedingly long or even unknown.
Major incidents often share the same characteristics as the Category 1 Critical incidents described above. They typically affect a lot of customers at a time, often affect several VIP customers, are costly to customers or to the business organization, and may have the capacity to affect the company's reputation. In addition, major incidents are characterized by the large amount of time and effort that is likely to be required to manage and resolve the incident.
What Is the ITIL Major Incident Process Flow?
ITIL suggests a relatively simple process flow for diagnosing and managing major incidents within the IT organization.
- The incident is first reported.
- Incident Logging and Categorization takes place—if the incident is a major incident, it will likely be assigned a high rating for both urgency and impact on the organization.
- The incident is escalated to 2nd-level support.
- The Incident Manager is notified that a major incident has taken place and that technical support staff believe it is a major incident
- The incident manager forms a Major Incident Team (MIT, made up of IT managers and technical experts, many from within the company but some potentially from outside. The team will work together to resolve the incident as quickly as possible.
- Once a workaround is discovered, the incident may be reported to problem management for future investigation and to develop a permanent solution.
- Data is captured from the Major Incident Management process and used to drive continuous improvement throughout the organization's Incident Management practices.
This simple process flow helps to ensure that major incidents are diagnosed early, escalated quickly to the top of the IT organizational chart, and acted on to ensure a prompt resolution. For this to happen, it it important that 1st-level technical staff diagnose and escalate major incidents quickly and don't waste valuable time trying to resolve large and complex incidents themselves.
In a major incident, service level breaches are highly probable. IT organizations must demonstrate their ability to efficiently resolve major incidents and maintain service level agreements.
What Are the ITIL Major Incident Management Roles and Responsibilities?
Under ITIL, four separate roles are allocated accountability and responsibility during the major incident handling process. Below, we detail the ITIL Major Incident Management roles and responsibilities associated with each of these job titles.
Role of 1st-Level Technical Support
First-level support technicians are the primary contact person for incident reports within the IT organization. Typically, they staff the IT Service Desk, taking incident reports from users and customers, registering and categorizing them, and undertaking an immediate effort to restore the service outage as quickly as possible.
When 1st-level support cannot rectify a service outage within an acceptable time frame, the incident is escalated to expert technical support groups (2nd-level support). First-level support technicians may be responsible for doing the actual work of restoring an IT service when a major incident occurs, but they aren't the ones responsible for coordinating the major incident team.
Role of an Incident Manager
The Incident Manager takes full ownership and accountability for the Incident Management process within the IT organization, including all major incidents that are reported and must be resolved. Once a major incident is escalated by 1st- or 2nd-level technical staff, the Incident Manager should determine what resources and expertise are required to resolve the incident and set about forming a Major Incident Team that can resolve the issue as quickly as possible.
Role of a Major Incident Team
The role of the MIT in addressing major IT outages is to restore service as quickly as possible using all available resources. The size and composition of the team will depend on the magnitude and nature of the service outage and the specific expertise and action steps required to restore service.
The team can include IT managers from other departments outside the Service Desk, including staff normally responsible for other processes like Change Management. In addition, 1st- and 2nd-level technical support staff, IT operators within the organization, and even third-party technical specialists from outside the company are typically involved. Together, the team develops and implements a strategy to restore services as quickly as possible.
Role of an IT Operator
IT operators perform daily operational activities within the IT organization, such as installing equipment in the data center, backing up data and maintaining servers, and ensuring that scheduled tasks are performed. IT operators are valued for their familiarity with the company's IT infrastructure and operations, and they may be used as a source of extra labor when the Incident Manager forms a Major Incident Team to address a major service outage.
ITSM Software an Asset for Major Incident Management
IT organizations can increase their efficiency of service delivery by adopting a software-based ITSM solution that supports ITIL best practices.