Incident Management
Incident Management
83 Likes
This document provides practical guidance for the incident management practice.
Table of Contents
9. Acknowledgements
the practice’s processes and activities and their roles in the service value chain
2. General information
Key message
The purpose of the incident management practice is to minimize the negative impact of incidents by restoring normal
service operation as quickly as possible.
The definition refers to a ‘normal service operation’. Conditions of normal service operation are typically defined within service level agreements
(SLAs), or other forms of service quality specification, either agreed with the customer or defined by the service provider. In some cases, internal
service provider’s specification can include more quality criteria than were initially agreed with the customers (see more on this in the service
level management practice guide). The incident management practice is not limited to the service quality perceived by users. It includes
restoration of the normal operation of services and resources, even when their failure or deviation is not visible to the service consumers. In this
case, normal operation can be defined in the technical specifications of services or configuration items (CIs). Finally, if there is no documented
specification of a normal operation, an expert opinion may be used to assess the status of the resources and services.
Tips
If users perceive the situation as abnormal, it is recommended to register an incident and work on making users happy as quickly as
possible, regardless of whether there is a breach of SLA. If users have not reported anything, but a service level agreement is breached,
register an incident and work to restore the agreed level of service before it affects users. If a service or configuration item are not working
as defined in a technical specification, register an incident and work to restore normal performance before it affects the SLA and users. If
there is no formal specifications of service or component normal operation, or if the service works within the specifications, but a specialist
thinks that it is not operating normally, register an incident and restore normal operation as quickly as reasonably possible.
The incident management practice is a fundamental element of service management. This practice is beneficial for both IT service provider
and their service consumers.
The quick restoration of a service is a key factor in user and customer satisfaction, the credibility of the service provider, and the value the
service provider creates in the service relationships.
Definition: Incident
The incident management practice ensures that periods of unplanned service unavailability or degradation are minimized, thus reducing
negative impacts on users. There are two main factors enabling this: early incident detection and the quick restoration of normal operation.
The quick detection and resolution of incidents is made possible with effective and efficient processes, automation, and supplier relationships
alongside skilled and motivated specialist teams. Resources from the four dimensions of service management are combined to form the
incident management practice.
Some systems and services demonstrate patterns of operations that include so-called typical incidents. These may be associated with known
errors, such as a lack of compatibility or patterns of incorrect user behaviour. Service providers benefit from defining incident models to
optimize the handling and resolution of repeating or similar incidents. Incident models help to resolve incidents quickly and efficiently, and
often with better results, due to the application of proven and tested solutions.
The creation and use of incident models are important activities in the incident management practice. They are described further in section 3.
Although some incidents have a relatively low impact on service operation and on work of users, others may lead to dramatic consequences for
service consumers and the service provider. These are called major incidents and require special attention.
A significant business impact is not the only characteristic of a major incident. Major incidents are often associated with a higher level of
complexity. Many systems and services are designed for high availability, and single failures are unlikely to cause a significant business impact.
Failures in these systems are quickly, and often automatically, detected and fixed. However, if multiple seemingly trivial events coincide, they
may lead to a major disruption of multiple services and have a high impact on service consumers. Complex incidents such as this require a
special approach to management and resolution.
It is recommended to implement a model to manage all major incidents, even though major incidents rarely recur and usually differ in nature.
A model for major incidents typically includes:
clear criteria to distinguish major incidents from disasters and other incidents
a special accountable coordinator, sometimes referred to as the major incident manager (MIM)
other dedicated resources (including budget); for example, for urgent consultations with third- party
an agreed model of communications with users, customers, regulators, media, and other
stakeholders
2.2.3 Workarounds
Definition: Workaround
A solution that reduces or eliminates the impact of an incident or problem for which a full resolution is not yet available.
Some workaround reduce the likelihood of incidents.
Sometimes, it may be impossible to find a systemic solution for an incident. In these situations, service providers may apply a workaround.
Workarounds promptly restore the service to an acceptable quality. However, workarounds can increase technical debt and may lead to new
incidents in the future. The problem management practice can be used to reduce the technical debt created by incident workarounds. In
many cases, understanding the cause or causes of an incident can help find an optimal solution.
The total rework backlog accumulated by choosing workarounds instead of systemics solutions that would take longer.
2.3 Scope
reviewing incidents and initiating improvements to services and to the incident management
There are a number of activities and areas of responsibility that are not included in the incident management practice, although they are
closely related to it. These activities are listed in Table 2.1, along with references to the practice guides in which they can be found. Management
practices should be combined to form service value streams, as described in section 3.2.
Table 2.1 Activities related to the incident management practice described in other practice
guides
Implementation of changes to Change enablement; deployment management; infrastructure and platform; project
products and services management; release management; software development management
A complex functional component of a practice that is required for the practice to fulfil its purpose.
A practice success factor (PSF) is more than a task or activity; it includes components from all four dimensions of service management. The
nature of the activities and resources of PSFs within a practice may differ, but together they ensure that the practice is effective.
Previously, it was a common practice to register most incidents based on information from end users and IT specialists. This method of
sourcing information is still widely used, but good practice currently suggests detecting and registering incidents automatically wherever
possible. This can be done immediately after incidents occur and before they start affecting users. This approach has multiple benefits:
Earlier incident detection decreases the time of the service unavailability or degradation, which in turn decreases the losses and other
negative business impact caused by incidents.
The higher quality of the initially collected data supports the correct response to and resolution of incidents, including automated
resolution, also known as self-healing.
Some incidents remain invisible to users, improving user satisfaction and customer satisfaction.
Some incidents may be resolved before they affect the service quality agreed with customers, improving the perceived service and the
reported service quality.
Early detection of incidents is enabled by the monitoring and event management practice. This includes tools and processes for event
categorization that distinguish incidents from other types of events. Automatically detected incidents can be classified either automatically,
manually, or with partial automation. A partially automated categorization is made manually but is based on suggestions made by the system.
Automated incident detection and categorization may benefit from machine learning solutions, using the data available from past incidents,
events, known errors, and other sources. See section 3.1.1 for more details on incident classification.
When automated incident detection is not possible, incidents are usually detected when they have already impacted users and their work.
Even then, the earlier an incident is reported and registered, the better. This can be achieved by promoting a culture of responsible service
consumption among users that includes encouraging reporting of suspicious events and behaviour, and tolerating false reports, within reason.
This PSF is vital for the success of the incident management practice and for general service quality. After incidents are detected, they should
be handled effectively and efficiently, considering the complexity of the environment:
In clear situations, such as recurring and well-known incidents, pre-defined resolution procedures are likely to be effective. These may
include automated resolution or standardized routing and handling (according to an appropriate pre-agreed incident model).
In complicated situations, where the exact nature of the incident is unknown but the systems and components are familiar to the support
teams and the organization has access to expert knowledge, incidents are usually routed to a specialist group or groups for diagnosis and
resolution. Sometimes this can assist in identifying patterns and lead to a model and/or a solution which can be applied to similar
incidents in the future.
In complex situations, where it is difficult or impossible to define an expert area and group, or where defined groups of experts fail to find a
solution, a collective approach may be useful. This technique is known as swarming.
Definition: Swarming
A technique for solving various complex tasks. In swarming, multiple people with different areas of expertise work together
on a task until it becomes clear which competencies are the most relevant and needed.
Usually, swarming assists in decreasing the level of complexity and makes it possible to switch to the techniques used in a complicated or clear
situations. One example where swarming is particularly relevant are major incidents of an unknown nature. In these situations, pulling
together numerous specialized resources is cost-effective compared to the losses resulting from the incident remaining unsolved.
Physical meetings are not required when swarming. When a plan is established, experts may work alone to run experiments, perform analysis ,
and use other tools to discover what is happening. To engage with the incident, swarming utilizes the correct people rather than a great
amount of people. It is usual to involve people from different teams in swarming; this requires organizational solutions which allow involving
team members on a very short notice.
Other techniques can be used in complex situations. For example, expert analysis may be replaced or combined with a series of safe-to-fail
experiments which aim to improve the understanding of the nature of the incident. Adopting and utilizing a complexity-based framework for
decision-making1 is useful for dealing with incidents in situations of high and changing complexity.
As mentioned in section 2.2.1, some incidents recur and can be handled in a well-known, repeatable way. Ideally, such recurrences should be
analysed and further repetition prevented (this usually involves the problem management practice). However, problem management may take
significant time, and some incident, even if well-understood, cannot be effectively prevented. Their occurrence and nature are clear, and their
handling often can follow a well-defined incident model. To optimize the time and resources for resolution of such incidents, the shift left
approach can be used.
An approach to managing work that focuses on moving activities closer to the source of the work, in order to avoid
potentially expensive delays or escalations. In a software development context, a shift-left approach might be characterized
by moving testing activities closer to (or integrated with) development activities. In a support context, a shift-left approach
might be characterized by providing self-help tools to end-users.
In incident management, shift-left can be used to delegate more activities to users: not only reporting an incident, but also self-help using chat
bots, FAQ pages, and other resources. Another form of shift-left is training of the service desk agents to diagnose and solve more different types
of incidents. Any opportunity to solve incidents without transferring them to other teams should be used, especially as the transfer is likely to
take extra time and cost extra money. This should not, however, create unacceptable delays; the speed of incident resolution remains the most
important requirement. The shift-left approach works best in clear, well-known situations, where less experienced people can successfully
follow well-tested and safe instructions.
Regardless of the complexity, it is important to review and confirm the high quality of the incident data from the first steps of incident
handling. This has a strong influence on the:
Incidents should be resolved as soon as possible. However, the resources of the teams involved in incident resolution are limited and these
teams are often simultaneously involved in other types of work. Some incidents should be prioritized over others to minimize negative impacts
on users and optimize the use of resources.
Definitions
Prioritization
An action of selecting tasks to work on first when it is impossible to assign resources to all tasks in the backlog.
Task priority
The importance of a task relative to other tasks. Tasks with a higher priority should be worked on first. Priority is defined in
the context of all the tasks in a backlog.
There are a number of simple guidelines for prioritization which apply to all types of tasks, including incidents:
Prioritization is a tool for assigning tasks to people in the context of a team. If an incident is handled by multiple teams, it will be prioritized
within each team depending on resource availability, target resolution time, and estimated processing time. If resolution of an incident
requires several tasks to be performed by different teams working in parallel, each team will be prioritizing their own task.
Prioritization is needed only when there is a resource conflict. Where there are sufficient resources to process every task within the time
constraints, prioritization is unnecessary.
In each team, all types of tasks (including incidents) should await prioritization and assignment in a single backlog, together with other
tasks (planned and unplanned).
Visualization tools, such as Kanban, and Lean principles, such as the limiting of work in progress, are useful for effective prioritization.
These rules apply to all types of work, whether planned or unplanned, performed by the service provider’s specialist teams. It is important that
they are agreed and followed by everyone involved in the organization’s service management activities, across all practices. Specific to incident
management, the following additional recommendations should be considered:
Evaluation of the impact and urgency of an incident is performed during the incident classification (see section 3.1.1). This evaluation and
the related time constraints for its investigation and resolution (often guided by a service level agreement) is NOT prioritization. However,
this evaluation provides important input for prioritization.
Resource availability and estimated processing time are defined by each team. For well-known repeating operations, the processing time
may be standardized. The target resolution time may be defined by SLAs and/or the internal service specifications of the service provider.
The impact assessment and completion (resolution) time may change as support teams discover new information.
Periodic reviews of incidents should be conducted to improve the effectiveness and efficiency of the incident management practice. Some
incidents will require an individual review upon resolution. This usually applies to major incidents, new types of incidents, and incidents that
were not resolved on time. Most incidents, however, do not require an individual review beyond confirming their successful resolution.
Nonetheless, an overview of the incident management records at certain intervals will help to identify positive experiences and room for
improvement; share knowledge between specialist teams; identify new types of incidents; and improve or introduce incident models.
Periodic reviews provide an opportunity to analyse the stakeholders’ satisfaction with the incident management practice. Periodic incident
review is also key for the continual improvement of the practice and the organization’s products and services.
Key message
Effective reviews will always need data; therefore, it is important to agree the requirements for documenting it. Data should
be:
- Concurrent: It is useful to know exactly what was done when, to assist in continual improvement. This requires
stakeholders to update incident records during, not after, the event. Also, an accurate timeline may be useful for
investigating the problem.
- Complete: A considerable amount of activity can be hidden behind a simple statement. For example, a statement such as
‘We restarted the cluster and normal function was observed after 45 minutes’ may hide useful detail. It could mean: ‘We
restarted Server 1, then 2, then 3 and found that Server 4, which was operating normally, stopped. We checked the manual
and restarted Servers 2 and 4, then 1 and 3. All were processing data correctly after 10 minutes.’
- Comprehensive: Describing why an action was taken can be just as important as describing the action itself.
The practice metrics should be applied to a specific context such as type of incident, services, specialist groups, or periods of time.
The effectiveness and performance of the ITIL practices should be assessed within the context of the value streams to which the practices
contribute. The context of the business and the value streams is important to define what is considered good or not so good performance of a
practice. This is why this practice guide cannot recommend universal key performance indicators for incident management: the target values
for each metric can only be defined in the organization’s context.
Resolving incidents quickly and efficiently Time between incident detection and acceptance for diagnosis
Time of diagnosis
Number of reassignments
Percentage of waiting time in the overall incident handling time
First-time resolution rate
Meeting the agreed resolution time
User satisfaction with incident handling and resolution
Percentage of the incident resolved automatically
Percentage of incidents resolved before being reported by users
Continually improving incident Percentage of incident resolutions using previously identified and recorded
management solutions
Percentage of incidents resolved using incident models
Improvement of the key practice indicators over time
Balance between the speed and effectiveness metrics for incident resolution
Definition: Process
A set of interrelated or interacting activities that transform inputs into outputs. A process takes one or more defined inputs
and turns them into defined outputs. Processes define the sequence of actions and their dependencies.
Incident handling and resolution This process is focused on the handling and resolution of individual incidents, from detection to
closure.
Periodic incident review This process ensures that the lessons from incident handling and resolution are learned and that approaches to
incident management are continually improved.
This process includes the activities listed in Table 3.1, and transforms the inputs into outputs.
Table 3.1 Inputs, activities, and outputs of the incident handling and resolution process
SLAs with consumers and suppliers/partners Incident closure Updates to the knowledge base
Problem records
Knowledge base
Throughout the process, ownership over each incident should be ensured. The ownership may be transferred via the handling and resolution
process, but each incident should have a person responsible for it at any time. Also, stakeholder communications should be updated whenever
there are changes in the status of the incident.
The process may vary significantly, depending on the incident model. Table 3.2 provides descriptions of the activities in two incident models
(manual and automatic), which are just two of many options. They are meant to illustrate the difference between incident models.
Incident A user detects a malfunction in service operation and An event is detected by a monitoring system
detection contacts the service provider’s service desk through and identified as an incident based on a pre-
the agreed channel(s). A service desk agent performs defined classification.
the initial triage of the user query, confirming that the
query does indeed refer to an incident.
Incident The service desk agent performs incident registration, An incident record is registered and associated
registration adding the available data to the incident record. with the CI where the event has been detected.
Pre-defined technical data is registered. If
needed, a notification is sent to the relevant
technical specialists.
Incident The service desk agent performs initial classification of Based on pre-defined rules, the following is
classification the incident; this helps to qualify incident impact, automatically discovered:
identify the team responsible for the failed CIs and/or
services, and to link the incident to other past and - the incident's impact on services and users
ongoing events, incidents, and/or problems. In some - the solutions available
cases, classification helps to reveal a previously - the technical team(s) responsible for the
defined solution for this type of content. incident resolution if automated solutions are
ineffective or unavailable.
Incident If classification does not provide an understanding of If the automated solution is ineffective or
diagnosis a solution, technical specialist teams perform incident unavailable, the incident is escalated to the
diagnosis. This may involve transfer of the incident responsible technical team to manual diagnosis.
between the teams (also known as functional It may involve transfer of the incident between
escalation), or joint techniques, such as swarming. the teams, or joint techniques, such as
If classification does not provide an understanding of swarming.
a solution, technical specialist teams perform incident If an automated solution failed because of an
diagnosis. This may involve transfer of the incident incorrect CI association, this information should
between the teams (also known as functional be communicated to those responsible for the
escalation), or joint techniques, such as swarming. configuration control (see the service
If classification is wrong because of an incorrect CI configuration practice guide).
assignment, this information should be
communicated to those responsible for configuration
control (see the service configuration practice guide).
Incident When a solution is found, the relevant specialist teams If there is an automated solution available, it is
resolution attempt to apply it, working sequentially or in parallel. applied, tested, and confirmed. If a manual
It may require the initiation of a change. If the solution intervention is required, a relevant specialist
does not work, additional diagnosis is performed. team attempts to apply it. It may require the
initiation of a change. If the solution proves not
to work, additional diagnosis is performed.
Incident After the incident is successfully resolved, several If the automated solution proves effective,
closure formal closure procedures may be needed: incident records are automatically updated and
- user confirmation of service restoration closed. A report is sent to the responsible
- resolution costs calculation and reporting technical team. If information about the incident
- resolution price calculation and invoicing has been communicated to other stakeholders
- problem investigation initiation at any of the previous steps, the closure of the
- incident review. incident should also be communicated.
This process is focused on the continual improvement of the incident management practice, incident models, and incident handling
procedures. It is either performed regularly or triggered by incident reports highlighting inefficiencies and other improvement opportunities.
Regular reviews may take place every two to three months or more frequently, depending on the effectiveness of the existing models and
procedures.
This process includes the activities listed in Table 3.3 and transforms the inputs into outputs.
Table 3.3 Inputs, activities, and outputs of the periodic incident review process
Current incident models and procedures Incident review and Updated incident models
Incident records Incident reports incident records analysis
Policies and regulatory requirements Incident model Updated incident handling procedures
improvement initiation
Activity Description
Incident review The incident manager, together with service owners and other relevant stakeholders, performs a review
and incident of selected incidents such as major incidents, those not resolved in time, or all incidents over a certain
records analysis period. They identify opportunities for incident model and incident handling procedures optimization,
including the automation of incident processing and resolution.
Incident model The incident manager registers the improvement initiatives to be processed with the involvement of the
improvement continual improvement practice or initiates a change request (if incident models, procedures, and
initiation automation are included within the scope of the change enablement practice).
Incident model If the incident model is successfully updated, it is communicated to the relevant stakeholders. This is
update usually done by the incident manager and/or the service or resource owner.
communication
To perform certain tasks or respond to particular situations, organizations create service value streams. These are specific combinations of
activities and practices, and each one is designed for a particular scenario. Once designed, value streams should be subject to continual
improvement.
A series of steps an organization undertakes to create and deliver products and services to consumers.
In practice, however, many organizations come to use of the value stream concept after having worked for a while (sometimes for years)
without the value streams being managed, mapped, or understood. This means that when the importance of the concept becomes clear, the
first step is to understand and map the ‘As Is’ situation, the de-facto flows of work, and to analyse them in order to identify and eliminate the
non-value-adding activities and other forms of waste.
Identifying and understanding the existing value streams is critical to improving organization’s performance. Structuring the organization’s
activities in the form of value streams allows it to have a clear picture of what it delivers and how, and to make continual improvements to its
services.
Combined, organizations’ value streams form an operating model which can be used to understand and improve how the organization creates
value for the stakeholders.
Many organizations have been following best practice recommendations for various service management practices, such as incident
management, change enablement, software development, and many others. Incident management is one of the most adopted and mature
practices; organizations often start their ITSM journey with incident management.
However, the practices have often been adopted and organized in a siloed, isolated manner, just as they were presented in the service
management bodies of knowledge. In reality, a flow of work required to create or restore value, for a customer or another stakeholder, is almost
never limited to one practice.
The incident management practice is not enough to restore normal service after it has been interrupted. The real-life workflow may include the
activities outlined in table 3.5, which are described as parts of different practices.
Activity Practice
The incident management practice is core for this value stream, but it is not enough to complete the value stream and restore value co-
creation.
ITIL 4 recommends organizations to examine how they perform work and map all the value streams they can identify. This will enable them to
analyse their current state and identify any barriers to workflow and non-value-adding activities (waste). Wasteful activities should be
eliminated to increase productivity.
Opportunities to increase value-adding activities can be found across the service value chain. These may be new activities or modifications to
existing ones, which can make the organization more productive. Value stream optimization may include process automation or adoption of
emerging technologies and ways of working to gain efficiencies or enhance user experience.
Value streams should be defined by organizations for all their products and services. Depending on the organization’s strategy, value streams
can be redefined to react to changing demand and other circumstances, or remain stable for a significant amount of time. In any case, they
should be continually reviewed and improved to ensure that the organization achieves its objectives in an optimal way.
The main and most obvious value stream involving incident management is described in section 3.2.2.1. Unlike most other practices, incident
management is rarely involved in other value streams. Incidents occurring in other value streams trigger the value stream to restore normal
operation, rather than involve the incident management practice in their own context. For example, if an incident occurs during a new product
release, it triggers the value stream to restore normal operation, while the release-related value stream continues, most likely, rolling back the
unsuccessful changes. Similarly, if an incident occurs during fulfilment of a service request, it does not involve incident management into the
ongoing request fulfilment workflow; instead, it triggers the value stream to restore the normal operation, while the request-related value
stream continues or restarts.
However, some organizations come up with operating models where incident management is involved in other value streams. The examples
include:
Involving the incident management practice to deal with unplanned events in development, testing, and other pre-live environments.
Although these events do not impact live services and don’t have a direct business impact, they can be processed using the same or similar
processes, competencies, tools and third parties: in other words, the same practice. In most cases, people involved in the related workflows are
different from those involved in management of incidents in the live environment.
Separating the restoration value streams for incidents detected by users and incidents detected by monitoring. The former value stream would
be initiated by users contacting service desk and focused on restoring the services to an agreed level and to the users’ expectations. The latter
value stream would be triggered by events captured by the monitoring systems and focused on restoring the components and services to an
agreed technical specification, preventing any negative impact on the live services and their users.
There is no single operating model fitting all organizations. Different solutions work for different organizations, involving different value streams
which in turn involve different management practices.
The following are some simple and practical recommendations for service value stream analysis and mapping:
Identify the scope of the value stream analysis It can be mapped to a particular product or service or applied to most or all of them.
Similarly, service value streams may differ for different consumers; for example, incidents can be solved and communicated differently for
internal and external customers, or for B2B and B2C products, or for services based on products developed inhouse or sourced externally.
Define the purpose of the value stream from the business standpoint Make sure the stakeholder’s concerns are clearly understood,
since they are the ones defining value. In case of incident management, it is usually user who needs to return to normal work as soon as
possible; however, there are usually other interested parties. For example, internal users may be unable to provide normal service to a
business customer because of the incident, and the value of the value stream should be considered from the business perspective, not
solely from the user perspective.
Do the service value stream walk Walk through or directly experience the steps and information flow as they go in practice (consider the
Lean technique of Gemba walk):
c. Evaluate the workflow steps Typically, the criteria for evaluation are:
value for the stakeholder (does the step add value for the business stakeholder?)
d. Map the activities and the information flows In an ideal situation, the flow goes smoothly without delays and pauses, there are no
disconnections between the steps, and the world is level with minimal (and agreed) variation.
e. Create and review the timeline and resource level Map out process times and lead times for resources and workload through the workflow
steps.
Reflect on the value stream map (VSM) Identify factors that might not have been entirely apparent at first. The information collected is
used at this step to find the waste.
Create a ‘to be’ VSM This informs and drives improvement. The value stream should be considered holistically to ensure end-to-end
efficiency and value creation, not just local improvements.
Using the ‘to be’ VSM, plan improvements Refer to the continual improvement practice guide for a practical improvement model.
To ensure that relevant incident management activities are included in service value streams, the following steps can be added to the above
recommendations.
At the scoping step (1), identify the IT and business services related to the value stream and the involved business stakeholders. For
example, when an IT service provider delivers IT services consumed by business users who in turn provide services to the business clients,
should the incident-related service value stream involve restoration of normal business services for the clients, or should it be limited to
the restoration of normal IT services for the business users?
Make sure the value stream is understood (step 2) from the standpoint of the business, not only of the service provider.
During the service value stream walk (3a), identify other practices involved in dealing with incidents at every step. Which practices provide
required information (configuration data, asset data, previously identified solutions, agreed timeline for the service restoration…)? What if
the incident resolution requires changes? What if incident diagnosis and/or resolution involves third parties?
During the workflow steps evaluation (3c), evaluate the step’s impact on the value restoration. Special attention should be paid to steps
with low business value, low performance, and availability or capacity issues. It is not unusual to find steps which serve some internal
control or bureaucratic purposes but delay the incident resolution.
At the reflection and planning steps (4-5), ensure that the incident management flow is optimized for business value throughout the
stream, not only at the incident management practice activities.
Include creation or update of incident models (see sections 2.2.1 and 3.1.2) in the value stream improvement plans (step 6).
The practice guides do not describe the practice management roles such as practice owner, practice lead, or practice coach. They focus instead
on the specialist roles that are specific to each practice. The structure and naming of each role may differ from organization to organization, so
any roles defined in ITIL should not be treated as mandatory, or even recommended.
Remember, roles are not job titles. One person can take on multiple roles and one role can be assigned to multiple people.
Roles are described in the context of processes and activities. Each role is characterized with a competency profile based on the model shown
in Table 4.1.
A Administrator Assigning and prioritizing tasks, record-keeping, ongoing reporting, and initiating basic
improvements
M Methods and techniques expert Designing and implementing work techniques, documenting
procedures, consulting on processes, work analysis, and continual improvement
T Technical expert Providing technical (subject matter) expertise and conducting expertise-based
assignments
In many organizations, the incident manager role is performed by a dedicated person, sometimes under the incident manager job title. In
other organizations, the responsibilities of an incident manager are taken by the person or team responsible for the CI, service, or product with
which the incident is associated; this may be the resource owner, service owner, or product owner.
the coordination of incident handling in the organization or in a specific area, such as territory, product, or technology, depending on the
organizational design
coordinating manual work with incidents, especially those involving multiple teams
monitoring and reviewing the work of teams that handle and resolve incidents
ensuring sufficient awareness of the incidents and their status across the organization
conducting regular incident reviews and initiating improvements of the incident management practice, the incident models, and the
incident handling procedures
developing the organization’s expertise in the processes and methods of the incident management practice.
In some cases, organizations may introduce an additional role of the major incident manager (MIM). This role has similar responsibilities to the
incident manager but focuses exclusively on major incidents. This role becomes the main point of contact and coordination during major
incidents. The MIM usually has wider authority and may have dedicated resources for major incident management.
The competency profile for these roles is CMAT, though the importance of each of these competencies varies from activity to activity.
Examples of other roles which can be involved in incident management activities are listed in Table 4.2, together with the associated
competency profiles and specific skills.
Table 4.2 Examples of roles with responsibility for incident management activities
Incident registration Incident AT Good knowledge of IT service management (ITSM) tools and
manager procedures
Service desk
agent
Technical
specialist
Incident closure Incident ACT Understanding of the service design, resource configuration,
manager and business impact
Service desk Good knowledge of the requirements and commitments for
agent incident resolution
Technical
specialist
Incident review and incident Incident TCL Understanding of the service design, resource configuration,
records analysis manager and business impact
Product Good knowledge of the requirements and commitments for
owner incident resolution
Service Knowledge of incident models, diagnostic tools, methods, and
owner analytic skills
Supplier
Incident model improvement Incident TMC Understanding of the service design, resource configuration,
initiation manager and business impact
Product Good knowledge of the requirements and commitments for
owner incident resolution
Service Knowledge of incident models, diagnostic tools, and methods
owner Knowledge of the organization's continual improvement and
change enablement practices
Organizational structure and the size of organization influences how the incident management practice is performed and how it is integrated
in the organization’s value streams. Incident management involves specialists with different areas and levels of expertise; these specialists may
belong to different organizational teams. Typical methods of grouping specialists include, among others:
technical domain
product/service
territory
consumer types.
The method of organization will vary, depending on the organization’s needs and resources. The incident management practice should take a
flexible approach to its organization, involving resources from various internal and external teams as necessary. Either way, it is crucial to ensure
effective cooperation between members of different teams involved in handling and resolution of incidents.
Historically, teams working on incidents had a tiered or levelled structure in which competency, expertise, and specialization increased with
each level. It aimed to resolve most of the incidents at the lowest level possible to reduce costs. Incidents were transferred to the upper level, or
escalated, if they could not be resolved in the current level. In such teams, there were clear boundaries between levels and clear procedures for
the escalation of incidents. Unfortunately, such structures can restrain collaboration and information flow, resulting in prolonged resolution
time. So, for high-priority incidents, teams collaborate to facilitate speedy resolution.
The expansion of Agile methods and evolution of IT systems (such as self-healing systems) call for the wider use of horizontal team structures,
rather than hierarchical team structures. Flatter structures and respective collaboration methods, such as swarming, replace tiered ones to
facilitate cooperation and the free flow of information. The main driver of such change is the rejection of rigid tiering and its replacement by a
more dynamic, self-organized collaboration.
The incident management practice is the foundation of team dynamics, because they affect the functioning of the support operation. The
following issues regularly recur:
team members experience a lack of autonomy and report being blocked by others
a culture prevails where lone ‘heroes’ are rewarded when incidents are solved.
a decrease in morale
a lack of motivation
Furthermore, trust between team members breaks down. Approaches such as DevOps and techniques such as swarming show some of the
characteristics needed to encourage a positive culture, although it is not necessary to follow these approaches to achieve the correct team
dynamic. The following three main areas need to be addressed.
If resolving incidents is the primary responsibility, that is what individuals within the teams will focus on. Team dynamics should come second
to achieving the SLA or meeting a deadline. The first step in changing this is to build a culture where team members share successes and
failures. Teams that share responsibility may have a single person who sees an incident through to resolution, but they should be encouraged
to engage other experienced people in the process. When this occurs, the organization will benefit from a fast restoration of normal service as
well as knowledge-sharing.
There should be a no-blame culture within teams, otherwise this will lead to the deterioration of trust between individuals, teams, and
suppliers. Incident investigations and reviews need to address incident resolution and service restoration. Incident teams must be encouraged
to act without fear of retribution if their idea fails to work. This requires transparency and positive leadership. Mistakes should be treated as
shared learning opportunities rather than personal failures.
Team members need to share the lessons that they have learned from experimenting so they can learn and improve. This can prove to be a
significant cultural leap in many environments, particularly those with a large percentage of outsourcing.
The effectiveness of the incident management practice is based on the quality of the information used. This includes, but is not limited to,
information about:
partners and suppliers, including contract and SLA information on the services they provide
This information may take various forms, depending on the incident models in use. The key inputs and outputs of the practice are listed in
chapter 3.
Details of incidents are the most important pieces of information. These usually include:
sources of information
the last known time of correct operation before the symptoms began
similar systems which might be affected by the poor performance and are currently operating normally
Additional information that will be exchanged and recorded during the incident management practice should include details of:
the investigation
Any actions taken should be documented to produce an accurate timeline. If it is not practical to document actions in real time, the
documentation should specify when the action was started and completed to avoid the creation of a false history log. It is preferable, however,
to capture real-time actions if the customer can see the information through a portal. Where possible, the registration of actions should be
automated.
The incident management practice can significantly benefit from automation. The term automation is used in this and other ITIL publications
to refer to the use of digital technology to enable, support, or enhance various activities. This includes, but is not limited to the full automation
of activities where technology solutions remove the need for human intervention. Table 5.1 provides a list of the key automation supporting the
practice and their most common application.
Workflow management and collaboration tools (including user Management of incident lifecycle
query (‘ticket’) management tools) Support and automation of incident models
Communications between specialists involved in
incident handling and resolution
Integration of the practices into service value systems
Classification and analysis tools, including ML-enhanced Incident classification and analysis
Remote administration, diagnosis, deployment, and other Incident diagnosis and resolution
infrastructure and software management tools
Work planning and prioritization tools Planning and tracking of improvement initiatives
Detailed descriptions of how these tools support the practice’s activities are outlined in Table 5.2.
In some cases, all activities after a particular activity in the incident handling and resolution process can be fully automated using pre-defined
scripts and scenarios for specific types of incidents.
Note that automation tools used in the incident management practice could include not only organization-wide tools, which are valid for all
incidents, but also some local custom tools and scripts created as a result of a periodic incident review process for specific incident models.
Both should be used to drive automation efforts.
Incident
handling
and
resolution
process
Incident Workflow management and Fast and correct classification and Very high, especially when
classification collaboration tools, including assignment of the incidents, the number of incidents is
user query ('ticket') identification of known solutions, high
management tools identification of major incidents
Knowledge management
tools
Service configuration
management tools
Classification and analysis
tools
Incident Workflow management and Fast and correct definition and testing High, especially when the
diagnosis collaboration tools, including hypothesis, effective collaboration of number of complex incidents
user query ('ticket') multiple specialists/teams requiring manual
management tools collaboration efforts is high
Knowledge management
tools
Service configuration
management tools
Incident Remote administration, Fast correction of the faulty CIs and High, especially when
resolution diagnosis, deployment, and restoration of the services services are provided in
other infrastructure and remote locations
software management tools
Periodic incident
review process
Incident review Analysis and Remote collaboration, incident data Medium to high, especially for high
and incident reporting tools analysis, and users survey data volumes of incidents.
records analysis Workflow analysis and reports
management and
collaboration tools
Survey tools
Incident model Workflow Communicating updates to the Medium to high, especially when
update management and relevant teams organization is large, and number of
communications collaboration tools updates is high
The following recommendations can help when applying automation to incident management:
Automate the value stream Although incident management is often one of the first practices to be developed by a service provider, the
implementation of ITSM automation systems also often starts with the incident management processes. Even if other practices may not
be mature at this stage, it is important to define requirements and design workflows that will support the full value stream, from
detection, to resolution of incidents. For incident resolution that requires changes, the automation tool should allow for a simple change
tracking workflow; for recurring incidents, it should be possible to capture and reuse of proven solutions. Think and work holistically.
Allow different workflows for user- and event- initiated incidents Detection, classification, communications, and conditions for closing
a record are all handled differently for user-initiated and event-initiated incidents, even if the latter are handled manually. Attempts to fit
both types of incidents in one workflow with the same forms and business logic are unlikely to be successful. The handling of event-
generated incidents can and should be automated.
Do not overcomplicate the workflows and business rules Forms filled in manually should be user-friendly and should not take much
time to fill in. When designing user journeys and interfaces, treat IT support teams as you would treat external users whose expectations
are based on their experience with mobile apps and modern web sites.
Pay attention to measurement and reporting from the beginning Incident management is a high-load practice, and it is not possible to
monitor the status of incidents and the performance of the practice without a convenient dashboard; it is impossible to understand the
trends and to analyse the work of teams without a flexible reporting engine. The popular statement ‘you cannot manage what you don’t
measure’ is not always true, but it certainly applies to large amounts of data, and the incident management practice generates large
amounts of data.
Allow for swarming and other forms of cross-team collaboration Some incident management tools are designed for a linear flow and
transfer of incident records between the teams. When a joint action is required, it is often unsupported; specialists meet and work
together, but the incident records do not reflect it. Design the tool for collaborative and non-linear workflows.
Communications are important Informing people about incidents, both on the service consumer side and within the service provider, is
a crucial part of incident management. Relevant and proactive communications significantly reduce work duplication and optimize the
resources of the incident management and service desk practices.
Leverage machine learning capabilities Incident detection, matching, classification, and prioritization can be enhanced or fully
automated using machine learning. Effective use of machine learning requires high-quality data and effective integration with various
sources of information. If used properly, it can significantly improve the incident management practice.
Very few services are delivered using only an organization’s own resources. Most, if not all, depend on other services, often provided by third
parties outside the organization (see section 2.4 of ITIL Foundation: ITIL 4 Edition for a model of a service relationship). Relationships and
dependencies introduced by supporting services are described in the practice guides for service design, architecture management, and
supplier management.
Partners and suppliers may support the development, management, and execution of the incident management practice. The forms of
support include the following:
Performing incident management activities Some incident management activities can be largely or completely performed by a
specialized supplier. Third parties are often involved in incident diagnosis and resolution, and sometimes in other activities. It is important
to ensure effective integration of the third parties in the incident-related workflows and information exchange, as well as their adherence
to relevant policies. Incident models should define how third parties are involved in incident resolution and how the organization ensures
effective collaboration. This will depend on the architecture and design solutions for products, services, and value streams. Nonetheless,
the optimization of incident models supporting these solutions will involve the incident management practice. Generally, after the correct
model is selected for an incident, further consideration of third-party dependencies is needed during incident diagnosis, resolution, and
review. Defined standard interfaces may become an easy way to communicate the necessary conditions and requirements for a supplier
to become a part of the organization’s ecosystem. Such interface description may include rules of data exchange, tools, and processes that
will create a common language in the multi-vendor environment. Where organizations aim to ensure fast and effective incident
resolution, they usually try to agree close cooperation with their partners and suppliers, removing formal bureaucratic barriers in
communication, collaboration, and decision-making (see the supplier management practice guide for more information).
Provision of software tools Most software tools used for incident management are shared with other practices. However, implementation
and use of integrated service management information systems often starts with automating incident management (and service desk)
activities. In this case, the owner of the incident management practice and the managers of the teams involved in incident management
should define requirements and interact with other teams and practices of the service provider to ensure that the required tools are
procured, implemented, and used in an optimal way.
Consulting and advisory Specialized suppliers who have developed expertise in incident management can help establish and develop
practices, adopt methods and techniques (such as swarming), and initially develop incident models.
The practice success factors described in section 2.4 cannot be developed overnight. ITIL maturity model defines the following capability levels
applicable to any management practice:
Level 1 The practice is not well organized; it’s performed as initial or intuitive. It may occasionally or partially achieve its purpose through an
incomplete set of activities.
Level 2 The practice systematically achieves its purpose through a basic set of activities supported by specialized resources.
Level 3 The practice is well defined and achieves its purpose in an organized way, using dedicated resources and relying on inputs from other
practices that are integrated into a service management system.
Level 4 The practice achieves its purpose in a highly organized way, and its performance is continually measured and assessed in the context of
the service management system.
Level 5 The practice is continually improving organizational capabilities associated with its purpose.
For each practice, the ITIL maturity model defines criteria for every capability level from level two to level five. These criteria can be used to
assess the practice’s ability to fulfil its purpose and to contribute to the organization’s service value system.
Each criterion is mapped to one of the four dimensions of service management and to the supported capability level. The higher the capability
level, the more comprehensive realization of the practice is expected. For example, criteria related to the practice automation are typically
defined at levels 3 or higher because effective automation is only possible if the practice is well defined and organized.
Figure 7.1 Design of the capability criteria
This approach results in every practice having up to 30 capability criteria based on the practice PSFs and mapped to the four dimensions of
service management. The number of criteria at each level differs; the four dimensions are comprehensively covered starting from level 3, so this
level typically has more criteria than others.
Table 7.1 outlines the capability criteria that are defined in the ITIL maturity model for the incident management practice.
Detecting incidents Incidents are usually detected immediately after they occur Value 2
early streams and
processes
The users and other relevant stakeholder know how to report Organizations 2
incidents and report them as soon as possible and people
Resolving incidents Incidents are usually resolved in the quickest possible way Value 2
quickly streams and
processes
Incidents are usually resolved within the agreed target resolution Value 2
times streams and
processes
These capability criteria can be used by organizations for self-assessment and improvement of the practice.
The self-assessment can be conducted by the service provider’s internal audit team, if the service provider has one, or by the respective team of
the parent organization. If there is no specialized team in the organization, the assessment can be done by a team of practice owners and
managers responsible for other management practices of the service provider, or a mixed team of the service provider’s executive leaders and
managers.
To perform a quick self-assessment using the capability criteria, the following rules should be followed.
1. Start with the level 2 criteria. Based on the knowledge of your organization, answer the question, ‘Is this a valid description of our
organization in MOST cases?’
2. If the answer to the question above is ‘yes’, make a list of at least three types of material evidence that could prove the answer.
These can be records, documents, interviews with business stakeholders, or service provider’s employees.
3. If the answer is ‘yes’ to all criteria of level 2, this level is considered achieved. Proceed to the criteria of level 3.
4. If not all criteria of level 2 are met, the practice is considered to be at level 1. Focus on the criteria that are not met; what is missing
in the organization? Why? How can it affect the service consumer and the quality of the IT services? What can be done to meet
the criteria that are currently missed?
5. The same approach is applied at every next level; the practice is considered to be at the level, where all criteria are met. It is
important to focus on the missing capabilities and improvement opportunities, rather than on a formal achievement of a high
capability level.
Management practices should support achievement of the organization’s objectives and enable creation of value for the stakeholders.
Depending on the service provider’s strategy, positioning, and business and operating models, some practices may be more important and
therefore require a higher level of capability. There is no organization that requires all management practices to be at the capability level 5.
Higher capability level provides higher assurance of the fulfilment of the practice’s purpose, but it comes with a cost; cost of management,
automation, and training, for example. To achieve optimal performance with sufficient level of assurance, organizations should define a target
capability level for each management practice.
Figure 7.2 and table 7.2 show the capability development model, which can be applied to every management practice. The structure of this
publication is aligned with the development steps.
Scope 2.3
Tools and 5
procedures
Most of the content of the practice guides should be taken as a suggestion of areas that an organization might consider when establishing and
nurturing their own practices. When using the content of the practice guides, organizations should always follow the ITIL guiding principles:
focus on value
More information on the guiding principles and their application can be found in section 4.3 of ITIL Foundation: ITIL 4 Edition.
Table 8.1 outlines recommendations for the success of the incident management practice, linked to the relevant guiding principles.
Look at the incidents from For user-reported incidents, do not hide behind SLAs, aim to restore level of Focus on
the service consumer service which satisfies the users. value
perspective For monitoring-based incidents, assess business impact even if there are no Collaborate
directly affected users yet. and
Prioritize incidents according to their business impact. promote
visibility
Gather and reuse data Many incidents recur. Significant time and resources can be saved by Collaborate
developing incident models and reusing known resolutions. Do not rely on and
individuals' experience, motivate team members to document and share promote
their knowledge. visibility
Leverage automation tools to manage knowledge and automate solutions, Optimize
where possible. and
automate
Understand, manage, and Incident lifecycle spans beyond one practice. Ensure effective integration Think and
improve the incident with service desk, change enablement, problem management, and other work
resolution value stream, not relevant practices. holistically
only the incident Focus on
management practice value
Develop the practice Start with the most critical products and services and with basic workflow Start
continually but don't from detection to resolution. Gradually increase both the scope and the where you
overcomplicate it capability level based on the business requirement and stakeholder are
feedback. Use the capability criteria and continual improvement model as a Progress
guidance. iteratively
with
feedback
Keep it
simple and
practical
Adjust for complexity Shift left and automate handling and resolution of repeating clear incidents. Optimize
Use swarming to optimize resolution of unusual, complex, and major and
incidents. automate
Collaborate
and
promote
visibility
Demonstrate business value Measure the practice and produce regular reports and dashboards for Focus on
internal (within the service provider) and external (service consumer) value
stakeholders. Collaborate
Use dashboards for the current state and regular reports for analysis and and
highlights. promote
visibility
9. Acknowledgements
PeopleCert is grateful to everyone who has contributed to the development of this guidance. These practice guides incorporate an
unprecedented level of enthusiasm and feedback from across the ITIL community. In particular, PeopleCert would like to thank the following
people.
Authors
Reviewers
Akshay Anand, Sofi Fahlberg, Michael G. Hall, Steve Harrop, Piia Karvonen, Anton Lykov, Paula Määttänen, Christian F. Nissen, Mark O’Loughlin,
Tatiana Orlova, Elina Pirjanti, Stuart Rance
2023 Revision
David Cannon, Antonina Douannes, Peter Farenden, Adam Griffith, Roman Jouravlev, Kaimar Karu, Barclay Rae, Stuart Rance, Nicola Reeves