Incident Management Process Guide
Incident Management Process Guide
Incident Management
Process Guide
University of Washington
Original version: September 23, 2013
Last update: January 4, 2014
This document will continuously improve and is subject to Change Management. If referencing a printed version, be sure you are
referencing the most current version at https://wiki.cac.washington.edu/x/36mKAw
This process document was developed in partnership with Covestic, Inc. and UW-IT.
All trade names, trademarks, or registered trademarks are trade names, trademarks, or registered trademarks of their respective companies.
Process Description The primary goal of the Incident Management process is to restore
normal service operation as quickly as possible and minimize the
adverse impact on business operations. The Service Desk is the
primary customer-facing operations center tasked with documenting
and processing incidents through to resolution. The Service Desk is
the critical liaison between the support organization and the
customer. All levels or tiers of support utilize the incident
management process to restore service to their customers.
1. Process Objectives
2. Scope
2.1. Scope
3. Process Overview
3.1. Description & Overview
3.2. Process Inputs
3.3. Incident Ownership
3.4. Incident State Definitions
4. Incident Management Process Activities
4.1. Incident Logging & Identification Sub-Processes
4.1.1. Caller Identification
4.1.2. Output
4.2. Incident Logging
4.2.1. Output
4.3. Classification & Prioritization Sub-Processes
4.3.1. Incident Classification & Prioritization
4.3.2. Impact
4.3.3. Urgency
4.3.4. Priority
4.3.5. Incident Investigation & Diagnosis
4.3.6. Output
4.4. Investigation & Diagnosis Sub-Processes
4.4.1. Service Desk Resolution Guidance
4.4.2. Managing Staff Resources
4.5. Incident Resolution & Recovery Sub-Processes
4.6. Incident Closure
5. Incident Communications
6. Roles & Responsibilities
6.1. RACI Model
6.1.1. RACI Model Legend:
7. Process Interfaces and Dependencies
7.1. Service Transition
7.1.1. Change Management
7.1.2. Configuration Management
7.1.3. Release and Deployment Management
7.1.4. Knowledge Management
7.2. Service Design
7.2.1. Service Continuity
7.2.2. Service Level Management
7.3. Service Strategy
7.3.1. Financial Management
8. Process Measurements and Metrics
8.1. Critical Success Factors
8.2. Incident Management Operational Metrics
9. Terms of reference
Assignment Groups
Caller
Contact Record
Duty Manager
First Time Resolution (FTR)
Incident
Incident Manager
IT Manager
Major Incident
NBD
Parent Incident - Master Incident
Priority
Problem
Record
RFU
Service desk
Service Owner
Support Tier
URC Decision Support Group
UW-IT Unit Response Center (URC)
1. Process Objectives
The objective of Incident Management process is restore IT service operations as quickly as possible to users and minimize the adverse impact
on business operations.
2. Scope
2.1. Scope
The scope of Incident Management will include all managed services as defined by the UWIT Service Catalog.
3. Process Overview
Figure 1 – Incident Management Lifecycle & Sub-Processes
Telephone – User contacts the Service Desk using a centralized phone number.
Web Form – User completes a web form that is assigned to Service Desk support for resolution.
Instant Messenger – User initiates an Instant Messenger (IM) conversation with a Service Desk support analyst
E-Mail – User sends a properly addressed e-mail message to a centralized e-mail address.
Walk-Up – A user walks up to an on-duty Service Desk analyst to report an Incident.
Events/Automated Alerts – Automated monitoring systems may generate Incidents.
Social Media/Community – Users may post issues on Facebook and Twitter that IT transforms into Incidents.
3rd Party – Third-Party partners and suppliers may report Incidents to UW
IT – Incidents may be generated by IT team members for unmonitored devices.
In regards to the Incident Inputs, the following rules should be communicated to the user community in order to expedite Incident resolution:
Incident reporting that could potentially involve complete service failure or degradation should be reported to the Service Desk .
Incidents reported by web form, e-mail, or walk-up are excluded from FTR reporting.
New Incidents start out in a New Incident state and remain in that state
until they are assigned to a named resource (Assigned to).
Awaiting Problem When the Incident is placed in Awaiting Problem state, the SLA
clock is paused while the parent Problem is being worked. Once the
Problem has been resolved, all associated Incidents may be resolved
by using the "Close Incidents" UI Action on the Problem form.
Awaiting User Info The Awaiting User Info state pauses the SLA clock (while support is
waiting on additional information from a user and/or 3rd party). When
the Incident is placed in Awaiting Info state, many customers have
elected to send the Caller a system generated e-mail notifying them
that more information has been requested. A common customization
is to set a rule/script that automatically resolves incidents that are
awaiting user info and haven't been updated in 10 days. A reminder
notification will be sent to the caller after 5 days, warning them that
their issue will be placed in resolved state if they do not respond. If
the caller responds, the 10 day clock restarts.
4.1.2. Output
Upon completion of the Caller Identification activities, the user of record should be recorded and verified with a means to contact the user again, if
necessary. The Caller Identification output will be used as a process input, for the next set activities, Incident Logging.
Name Activities
Name Description
Title A brief
description of
the Incident
Description Detailed
information
regarding the
Incident
3.0 Classification & Prioritization The output for the Incident Logging acti
vity is meta-data for consumption by the
Classification & Prioritization process.
4.2.1. Output
Upon completion of the Incident Logging activities, all information collected (up to this point), has been document and can be easily referenced, if
necessary, and used by other groups for other Incident Management activities. The Incident Logging output will be used as a process input, for
the next set activities, Classification & Prioritization.
Incident Impact
Incident Urgency
Incident Prioritization
4.3.2. Impact
The effect on business that an Incident has – 1-3. 1 is high Impact. Outages impacting key services, systems or business processes are High
Impact.
4.3.3. Urgency
The extent to which the Incident's resolution can bear delay – between one and three. One is high Urgency. A user with a critical deadline could
have an Impact of Low (single user) but a High Urgency.
4.3.4. Priority
The order in which the support organization should address the Incident or Service Request – between one and three. One is Critical Priority.
Setting Impact and Urgency to 1 for any Incident immediately sets the Priority level to 1.
4.3.6. Output
Upon completion of the Classification & Prioritization activities, the Incident Record is now ready to be processed for appropriate action. For the
Incident Process, most actions (or non-actions) are driven by the Incident Priority. The Priority will be key in determining how the Incident is
handled from this point forward.
The Incident priority will drive who and when Investigation & Diagnosis must occur.
Note: Objects in red are activities exclusively executed by Tier 2 or Tier 3 resources. Any Incident that requires activities in red
are not considered for FTR.
Name Activities
4.3 Attach to Parent Incident Record When a duplicate Incident has been
identified that was been reported by
different users, all Incidents reported
should be associated to the Parent
Incident Record. The Parent Incident
Record will have the following
attributes:
4.10 Update Incident Record If step 4.9 determines the enough time
or information is not available to resolve
the Incident, the assigned analyst
should update the Incident Record with
the steps taken so far. This is to ensure
that other support teams are not
duplicating resolution steps.
4.11 Functional (Tier 2) Escalation If for any reason, at any time, Service
Desk resources are unable to resolve
an Incident, a child work task should be
assigned to Tier 2 for further
Investigation.
4.12 Advanced Diagnostics Tier 2 will often utilize resources not
readily available to Service Desk
resource (e.g. system or application
logs, Change logs, vendor knowledge
bases).
Active Passive
Priority/Type Engage On Engage Duty Engage Other Notify Notify Divisio Notify CIO Effort to
Call Manager Staff Service n Heads Resolve
Resources Resources Manager
Major X X X X X X Continuous
(24 hrs)
1 – Critical X X X X Continuous
(24 hours)
2 – High X X Business
Hours
Continual
3 – Medium X Can Be
Suspended for
up to 5 days
4 – Low Can Be
Suspended
Indefinitely
5 - Planning N/A
Engage On Call Resources – This authorizes Incident Managers to engage on-call IT resources for assistance in the Investigation & Diagnosis o
f an applicable Incident. Group On Call Rotation schedules should be referenced as to which particular resource is on call. This authorization also
extends to secondary, tertiary, or on call manager escalation, if the primary on call resource is not available.
Engage Duty Manager – A UW Duty Manager (based on rotation schedule) is engaged for major and critical issues. The Duty Manager
coordinates and engages all resources needed to resolve major and critical issue.
Engage Other Staff Resources – This authorizes Incident Managers to engage any IT resource for assistance in the Investigation & Diagnosis
of an applicable Incident. This authorization includes engaging resources that are not currently on the on-call list.
Notify Service Manager – This mandates that the Incident Manager must at a minimum alert the Service Manager for the affected Service(s) of
the current Incident.
Notify Division Heads – This mandates that the Incident Manager must at a minimum alert the Division Heads of the affected Service(s) of the
current Incident.
Notify CIO – This mandates that the Incident Manager must at a minimum alert the University of Washington IT Chief Information Officer (CIO) of
the affected Service(s) of the current Incident.
Effort to Resolve – This describe the mandated minimum effort levels required to for Incident Investigation & Diagnosis.
Continuous – UWIT must work continuously on the Incident until resolution or re-prioritization. These activities extended into non-core
operating hours including weeknights, weekends, and holidays. If applicable, user resources must also be available to assist with Investig
ation & Diagnosis. (e.g. A user can not report a Priority 1 Incident and then go home with no alternative resource to assist from the user
perspective.)
Business Hours Continual - UWIT must work continuously on the Incident until resolution or re-prioritization during core operating
business hours. Investigation & Diagnosis activities can be suspended by IT resources or user resources during off-peak hours. Outside
of core operating hours, no active Incident Management activities, except for system logging or system monitoring, is assumed to be
occurring. Activities must resume during business hours or the Incident must be re-prioritized.
Incident Suspension – Incident can be suspended by UWIT and or the reporting user for a period as indicated in this table. During the
time of the suspension, it is assumed that no active Incident Management activities, except for system logging or system monitoring, is
assumed to be occurring. Activities must resume within the time frame indicated.
Name Activities
4.0 Investigation & Diagnosis The input into the Resolution &
Recovery activities is a completed Inves
tigation & Diagnosis where a
remediation for the Incident is available.
The remediation can be in the form of a:
Permanent Fix
Work Around
Known Error
Name Activities
5.0 Recovery & Resolution The input into the Incident Closure activ
ities is a resolved Incident record. The
service and user is no longer
experiencing a disruption of service and
there are no lingering effects of the
Incident that needs immediate action.
Affected CIs
Impact
Close Code
5. Incident Communications
This section describes the approach to communicating Incidents to the user community. The communication method is dictated by the Incident
Priority and describes the following communication types:
Bulletin – A bulletin is a general communication (usually through e-mail) that alerts the user community to IT services that are currently
experiencing an Incident.
Communication Method - The types of communication methods are listed as follows:
User Updates - Based on the communication method, this describes how often a user receives Incident updates on the Incident.
Major < 15 minsI didn't feel that we Push Determined per outage
landed on what this frequency is
– or if it is a set frequency
1 – Critical < 30 minsI didn't feel that we Push Every 30 mins
landed on what this frequency is
– or if it is a set
frequencyFrequency is defined
on a case by case basis,
according to workshop feedback.
In addition, updates were
decided to be published but not
pushed to users.
Activity Incident Service Desk Tier 2/3 IT Managers Caller Service Service
Process Owners/ Specific
Owner Managers Other
Interested
Parties
Contact A R C
Management
Incident A R C
Logging
Classification A R I I I
&
Prioritization
Investigation A R R C I C
& Diagnosis
Resolution & A R R C C C I
Recovery
Incident A R I I I
Closure
The Business Service Map, which can help isolate Problems caused by Problems in related items
The CMDB Baseline, which can help track planned and unplanned Changes
ServiceNow allows you to create Knowledge Articles from Problem records. The Knowledge Base may have information gathered from Incide
nts, and may also have useful Workarounds for Problems.
MTTR Priority 1 & Priority 2 All The mean time it takes to resolve an Incident
(Mean Time to Resolve) By Service to the point where the user(s) are no longer
experiencing any symptoms of the Incident.
Number of Unique Incidents* N/A Measures total number of Does not include child incidents
Incidents experienced and duplicates
9. Terms of reference
Assignment Groups
Also known as support groups, resolver groups or queues. Groups of team members who actively work records in the system to provide service
to customers.
See Ymir for a list of assignment groups with information about the business services, owner, and manager associated with each.
Use the UW Groups Service to manage assignment groups.
Caller
The individual reporting the issue, regardless of the reporting mechanism.
Contact Record
A structured user record that is used to store information in regards to valid user. Contact records are also used to correlate a user to other record
types including Incident records.
Duty Manager
A role which provides 24x7 on-call support for incident escalation from a UW-IT Service/Help Desk. The role of the Duty Manager is to facilitate
communication, coordination of resources, and escalation to the UW-IT Unit Response Center (URC) Decision Support Group.
Incident
An unplanned interruption to an IT Service, or reduction in the Quality of an IT Service. This can include failure to meet agreed-upon service
levels (i.e., breach of an SLA).
Failure of a configuration Item that has not yet impacted service -- i.e., failure of one disk from a mirrored set -- is also an Incident. Howe
ver, some teams may choose to treat such items as normal operations instead of incidents.
Incident resolution does not necessarily mean problem resolution; see Problem.
The goal of incident management is rapid restoration of service.
For the official ITIL definition, see ITIL 2011 English Glossary v1.0.
Incident Manager
Role based support staff who are responsible for coordinating all aspects of incident response. The Incident Manager role assignment can
change within an active incident from Tier-1 -> Duty Manager -> URC. The Incident Manager can be anyone in a UW-IT Service/Help Desk,
especially during the early stages of an incident, and is handed off as the incident escalates.
Note: There is another version of this definition in Incident Manager - Responsibilities and Concepts.
IT Manager
In Incident Management, IT Managers are responsible to ensure that their IT groups remain engaged and available. IT Managers will allocate
both on-call and other resources as appropriate to assist with Incident Management activities.
Major Incident
An incident which exceeds the ability of the normal incident management process to deliver the desired results in a timely manner. Major
Incidents are declared by the Duty Manager, and confirmed by the URC Decision Support Group.
For the official ITIL definition, see ITIL 2011 English Glossary v1.0.
NBD
Category: Incident Management Process
Priority
The level of response and resources given to an incident. Priority is derived from Impact and Urgency.
See the Incident Severity Scale Guidelines, which currently reflect mostly the Impact part of Priority.
For the official ITIL definition, see ITIL 2011 English Glossary v1.0.
Problem
The cause of one or more incidents. Problem determination is separate from incident resolution.
An incident is usually resolved when service is restored, but root cause investigation may continue after the incident is closed.
The goal of problem resolution is to prevent incidents.
For the official ITIL definition, see ITIL 2011 English Glossary v1.0.
Record
A Document containing the results or other output from a Process or Activity. Records are evidence of the fact that an activity took place and may
be paper or electronic. For example, an Audit report, an Incident Record, or the minutes of a meeting.
For the official ITIL definition, see ITIL 2011 English Glossary v1.0.
RFU
Category: Incident Management Process
Request For Update. This is the periodic reminder from Service Desk that a status report is requested on an open Child Incident record.
Service desk
Any first level support groups. These groups are typically 24x7 operations centers, Service Center help desk and are primarily responsible for
First Time Resolution (FTR) and Incident Management.
Service Owner
In Incident Management, Service Owners are subject matter experts for their Service and need to be consulted when impactful Incidents have
been raised for their respective service. Service Owners should be consulted on any Incident fixes that could potentially alter their service
availability or the overall functionality of the Service.
For the official ITIL definition, see ITIL 2011 English Glossary v1.0.
Support Tier
Role-based support staff performing work within the ServiceNow toolset using ITIL processes.