100% found this document useful (5 votes)
1K views

Incident Management Process Guide

The document provides guidelines for an incident management process used by the University of Washington. The primary goal of the process is to restore normal IT service operations as quickly as possible while minimizing impact on business operations. The service desk is responsible for documenting, processing, and resolving incidents, or escalating them to higher tiers as needed. The process includes steps for logging, classifying, investigating, resolving, and closing incidents. Roles and responsibilities are defined, and the process interfaces with other IT service management areas like change management and service level management. Metrics are collected to measure critical success factors and process performance.

Uploaded by

vijay kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
1K views

Incident Management Process Guide

The document provides guidelines for an incident management process used by the University of Washington. The primary goal of the process is to restore normal IT service operations as quickly as possible while minimizing impact on business operations. The service desk is responsible for documenting, processing, and resolving incidents, or escalating them to higher tiers as needed. The process includes steps for logging, classifying, investigating, resolving, and closing incidents. Roles and responsibilities are defined, and the process interfaces with other IT service management areas like change management and service level management. Metrics are collected to measure critical success factors and process performance.

Uploaded by

vijay kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Incident Management Process Guide

Incident Management
Process Guide
University of Washington
Original version: September 23, 2013
Last update: January 4, 2014

This document will continuously improve and is subject to Change Management. If referencing a printed version, be sure you are
referencing the most current version at https://wiki.cac.washington.edu/x/36mKAw

This process document was developed in partnership with Covestic, Inc. and UW-IT.
All trade names, trademarks, or registered trademarks are trade names, trademarks, or registered trademarks of their respective companies.

Process Name Incident Management

Process Description The primary goal of the Incident Management process is to restore
normal service operation as quickly as possible and minimize the
adverse impact on business operations. The Service Desk is the
primary customer-facing operations center tasked with documenting
and processing incidents through to resolution. The Service Desk is
the critical liaison between the support organization and the
customer. All levels or tiers of support utilize the incident
management process to restore service to their customers.

1. Process Objectives
2. Scope
2.1. Scope
3. Process Overview
3.1. Description & Overview
3.2. Process Inputs
3.3. Incident Ownership
3.4. Incident State Definitions
4. Incident Management Process Activities
4.1. Incident Logging & Identification Sub-Processes
4.1.1. Caller Identification
4.1.2. Output
4.2. Incident Logging
4.2.1. Output
4.3. Classification & Prioritization Sub-Processes
4.3.1. Incident Classification & Prioritization
4.3.2. Impact
4.3.3. Urgency
4.3.4. Priority
4.3.5. Incident Investigation & Diagnosis
4.3.6. Output
4.4. Investigation & Diagnosis Sub-Processes
4.4.1. Service Desk Resolution Guidance
4.4.2. Managing Staff Resources
4.5. Incident Resolution & Recovery Sub-Processes
4.6. Incident Closure
5. Incident Communications
6. Roles & Responsibilities
6.1. RACI Model
6.1.1. RACI Model Legend:
7. Process Interfaces and Dependencies
7.1. Service Transition
7.1.1. Change Management
7.1.2. Configuration Management
7.1.3. Release and Deployment Management
7.1.4. Knowledge Management
7.2. Service Design
7.2.1. Service Continuity
7.2.2. Service Level Management
7.3. Service Strategy
7.3.1. Financial Management
8. Process Measurements and Metrics
8.1. Critical Success Factors
8.2. Incident Management Operational Metrics
9. Terms of reference
Assignment Groups
Caller
Contact Record
Duty Manager
First Time Resolution (FTR)
Incident
Incident Manager
IT Manager
Major Incident
NBD
Parent Incident - Master Incident
Priority
Problem
Record
RFU
Service desk
Service Owner
Support Tier
URC Decision Support Group
UW-IT Unit Response Center (URC)

1. Process Objectives

The objective of Incident Management process is restore IT service operations as quickly as possible to users and minimize the adverse impact
on business operations.

2. Scope
2.1. Scope
The scope of Incident Management will include all managed services as defined by the UWIT Service Catalog.

3. Process Overview
Figure 1 – Incident Management Lifecycle & Sub-Processes

3.1. Description & Overview


Incident Management is a process employed by UWIT to restore normal service operation as quickly as possible and minimize the adverse
impact on business operations. The Service Desk 1 is the primary customer-facing operations center tasked with documenting and processing
incidents through to resolution. The goal of the Service Desk is to resolve, through first time resolution (FTR), or to escalate the Incident as rapidly
as possible to the appropriate Tier 2 group; once it is established FTR is not practical.
The Service Desk is the critical liaison between the support organization and the customer. All levels or tiers of support utilize the incident
management process to restore service to their customers.

3.2. Process Inputs


The Incident management process has the following inputs:

Telephone – User contacts the Service Desk using a centralized phone number.
Web Form – User completes a web form that is assigned to Service Desk support for resolution.
Instant Messenger – User initiates an Instant Messenger (IM) conversation with a Service Desk support analyst
E-Mail – User sends a properly addressed e-mail message to a centralized e-mail address.
Walk-Up – A user walks up to an on-duty Service Desk analyst to report an Incident.
Events/Automated Alerts – Automated monitoring systems may generate Incidents.
Social Media/Community – Users may post issues on Facebook and Twitter that IT transforms into Incidents.
3rd Party – Third-Party partners and suppliers may report Incidents to UW
IT – Incidents may be generated by IT team members for unmonitored devices.

In regards to the Incident Inputs, the following rules should be communicated to the user community in order to expedite Incident resolution:

Incident reporting that could potentially involve complete service failure or degradation should be reported to the Service Desk .
Incidents reported by web form, e-mail, or walk-up are excluded from FTR reporting.

3.3. Incident Ownership


In general, all newly created Incidents should first be assigned to the Service Desk group for triage. The Service Desk team members should be
individually assigned to Incidents and will retain ownership of the Incident until the Incident is closed or cancelled. The Incident record should not
change ownership, even in the case where assistance from other groups is required. Child work tasks should be created to delegate any work
activities to other groups.
The exception to this is the cases where:

Ownership is being transferred to a different Service Desk group


Ownership is being transferred internally between a Service Desk group in order to maintain resolution continuity between work shifts
3.4. Incident State Definitions
Incident State Definition

New Incidents start out in a New Incident state and remain in that state
until they are assigned to a named resource (Assigned to).

Active After the Assigned to is no longer blank, the Incident moves to an A


ctive Incident state.

Awaiting Problem When the Incident is placed in Awaiting Problem state, the SLA
clock is paused while the parent Problem is being worked. Once the
Problem has been resolved, all associated Incidents may be resolved
by using the "Close Incidents" UI Action on the Problem form.

Awaiting User Info The Awaiting User Info state pauses the SLA clock (while support is
waiting on additional information from a user and/or 3rd party). When
the Incident is placed in Awaiting Info state, many customers have
elected to send the Caller a system generated e-mail notifying them
that more information has been requested. A common customization
is to set a rule/script that automatically resolves incidents that are
awaiting user info and haven't been updated in 10 days. A reminder
notification will be sent to the caller after 5 days, warning them that
their issue will be placed in resolved state if they do not respond. If
the caller responds, the 10 day clock restarts.

Resolved When the support team resolves an Incident, certain mandatory


fields appear on the form that must be completed before the Incident
may be put into a Resolved Incident state (more details on this later
in the document).

Closed The Incident is automatically moved from a Resolved to a Closed


Incident state after three business days. The user may request that
the Incident be reopened on their behalf during this period. After the
Incident has been moved to a Closed Incident state, no additional
updates or edits may be made to it.

4. Incident Management Process Activities


4.1. Incident Logging & Identification Sub-Processes
The first task is to identify the identity of the user. This will be achieved in the Caller Identification activities.

4.1.1. Caller Identification


The first activity of the Incident Management process is to definitively establish the identity and contact information of the caller. For inputs of web
form, instant messenger, and e-mail, verification of identity by electronic means can be assumed. In cases where identification cannot be initially
identified, the following activities will assist with achieving Caller Identification.

Figure 2- Caller Identification Activities


Name Activities

1.1 Input A Service Desk support analyst


processes an incoming Incident
submission from one of the methods
described in Process Inputs.

1.2 Verify Identification By electronic means, or by manual


inquiry, user identification will attempt to
be established. This will not require the
user to provide any additional
information. If identification can be
verified to an existing user, the analyst
will proceed with Incident Logging (2.0).

If the analyst cannot verify identity, they


must begin to establish user identity in
step 1.3.

1.3 Existing User In this step, the Service Desk support


analyst will attempt to identify the user
by asking the user for additional
information (tokens). Using these
tokens, the analyst will attempt to
correlate the user to an existing user
record.

If an existing user record is found, the


analyst will proceed with Incident
Logging (2.0).

If the analyst cannot find an existing


user record, the analyst will proceed to
step 1.4 to create a user Contact
Record.

1.4 Create Contact Record In this step, a user record will be


created on behalf of the reporting user.
Minimal information will be populated in
favor of expediting other Incident
management activities.

At a later time, the user will be notified


that they have been added to the
system, and will have the ability to
add/modify their contact information.
2.0 Incident Logging The output for the Caller Identification a
ctivity is meta-data for consumption by
the Incident Logging process.

4.1.2. Output
Upon completion of the Caller Identification activities, the user of record should be recorded and verified with a means to contact the user again, if
necessary. The Caller Identification output will be used as a process input, for the next set activities, Incident Logging.

4.2. Incident Logging


Incident logging includes activities that are required to accurately record an Incident or Request. During this stage in the process, the analyst has
only determined caller identification, and has not made the determination of whether the call is Incident related or Request related. The purpose of
this activity is to establish the following:

Incident Related or Service Catalog Related or Request Related


Collect Incident Record meta-data
Create Incident Record

Figure 3- Incident Logging Activities

Name Activities

1.0 Caller Identification Process Input from Caller Identification


Process (1.0)
2.1 Collect Data Elements In this step, essentially data elements
will be collected. The data elements
collected must include the following
minimum elements:

Name Description

Caller Name Name of the


person reporting
the Incident

Title A brief
description of
the Incident

Description Detailed
information
regarding the
Incident

For interactive systems, this information


is typically supplied by the user.

Note: Information collected in this step


is only used to support activities in step
2.2 and is not intended to include all
information regarding the Incident.

2.2 Call Type Based on the information collected in


step 2.1, sufficient amount of
information has been collected to make
the determination if the call is Incident
related or Request related. The analyst
will make the determination based on
the information provided.

2.3 Request Process If the call is determined to be Request o


r Service Catalog related, the call will
then defer to the Request Process. The
Request Process is not covered in this
process guide, but in a dedicated guide
focused on Request Management.

2.4 Create Incident Record Additional information, beyond those


elements collected in step 2.1 are
documented and recorded in a formal
Incident record. The analyst should
collect enough information to support
activities in the Classification &
Prioritization processes.

These data elements are typically


completed by an IT team member, and
not the user.

During this step, a unique record


identifier is created and can be
communicated to the user, if desired.

3.0 Classification & Prioritization The output for the Incident Logging acti
vity is meta-data for consumption by the
Classification & Prioritization process.

4.2.1. Output
Upon completion of the Incident Logging activities, all information collected (up to this point), has been document and can be easily referenced, if
necessary, and used by other groups for other Incident Management activities. The Incident Logging output will be used as a process input, for
the next set activities, Classification & Prioritization.

4.3. Classification & Prioritization Sub-Processes


The next set of activities is to take the available information in the Incident Record and a make a determination on the following:

Incident Impact
Incident Urgency
Incident Prioritization

4.3.1. Incident Classification & Prioritization


Incident Classification firsts involves the Business Services, Applications or Assets affected by the issue. Business and IT Services are often
represented as Configuration items (CIs).
Based on the information collected up to this point, that analyst will then assign an initial Impact and Urgency. These two variables will be used to
calculate Priority.

Figure 4- Incident Classification & Prioritization Activities

Affected Services (CIs)


In this step the analyst will select the Business or IT service that is affected by this Incident. Keep in mind that the population of this field will be
based on "best guess" with the information provided from the caller. The value of this field may change several times over the course of the
lifecycle.
Ideally, only a single Affected Service should be identified for each Incident.

4.3.2. Impact
The effect on business that an Incident has – 1-3. 1 is high Impact. Outages impacting key services, systems or business processes are High
Impact.

Impact Level Description

High A vital service or system is unavailable, or performance or


functionality is highly degraded, or users in multiple facilities are
affected. Business operations are disrupted.

Medium A vital service or system performance or functionality is degraded or


a non-vital service or system performance or functionality is highly
degraded, or all users in a single facility are affected.

Low A non-vital service or system performance or functionality is


degraded, or an individual or single user is affected. Requests and
Inquiries are typically low impact.

Figure 5 - Impact Field Definitions

4.3.3. Urgency
The extent to which the Incident's resolution can bear delay – between one and three. One is high Urgency. A user with a critical deadline could
have an Impact of Low (single user) but a High Urgency.

Urgency Level Description


High No workaround is available or resolution (or request fulfillment)
cannot be delayed.

Medium A workaround exists or work can be shifted to other activities,


permitting customer productivity to be maintained for at least the next
8 business hours.

Low A workaround exists or work can be shifted to other activities,


permitting customer productivity to be maintained for at least the next
two business days.

Figure 6- Urgency Field Definitions

4.3.4. Priority
The order in which the support organization should address the Incident or Service Request – between one and three. One is Critical Priority.
Setting Impact and Urgency to 1 for any Incident immediately sets the Priority level to 1.

4.3.5.1 Incident Prioritization Matrix

High Urgency Medium Urgency Low Urgency

High Impact Critical Priority High Priority Moderate Priority

Medium Impact High Priority Moderate Priority Low Priority

Low Impact Moderate Priority Low Priority Planning Priority

Figure 7 - Incident Prioritization Matrix

4.3.5. Incident Investigation & Diagnosis


The output for the Classification & Prioritization activity is meta-data for consumption by the Investigation & Diagnosis process.

4.3.6. Output
Upon completion of the Classification & Prioritization activities, the Incident Record is now ready to be processed for appropriate action. For the
Incident Process, most actions (or non-actions) are driven by the Incident Priority. The Priority will be key in determining how the Incident is
handled from this point forward.

4.4. Investigation & Diagnosis Sub-Processes


Now that a Priority has been determined from the Classification & Prioritization Process, the next set of activities is to quickly investigate and
diagnose the report of the Incident. In summary, the activities involved in this step include:

Check for duplicate records


Check for Parent Incident
Check Priority for Escalation
Duplicate Incident symptoms

The Incident priority will drive who and when Investigation & Diagnosis must occur.

Figure 8 - Incident Investigation & Diagnosis Workflow

Note: Objects in red are activities exclusively executed by Tier 2 or Tier 3 resources. Any Incident that requires activities in red
are not considered for FTR.
Name Activities

3.0 Input Before Investigation & Diagnosis the


Incident record must be assigned a Prio
rity, which is determined in the Classific
ation & Prioritization Activities.

4.1 Duplicate Record When an Incident is worked on by an IT


team member, they should first check to
ensure that the Incident is not a
duplicate record. For this step, a
duplicate record is a record that is:

Reporting the same issue or similar


issue

The analyst should make a


determination if this issue has
already been previously reported.
The most common way to correlate
duplicate Incident Records is to
look at other open Incidents
against the same Business Service
(CI) that was identified in step 3.1.
4.2 Recurring Incident If a duplicate Incident record is
suspected in step 4.1, the analyst must
then make the determination if the
Incident Record is:

Duplicate issue reported by the


same user
Duplicate issue reported by a
different user

A duplicate issue reported by the


same user should be closed as a
duplicate issue with reference to
the original Incident record
reported.

A duplicate issue reported by a


different user is indicative of a
more widespread issue. The
duplicate Incident should then be
associated to a Parent Incident
record for tracking.

4.3 Attach to Parent Incident Record When a duplicate Incident has been
identified that was been reported by
different users, all Incidents reported
should be associated to the Parent
Incident Record. The Parent Incident
Record will have the following
attributes:

When Parent Incident is resolved,


all Child Incidents will be updated
with appropriate close notes (See
Incident Closure Process).
When the Parent Incident is
resolved, all Child Incidents will be
subsequently closed.
The purpose of this step is to
ensure that IT is working off of one
Incident Record, but still maintains
contact with each user through
each individual report. IT team
members should be cognizant of
open Incident Records with an
associate Parent Incident. Many
internal IT progress work notes will
be published in the Parent Incident,
but not each individual Child
Record. Tasks for work needed to
resolve the Incident should be
added to the parent incident.

4.4 Service Desk Resolution This determines whether it is


appropriate for a Service Desk resource
to attempt to resolve the Incident.
Guidance for this decision is listed in
the table below.

4.5 Can Be Duplicated The analyst will attempt to duplicate the


Incident to ensure that the Incident is
still current and/or repeatable. In
general, this step should be able to
isolate the issue to:

Can Not Be Duplicated


Can Only Be Duplicated By Single
User, Computer, or Browser
Can Only Be Duplicated by Users
in the same location
Can Be Duplicated by All Users
4.6 Knowledge Base Service Desk resources should attempt
to resolve Incidents using known
repeatable Incident resolution
processes. These processes are often
stored in either a formal (or informal)
Knowledge base. A simple query
against available Knowledge sources
should be performed to determine if
there is a documented resolution to a
particular issue.

In some cases, it may be appropriate to


provide the caller with a link (or
attachment) of the Knowledge article for
resolution.

4.7 Known Error Database In addition to checking the available


Knowledge Bases, the Incident should
be checked against the Known Error
Database. Typically, the Business
Services identified in step 3.1 should be
compared to Known Errors for the
associated Business Service.

4.8 Attach to Problem Record If the Incident is suspected to be


associated with a Known Error, the
Incident Record should be associated
with the Known Error. This will assist
Problem Management towards
understanding the overall impact and
assist with a Long Term Fix. In addition
to associating the Incident to the Known
Error, the Known Error Record should
also be searched for an available
workaround. Any workaround
information should be communicated to
the caller as a means of resolving the
Incident.

The advantage of associating the


Problem record to the Incident Record
is that when a workaround or long-term
fix is available, the updated Problem
record should update any associated
Incident records.

4.9 Can Resolve? At this point, a determination should be


made if:

Enough information is available to


attempt resolution
Enough allocated time (by priority)
is available to attempt resolution

4.10 Update Incident Record If step 4.9 determines the enough time
or information is not available to resolve
the Incident, the assigned analyst
should update the Incident Record with
the steps taken so far. This is to ensure
that other support teams are not
duplicating resolution steps.

4.11 Functional (Tier 2) Escalation If for any reason, at any time, Service
Desk resources are unable to resolve
an Incident, a child work task should be
assigned to Tier 2 for further
Investigation.
4.12 Advanced Diagnostics Tier 2 will often utilize resources not
readily available to Service Desk
resource (e.g. system or application
logs, Change logs, vendor knowledge
bases).

Tier 2 resources will also adjust any


Incident Record fields that may have
been incorrectly diagnosed or
represented up this point. This may
include Priority, Affected CIs, Urgency,
and Impact.

If additional Tier 2 resources are


required, a separate child work task
should be opened and assigned to the
Tier 2 group. It is no uncommon for an
Incident record to have one or more
child work tasks open at the same time.

If assistance from the vendor is


required, the Tier 2 resource should
retain ownership of its task, and make
notes that vendor resources are being
consulted.

Any work performed should be


recorded in the child work tasks and or
Incident work notes.

4.13 Management Escalation Tier 2 should request that Tier 1


engage the Duty Manager if at any
time:

Progress toward resolution is not


satisfactory or timely.
More resources are required than
Tier 1 can ordinarily engage.
Higher-level coordination or
communication is required.
Management visibility is desired for
business reasons.
A business decision needs to be
made before work can proceed.
The incident is or may become a M
ajor Incident. (Only the Duty
Manager can declare an incident
Major and activate the URC
decision tree.)
Illegal or unauthorized activity has
occurred or is suspected (in this
case, recovery activity should be
stopped until the Duty Manager
responds).

5.0 Output Once enough information has been


analyzed where a resolution can be
performed, the Incident record can
continue to the next process steps, Inci
dent Resolution & Recovery.

6.0 Incident Closure This is in reference to the Incident


Closure sub process covered later in
this process guide.
4.4.1. Service Desk Resolution Guidance

Priority Description Action

1 Critical Work available help text. If Incident is not


resolved within 30 minutes, assign work task
to appropriate Tier 2 resource for immediate
resolution.

2 High Work available help text. If Incident is not


resolved within 60 minutes, assign work task
to appropriate Tier 2 resource for immediate
resolution.

3 Medium Work available help text. If Incident is not


resolved within 120 minutes, assign work
task to appropriate Tier 2 resource for
resolution within 1 business day.

4 Low Work available help text. If incident is not


resolved within 1 business day, assign work
task to appropriate Tier 2 resource for
resolution as needed.

5 Planning Work available help text. If incident is not


resolved, hold in queue until resolution is
required. Do not assign work Task to Tier 2
resources until Incident has been assigned a
higher priority.

4.4.2. Managing Staff Resources


This section describes how staff should be managed during Incidents. The table below provides guidance on when appropriate resources can be
utilized, as well as when key personnel should be alerted to the Incident.

Active Passive

Priority/Type Engage On Engage Duty Engage Other Notify Notify Divisio Notify CIO Effort to
Call Manager Staff Service n Heads Resolve
Resources Resources Manager

Major X X X X X X Continuous
(24 hrs)

1 – Critical X X X X Continuous
(24 hours)

2 – High X X Business
Hours
Continual

3 – Medium X Can Be
Suspended for
up to 5 days

4 – Low Can Be
Suspended
Indefinitely

5 - Planning N/A

Engage On Call Resources – This authorizes Incident Managers to engage on-call IT resources for assistance in the Investigation & Diagnosis o
f an applicable Incident. Group On Call Rotation schedules should be referenced as to which particular resource is on call. This authorization also
extends to secondary, tertiary, or on call manager escalation, if the primary on call resource is not available.
Engage Duty Manager – A UW Duty Manager (based on rotation schedule) is engaged for major and critical issues. The Duty Manager
coordinates and engages all resources needed to resolve major and critical issue.
Engage Other Staff Resources – This authorizes Incident Managers to engage any IT resource for assistance in the Investigation & Diagnosis
of an applicable Incident. This authorization includes engaging resources that are not currently on the on-call list.
Notify Service Manager – This mandates that the Incident Manager must at a minimum alert the Service Manager for the affected Service(s) of
the current Incident.
Notify Division Heads – This mandates that the Incident Manager must at a minimum alert the Division Heads of the affected Service(s) of the
current Incident.
Notify CIO – This mandates that the Incident Manager must at a minimum alert the University of Washington IT Chief Information Officer (CIO) of
the affected Service(s) of the current Incident.
Effort to Resolve – This describe the mandated minimum effort levels required to for Incident Investigation & Diagnosis.

Continuous – UWIT must work continuously on the Incident until resolution or re-prioritization. These activities extended into non-core
operating hours including weeknights, weekends, and holidays. If applicable, user resources must also be available to assist with Investig
ation & Diagnosis. (e.g. A user can not report a Priority 1 Incident and then go home with no alternative resource to assist from the user
perspective.)
Business Hours Continual - UWIT must work continuously on the Incident until resolution or re-prioritization during core operating
business hours. Investigation & Diagnosis activities can be suspended by IT resources or user resources during off-peak hours. Outside
of core operating hours, no active Incident Management activities, except for system logging or system monitoring, is assumed to be
occurring. Activities must resume during business hours or the Incident must be re-prioritized.
Incident Suspension – Incident can be suspended by UWIT and or the reporting user for a period as indicated in this table. During the
time of the suspension, it is assumed that no active Incident Management activities, except for system logging or system monitoring, is
assumed to be occurring. Activities must resume within the time frame indicated.

4.5. Incident Resolution & Recovery Sub-Processes


At this point in the process, the Incident is ready for resolution and recovery. Appropriate steps should be taken to ensure efficient Incident
resolution. This includes:

Communication of workarounds or fixes (if required and available)


Timely Resolution
Non-impact service Incident resolution
Request & Schedule Change
Management of resolution through repair
Management of resolution through resolution
Efficient documentation for Problem Management

Figure 9- Incident Resolution & Recovery Activities

Name Activities

4.0 Investigation & Diagnosis The input into the Resolution &
Recovery activities is a completed Inves
tigation & Diagnosis where a
remediation for the Incident is available.
The remediation can be in the form of a:

Permanent Fix
Work Around
Known Error

5.1 Change Record Remediation for the Incident should be


evaluated as to whether a Change (as
defined by Change Management) is
required. If a Change is required, the
Change should be requested and
processed through the Change
Management process.
5.2 Schedule/Execute Repair If no change is required, it should be
determined when is the proper time to
execute the remediation steps. Most
remediation steps should occur
immediately, however, IT staff should
be cognizant of executing repairs that
have the potential of causing service
interruption and/or degradation. Those
remediation activities should occur
when it will be least impactful to the
environment, while still timely resolving
the Incident.

5.3 Confirm Recovery This step involves confirming that the


issue has been completely resolved. In
some cases, full service recovery may
not occur immediately. (e.g. Message
queues usually takes several minutes to
several hours to recover, failed backups
could take several hours). While no
additional action is required at this point
an Incident can not be considered Reso
lved until service has been completely
restored.

5.4 Problem After confirmation of service recovery, a


determination is made as to whether a
new or existing Problem record should
be created or update for this Incident. If
a Problem record is created, it will be
managed and processed by the
Problem Management process.

6.0 Incident Closure The output to the Resolution &


Recovery activities is the final set of
activities, Incident Closure.

4.6. Incident Closure


At this point in the process, the Incident is ready for resolution and recovery. Appropriate steps should be taken to ensure

Figure 10- Incident Closure Activities

Name Activities
5.0 Recovery & Resolution The input into the Incident Closure activ
ities is a resolved Incident record. The
service and user is no longer
experiencing a disruption of service and
there are no lingering effects of the
Incident that needs immediate action.

6.1 Update Incident Record Updating the Incident record requires


the Incident Owner to ensure that all
child tasks have been updated and
closed. In addition, any work notes for
each task should be documented.

In addition to work notes, the Incident


record should also be updated with Clo
se Notes, which includes a Close Code.
As a best practice, upper tiers of
support should tell lower tiers how an
incident was resolved, for continual
improvement and knowledge transfer.

Close Code Definition

Solved (Work Issue is believed


Around) to be resolved
by a temporary
fix or
workaround
while a
long-term
solution is being
researched,
tested,
purchased or
planned.

Solved Issue is believed


(Permanently) to be completely
resolved and
should not
reoccur.

Not Solved Issue was not


(Not able to be
Reproducible) reproduced or
observed again.

Not Solved Issue is not


(Not Feasible) resolved and
resolution is not
planned.
Permanent fix is
too expensive or
otherwise
unfeasible.

Closed/Resolv Issue is believed


ed by Caller to be completely
resolved by the
client and
should not
reoccur.

Additionally Incident Record data points


that will need to be completed include:

Time worked (for each individual


involved- optional based on
assignment group)|
6.2 Reconcile Incident Data At this point, the Incident is almost
ready to be communicated that the
Incident has been resolved. The
Incident Record should be reviewed by
the individuals contributing to the
resolution and verify that all values in
the Incident Record accurately reflect
the record. The most common fields
that should be reconciled include:

Affected CIs
Impact
Close Code

6.3 Resolve The Incident is marked as resolved.


Appropriate notifications (based on
priority) are sent out to alert of Incident
resolution.

6.4 Close Incident After being marked as Resolved, the


Incident will remain in this state for 3
business days. If no further updates are
provided, the Incident will automatically
be marked as Closed. This rule
provides the user with the ability to
respond to the Resolve notification in
step 6.3 for a period of up to 3 business
days, where they can escalate the
Incident if they feel the Incident is not
fully resolved.

6.5 Survey Closed Incidents should be randomly


surveyed for customer satisfaction. A
survey request will be sent out for every
1 out of 5 (20%) Incidents that are
closed.

In addition to the 20% random survey,


any caller who responds to a survey will
not be surveyed again for a period of 30
days.

5. Incident Communications
This section describes the approach to communicating Incidents to the user community. The communication method is dictated by the Incident
Priority and describes the following communication types:
Bulletin – A bulletin is a general communication (usually through e-mail) that alerts the user community to IT services that are currently
experiencing an Incident.
Communication Method - The types of communication methods are listed as follows:

Push Communications– User is actively sent (pushed) a communication (typically by e-mail.)


Pull Communications – Users are alerted to an Incident by communications that they are already actively engaged in. (e.g. News feeds,
RSS feeds, Live Feed)

User Updates - Based on the communication method, this describes how often a user receives Incident updates on the Incident.

Priority/Type Bulletin Communication Method User Updates

Major < 15 minsI didn't feel that we Push Determined per outage
landed on what this frequency is
– or if it is a set frequency
1 – Critical < 30 minsI didn't feel that we Push Every 30 mins
landed on what this frequency is
– or if it is a set
frequencyFrequency is defined
on a case by case basis,
according to workshop feedback.
In addition, updates were
decided to be published but not
pushed to users.

2 – High < 1 hour Pull Every business day

3 – Medium None Pull When Resolved

4 – Low None Pull When Resolved

5 - Planning None Pull When


Re-Prioritized

6. Roles & Responsibilities


6.1. RACI Model
For role definitions, see the Incident Management glossary below.

Activity Incident Service Desk Tier 2/3 IT Managers Caller Service Service
Process Owners/ Specific
Owner Managers Other
Interested
Parties

Contact A R C
Management

Incident A R C
Logging

Classification A R I I I
&
Prioritization

Investigation A R R C I C
& Diagnosis

Resolution & A R R C C C I
Recovery

Incident A R I I I
Closure

6.1.1. RACI Model Legend:


R (Responsible) – The individual(s) responsible for executing the day-to-day procedures relate to the activity.
A (Accountable) – The individual who holds overall accountability for the activity. For each activity, there can only be one individual who is
accountable.
C (Consulted) – The individual(s) who are consulted for specific procedures of the activity. This is often seen in the form of an approval
before a specific work procedure can be executed.
I (Informed) – The individual(s) who are informed or notified of specific procedures of the activity. This is often seen in the form of either an
active (e.g. email) or passive (record update) notification.

7. Process Interfaces and Dependencies


7.1. Service Transition
7.1.1. Change Management
Incident Management ensures that all resolutions or workarounds that require a change to a CI are submitted through Change Management thr
ough an RFC. Change Management will monitor the progress of these changes and keep Incident Management advised. Incident
Management is also involved in rectifying the situation caused by failed changes. ServiceNow allows you to initiate an RFC from a Problem (see
detailed steps). This can be done in ServiceNow via a UI Action. To modify how the Create Change UI Action works, navigate to System
Definition > UI Actions, and select the Create Change UI Action that specifies Problem in the table column.

7.1.2. Configuration Management


Incident Management uses the CMS to identify faulty CIs and also to determine the impact of problems and resolutions. The CMS can also be
used to form the basis for the KEDB and hold or integrate with the Problem Records. Strong partnerships with suppliers (hardware and software
vendors) can also uncover key defects with existing technology that can be remediated through an update or patch. (bug tracking, bug reporting,
defect tracking/reporting, acceptable level of defects, etc.)
Using the CMDB
The Configuration Management Database stores information on all of the configuration items and their relationships. In addition to providing
basic information about the configuration item to serve as a reference, there are two tools within the CMDB that can provide important information
on Problems:

The Business Service Map, which can help isolate Problems caused by Problems in related items
The CMDB Baseline, which can help track planned and unplanned Changes

7.1.3. Release and Deployment Management


Is responsible for rolling problem fixes out into the live environment. It also assists in ensuring that the associated known errors are transferred
from the development Known Error Database into the live Known Error Database. Incident Management will assist in resolving problems
caused by faults during the release process.

7.1.4. Knowledge Management

ServiceNow allows you to create Knowledge Articles from Problem records. The Knowledge Base may have information gathered from Incide
nts, and may also have useful Workarounds for Problems.

7.2. Service Design


7.2.1. Service Continuity
Incident Management acts as an entry point into Service Continuity Management where a significant Problem is not resolved before it starts
to have a major impact on the business.

7.2.2. Service Level Management


The occurrence of incidents and problems affects the level of service delivery measured by SLM. Incident Management contributes to
improvements in service levels, and its management information is used as the basis of some of the SLA review components. SLM also provides
parameters within which Incident Management works, such as impact information and the effect on services of proposed resolutions and
proactive measures.

7.3. Service Strategy


7.3.1. Financial Management
Assists in assessing the impact of proposed resolutions or Workarounds, as well as Pain Value Analysis. Incident Management provides
management information about the cost of resolving and preventing Problems, which is used as input into the budgeting and accounting systems
and Total Cost of Ownership calculations.

8. Process Measurements and Metrics


8.1. Critical Success Factors
Metric Name Type Description

MTBF All The mean time between service failures.


(Mean Time Between Failure) By Service

MTTR Priority 1 & Priority 2 All The mean time it takes to resolve an Incident
(Mean Time to Resolve) By Service to the point where the user(s) are no longer
experiencing any symptoms of the Incident.

MTTRS All The mean time it takes to repair a service.


(Mean Time to Repair Service) By Service The Incident has been repaired, but the
user(s) may still be experiencing lingering
symptoms.

Customer Satisfaction Measures overall survey results from


Incident closures.

8.2. Incident Management Operational Metrics


Metric Name Threshold Description Notes
(%)

Number of Incidents* N/A Measures total number of


Incidents Reported

Number of Unique Incidents* N/A Measures total number of Does not include child incidents
Incidents experienced and duplicates

Number of Priority 1 Incidents* <5% Measures total number of Priority


1 Incidents

Number of High Impact <>5% Measures total number of


Incidents* Incidents that caused a Service
outage

Number of Incidents <5% Measures the efficiency of


Reopened* closing Incidents

First Time Resolution Rate 75% Measures the number of calls


that achieve FTR/% resolved
within Service Desk (no
dispatch)

Number of Child Tasks* N/A Measures the overall number of


work tasks required to close
Incidents

9. Terms of reference

Assignment Groups
Also known as support groups, resolver groups or queues. Groups of team members who actively work records in the system to provide service
to customers.

For more information:

See Ymir for a list of assignment groups with information about the business services, owner, and manager associated with each.
Use the UW Groups Service to manage assignment groups.

Caller
The individual reporting the issue, regardless of the reporting mechanism.

Contact Record
A structured user record that is used to store information in regards to valid user. Contact records are also used to correlate a user to other record
types including Incident records.

Duty Manager
A role which provides 24x7 on-call support for incident escalation from a UW-IT Service/Help Desk. The role of the Duty Manager is to facilitate
communication, coordination of resources, and escalation to the UW-IT Unit Response Center (URC) Decision Support Group.

First Time Resolution (FTR)


First time resolution is properly addressing the users' need the first time they call, thereby eliminating the need for the customer to follow up with a
second call or e-mail from a different support group or support tier. FTR does not require the analyst to involve other support groups, however it
may require follow-up with the user at a later time for complete resolution.

Incident
An unplanned interruption to an IT Service, or reduction in the Quality of an IT Service. This can include failure to meet agreed-upon service
levels (i.e., breach of an SLA).

Failure of a configuration Item that has not yet impacted service -- i.e., failure of one disk from a mirrored set -- is also an Incident. Howe
ver, some teams may choose to treat such items as normal operations instead of incidents.
Incident resolution does not necessarily mean problem resolution; see Problem.
The goal of incident management is rapid restoration of service.

For the official ITIL definition, see ITIL 2011 English Glossary v1.0.

For the ServiceNow definition, see ServiceNow Glossary.

Incident Manager
Role based support staff who are responsible for coordinating all aspects of incident response. The Incident Manager role assignment can
change within an active incident from Tier-1 -> Duty Manager -> URC. The Incident Manager can be anyone in a UW-IT Service/Help Desk,
especially during the early stages of an incident, and is handed off as the incident escalates.

Note: There is another version of this definition in Incident Manager - Responsibilities and Concepts.

IT Manager
In Incident Management, IT Managers are responsible to ensure that their IT groups remain engaged and available. IT Managers will allocate
both on-call and other resources as appropriate to assist with Incident Management activities.

Major Incident
An incident which exceeds the ability of the normal incident management process to deliver the desired results in a timely manner. Major
Incidents are declared by the Duty Manager, and confirmed by the URC Decision Support Group.

For the official ITIL definition, see ITIL 2011 English Glossary v1.0.

NBD
Category: Incident Management Process

Next Business Day.

Parent Incident - Master Incident


An Incident record that is associated with one or more Child Incidents.

Priority
The level of response and resources given to an incident. Priority is derived from Impact and Urgency.

See the Incident Severity Scale Guidelines, which currently reflect mostly the Impact part of Priority.
For the official ITIL definition, see ITIL 2011 English Glossary v1.0.

For the ServiceNow definition, see ServiceNow Glossary.

Problem
The cause of one or more incidents. Problem determination is separate from incident resolution.

An incident is usually resolved when service is restored, but root cause investigation may continue after the incident is closed.
The goal of problem resolution is to prevent incidents.

For the official ITIL definition, see ITIL 2011 English Glossary v1.0.

For the ServiceNow definition, see ServiceNow Glossary.

Record
A Document containing the results or other output from a Process or Activity. Records are evidence of the fact that an activity took place and may
be paper or electronic. For example, an Audit report, an Incident Record, or the minutes of a meeting.

For the official ITIL definition, see ITIL 2011 English Glossary v1.0.

For the ServiceNow definition, see ServiceNow Glossary.

RFU
Category: Incident Management Process

Request For Update. This is the periodic reminder from Service Desk that a status report is requested on an open Child Incident record.

Service desk
Any first level support groups. These groups are typically 24x7 operations centers, Service Center help desk and are primarily responsible for
First Time Resolution (FTR) and Incident Management.

Service Owner
In Incident Management, Service Owners are subject matter experts for their Service and need to be consulted when impactful Incidents have
been raised for their respective service. Service Owners should be consulted on any Incident fixes that could potentially alter their service
availability or the overall functionality of the Service.

For the official ITIL definition, see ITIL 2011 English Glossary v1.0.

Support Tier
Role-based support staff performing work within the ServiceNow toolset using ITIL processes.

Tier-0 Customer self-service.


Tier-1 First level support groups within any UW-IT Service/Help Desk.
Tier-2 Any escalation from Tier-1. These are specialized IT groups that are engaged when it is determined that Service Desk resources
can not resolve an Incident. Tier 2 is also responsible to escalating to even more specialized resources, such as vendor support or
development groups, which may be considered Tier 3.
Tier-3 Any escalation from Tier-1 or Tier-2.

URC Decision Support Group


A subset of the Unit Response Center (URC) including UW-IT senior management that are contacted for major incident notification and
assessment. The Decision Support Group can decide to activate the full URC or a subset of the URC.

UW-IT Unit Response Center (URC)


Departmental staff helping to coordinate the actions of field teams and facilitate communication to and from the larger UW Emergency Operations
Center (EOC).

You might also like