0% found this document useful (0 votes)
60 views6 pages

ITIL Problem Management

Uploaded by

smith d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views6 pages

ITIL Problem Management

Uploaded by

smith d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CHAPTER 10

ITIL PROBLEM MANAGEMENT

In their ITIL 4 framework, Axelos Ltd define the practice of problem man-
agement as being distinct from the incident management practice. Reactive
problem management involves responding to incidents which have already
occurred in order to understand the underlying causes and address these.
Proactive problem management is about identifying risks and responding
to those risks before they manifest themselves in incidents.

PROACTIVE PROBLEM MANAGEMENT


A key component of proactive problem management is to have a well-­
defined patching policy. Security risks may be reduced by routinely
deploying security patches issued by vendors in a timely manner. Many
organisations understand the need for security patches, but fail to
take seriously the need to deploy other patches. Patches and hotfixes
are issued for two reasons. One is feature enhancements; the other is
addressing defects in the design of the product. If defect patches are not
deployed, then by definition there are unresolved problems within your
product. Patching policies are needed not just for software applications
but also for the firmware which comes with hardware. There was a
recent case where a hardware vendor, HPE, identified a fault with the
firmware within some of their hard drive products.1 A particular model
of SSD drive would fail with total loss of data after 32,768 hours (less
than four years) unless the firmware was updated. This is an extreme
case where the vendor was proactive in informing their customers of
the need to upgrade the firmware. Hardware vendors produce firmware
updates on a regular basis, and it is important that each organisation has
a patching policy for how frequently they will respond to these updates.
DOI: 10.1201/9781003119975-11 51
ITIL Problem Management

One of the best ways of doing proactive problem management is to


learn from other people’s incidents. Following the industry news can
be useful for alerting you to major, widespread issues. As an example,
there are regular reports on the effects of Microsoft’s Patch Tuesdays
(the date each month when new Windows updates are issued) on the
stability of the computers receiving the patches. However, being part
of a support community can add greater value than this. Sharing your
experiences and then learning from other people’s experiences is use-
ful in its own right, but also provides greater leverage with vendors.
Vendors are more likely to address an underlying issue if multiple cli-
ents are pursuing them, but if those clients are working together this
may add extra weight to their individual voices. I have known vendors
who claimed that teething issues with a new software application were
local to our organisation, but when I spoke to other organisations using
the same product it was apparent that the issues were ubiquitous. We
were able to apply greater pressure on the vendor when we combined
to speak with a common voice.
Various techniques mentioned earlier in this book, such as failure
mode analysis, are important tools in proactive problem management.
There is value in conducting an independent audit of an end to end
service in order to assess the risks to that service.

PROBLEM CONTROL
ITIL 4 recommends that a key aspect of problem management is the
process developed for controlling and managing problems. Each prob-
lem which is identified (either through reactive or proactive problem
management) should be recorded in a problem record within an ITSM
tool or similar system. Problem records should be linked to related
resources. In reactive problem management, the related incidents
should be linked to the problem record. Configuration Items (CIs) such
as desktops, servers, printers and software assets should also be linked
to the problem record as required. The problem record is a way to:
• collate the information
• prioritise the effort
• coordinate who is involved

52
Problem Control

• spin off tasks for people to engage in to progress the problem


diagnostics and resolution
• keep a historical record which may be referred to should a
similar problem occur in the future
Each problem record will have a lifecycle. Note that different stages in
the lifecycle may overlap. Some organisations prefer greater granularity
in the lifecycle, whilst others will utilise a more coarse approach, but
the following stages may be helpful:
• Logged: a problem record is created because a problem is
suspected. At this stage, the problem has not been confirmed.
In reactive problem management, a logged problem record
indicates that there is a suspicion that a group of incidents may
be related and that the problem has not been seen before. In
proactive problem management, it may be that an issue has been
identified in another organisation but it is not clear at this stage
whether that issue will affect your organisation.
• Identification: this is a confirmation stage where a consistent
problem is confirmed and ideally is reproducible. Data is
collated at this stage. A trawl through recent incidents may
surface further ones, which were not initially identified, as being
related. A prioritisation process needs to happen at this point
to determine how much effort will be devoted to this problem
record. This is typically scored according to both impact and
urgency. Some problem records will be left at this stage because
either the impact or the urgency is low. They will be reviewed
periodically to see whether new data (e.g. additional incidents)
warrant a change to the priority.
• Investigation: The problem solving techniques outlined in this
book may be employed in order to identify one or more root
causes or other possible means of progressing the problem.
• Known Error: ITIL defines a Known Error to be a problem which
has been analysed but not resolved. From a problem resolution
point of view, this is not an important stage. However, it is
useful for the Service Desk to have a list of current Known
Errors, together with an explanation on how to identify whether
an incident is related to them and also what action should
be taken if one is encountered. It is worth reflecting on the
frustration experienced by Service Desk analysts if they spend

53
ITIL Problem Management

time trying to resolve an issue for a customer, fail, refer it to


second line and are only then told that this is a known error.
• Workaround available: The role of incident management is to
get users/customers back up and working as quickly as possible.
It is often possible to identify a workaround which will achieve
this as an interim solution whilst the permanent solution is
sought. In an earlier example, I noted that clearing the web
browser cookie cache before visiting a web application provided
a viable workaround (as did accessing it from an incognito
window). Whilst this was not a desirable action to have to take
for any prolonged period of time, it did allow users to carry on
working whilst the IT teams identified the right solution and
implemented it. Where there are outstanding incidents, the
workaround needs to be communicated to those users. Some
workarounds become permanent workarounds. It should be
noted that these increase the organisation’s technical debt and
need to be added to a Continual Improvement register.
• Root Cause Identified: Whilst not all problem records get to this
point, it is hoped that for significant problems (problems with a
high impact or a high urgency), the root cause will be identified
within a reasonable timeframe.
• Evaluation: It is tempting to jump straight from identifying the
root cause to fixing it. It is important to include an evaluation
step first. Chapter 9 looks at resolution evaluation methods to
discern the best way of addressing a root cause. It should be
remembered that not all root causes should be fixed. In some
cases, a workaround may be deemed to be adequate. In the
case of the cookie clash previously mentioned, two root causes
were identified. An evaluation needed to be made to determine
whether one or both would be fixed. The evaluation decided
that the cookie needed to be fixed because it might impact
other web-­based applications as well, either now (but not yet
identified) or in the future. The corporate application was also
patched, because a patch was available and recommended by the
vendor. Whether this was essential or not was subject to a risk
assessment. It was decided that it was easier to apply the patch
than to run with the risk of this happening again. It should be
noted that applying the software patch took a number of days of
54
Knowledge Management

staff effort and if the impact of this problem had been less, this
might not have been considered cost effective.
• Resolving the root cause: Having evaluated the optimum means of
fixing the root cause, this needs to be added to the work queue
for the relevant teams, appropriately prioritised alongside their
other work. Adequate testing of any changes to the system need
to be done before the fix is implemented and normal change
enablement processes followed. Once the fix is in place, the
result on users who have been affected needs to be evaluated.
Sometimes the fix at the server end will not resolve the issue
for the end users, who may also need to make a change on
their desktops (e.g. clearing the cache). If users are still using a
workaround, they need to be notified that the permanent fix is
now in place. The Known Error may be removed from the Service
Desk list of current problems once this has been completed.
• Long-­term monitoring: Unlike incidents which should be marked
as resolved as soon after resolution as possible, a problem record
will typically be left in a semi-­open state for a period of time in
order to assess whether the fix which has been applied has been
effective. Not all fixes address all issues. If the incidents reoccur,
then the problem record should be re-­activated and moved back to
the identification stage. However, it should be noted that it is often
the case that the incidents for two related problems will all be
linked to the first problem record. If there is evidence that the first
problem has been successfully fixed, but that a second problem
exists with a different root cause, then a new problem record
should be created and the relevant incidents moved across. As a
general rule of thumb, an incident should not be linked to two
problem records as there should not be two independent problems
causing it (as distinct from one problem with multiple root causes).
• Closed: a problem record which has been monitored for a reasonable
length of time, with no recurrences may be marked as closed.

KNOWLEDGE MANAGEMENT
One key aspect of both proactive problem management and reactive
problem management is knowing how data is meant to flow between
55
ITIL Problem Management

systems. It is common practice in large organisations for integration


platforms to be used as midpoints between different corporate appli-
cations. Data is not shared on a point to point basis, but is shared to
the integration platform, which then passes the data on. Whilst there
are many technical and operational benefits to this approach, it can
obscure how the data is used. Periodically changes are made to the
meta data for corporate applications – in some cases this will be the
addition of a new field, in some cases it will be the change in format of
a field (e.g. extending the field length to allow for longer surnames or
changing the encoding for a field from 8-­bit ASCII to 16-­bit UniCode).
In other cases it will just be a change in the contents of the dataset such
as agreeing that invoice codes can now be 6 digits rather than 5 digits
or adding new country codes to reflect a changing political horizon. It
is important to recognise the knock-­on consequences of changes to the
data in one corporate application on the other corporate applications
which are downstream consumers of that data. If change enablement
does not adequately consider the implications of these types of change,
then problems can arise sometime later. Tracking these problems back
to the change concerned can prove time-­consuming if records are not
kept with sufficient detail of how the data is used.
Knowledge Management may be used both for keeping track of
shared data about the services and systems available and for provid-
ing the Service Desk analysts with checklists for drill down and other
means of resolving incidents.

SUMMARY
A formal practice and process for problem management, such as
the ITIL 4 practice, is a good way of methodically keeping track of
problems.

Note
1 https://support.hpe.com/hpesc/public/docDisplay?docId=emr_
na-­a00092491en_us

56

You might also like