Infrastructure and Platform Management_ Practice Guide
Infrastructure and Platform Management_ Practice Guide
This document provides practical guidance for the infrastructure and platform
management practice.
Table of Contents
1. About this document
2. General information
7. Important reminder
8. Acknowledgements
the practice’s processes and activities and their roles in the service value chain
2. General information
Key message
The purpose of the infrastructure and platform management practice is to oversee the
infrastructure and platforms used by an organization. When carried out properly, this
practice enables the monitoring of technology solutions available to the organization,
including the technology and external service providers.
The infrastructure and platform management practice ensures that the organization has
a high-quality IT infrastructure that efficiently meets its current and anticipated needs. ‘IT
infrastructure’ as a concept includes all of the hardware, software, networks, and facilities
that are required to develop, test, deliver, monitor, manage, and support IT services.
This practice covers all stages of the infrastructure solutions lifecycle, from ideation and
gathering requirements to delivery and support. At every stage, it is used in conjunction
with other practices (including the business analysis, architecture management, service
design, availability management, capacity and performance management, service
continuity management, information security management, risk management practices,
and others). The importance of high-quality infrastructure and platforms for service
delivery cannot be overstated; this practice is vital for the success of the organization’s
digital services and digitized business processes.
Definition: IT infrastructure
All of the hardware, software, networks, and facilities that are required to develop, test,
deliver, monitor, manage, and support IT services.
A wide range of activities are used to run and manage IT infrastructure effectively. These
activities range from understanding organization’s requirements and developing and
planning infrastructure and platforms, to performing routine maintenance and
overseeing infrastructure performance.
Definition: Operation
A large portion of the operational activities can be automated. Automation tools can
monitor the environment, identify changes, distribute patches and other updates,
provide asset inventory, and schedule and automate jobs.
In integration with the architecture management the practice, the infrastructure and
platform management practice should ensure development or outsourcing and cost-
efficient operation of flexible and compatible core infrastructure and platform solutions,
that should be easily deployable and easily configured or merged to support the
organization’s services or products, serving as building blocks for the complex solutions,
products, and services. One of the examples of implementing such approach is usage of
microservices, that are “small in size, messaging-enabled, bounded by contexts,
autonomously developed, independently deployable, decentralized and built and
released with automated processes”1.
When the standard solution does not align with the business, a tailored or customized
solution must be developed. The selection of a non-standard service delays the delivery
of the solution and increases the ongoing effort and cost to the business for support for
the solution. These non-standard solutions should be deployed and managed as an
exception due to the additional overhead it requires.
In cases where the technology is not currently in place, the solution must be designed
together with the architecture management and service design practices for conceptual
and detailed design. During design, the infrastructure and platform management
practice, business, and technical requirements are aligned and the recommended
infrastructure and platform solutions are determined. As the solution is not currently
available within the environment, additional steps are taken to address the procurement,
build, sourcing, and support of the solution. The solution should be evaluated by
infrastructure and enterprise architecture to determine if this should be offered to
additional consumers or to remain as an exception to the existing documented
standards.
There are many models for providing infrastructure and platform solutions, ranging from
in-house dedicated data centres to fully out-sourced cloud environments. Many
organizations continue to provide and support infrastructure residing in their internal
data centres. They can also use solutions external to their organization. Cloud solutions
provide offerings that allow systems and applications to run in internal and external data
centres. Most enterprises use public cloud providers for at least part of their
infrastructure. Cloud providers offer many solutions based on the expected needs of the
business. An application may be accessed through the cloud, leaving infrastructure
management activities beyond connecting to the cloud to be done externally by the
application provider. Cloud offerings can include platforms for application development
and infrastructure specific services like storage or backup as a service.
There is usually a mix of public and private cloud services in any organization. Both cloud
services and outsourcing can provide infrastructure and platform services. Cloud services
provide technical capabilities whereas outsourcing performs IT functions in a similar
manner to internal teams. The contract defines the outsourcing scope and service levels.
Instead of managing technology directly, internal IT teams focus on managing the
contractual obligations and interactions with internal teams in an outsourced
environment.
Along with a focus on development from a system perspective, many organizations have
also moved into models that blend development and infrastructure capabilities on one
team to provide coverage throughout the lifecycle. DevOps and site reliability
engineering (SRE) are examples of these models.
By accounting for the end-to-end development and management of the solution, this
approach allows for operational improvements to be included in the development
releases. Machine learning and AIOps leverages data collected on solutions to automate,
address issues, or manage requests without development. Through operational visibility
and development capabilities, the overall system is managed in a more comprehensive
and consistent manner through automation.
When using DevOps for infrastructure and platform management, special attention
must be paid to obsolete systems and monolithic solutions that require manual
operation and, therefore, slow down all management processes and changes. There
should be a clear roadmap of decommissioning and replacing such solutions or
replacing the manual activities with automation. One of the ways to do this is have an
SRE team to run operations.
SRE is a discipline that incorporates aspects of software engineering and applies them to
infrastructure and operations problems with the goal of creating ultra-scalable and
highly reliable solutions. SRE is an approach that tries to bridge the gap between
development and operations and find a consensus of their opposite objectives, which is
to develop and release solutions fast and have a stable solution to support. SRE teams
usually have software developers who must support the solutions they develop, and this
stimulates them to automate most of the manual support and management tasks (in
the course of reducing toil: manual, repetitive, automatable, non-creative work). With
this, infrastructure and platform solutions become more manageable, require less
manual work, and gain agility in changes, delivery, and support. Probably one of the
most important gains of SRE operations is that infrastructure scale-out doesn’t lead to
according linear growth of the team size, as it often happens in classical operations.
Key message
The practice is involved throughout the lifecycle of product and services. Figure 2.1
from “The Site Reliability Workbook” by Google, illustrates how SRE teams are involved
during the lifecycle. With minor variations, this illustration is applicable to other
approaches to infrastructure and platform management.
Figure 2.1 Infrastructure and platform management during product and service
lifecycle
Reliability is designed with the system. Reliability requirements are aligned to the uptime
and performance requirements, defined by the capacity and performance management
practice. These requirements ensure the solutions are built in to support the
organization’s requirements. For example, this may include high availability or redundant
network connectivity.
Definition: Reliability
The ability of a product, service, or other configuration item to perform its intended
function for a specified period of time or number of cycles.
Maintainability of a system should be addressed during the design of a new system and
tested before being transitioned to production. There could be rules agreed for an
infrastructure and platform solution, ensuring maintainability based on the
organization’s requirements and industry practices. One example is the existence of a
monitoring tool to identify issues, or general monitorability of the solution planned at the
design phase. Other examples could be the existence of tools used to configure, deploy,
and provision the solutions. These rules could also be used to manage partners and
suppliers responsible for infrastructure and platform service components.
Definition: Maintainability
The ease with which a service or other entity can be repaired or modified.
If maintainability is not addressed during the initial design and as part of daily
operations, higher support costs, extended outages, and negative impacts to
performance will affect the production environment. Maintainability is improved through
appropriate monitoring configurations, automation, and utilization of standards.
2.3. Scope
The scope of the infrastructure and platform practice includes:
activities used to plan, design, develop, deliver, maintain, and support infrastructure
and platform technology
hardware (servers, desktops, routers, switches, storage, cabling, and data centre)
web hosting
There are many activities and areas of responsibility that are not included in the
infrastructure and platform management practice, although they are still closely related
to infrastructure and platform management. These are listed in Table 2.1, along with
references to the practices in which they can be found. It is important to remember that
ITIL practices combine value chain activities through value streams to deliver value.
Monitoring, event management, and log management for Monitoring and event
infrastructure and platform technologies management
A complex functional component of a practice that is required for the practice to fulfil
its purpose.
A practice success factor (PSF) is more than a task or activity; it includes components
from all four dimensions of service management. The nature of the activities and
resources of PSFs within a practice may differ, but together they ensure that the practice
is effective.
ensuring that the infrastructure and platform solutions meet the organization’s
current and anticipated needs.
2.4.1 Establishing an infrastructure and platform
management approach to meet evolving
organizational needs
The needs of organizations and their customers are continually changing which leads to
the technology industry continually transforming. The changes may result from industry
trends, changes within organizations, business process innovation, or changes to
business volumes. The infrastructure and platform management practice ensures that
infrastructure and platform solutions are flexible and scalable so that they are aligned
with demand. Organizational infrastructure and platforms meet this demand through
optimized solutions that are designed for and used by all parts of the organization.
To properly design these solutions, teams delivering infrastructure and platform change
must be aware of new technologies and techniques. The evolution of technology can be
seen in examples like email, virtual server farms, storage arrays, single sign-on, and cloud
platforms. When solutions are identified based on requirements, requests are promptly
fulfilled. With virtual server technology that is used both internally and for cloud
offerings, the turnaround time for requests can be reduced to minutes. Technological
progress, such as virtualization, containers, continuous integration/continuous delivery
(CI/CD), and IaC, significantly impacts the rate of change and innovation.
Organizations that deliver and support infrastructure and platform solutions have
evolved through models, such as DevOps and SRE; they eliminate the use of traditional
waterfall techniques in favour of end-to-end development and management within one
team. Crucially, the organization’s structure and technology components must align with
its overall strategic direction in order to ensure the consistent delivery and support of
infrastructure and platform solutions. Components must align with the overall strategic
direction to ensure consistent delivery and support of infrastructure and platform
solutions.
It is important to plan how infrastructure and platform teams will identify, design, and
introduce innovation into the environment at the solution and strategic levels.
Depending on the current needs, infrastructure and platform management might need
initial research and testing so that, when the need is presented, there is a clear plan of
action. If the need is pressing, the technology may be selected, purchased, designed, and
configured before any official requests are received.
The infrastructure and platform management practice should ensure that the
infrastructure and platforms are built to promote experimentation, quick technology
adoption, the ability to test theories and hypotheses, change the infrastructure and
platform iteratively with feedback, fail fast, and learn from experience and errors in a safe
environment. Each organization should define its innovation and risk appetite and
consider their financial constraints for innovation in the infrastructure and platforms
areas.
When the organization needs a technical solution, requirements are defined in order to
ensure that the solution meets the organization’s needs. The solution design should
include technical and business requirements. The infrastructure and platform
management practice is involved in analysing requirements to create a high-level design
(in conjunction with the architecture management, business analysis, and service design
practices, and others).
The requirements for infrastructure and platform solutions may come from different
sources, including:
Where possible, the infrastructure and platform management practice ensures that
standards can be defined and utilized in order to simplify the management of
infrastructure and platform solutions. The enforcement of these standards ensures the
reliability and maintainability of solutions. Standards enable efficient and effective
operations and may include the hardware and software versions, configuration settings,
management and monitoring tools, and support structures. Through standards,
solutions are easier to operate, monitor, and upgrade.
Designs should be assessed against current and planned standards and validated
against the current and anticipated levels of availability, performance, capacity,
information security, and so on. Management practices supporting these should have
active involvement.
Part of the practice’s focus is to manage risk to the organization throughout the
infrastructure and platform. As part of this effort, input from practices such as
information security, service continuity, and risk management are taken to ensure that
risks are managed throughout the lifecycle of the solution. This ongoing management
includes, for example, ensuring that network devices are configured based on defined
security policies, controls are tested periodically, and risks are identified and effectively
managed. Requirements are handled on an ongoing basis to prevent adverse impacts,
such as extended service downtime or a security breach of confidential information.
The overall management of infrastructure and platform solutions often includes internal
and third-party solutions and components. Understanding the overall structure of these
solutions and ensuring that the overall level of service provided meets customer
expectations is critical.
Management need visibility to validate that solutions are performing at acceptable levels
and to highlight opportunities. These may include addressing any issues and identifying
areas that could be improved. The infrastructure and platform management practice
should provide visibility to stakeholders in performance and improvement plans. This
practice interacts with other practices to ensure that any issues or requests on solutions
are resolved promptly. For this reason, the practice participates in agreeing targets for
incident response, restoration, and request fulfilment times to align with customer
expectations. This practice may include managing and reporting on the ability of
solutions to meet targets. This visibility also provides an opportunity to improve
performance in this area through automation or process refinement.
This practice also contributes to ensure that the agreed-upon levels of service is met. The
scope of this effort includes any internal or external components used in the solution.
Third-party services must align with customer expectations, or the expectations must be
reset. External providers must meet the service levels in their contracts. By managing
performance levels across internal and external services, the practice is able to report
performance and other outcomes to the business.
The infrastructure and platform management practice ensures that solutions within its
scope effectively contribute to overall financial targets. Infrastructure and platform
solutions should be benchmarked against cloud offerings and external provider
solutions. From a technology perspective, automation, consolidation, and
standardization simplify the infrastructure and platforms and release resources, which
can then be used to drive value. The current and potential partnerships with external
providers can also be evaluated and existing agreements optimized.
Key metrics for infrastructure and platform management are mapped to its PSFs. They
can be used as KPIs in the context of value streams to assess the contribution of the
practice to the effectiveness and efficiency of those value streams. Some examples of key
metrics are given in Table 2.3.
The correct aggregation of metrics into complex indicators will make them easier to use
for the ongoing management of value streams and for the periodic assessment and
continual improvement of the infrastructure and platform management practice. There
is no single best solution. Metrics will be based on the overall service strategy and
priorities of an organization, as well as on the goals of the value streams to which the
practice contributes.
[1] Nadareishvili, I., Mitra, R., McLarty, M., Amundsen, M., Microservice Architecture:
Aligning Principles, Practices, and Culture, O’Reilly 2016
obtain/build
plan.
The contribution of the infrastructure and platform management practice to the service
value chain is shown in Figure 3.1.
Figure 3.1 Heat map of the contribution of the infrastructure and platform
management practice to value chain activities
3.2 Processes
Each practice may include one or more processes and activities that may be necessary to
fulfil the purpose of that practice.
Definition: Process
There are numerous models to structure activities of the infrastructure and platform
management practice. These span several decades and range from waterfall and manual,
to iterative and incremental.
This practice is one of the two ITIL practices (the other is the software development and
management practice) where activities do not always form processes that could be
described as sequences at the level of detail appropriate to this guide. This is because the
infrastructure and platform management activities are always performed in a context of
one or another value stream, and always in conjunction with other practices. However,
activities of this practice can be categorized in three groups:
technology planning
product development
technology operations.
Business analysis
records and review
reports
Audit reports
Activity Example
Develop and agree the Business analysts, architects, product owners, and
infrastructure and infrastructure experts agree and communicate an
platform management infrastructure and platform approach, including scope,
approach sourcing strategy, methods and techniques, procedures, and
responsibilities.
This group includes the activities outlined in Table 3.3 and transforms the inputs into
outputs.
Table 3.3 Inputs, activities, and outputs of product
development
Success criteria
Project structure
(schedule,
assignment,
methods)
The focus of technology delivery and engineering is on designing, building, and
transitioning infrastructure and platform services. These activities may vary, depending
on how the services will be delivered and how the organization applies these steps, as is
outlined in Table 3.4.
Product development activities ensure the delivery of a supportable solution that meets
the organization’s needs and agreed SLOs. Even if an external provider provides a
solution, steps are taken to ensure it fits into the overall delivery and support model.
This group includes the following activities, and transforms the following inputs into
outputs:
Automation
Improvements
Table 3.6 provides example descriptions of the technology operation activities
Activity Example
solving incidents
analysing problems
conducting post-mortems.
training users
Patch and Patches and system updates are released to the environment in a
update the structured manner. Typically, patches deployed to the lower
system environments for testing and then deployed to production. Despite this
structure, there are exceptions where systems are not patched as part
of this scheduled release due to an application incompatibility, business
usage of the solution, or issues identified through testing. It is
important to track the solutions that are not at current levels.
Completing these updates should be rolled out promptly to maintain
overall supportability. Up-to-date solutions reduce the risk of downtime
or security breaches.
There are also situations where system updates or patches are installed
to resolve an incident and then need to be rolled out to the rest of the
organization. The result of applying patches and updates reactively
creates a non-standard environment.
The infrastructure specialist manages these exceptions and identifies a
plan to address these exceptions. Understanding and addressing these
deviations is a vital part of technology management.
The technology operation activities ensure that solutions are available and functioning as
designed from acceptance into the live environment through retirements. Technical
experts and technical coordinators perform the activities in this process.
Roles are described in the context of processes and activities. Each role is characterized
with a competency profile based on the model shown in Table 4.1.
Technology planning
Understanding of
the current
infrastructure
architecture and
architecture
roadmap
Analytical skills
Good knowledge of
current and available
technology
Excellent
knowledge of
current and
available
infrastructure and
platform solutions
Good knowledge
of infrastructure
and technology
services suppliers
and market
Understanding of
the current
infrastructure
architecture and
architecture
roadmap
Analytical skills
Good knowledge
of current and
available
technology
Product development
Create a basic solution Solution TA
design architects,
Understanding of
infrastructure
the requirements
specialists, site
reliability
engineers, Good knowledge
product owners of the
infrastructure and
platform
management
approach
Expertise in the
available
technology
Expertise in the
available
technology and
services
Source/develop/configure Infrastructure TC
the components specialists, site
Technical expertise
reliability
engineers,
product Communication
owners, and collaboration
suppliers skills
Source/build/configure the Infrastructure TC
solution specialists, site
Technical expertise
reliability
engineers,
product Communication
owners, and collaboration
suppliers skills
Technical expertise
Good knowledge
of the organization
and its
environment,
portfolios,
products,
resources, and
customers
Technology operations
Understanding of
business and
customer context
Communication
and coordination
skills
Technical
knowledge
4.1.1 Infrastructure specialist
The key role for this practice is infrastructure specialist. This is a generic term to describe
roles that can be specified either by the technology, like network, SRE, and so on (for
example, network specialist, site reliability engineer, or virtualization specialist) or by the
phase in product lifecycle, like design, testing, or operations (for example,. infrastructure
designer/development specialist, testing specialist, or operations administrator).
Those distinctions are defined by the organization’s size and structure, but the general
set of competencies are similar, and usually includes:
service mindset
Key message
Rigid boundaries between “application development” and “production” (sometimes
called programmers and operators) are counterproductive. This is especially true if the
segregation of responsibilities and classification of ops as a cost centre leads to power
imbalances or discrepancies in esteem or pay
(…) Ideally, both product development and SRE teams should have a holistic view of the
stack—the frontend, backend, libraries, storage, kernels, and physical machine—and no
team should jealously own single components. It turns out that you can get a lot more
done if you “blur the lines”11 and have SREs instrument JavaScript, or product
developers qualify kernels: knowledge of how to make changes and the authority to do
so are much more widespread, and incentives to jealously guard any particular
function are removed.”
This quote from “The Site Reliability Workbook” by Google refers specifically to SRE
teams. However, it is valid for any other approach to infrastructure and platform
management.
The infrastructure and platform management practice needs to allow for organization
variations while ensuring some level of consistency across infrastructure teams. The
teams may be split by geography, type of technology, or business service. Having an
overall structure to manage practice changes and communication is important to keep
the overall service functioning in an optimal manner. This may be done with an overall
governance group or through representation in an infrastructure committee.
SLAs
change records
incident records
request records
problem records
release records
financial information
Analytical
systems
Knowledge
management tools
Reporting
Knowledge
engines
management
tools
Dashboard
systems
Product development
Create a basic solution Workflow tools Ability to assign High
design including task design tasks and
assignment, routing, approval for
approvals, tracking, planning
and notifications activities,
including status
tracking,
notifications, and
reporting to
ensure actions
are on task and
the design is
approved
System health
monitoring and
reporting tools
Technology operation
Task
assignment,
routing,
approvals,
tracking and
notifications
Automated
report
consolidation
and generation,
customer
feedback
surveys
Workflow tools
including task
assignment,
routing,
approvals,
tracking, and
notifications
Very few services are delivered using only an organization’s own resources. Most, if not all,
depend on other services, often provided by third parties outside the organization (see
section 2.4 of ITIL Foundation: ITIL 4 Edition for a model of a service relationship).
The infrastructure and platform management practice allows for many outsourcing
options both from an activity perspective as well as from a technology perspective. Table
6.1 provides examples of areas that are candidates for outsourcing.
With a large amount of opportunity within this space, understanding and managing
outsourcing risks is an important activity to ensure that services meet customer
expectations. This should be done in a close conjunction with other practices, such as the
risk management and supplier management practices.
loss of internal talent as role moves from performing activities to oversight of those
activities
lack of visibility.
7. Important reminder
Most of the content of the practice guides should be taken as a suggestion of areas that
an organization might consider when establishing and nurturing their own practices.
The practice guides are catalogues of topics that organizations might think about, not a
list of answers. When using the content of the ITIL practice guides, organizations should
always follow the ITIL guiding principles:
focus on value
More information on the guiding principles and their application can be found in section
4.3 of ITIL Foundation: ITIL 4 Edition.
8. Acknowledgements
AXELOS Ltd is grateful to everyone who has contributed to the development of this
guidance. These practice guides incorporate an unprecedented level of enthusiasm and
feedback from across the ITIL community. In particular, AXELOS would like to thank the
following people.
8.1 Authors
Angie Pederson.
8.2 Reviewers
Dinara Adyrbayeva, Akshay Anand, Peter Farenden, Roman Jouravlev, Vernon Lloyd.
References
1. Nadareishvili, I., Mitra, R., McLarty, M., Amundsen, M., Microservice
Architecture: Aligning Principles, Practices, and Culture, O’Reilly 2016