SG 248544
SG 248544
Ewerson Palacio
Brian Hugenbruch
Wilhelm Mild
Filipe Miranda
Livio Sousa
Anderson Augusto Silveira
Bill White
Redbooks
Draft Document for Review January 30, 2024 9:35 am 8544edno.fm
IBM Redbooks
January 2024
SG24-8544-00
8544edno.fm Draft Document for Review January 30, 2024 9:35 am
Note: Before using this information and the product it supports, read the information in “Notices” on page 9.
This edition applies to IBM LinuxONE Emperor, LinuxONE Rockhopper 4 and IBM z16.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Contents 7
8544TOC.fm Draft Document for Review April 4, 2024 3:19 pm
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at https://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
Db2® IBM z16™ WebSphere®
DS8000® Interconnect® z Systems®
FICON® OMEGAMON® z/Architecture®
FlashCopy® Parallel Sysplex® z/OS®
GDPS® PIN® z/VM®
Guardium® Redbooks® z13®
HyperSwap® Redbooks (logo) ® z15®
IBM® Resilient® z16™
IBM Cloud® Resource Link® zEnterprise®
IBM Spectrum® System z® zSystems™
IBM Z® Tivoli®
Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.
Ansible, OpenShift, Red Hat, are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in
the United States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
Preface
This IBM® Redbooks® publication explores the concepts of resiliency as they relate to
information technology systems. Focus will be given to a LinuxONE server as a foundation
model; however, understanding why we care about uptime beyond vague notions that “time
equals money” will help readers to understand the importance of availability for a business,
institution, or governmental agency Authors.
This book was produced by a team of specialists from around the world working at IBM
Redbooks, Poughkeepsie Center.
Brian Hugenbruch CISSP, is a Senior Software Engineer with IBM Z and LinuxONE.
Officially, he is the z/VM® Cryptography and Security Development lead. He has also served
as the LinuxONE Resiliency Technical Lead since 2020; in this capacity, he drove the platform
to its “Eight 9's” calculation. He combines these two sides of his work on the topic of Cyber
Resiliency and Digital Forensics. He writes and speaks on all these topics in numerous
forums and conferences, with the goal of making complicated topics as easy to understand as
possible.
Wilhelm Mild is an IBM Executive IT Architect and Open Group Distinguished Architect
which is passionate since more than 3 decades for new technologies and their adoption to
enterprise solutions. He is architecting IBM LinuxONE and IBM zSystems™ complex solution
Architectures featuring, Containerization and Red Hat OpenShift Container Platform,
end-to-end security, Resiliency HA / DR and Network topologies for complex application
landscape in worldwide customer engagements. He is a speaker on international customer
events and education classes and is active in teaching hands-on labs and enjoys Nature in
free time with hiking and traveling to funny natural places
Filipe Miranda is a principal technical specialist for Red Hat Synergy team/IBM. 15+ years of
experience using Open Source technologies. Author of several IBM Redbooks and technical
articles spread out on LinkedIn and IBM Developer, he is member of the zAcceleration team,
where key skills on the toolbox includes, IBM CloudPaks, Red Hat OpenShift, DevOps and
many other technologies part of the hybrid cloud portfolio.
Livio Sousa is an IT Specialist with over 20 years of experience with high-end platforms,
which encompasses servers, storage, networking equipment and enterprise architecture
integration. He is assigned to support hybrid cloud solutions. Throughout his career he had
the opportunity to work with several different software platforms, such as z/OS®, Unix, Linux
and Windows as well as different hardware processors architectures like the z/Architecture®,
Power and the IA-32 architecture and its extensions.
Anderson Augusto Silveira is a Senior Technical Specialist for LinuxONE, Linux on Z and
Red Hat OpenShift within IBM. With 12+ years of experience in Virtualization and Open
Source technologies he has been participating in mission critical projects around the world,
training teams of IBM Business Partners and Clients, and helping and supporting with
adoption of new technologies and products.
Bill White i is a Project Leader and Senior Infrastructure Specialist at IBM Redbooks,
Poughkeepsie Center.
Steven Cook
IBM US
Justin VanSlocum
IBM US
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
https://www.redbooks.ibm.com/subscribe
Stay current on recent Redbooks publications with RSS Feeds:
https://www.redbooks.ibm.com/rss.html
Preface 13
8544pref.fm Draft Document for Review February 4, 2024 10:47 am
In this chapter we explore the concepts of resiliency as they relate to IT systems. Focus is
given to the IBM LinuxONE platform as a foundation model, however, discussions on why we
care about uptime beyond vague notions that “time is money”, and the importance of high
availability, business continuity, and disaster recovery is also provided.
There is a cost when those expectations are not met, and that cost can be significant: legal
consequences, lost revenue, damaged reputation, stalled productivity, or failure to meet
regulatory and compliance standards, to name a few.
The impact of downtime can vary greatly depending on the business. Some might be able to
survive an outage of a few hours, while for others, even consumers experiencing poor
performance for several seconds can impact the bottom line. Not being able to deliver
services as expected or at the agreed level can potentially turn into loss of business or hefty
fines.
IT resiliency must protect your business from downtime. IT infrastructure solutions must work
together to optimize availability, keep your systems running, detect problems in advance and
recover your critical data.
The best approach to remedying any type of outage is to eliminate single points of failure by
creating an environment that has redundant resources with an automated and seamless
takeover process. Without such measures, you must first identify and fix the problem, and
then manually restart the affected hardware and software components.
Even planned outages for businesses with 24x7 service offerings can come at a significant
cost. Businesses need to be ready and able to recover as quickly and seamlessly as possible
from any type of event to minimize cost.
Refer to The Real Costs of Planned And Unplanned Downtime, for the cost of downtime.
1
The global average total cost of a data breach is $4.45 million per the Cost of a data breach 2023:
https://www.ibm.com/reports/data-breach
There can also be a requirement to adhere to regulatory guidelines and standard rules for the
entire IT infrastructure that also impact system resiliency, examples include:
NIST SP 800-160 and NIST SP 800-193
PCI DSS 9,5,1
US White House Executive Order 14028, “Improving the Nation's Cybersecurity”
European Union Digital Operational Resilience Act (DORA)2
ISO 27001(Information Security Management)
Because of its checklist-oriented nature, many businesses and organizations use the
Payment Card Industry Data Security Standard (PCI DSS) as a starting point for building
system resilience. They might then apply stronger standards when either needed for the client
experience, or when required by law.
Regulations and standards that apply to various industries are frequently updated to reflect
what is happening in the cyber resiliency aspect of their industries. For multinational
businesses, laws, policies, and regulations might differ from one geographic location to the
next, causing additional stress on the requirements for IT infrastructure resiliency.
In the end, the design of systems supporting critical IT services depends on the interaction
between the criticality of the service and its business profile—regarding it as a journey toward
reducing risk and impact on service delivery. Consequently, it becomes important to address
downtime from a business viewpoint that is directed by service level agreements.
Service-level agreements (SLAs) are used between a service provider and consumers to
describe the service and define the level and quality that can be expected. It includes the
duties and responsibilities of each party, the metrics for measuring service quality, and the
2 Refer to The Digital Operational Resilience Act (DORA) - Regulation (EU) 2022/2554
course of action or penalties should the agreed upon service levels not be met. SLAs also
specify terms and conditions, like:
Effective start dates, end dates, and review dates.
Key performance indicators that track performance over time, for example:
– Service ABC must be able to run at least 10,000 transactions/second with 95%
finishing under 0.1-seconds response time as measured by application records
– Service ABC must be available 24/7, except for a 1-hour planned outage, once per
quarter. Warning of this planned outage must be given no later than 2 weeks before the
outage. Availability level is based on the value that is indicated in the outage record and
agreed to by all parties
Commitments that are related to the type of record that the SLA applies to. For example,
problem tickets will specify target dates and times for response, resolution, delivery,
availability, and other values. Work orders will specify target start, finish, and delivery
times
Other criteria that may be measured include defect rates, technical quality, security
controls, and compliance standards
Recovery events are typically measured by the targets that are defined for Recovery Point
Objective (RPO) and Recovery Time Objective (RTO) targets.
Looking backward from an event, the RPO represents how much data is lost due to a failure.
This value can vary from zero or near-zero with synchronous remote copy, to hours or days
with recovery from tape. Looking forward from the event, the RTO is the goal for how much
time can pass before service is restored. It includes bringing up the operating system on the
recovery system, the workloads, management tools, enabling the network, and bringing the
services online. This time span can be anywhere from minutes to days, depending on the
recovery solution.
It seems obvious to say that you must choose the replication techniques that achieve the
recovery objectives. However, business realities often complicate the choices, for example:
An existing building is owned at a remote location so it must be used, even though it is too
far away to support synchronous replication to get RPO=0
Tape-tape replication is chosen because it is cheaper, even though the amount of potential
data loss might impact business results
The SLA, RPO, and RTO have been defined when the environment was different and are
now outdated. Regular reviews and updates of existing SLAs, RPO, and RTO are needed
to capture changes to the business environment
Also, in case of problems with the recovery process there will be dependencies on the
infrastructure support staff. Automation can help in greatly reducing the time that is needed to
recover the environment by optimizing the recovery process.
IT resiliency is the ability to ensure continuity of service and support for its consumers and to
maintain its viability before, after, and during an event. This means that a business must
make use of people, process, and technology:
People who understand their roles-and-responsibilities during an outage or disruption.
They must design the strategies and solutions to enable the continuity of business
operations. Operators must practice recovery procedures to ensure a smooth recovery
after an outage or disruption
Processes include crisis management and business continuity, but can be expanded to
include problem, change-and-configuration management, and business controls
Technologies are often emphasized too much. However, they include backup, replication,
mirroring, and failover solutions
Several terms are used in for different levels or types of availability, they include high
availability, continuous operations, continuous availability, business continuity, and disaster
recovery. These are describe in the subsequent sections.
High availability
A high availability (HA) environment has a fault-tolerant, failure-resistant infrastructure that
supports continuous workload processing. The goal is to provide service during defined
periods, at acceptable or agreed upon levels, and mask unplanned outages from consumers.
Many hardware and software components have built-in redundancy and recovery routines to
mask possible events. The extent of this support greatly varies.
HA refers to systems and services that typically operate 24x7 and usually require redundancy
and rapid failover capabilities. These capabilities minimize the risk of downtime due to failures
in the IT infrastructure.
HA is distinguished from disaster recovery (DR) primarily through the duration and scope of
outage that the technology is designed to mitigate.
Continuous operations
A continuous operations (CO) environment enables non-disruptive backups and system
maintenance that is coupled with continuous availability of services. The goal of business
continuity is to mitigate or mask the effect of planned outages from consumers. It employs
non-disruptive hardware and software changes, non-disruptive configuration, and software
coexistence. z/VM Live Guest Relocation, for example, enables CO because workloads can
be moved from one system to another system at will.
Continuous availability
Continuous availability (CA) is a combination of CO and HA. It is designed to deliver
non-disruptive service to consumers seven days a week, 24 hours a day. There is no impact
from planned or unplanned outages. Red Hat Openshift Container Platform (OCP) is an
example of a CA solution. It enables HA by removing single points of failure (SPOFs) within
the hardware and its software environment. It enables CO by allowing “rolling system
reboots”, non-disruptive updates and workload relocation to mask planned events.
Disaster recovery
Disaster recovery (DR) is protection against unplanned outages through a reliable,
predictable recovery process. Examples include inability to access storage or a site disaster.
Business continuity
Business continuity is the capability of a business to withstand outages and to operate
important services normally and without interruption in accordance with service-level
agreements.
Business continuity includes disaster recovery (DR) and high availability (HA) and can be
defined as the ability to withstand all outages (planned, unplanned, and disasters) and to
provide continuous processing for all critical workloads. The goal is for the outage time to be
less than 0.001% of total service time. A high availability environment typically includes more
demanding recovery time objectives (seconds to minutes) and more demanding recovery
point objectives (zero user disruption) than a disaster recovery scenario.
High availability solutions provide fully automated failover to a backup system so that users
and applications can continue working without disruption. HA solutions must have the ability
to provide an immediate recovery point. At the same time, they must provide a recovery time
capability that is significantly better than the recovery time that you experience in a non-HA
solution topology.
First, consider points of failure. Any component of your system which does not have a backup
can lead either to reduced throughput, or reduced processing power, or a service outage.
To solve these challenges, invest in technologies that help reduce all downtime.
While such calculations are rough estimates, they can provide an investment baseline on the
risk of investing in resiliency technologies or trusting to luck or fate.
Such calculations are done using relative percentiles of availability, versus the cost of an
outage. Having a 99% available system means that, in a given year, one might anticipate 3.65
days of downtime. If one measures the cost of downtime by the hour, this can still lead to an
expensive cost, even if the outage is planned.
In comparison, a system with seven 9's of availability (99.99999% uptime) is, in a given year,
unavailable for approximately 3.16 seconds, and the cost of this outage is significantly lower
(see Table 1-1).
It is worth noting that outages are never precisely as short or as long as the table above, and
these estimates are based on a mean-time between failures - how likely, in instances per year
or per decade, they're likely to occur.
Balancing risk and costs of mitigation is essential to achieving business goals. By balancing
risk of a type of failure and cost of a solution, you can determine how much redundancy and
availability management you can design into your IT infrastructure. By understanding the real
cost of service unavailability, you can calculate the costs that you can sustain to provide this
availability. These factors also determine when you can do maintenance and upgrades to the
IT infrastructure.
As illustrated in Figure 1-1, you must evaluate the risk and impact of the outage and decide
whether the design needs to be changed to meet the cost implications of service outages. Or
maybe these cost implications should be adjusted?
Resilience optimization
High High
Optimum resilience
risk balance
Costs of
Costs from mitigation
risk events solutions
This is not to say that one server is enough, or that LinuxONE hardware solves all problems.
Again, system resiliency is a lifestyle choice, and this means that every single point of
failure—in storage, compute, networking, virtual infrastructure, physical site maintenance,
and so forth—needs to be addressed as part of an overall resiliency and DR plan.
Deployment models, for the purposes of this book, is a framework for planning and
implementing resiliency in a IBM LinuxONE environment. The focus of such models is less
about workload configuration and more about the level of resource needed in a particular part
of your IT environment.
The IBM LinuxONE deployment models are discussed briefly in this section and described in
Chapter 3, “Deployment Scenarios for Resiliency” on page 75 using different use case
scenarios with architectural patterns.
The goal of an architectural pattern is to provide vertical scaling for resiliency - in a given
stack, how can the parts help offset one another? Architectural patterns are constructed of
the components discussed in Chapter 2, “Technical View on LinuxONE Resiliency” on
page 29, and an illustration of both a deployment model and architectural pattern is provided
in Chapter 4, “Resiliency for Red Hat Openshift Container Platform” on page 95.
Table 1-2 provides an overview of the four IBM LinuxONE resiliency deployment models
* Recovery Time Objective (RTO) is the target time for system restoration, while Recovery Point
Objective (RPO) is the data loss that is acceptable. (RPO could be zero data loss.)
Each of the four models supports different capabilities of the hardware and software
technologies that are represented by the layers that are listed in Table 1-3 on page 24. The
table also emphasizes that the layers relate to each of the four resiliency deployment models.
All models in table 1-2 do have a relationship to aspects of all the layers.
All deployment models in the Table 1-2 on page 23 do have a relationship with the aspects of
all layers shown above in Table 1-3.
In the relationship of the layers to the four resiliency deployment models, the question is this:
What are the essential features of the layers for each of the four models? This will be
discussed in Chapter 2, “Technical View on LinuxONE Resiliency” on page 29.
Here are some example issues regarding layers in relation to the four models:
Model 1 to Model 2: In the transition from resiliency model 1 to 2, the hardware layer of
typically grows from a single system to a multi-system environment
Reference architectures: Various IBM solutions can be a part of your implementation of a
model. Each solution has a ‘reference architecture,’ which can include hardware
components and software
For descriptions of the capabilities of each layer, see Chapter 2, “Technical View on
LinuxONE Resiliency” on page 29.
The selection of a deployment model should be based on a combination of risk and cost. If a
smaller business, for example, only has one data center, then this business is constrained by
the limits of physical site construction as to the number of threats that can be defeated.
It would be best never to have failures. You can get closer to this by starting with quality base
technology together with regular maintenance. Application code that you write on top of that
base should have extensive problem detection and near-instantaneous correction. You take
these steps with IBM LinuxONE hardware and the operating system.
By design, the IBM LinuxONE platform is a highly resilient system. Its architecture has built-in
self-detection, error correction, and redundancy. The IBM LinuxONE reduces single points of
failure, to deliver the best reliability of any enterprise system in the industry. Transparent
processor sparing, and dynamic memory sparing enable concurrent maintenance and
seamless scaling without downtime. Workloads that fail can be restarted in place.
If the operating system or logical partition (LPAR)3 is down for a planned or unplanned
outage, another instance of the workload (in another LPAR) can absorb the work. An
automatic recovery process can reduce the workload impact by avoiding the delays
associated with a manual intervention.
Resiliency for a single IBM LinuxONE platform can be enhanced by using a clustering
technology with data sharing across two or more Linux images. Although this does not
provide for additional hardware redundancy, it can keep workloads available and accessible
across software upgrades by allowing rolling restarts.
For more information on this model, see “ 3.1, “Reliable base” on page 76.
Despite the best base technology, failures can still occur. You should anticipate failures with
the right technology implementation. The Flexible site model with the ability to restart
workloads quickly on a backup component, can significantly reduce the impact of an outage.
With the installation of a second IBM LinuxONE system in the IT infrastructure, your
component resource sharing begins the road to full system redundancy. Additionally, single
points of failure for data can be mitigated by setting up Peer-to-Peer Remote Copy (PPRC),
“Metro” (synchronous) data mirror if the systems are installed separated by a metro distance,
(up to 300 km4), and “Global” (asynchronous) data mirroring for longer distances. IBM Copy
Services Manager (CSM) can enable fast data replication with little or no data loss. For more
information see 2.8.8, “Infrastructure placement” on page 71
For more information on this deployment model, see 3.2, “Flexible site” on page 77
Even the best fault tolerant designs can fail. When failure happens and service is impacted,
be prepared to restore service quickly. One method to reduce the impact of an outage is to
avoid outages whenever possible. Anticipate failures with the right technology
implementation.
Most hardware and software components in the IBM LinuxONE system can be cloned to a
second IBM LinuxONE system with dynamic workload balancing spreading the work. For
planned or unplanned events on one LPAR, the workload flows to another LPAR seamlessly,
without affecting availability.
By adding GDPS® capabilities to the disk replication, you can automate management of
system and site recovery actions. In failover situations, automation can reduce the business
impact by minutes or even hours.
For more information on this deployment model, see 3.3, “Multi-site resiliency” on page 82.
3
A logical partition (LPAR) is a subset of the IBM LinuxONE hardware that is defined to support an operating
system. An LPAR contains resources (such as processing units, memory, and input/output interfaces) and operates
as an independent system. Multiple LPARs can exist in an IBM LinuxONE platform. See 2.8.3, “Automation with
Linux HA components” on page 63 for more information.
4
IBM Storage DS8870 supports up to 300km (186 mi) with Metro Mirror - Delays in response times are proportional
to the distance between volumes.
With multiple systems, multiple storage units, automation software, at multiple data centers,
every possible risk can be addressed before it happens. The distance at which physical sites
are set also becomes a defining factor, as two data centers located at a metro proximity may
fall subject to the same disasters.
Each step of the journey does involve time and money in order to procure, install, configure,
and test each of the components involved. And testing is a requirement - investing in all of the
above, without validating that it works despite the pressures of on-going business, could lead
to unwanted surprises in the event of an actual disaster.
When you have Fault Tolerance within the Primary Data Center, plus GDPS xDR, Metro and
Global Mirror, plus Continuous Availability for key workloads, Disaster Recovery looks like this
architecturally:
GDPS Global Mirror consists of two sites, which are separated by virtually unlimited
distances
The sites run the same applications with the same data sources, to provide cross-site
workload balancing, continuous availability, and Disaster Recovery (DR)
As a result, workloads to fail over to another site for planned or unplanned workload
outages within seconds
GDPS allows recovery time in 1 - 2 minutes with 3 - 5 seconds of data loss, also with full
end-to-end automation
You can combine Metro-Mirror and Global-Mirror solutions into 3- and 4-site solutions to
provide IBM HyperSwap® for very rapid recovery for disk outages. You also enable zero data
loss across long distances, without affecting user response time.
GDPS is the technology used to provide continuous availability and disaster recovery for IBM
Z, including linuxONE. It orchestrates workload routing and facilitates failover in the event of
disruptions, ensuring uninterrupted operations.
GDPS Metro provides a function called “GDPS Metro Multiplatform Resiliency for IBM Z,” also
referred to as cross-platform disaster recovery, or xDR. This function is especially valuable for
customers who share data and storage subsystems between z/OS and Linux z/VM guests on
IBM Z. For example, an application server running on Linux on IBM Z and a database server
running on z/OS.
For site failures, GDPS invokes the Freeze function for data consistency and rapid application
restart, without the need for data recovery. HyperSwap can also be helpful in data migration
scenarios to allow applications to migrate to new disk volumes without requiring them to be
quiesced.
When using ECKD formatted disk, GDPS Metro can provide the reconfiguration capabilities
for the Linux servers and data in the same manner as for z/OS systems and data. To support
planned and unplanned outages these functions have been extended to KVM on LinuxONE
and IBM Z with GDPS V4.1. GDPS provides the recovery actions listed below:
Re-IPL in place of failing operating system images
z/VM Live Guest Relocation management
Manage z/VM LPARs and z/VM guests, including Linux on Z
Heartbeat checking of Linux guests
Disk error detection
Data consistency with freeze functions across z/OS and Linux
Site takeover/failover of a complete production site
Single point of control to manage disk mirroring configurations
Coordinated recovery for planned and unplanned events
Additional support is available for Linux running as a guest under z/VM. This includes:
Re-IPL in place of failing operating system images
Ordered Linux node or cluster start-up and shut-down
Coordinated planned and unplanned HyperSwap of disk subsystems, transparent to the
operating system images and applications using the disks
Transparent disk maintenance and failure recovery with HyperSwap across z/OS and
Linux applications
It describes the various aspects of a virtualized environment, including the IBM LinuxONE
processor Reliability, Availability and Serviceability (RAS) characteristics, Processor
Resource System Manager (PR/SM) and Dynamic Partition Manager (DPM) operation
modes and Virtualization and Hypervisors such as z/VM and KVM. Resiliency automation for
the infrastructure is also covered. It also covers the LinuxONE Input/Output (I/O) capabilities
with FICON® and FCP SCSI, as well as Networking with the use of internal network
capabilities and network cards. This chapter includes the following topics:
We also need to consider which layers could build the most reliable environment and best
availability of the application services, no matter of planned or unplanned outages of
components on any of the layers that are part of the environment. The picture below
illustrates these layers and has options on the right side which could contribute to build an
environment with the highest resiliency in IT industry with IBM LinuxONE.
Physical • RoCE
Network • OSA
• Containerized Workloads
Applications • Middleware
• ISV Applications
• Databases
• Hypervisors Management
• z/VM, KVM and PRSM Management
Management • Container Management
• Kubernetes / RHOCP
2.2 Infrastructure
This section describes the infrastructure layer, which includes the IBM Linux ONE built-in
hardware resiliency components and its RAS platform enhancements. It also covers the IBM
LinuxONE CPC components and the infrastructure layer's key capabilities, including
compute, physical storage and physical network.
The IBM LinuxONE system has its heritage in decades of extensive research and
development by IBM into hardware solutions to support mission-critical applications for the
most diverse industries and delivers exceptional overall resiliency.
Resilience capabilities vary according to equipment type & model. Figure shows available
IBM LinuxONE 4 models.
Note: Redundancy, by itself, does not necessarily provide higher availability. It is essential
to design and implement your IT infrastructure by using technologies such as system
automation and specific features. These technologies can take advantage of the
redundancy and respond to failures with minimal impact on application availability.
From a redundancy and resiliency perspective, the IBM LinuxONE platform design (hardware
and software) includes RAS principles that are driven by a set of high-level program
objectives that move toward continuous reliable operation (CRO) at the system level. The key
objectives of IBM LinuxONE are to ensure data integrity, computational integrity, reduce or
eliminate planned and unplanned outages, and reduce the number of repair actions.
The RAS strategy is to manage change by learning from previous generations of IBM Z and
LinuxONE and investing in new RAS functions to eliminate or minimize all sources of
outages. The RAS strategy employs a building-block approach that is designed to meet
stringent requirements for achieving CRO. These are the RAS building blocks:
Error prevention
Error detection
Recovery
Problem determination
Service structure
Change management
Measurement
Analysis
Infrastructure Layer
Enhancements to IBM LinuxONE current RAS designs are implemented in the next IBM
LinuxONE platform through the introduction of new technology, structure, and requirements.
Continuous improvements in RAS are associated with new features and functions to ensure
that the IBM LinuxONE platforms deliver exceptional resiliency.
IBM LinuxONE RAS is accomplished with concurrent replace, repair, and upgrade functions
for processing units, memory, CPC and I/O drawers, as well as I/O features for storage,
network, and clustering connectivity.
The IBM LinuxONE hardware and firmware are a physical implementation of the
z/Architecture. The key capabilities that are featured in the infrastructure layer include:
Compute (CPU, Memory and Firmware).
Storage
Network
Compute
The IBM LinuxONE hardware platform can have one or more frames based on the model.
The LinuxONE Rockhopper 4 models will always be just one frame whereas the LinuxONE
Emperor 4 models can be configured with one to four frames. Even the model with just one
frame will have many redundant components and features, because resiliency is a design
point of these machines. The frames contain the following components:
CPC drawers which contain processing units (PUs) also known as cores or General
Purpose Processors (GPU), memory, and connectivity to I/O drawers.
I/O drawers with special cores for I/O features
Special purpose features, such as on-chip crypto feature and on-chip AI accelerator
Cooling units for either air or water cooling
Power supplies
Oscillator cards for system clock
The LinuxONE model will depend on the number of processing units (PUs), the amount of
memory, and how much I/O bandwidth you require to run your workloads. The PUs, memory,
and I/O features have built in resiliency and the power, cooling, and system clocking are
redundant components in each system.
Figure 2-3 BM z16™ LinuxONE Rockhopper 4 under the covers (front and rear views)
LinuxONE CPCs
For highest Resiliency the IBM LinuxONE CPC contains almost all components installed
redundant, to avoid any type of outage or it uses the redundancy for self healing.
Each CPC drawer in a LinuxONE hosts the processing units, memory, and I/O interconnects.
The CPC drawer design aims to reduce, or in some cases even eliminate, planned, and
unplanned outages. The design does so by offering concurrent repair, replace, and upgrade
functions for the CPC drawer. Figure 2-3 shows a front and back views of a LinuxONE
Rockhopper 4 under the covers.
The process through which a CPC drawer can take over for a failed CPC drawer is called
Enhanced (CPC) Drawer Availability (EDA). EDA allows a single CPC drawer in a
multi-drawer
configuration to be removed and reinstalled concurrently for an upgrade or a repair.
– System assist processor (SAP) is used for offload and to manage I/O operations
Several SAPs are standard with the platform. More SAPs can be configured if
increased I/O processing capacity is needed
– Integrated Firmware Processor (IFP) - Two cores dedicated to supporting native PCIe
features (for example, RoCE Express, zEDC Express, zHyperLink Express, and
Coupling Express), and other firmware functions
– Central processor (CP) - In LinuxONE, a CP is permitted for exclusive use of Virtual
Appliance, which is a fully integrated software solution to provide Continuous
Availability / Disaster Recovery protection and fail-over automation of workloads on
LinuxONE. Refer to “Virtual Appliance (VA)” on page 58
In the unlikely event of a permanent core failure, each core can be individually replaced by
one of the available spares. Core sparing is transparent to the operating system and
applications. The resiliency capabilities for the PUs include:
– Transparent core sparing
– Concurrent processor drawer repair/add, including Processor/Cache Chips, Memory
and other internal components
– Transparent SAP Sparing
System clocking
LinuxONE has two oscillator cards (OSCs) for system clocking purposes: one primary and
one secondary. If the primary OSC fails, the secondary detects the failure, takes over
transparently, and continues to provide the clock signal to the system.
Power
The resiliency capabilities for power include transparent fail-over and concurrent repair of all
power parts and redundant AC inputs. The power supplies for LinuxONE are also based on
the N+1 design. The additional power supply can maintain operations and avoid an
unplanned outage of the system.
Cooling
LinuxONE can provide N+1 cooling function for the radiator-based, air cooled model, which
suits the needs of typical business computing. The N+1 (redundant) cooling function for the
fluid-cooled model suits the needs of enterprise computing. The resiliency capabilities for
cooling include transparent fail-over and concurrent repair of cooling pumps, blowers, fans,
and so on. The single frame models do not have radiators. The cooling is accomplished by
forced air using redundant fans.
The SEs are stand-alone 1U rack-mounted servers and closed systems that run a set of
LinuxONE platform management applications. When tasks are performed at the HMA, the
commands are routed to the primary SE of the platform. The primary SE then issues those
commands to the LinuxONE.
Two rack-mounted SEs (one is the primary and the other is the alternate) are implemented to
manage the LinuxONE platform. SEs include N+1 redundant power supplies. Information is
mirrored once per day between the primary and the alternative SEs. The Remote Support
Facility (RSF) provides communication with the IBM Support network for hardware problem
reporting and service.
Memory
LinuxONE platforms implements Redundant Array of Independent Memory (RAIM), which
detects and recovers from failures of dynamic random access memory (DRAM), sockets,
memory channels, or Dual Inline Memory Module (DIMM). LinuxONE memory includes these
resiliency capabilities:
– DIMM-level failure protection based on RAIM technology
– Memory channel and bus protection based on CRC and RAIM technology
– Concurrent memory repair/add through the concurrent drawer repair process
– Concurrent memory upgrades
To help minimize planned outages, the following tasks are also possible:
PCIe fanout
The PCIe fanout in the CPC drawer provides the redundant paths for data between memory
and the I/O drawers, which houses the I/O features. The PCIe fanout is hot-pluggable. If a
PCIe fanout fails, a redundant I/O interconnect allows a PCIe fanout to be concurrently
repaired without loss of access to its associated I/O domains within the I/O drawer.
I/O features are supported in any combination and can be concurrently added and removed.
The resiliency capabilities for I/O include:
– Multiple channel path support
– Concurrent repair/add of all features in an I/O drawer
– Concurrent repair/add of I/O drawer
– Concurrent upgrade of any I/O feature type
– Domain fail-over based on Redundant I/O interconnect
– Dynamic activation of I/O configuration changes
This is the basic principle underlying the Capacity on Demand offerings for IBM Z and
LinuxONE. The Capacity on Demand offerings allow you to get the resources you need, when
you need them.
The Capacity on Demand offerings provide permanent and temporary upgrades by activating
one or more LICCC records. These upgrades occur without disruption to the operation of the
server. Depending on the type of upgrade, you can order upgrades yourself using the
Customer Initiated Upgrade (CIU) application on Resource Link® or you can call your IBM
sales representative to order the upgrades.
Various dynamic “capacity on demand” capabilities for resiliency and capacity change on
demand are available for IBM LinuxONE:
– Flexible Capacity for Cyber Resiliency records allow you to shift production capacity
between participating IBM z16™ servers at different sites.This offering is available for
IBM Z beginning with IBM z16 servers and LinuxONE 4
For Resiliency, these cards can be coupled and are acting as a unit that can run the entire
workload if one card would malfunction.
IBM LinuxONE systems have dedicated processors (SAPs) to handle I/O which are
independent from the application cores. They do not require any capacity planning and are
not considered in Licensing models for the software stack.
– Several I/Os can be issued for a LUN at the same time (asynchronous I/O)
– Disk blocks are 512 bytes
– No ECKD emulation overhead of fixed block devices
– I/O queues occur in the FCP card or in the storage server
– No disk size restrictions
– High availability and load balancing is provided by Linux multipathing, type fail-over,
multibus, or z/VM hypervisor exploiting the EDEVICE feature
– Explores N-Port ID Virtualization (NPIV)2
1
ECKD-Extended Count Key Data and CKD-Count Key Data are used interchangeably in this chapter.
2
N-Port ID virtualization (NPIV) is a Fibre Channel (FC) standard that makes it possible to create multiple virtual
ports on a single physical node port (N-Port), with each virtual port appearing as a unique entity to the FC network.
For Resiliency purposes, the storage servers can be used via a SAN Volume Controller
(SVC), which can build a Storage Stretched Cluster with high consistency and resiliency
based on the storage controller itself. The SVC can handle storage fail-overs between
different storage servers.
Effectively, the OSA integrates the control unit and device into the same hardware. It does so
by placing it on a single card that directly connects to the central processor complex I/O bus.
2.3.1 Virtualization
IBM LinuxONE is a completely virtualized environment which understands ‘bare metal’ as the
virtualization mode with Processor Resource/System Manager (PR/SM) entity. This level is
already a virtualization layer that is implemented in the hardware and is highly efficient for
resiliency and fail-over capabilities between Logical Partitions (LPARs) and the resources
attached to it.
An LPAR is a subset of the processor hardware that is defined to support an operating
system. An LPAR contains resources (processors, memory, and input/output devices) and
operates as an independent system. Multiple logical partitions can exist within Linux ONE
mainframe hardware system.
3
IBM LinuxONE 4 is planned to be the last Server to support OSE networking channels. (This refers to: IBM support
for the System Network Architecture (SNA) protocol being transported natively out of the server by using
OSA-Express 100BASE-T adapters).
Partitioning control specifications are partly contained in the Input/Output Configuration Data
Set (IOCDS) and are partly contained in a system profile. The IOCDS and profile both reside
in the Support Element (SE). An IOCDS contains information to define the I/O configuration to
the processor channel subsystem and is created by a program called I/O Configuration
Program (IOCP).
The two hypervisors that can be installed in a LinuxONE LPAR are z/VM, and KVM.
With z/VM you can have IBM’s premier hypervisor matured over decades and exploiting very
granular resource sharing and resource shifting between the virtualized Linux virtual guests
controlled by z/VM.
With KVM you can enable on LinuxONE an hypervisor developed by the Open Source
Community, which is adapted by our premier Linux distribution partners, Red Hat, SUSE and
Canonical.
Platform virtualization
Platform Virtualization is a principal strength of the IBM LinuxONE. It is embedded in the
architecture and built into the hardware, firmware, and operating systems. For decades, the
IBM LinuxONE platforms have been designed based on the concept of partitioning resources
(such as CPU, memory, storage, and network resources). So, each set of features can be
used independently with its own operating environment.
Every LinuxONE platform is highly virtualized, with the goal of maximizing utilization of
computing resources, while lowering the total number of resources and cost needed to run
critical workloads and solutions.
Virtualization can help secure and isolate application workloads and data within virtual
servers and storage devices for easier replication and restoration. This added resiliency can
provide you with greater flexibility to maintain a highly available infrastructure, while
performing planned maintenance, and to configure low-cost disaster-recovery solutions.
IBM PR/SM is a technology used in IBM LinuxONE to provide logical partitioning and
resource management capabilities. PR/SM allows a LinuxONE system to be divided into
multiple logical partitions, each running its own operating system instance and set of
applications with highest isolation.
PR/SM logically partitions the platform across the various LPARs to share resources, such as
processor units and I/O (for networks and storage), allowing for a high degree of virtualization
and highest isolation at the same time.
The main goal of PR/SM is to maximize the utilization of system resources and improve
overall system performance. It allows up to 85 logical partitions to run concurrently on a single
physical LinuxONE Emperor and up to 40 on a single LinuxONE Rockhopper, enabling
organizations to consolidate workloads and at the same time isolate them for Resiliency and
therefore reduce hardware costs.
Overall, IBM PR/SM is a powerful virtualization technology that enables efficient resource
management and workload consolidation on IBM LinuxONE systems. It helps organizations
maximize the utilization of their LinuxONE hardware and achieve better performance,
flexibility, and reliability in their computing environments.
Virtual PUs
z/OS Linux KVM z/VM
z/VM
(distro specific) (can host all distros)
LPAR LPAR LPAR LPAR LPAR
Logical PUs
CP CP CP CP CP zIIP zIIP zIIP zIIP IFL IFL IFL IFL IFL IFL Physical PUs
PR/SM
Figure 2-4 shows the various options for exploiting the IBM PR/SM virtualization capabilities
such as z/VM, KVM as well ad native z/OS and native Linux distributions. IBM Linux ONE
supports only z/VM, KVM and native Linux LPARs.
Resiliency in workloads necessitates the utilization of one or more LPARs, along with various
modes of software-level control to ensure availability. It is essential to predetermine the
desired availability levels for each workload to understand the corresponding requirements.
To provide application fail-over capabilities for multi-LPAR application workloads, it is
recommended to implement high availability clustering solutions such as Pacemaker, Red
Hat High Availability Add-on for Red Hat Enterprise Linux, or SUSE High Availability. These
solutions facilitate automatic fail-over and workload recovery in the event of hardware or
software failures.
Regular testing of LPAR application workloads' resiliency through simulation and controlled
failure scenarios is vital. Conducting disaster recovery drills, fail-over testing, and load testing
will validate the effectiveness of resilience measures and identify areas for improvement.
LinuxONE can be configured as either PR/SM or Dynamic Partition Manager (DPM) mode.
LinuxONE system can be shared between partitions, which reduces the amount of adapters
that might be required to handle a specific workload.
IBM LinuxONE
Linux on Z
Linux on Z
Linux on Z
Linux on Z
Linux on Z
Linux on Z
Linux on Z
Linux on Z
Binaries Binaries
Libraries Libraries
z/VM KVM
DPM
DPM is a feature that is enabled by the systems’ initialization process and is mutually
exclusive with PR/SM. Once enabled, the DPM mode will be ready when the system is
powered on.
With DPM you will use the HMC to configure the running environment for your LinuxONE
server. DPM automatically discovers and display the system resources that are available for
you to use, and indicates how your selections might affect other servers and applications that
are already defined or running on the same system.
After your system is up and running in DPM mode, you can use DPM to:
– Modify system resources without disrupting running workloads
– Monitor sources of system failure incidents and conditions or events that might lead to
workload degradation
– Create alarms so that you can be notified of specific events, conditions, and changes to
the state of system resources
– Update individual partition resources to adjust capacity, redundancy, availability, or
isolation
For additional DPM mode information, refer to Dynamic Partition Manager (DPM) Guide,
SB10-7176.
Two LPARs can be used as separate entities with shared resources of computing and if one
LPAR is not functioning, the other LPAR can run the entire workload due to the capacity shift
to it. This is built-in resiliency per hardware design.
As shown in Figure 2-5 on page 43, DPM is not fully covering all features of PR/SM, but is
enhancing the user experience for an IBM LinuxONE especially in a cloud based environment
where Software defined Hardware is important and the Hardware Resiliency is controlled and
managed with external tools.
IBM z/VM is a highly secure and scalable virtualization technology. It is a hypervisor that runs
on LinuxONE systems, allowing multiple virtual machines to run on a single or multiple logical
partitions on a physical machine. Each virtual machine can run its own operating system,
such as Linux. With z/VM, organizations can consolidate workloads and improve resource
utilization and resiliency, while maintaining high levels of security and performance.
z/VM is a Type-2 hypervisor, that allows sharing the LinuxONE platform’s physical resources,
such as disk, memory, network adapters, and cores (called Integrated Facility for Linux -
IFLs). These resources are managed by the z/VM hypervisor, which runs in an LPAR, and
manages the virtual machines (VMs) that run under the control of the z/VM hypervisor.
Typically, the z/VM hypervisor is used to run Linux virtual servers, but it can also be nested
that means a second layer virtualization is possible. This enables granular testing capabilities
and verification for resiliency.
With z/VM the workload can scale efficiently horizontally, by enabling more virtual servers in
the same z/VM, and vertically, which means that applications can scale without modification
by just dedicating them dynamically more resources via the z/VM hypervisor.
You can combine the horizontal and vertical scalability and reach the best virtualization
capability in an IT environment. That is unique to the IBM LinuxONE environment.
z/VM can host Linux guests from all distribution partners in the same LPAR, because its
EAL4+ certified isolation for Virtual server guarantees the isolation of each server in a safe,
secure, and scalable environment.
Within z/VM, the network topology can be simplified using z/VM VSWITCH technology, a
virtual software defined network that can run in Layer 2 or Layer 3 connecting to one or
multiple OSA network cards for resiliency.
z/VM VSWITCH connected to multiple OSA cards can implement port aggregation, which
enables a resilient fail-over and enlarged network bandwidth at the same time.
The direct connection to a LinuxONE inter-network topology implemented with Hipersockets
across LPARs is another unique characteristic which uses the resiliency of the hardware
features built in LinuxONE and the z/VM hypervisor.
z/VM Resiliency
IBM z/VM delivers resiliency for the virtual machines and workloads in several levels:
Linux multipathing of ECKD and FBA emulated devices (as EDEVs or EDEVICEs).
• For ECKD devices it is called HyperPAV
Capacity of attaching NPIV FCP devices directly to the guests for Linux multipathing
Network redundancy with 2 or more Open Systems Adapters (OSAs) through z/VM Virtual
Switch (VSWITCH)
• z/VM VSWITCH supports various configurations including Layer 2 and 3
networking, Active/Backup, and IEEE 802.3ad Link Aggregation, Virtual Edge Port
Aggregator (VEPA) IEEE 802.1Qbg, VLAN aware and unaware, among others.
Other than OSA Express, RoCE Express does not support z/VM VSWITCH
technology to provide path redundancy
z/VM Single System Image (SSI) and Live Guest Relocation:
• SSI is a concept in z/VM that enables up to eight z/VM instances to function as a
single system. With SSI, these instances are interconnected and collectively
manage a shared set of resources, such as storage and network infrastructure. The
goal is to provide a cohesive and transparent environment for running VMs across
the interconnected z/VM instances. See Figure 2-7 on page 46
• Live Guest Relocation is a feature of z/VM SSI that allows for the movement of
running VMs from one z/VM instance to another while the VMs continue to operate
without interruption. These features provide flexibility and high availability by
enabling workload balancing, disaster recovery, and maintenance operations
without affecting the availability of applications running in the VMs
For effective workload balancing a prioritization between virtual guests in the LPAR is helpful.
Therefore, it can be guaranteed that certain VMs have always enough capacity and get as
much as is available if possible. This technology of VM SHARE can be defined on the VM
level. For Details see:
https://www.ibm.com/docs/en/zvm/7.3?topic=resources-set-share-command
Another concept for effective workload definition is CPU pooling with z/VM, that uses a pool of
CPU capacity within an LPAR and distributes it on a VM base to the workloads. This is used
for effective licensing within an LPAR. See: Using CPU Pools.
A z/VM SSI cluster consists of up to eight z/VM systems in an Inter-System Facility for
Communications (ISFC) collection. ISFC is a function of z/VM Control Program (CP) that
provides communication services between transaction programs on interconnected z/VM
systems. A group of interconnected domains consisting of z/VM systems that use ISFC to
communicate with each other is known as an ISFC collection.
Each z/VM system is a member of the SSI cluster. Figure 2-7 shows the basic structure of a
cluster with four members. The cluster is self-managed by CP using ISFC messages that flow
across channel-to-channel devices between the members. All members can access shared
DASD volumes, the same Ethernet LAN segments, and the same storage area networks
(SANs).
Single System Image (SSI) enhances the z/VM systems management, communications, disk
management, device mapping, virtual machine definition management, installation, and
service functions to enable multiple z/VM systems to share and coordinate resources within a
Single System Image structure. This combination of enhanced functions provides the
foundation that enables Live Guest Relocations (LGR), which is the ability for a Linux guest to
be moved from one z/VM system to another within the SSI cluster.
A running virtual server (guest virtual machine) can be relocated from one member to
another. Relocating virtual servers can be useful for load balancing and for moving workload
off of a physical server or member system that requires maintenance. After maintenance is
applied to a member, guests can be relocated back to that member, thereby allowing you to
4 A z/VM SSI cluster can have up to eight members.
maintain z/VM and to keep your Linux virtual servers highly available.When a relocation is
initiated, the guest continues to run on the source member until the destination environment is
completely prepared. At that point the guest is briefly quiesced on the source member and
then resumed on the destination member. Refer to: Planning for a Single System Image and
Live Guest Relocations.
KVM is an Hypervisor which is developed by the Open Source community and is then
adapted and integrated in every Linux distribution for IBM LinuxONE.
KVM represents an hypervisor that provides server virtualization for Linux workloads that run
on the IBM LinuxONE platform, leveraging the existing skill set used in other hardware
architectures.
It is also a Type-2 hypervisor that runs in an LPAR and enables sharing CPUs (IFLs),
memory, and I/O resources through platform virtualization.
KVM allows the creation and execution of multiple Virtual Machines (VMs) on a Linux host,
providing full hardware virtualization capabilities. It leverages the hardware virtualization
extensions available on IBM LinuxONE systems for efficient and secure virtualization.
It can coexist with z/VM environments and is optimized for scalability, performance, security,
and resiliency, and provides standard Linux and KVM interfaces for simplified virtualization
and operational control.
The KVM hypervisor is integrated in the major supported Linux distributions:
– Red Hat Enterprise Linux Server
– SUSE Linux Enterprise Server
– Canonical Ubuntu
KVM on LinuxONE relies on hardware, LPAR and Linux host redundancy and resiliency as a
starting point. As such it implements features of high availability, redundancy, and resiliency
at hypervisor level, such as:
– Multipathing of virtual disks with Linux Device Mapper-Multipathing (DM-MP)
– Network redundancy based on Linux bonding and teaming. Active-backup,
round-robin, and IEEE 802.3ad Link Aggregation are supported
– Virtual Machine Live Migration, which enables you to move a running VM from one
physical host to another without disrupting its operation
– Live Migration: The process of transferring a running VM from the source host to the
destination host while it continues to run without any noticeable downtime. It allows you
to achieve load balancing, perform hardware maintenance, or migrate VMs between
different physical hosts for various reasons, such as a workload balancing or a planned
maintenance
– Because RoCE Express does not provide a promiscuous mode, you cannot use Open
VSWITCH in a KVM host to provide path redundancy for its guests, instead MacVtap
can be used for internal communication of KVM guests
KVM Cluster
KVM Cluster is a virtualization environment that leverages the KVM hypervisor to create and
manage Virtual Machines on Linux Systems. KVM Clusters connect multiple nodes to work
collectively and manage virtual resources. KVM Clusters implement High Availability and
Linux Applications
IBM LinuxONE facilitates the containerization of Linux-based applications through
technologies such as Open Container Initiative (OCI) compliant containers. Containers
provide a lightweight and isolated runtime environment, ensuring consistency and portability
across different environments. Leveraging LinuxONE's powerful hardware resources and
scalability, it becomes an optimal choice for running large-scale containerized workloads.
Microservices Architecture
IBM LinuxONE is highly suitable for deploying microservices-based architectures using
containers. By breaking down complex applications into smaller, decoupled services,
developers can independently develop, deploy, and scale these services. Containers enable
efficient resource utilization and rapid deployment, making LinuxONE an ideal platform for
managing microservices workloads.
Cloud-Native Applications
IBM LinuxONE fully supports the development and deployment patterns of cloud-native
applications. These applications are designed to harness the benefits of containerization,
scalability, and elasticity. Leveraging technologies such as Kubernetes, IBM LinuxONE
enables automated container orchestration, scaling, and load balancing, providing a reliable
and high-performance foundation for running cloud-native applications.
Container Platforms
IBM LinuxONE supports a wide range of container platforms that
provide capabilities for managing containerized applications. An example is the Red Hat
OpenShift Container Platform (RHOCP), which is available for various hardware
architectures, including IBM LinuxONE. RHOCP, built upon Kubernetes, offers a
comprehensive environment for developing, deploying, and managing containerized
applications. It includes features like automated scaling, load balancing, and container
lifecycle management.
The Parallel Access Volume, or PAV, facility allows a controller to offer multiple device
numbers that resolve to the same DASD, which allows I/O to the same DASD to happen
concurrently.
With this concept, the device addressed by the operating system to perform the I/O operation
is the “Base” device. When the Base device is busy working and another I/O operation to the
same Base address is started, it can be executed by one of the “Alias” devices, if PAV or
HyperPAV is used.
If there is no aliasing of disk devices then only one I/O transfer can be in progress to a device
at a time, regardless of the actual capability of the storage server to handle concurrent access
to devices. Parallel access volume exists for Linux on System z® in the form of PAV and
HyperPAV. Compared to PAV, HyperPAV is much easier to administer and provides greater
flexibility. PAV and HyperPAV are optional features that are available on the DS8000® Storage
Subsystems series
With PAV
Without PAV
(Alias devices)
I/O request
I/O request
UA x'00'
UA x'00'
I/O request Parallel Access
UA x'80'
Volume
I/O Request Two or more I/O operations accessing
A Device 1080 the same volume at the same time,
A B
UA x'80' from the same OS
Storage PAV (when no extent conflict exists)
1000
207F
Alias to Base
207E
207D UA Mapping
HyperPAV
The PAV capability is extended dynamically with IBM Hyper-Parallel Access Volume
(HyperPAV). HyperPAV allows multiple I/O operations to a DASD through a pool of base and
alias subchannels (devices) that are shared within a logical subsystem (LSS). HyperPAV
eliminates the need for users to map volumes to aliases and takes care of the aliases and
I/Os automatically. See Figure 2-9 on page 51.
Using ECKD devices with HyperPAV, Figure 2-11 on page 54, can tremendously improve
performance and increase availability. Following are some characteristics of ECKD with
HyperPAV:
– The DASD driver sees the real disk and all alias
– Load balancing with HyperPAV is done in the DASD driver
– It uses less processor cycles than Linux multipathing
For resiliency, storage server with ECKD disks will typically be mirrored via the storage server
function of Metro Mirror for synchronous replication and metro distances, or with Global
Mirroring if the distances are larger. The swapping between the mirrored discs needs to be
BASE
ALIAS
HyperPAV=YES LCU POOL UA 01
UA 83 ALIAS
UA 84
In the Storage Subsystem LSS (DS8000),
alias exist as a global resource and are
assigned on an I/O by I/O basis rather A-1082
than being bound to a base
A-1084
A-1084
LinuxONE DS8000
An IBM LinuxONE system, can make use of multiple network cards and different network
topologies.
function eliminates the use of I/O subsystem operations and the use of an external network
connections.
For more information about the available OSA Express and RoCE Express features, see:
IBM Z Connectivity Handbook, SG24-5444:
Depending on the hypervisor you can use virtual switches in combination with the network
capabilities of IBM LinuxONE mentioned above.
The Figure Figure 2-10 on page 52 shows examples of the most used network capabilities
with OSA, RoCE, and Hipersockets network interfaces. There are three Linux instances; two
of them run as z/VM guests in one LPAR and a third Linux instance runs in another LPAR.
Within z/VM, Linux instances can be connected through a guest LAN or VSWITCH. Within
and between LPARs, you can connect Linux instances through HiperSockets. OSA-Express
cards running in either non-QDIO mode or in QDIO mode can connect the LinuxONE to an
external network.
Note: For information about QDIO and Non-QDIO modes, please refer to IBM Z
Connectivity Handbook, SG24-5444.
For resiliency it is recommended to use multiple network interfaces and even for different
workloads. It can be beneficial to use multiple networks with different characteristics to
support the application requirements regarding network bandwidth and resiliency.
SMC-D
Guest LAN
or Non Linux 4
VSWITCH
HyperSockets TCP HyperSockets QDIO RoCE
RoCE SMC-R
LAN LAN
A MacVTap endpoint is a character device that follows the tun/tap ioctl interface and can be
used directly by kvm/qemu and other hypervisors that support the tun/tap interface.
MacVTap can be configured in any of three different modes which determine how the
MacVTtap device communicates with the lower device in the KVM host. The three possible
modes are VEPA, Bridge and Private mode. See “Configuring a MacVTap interface” for
additional setup information.
VEPA
Virtual Ethernet Port Aggregator (VEPA) is the default mode. Data flows from one endpoint
down through the source device in the KVM host out to the external switch. If the switch
supports hairpin mode, the data is sent back to the source device in the KVM host and from
there sent to the destination endpoint.
Bridge
Connects all endpoints directly to each other. Two endpoints that are both in bridge mode can
exchange frames directly, without the round trip through the external bridge. This is the most
useful mode for setups with classic switches, and when inter-guest communication is
performance critical.
Private
Private mode behaves like a VEPA mode endpoint in the absence of a hairpin aware switch.
Even when the switch is in hairpin mode, a private endpoint can never communicate to any
other endpoint on the same lowerdev.
VSWITCH is a z/VM CP system-owned switch (a virtual switch) to which virtual machines can
connect. Each switch is identified by a switchname. A z/VM user can create the appropriate
QDIO network interface card (NIC) and connect it to this switch with the NICDEF directory
statement. Alternatively, a z/VM user can create a virtual network adapter (with the CP
DEFINE NIC command) and connect it to this LAN (with the COUPLE command). See z/VM:
CP Commands and Utilities Reference for more information on these commands.
The z/VM Virtual Switch can be configured in Bridge or VEPA mode, depending upon local
requirements for network separation; this allows for either the performance boost of the
"hairpin turn" or separation out to the physical switch. Additionally, the Virtual Switch supports
Link Aggregation — the combination of multiple OSA ports into a single logical pipe. This
allows for greater throughput and network load-balancing at the Layer 2 level when
communicating with a physical switch. This technology extends to multiple systems in a z/VM
Single System Image through Inter-VSWITCH Link Aggregation Groups.
For more information about virtual networking options in z/VM, see z/VM: Connectivity.
2.5.4 HyperSwap
HyperSwap is a storage high availability solution for IBM Storage Subsystems, such as
DS8000, that enables switching logical units between two Storage Subsystems to ensure
continuous operation. IBM HyperSwap is a high availability feature that provides single site or
dual-site, active-active access to a volume. This function ensures continuous data availability
in case of hardware failure, power failure, connectivity issues, or other unplanned outages. It
is designed to offer a robust disaster recovery solution with minimal Recovery Point Objective
(RPO) and Recovery Time Objective (RTO).
Additional details about HyperSwap function can be found in section 3.3.8, “HyperSwap
Function” on page 88.
2.5.5 Multipathing
Multipath I/O provides failover and might improve performance. You can configure multiple
physical I/O paths between server nodes and storage arrays into a single multipath device.
Multipathing thus aggregates the physical I/O paths, creating a new device that consists of
the aggregated paths.
Multipathing provides I/O failover and path load sharing for multipathed block devices. In
Linux, multipathing is implemented with multi-path tools that provide a user-space daemon for
monitoring and an interface to the device mapper. The device-mapper provides a container
for configurations, and maps block devices to each other.
A single SCSI device (or a single zFCP5 unit) constitutes one physical path to the storage.
The multipath user-space configuration tool scans sysfs for SCSI devices and then groups
the paths into multipath devices. This mechanism that automatically puts each detected SCSI
device underneath the correct multipath device is called coalescing.
Use a multipath setup to access SCSI storage in an Fiber Channel Storage Area Network (FC
SAN). The multipath device automatically switches to an alternate path in case of an
interruption on the storage system controllers or due to maintenance on one path.
The multipath daemon has default configuration entries for most storage systems, and thus
you only need to do basic configuration for these systems.
zfcp HBA x
QDIO
LPAR
Switch LU 1
5
zFCP device driver is a low-level driver or host-bus adapter driver that supplements the Linux SCSI stack. See:
https://www.ibm.com/docs/en/linux-on-z?topic=channel-what-you-should-knowok
For more information about how to access, configure and use FCP Multipathing with Linux
kernel, please access the link below:
IBM Documentation.
2.6 Applications
This section covers application exploitation in the IBM LinuxONE platform and introduces
container technology, containerization, and containerized workloads. Also covered are the
following topics: Middleware, Middleware types, Independent Software Vendors (ISVs)
applications, and Databases supported.
2.6.1 Containers
Container images become containers at runtime. Available for Linux or other platforms such
as Windows or Cloud, containerized software will always run the same, regardless of the
infrastructure. Containers isolate software from its environment and ensure that it works
uniformly despite differences, for instance between development and staging.
Containers are standard units of software that packages up code and all its dependencies so
the application runs quickly and reliably from one computing environment to another.
Container images are a lightweight, standalone, executable package of software that includes
everything needed to run an application:
code
runtime
system tools
system libraries
settings
Containers are a way to bundle and run applications. In a production environment, its required
to manage the containers that run the applications and ensure that there is no downtime. For
example, if a container goes down, another container needs to start.
This is where Kubernetes will assist. Kubernetes provides a framework to run distributed
systems resiliently. It manages applications scaling and failover as well as deployment
patterns and much more.
Infrastructure Infrastructure
2.6.2 Containerization
Containerization is the packaging of software code with just the operating system (OS)
libraries and dependencies required to run the code to create a single lightweight
executable—called a container—that runs consistently on any infrastructure. More portable
and resource-efficient than virtual machines (VMs), containers have become the de facto
compute units of modern cloud-native applications.
Containerization allows developers to create and deploy applications faster and more
securely. With traditional methods, code is developed in a specific computing environment
which, when transferred to a new location, often results in bugs and errors. For example,
when a developer transfers code from a desktop computer to a Virtual Machine (VM).
Containerization eliminates this problem by bundling the application code together with the
related configuration files, libraries, and dependencies required for it to run. This single
package of software or “container” is abstracted away from the host operating system, and
hence, it stands alone and becomes portable—able to run across any platform or cloud, free
of issues.
Containers are often referred to as “lightweight,” meaning they share the machine’s operating
system kernel and do not require the overhead of associating an operating system within
each application. Containers are inherently smaller in capacity than a VM and require less
start-up time, allowing far more containers to run on the same compute capacity as a single
VM. This drives higher server efficiencies and, in turn, reduces server and licensing costs.
Containerized workloads on IBM LinuxONE benefit from the platform's reliability, security,
and scalability. With advanced virtualization capabilities and integration with container
orchestration platforms, LinuxONE provides an excellent solution for running diverse
containerized applications, including Linux-based workloads, microservices, cloud-native
applications, and hybrid cloud deployments. Refer to 2.3.7, “Container Platforms” on page 48.
2.6.4 Middleware
Middleware is software that enables one or more kinds of communication or connectivity
between applications or application components in a distributed network. By making it easier
to connect applications that weren't designed to connect with one another, and providing
functionality to connect them in intelligent ways, middleware streamlines application
development and speeds time to market.
It does this by providing services that enable different applications and services to
communicate using common messaging frameworks such as JSON (JavaScript object
notation), Representational State Transfer (REST), Extensible Markyp Language (XML) ,
Simple Object Access Protocol (SOAP), or web services. Typically, middleware also provides
services that enable components written in multiple languages - such as Java, C++, PHP, and
Python - to talk with each other.
Midleware Services
In addition to providing this work-saving interoperability, middleware also includes services
that help developers:
– Configure and control connections and integrations
• Based on information in a client or front-end application request, middleware can
customize the response from the back-end application or service. In a retailer's
e-commerce application, middleware application logic can sort product search
results from a back-end inventory database by nearest store location, based on the
IP address or location information in the HTTP request header
– Secure connections and data transfer
• Middleware typically establishes a secure connection from the front-end application
to back-end data sources using Transport Layer Security (TSL) or another network
security protocol. And it can provide authentication capabilities, challenging
front-end application requests for credentials (username and password) or digital
certificates
– Manage traffic dynamically across distributed systems
• When application traffic spikes, enterprise middleware can scale to distribute client
requests across multiple servers, on premises or in the cloud. And concurrent
processing capabilities can prevent problems when multiple clients try to access the
same back-end data source simultaneously
Types of middleware
To explore the list of Independent Software Vendors (ISVs) for IBM LinuxONE, you can check
out the IBM LinuxONE Partner Network (LPN) program. This program helps ISVs easily port,
certify, and deploy applications on IBM LinuxONE. ISVs can also gain access to go-to-market
resources and a rich set of learning tools for skill development 1.
2.6.6 Databases
IBM LinuxONE supports a variety of databases. IBM LinuxONE servers are designed for
secure data serving and can run multiple Linux-based workloads that include Oracle
Database 19c, Oracle WebLogic Server, open source, blockchain, and other Linux-based
commercial software.
This article gives an overview of IBM LinuxONE support for open source databases like
Mongo DB PostgreeSQL and MariaDB.
2.7 Management
Managing an IBM LinuxONE environment can be accomplished using several tools that
depend on the virtualization environment. This section covers IBM Operations Manager, z/VM
Centralized Manager, and IBM Infrastructure Suite for z/VM and Linux. The available KVM
and Container Management tools are also discussed.
When using Centralized Service Management, one system is designated as the principal
system. This system uses the Shared File System (SFS) to manage service levels for a set of
defined managed systems, regardless of their geographic location. The new SERVMGR
command uses Virtual Machine Serviceability Enhancements Staged/Extended (VMSES/E)
commands to apply service and local modifications, to build serviced content, and to drive the
transport of the packaged service to the managed systems.
The capabilities of IBM Infrastructure Suite for z/VM and Linux provide you with
comprehensive insight to efficiently control and support your IBM z/VM and Linux on IBM Z
systems environments.
Products included with IBM Infrastructure Suite for z/VM and Linux:
– IBM Tivoli® OMEGAMON® XE on z/VM and Linux
– IBM Spectrum® Protect Extended Edition
– IBM Operations Manager for z/VM
– IBM Backup and Restore Manager for z/VM
– IBM Tape Manager for z/VM (Optional)
– ICIC - IBM Cloud® Infrastructure Center (Optional)
Libvirt:
Libvirt is a toolkit that provides a consistent and stable API for managing various virtualization
solutions, including KVM. It offers command-line utilities (virsh) and APIs for management.
See: KVM Management Tools.
Virt-manager:
Virt-manager is a desktop application for managing KVM virtual machines. It provides a
user-friendly graphical interface to create, view, modify, and manage virtual machines. See:
Good GUI for KVM.
WebVirtMgr:
WebVirtMgr is a web-based interface for managing KVM virtual machines. It provides a
browser-accessible platform to create and control VMs. See: Reddit
handle. With Container management, IT teams can keep their environment more secure, and
developers can explore its flexibility to create and deploy new apps and services.
2.7.4 Kubernetes
Kubernetes is an open source platform for managing containerized workloads and services,
that facilitates both declarative configuration and automation. Kubernetes is portable,
extensible, and has a large and rapidly growing ecosystem. Kubernetes services, support,
and tools are widely available.
RHOCP provides an abstraction layer with the same experience regardless of the cloud
deployment model (on-prem, public cloud, private cloud and hybrid cloud) and hardware
architectures (x86, s390x, ppc64le and arm) and it is used as the foundation for how IBM
distributes software in a containerized format.
Figure 2-13 shows the deployment options for RHOCP Container Platform on IBM LinuxONE.
Workloads Workloads
KVM
IBM z/VM
RHEL
LPAR LPAR
Figure 2-13 Deployment options for Red Hat OpenShift Container Platform on IBM LinuxONE
2.8 Automation
The IBM LinuxONE platform can exploit several Automation tools and processes depending
on the chosen virtualization method. This section discusses Data Replication, Copy Services
Manager, Automation with Linux HA, Ansible, IBM Cloud Infrastructure Center, and GDPS.
You can use Copy Services Manager to complete the following data replication tasks and help
reduce the downtime of critical applications:
Starting with DS8000 Version 8.1, Copy Services Manager also comes preinstalled on the
Hardware Management Console (HMC). Therefore, you can enable the Copy Services
Manager software that is already on the hardware system. Doing so results in less setup time;
and eliminates the need to maintain a separate server for Copy Services functions.
Copy Services Manager can also run on Linux on IBM Z and uses the Fiber Channel
connection (FICON) to connect and to manage Storage Systems' count-key data (CKD)
volumes.
Pacemaker
Pacemaker is an open source high-availability cluster resource manager software that runs
on a set of nodes. Together with Corosync, an open source group communication system that
provides ordered communication delivery, cluster membership, quorum enforcement, and
other features among the nodes, it helps detect component failures and orchestrate
necessary failover procedures to minimize interruptions to applications.
Pacemaker can supervise and recover from failures within a cluster. The components of
Pacemaker architecture are:
Cluster Information Base (CIB)
• CIB is the Pacemaker information daemon. It uses XML to distribute and
synchronize current configuration and status from the Designated Coordinator (DC)
The DC node is assigned by Pacemaker to store and report cluster state and
actions to the other nodes using the CIB. There is a Cluster Information Base in
each host instance
• The XML list of behaviors, directed by resource manager, informs policy engine
Cluster Resource Management daemon (CRMd)
• Pacemaker uses this daemon to route cluster resource actions. Resources
managed by this daemon can be queried, moved, instantiated and changed when
needed
• Communicates with Local Resource manager daemon on each cluster
• The “Local Resource Manager” receives instructions from the CRMd and passes
requests along to local resource agents (VirtualDomain, FileSystem, MailTo)
general operations
• Define one “Designated Focal Point” for the cluster
• The other hosts receive data from the CRMd via corosync
• The “Local Resource Manager” receives instructions from the CRMd and passes
requests along to local resource agents (VirtualDomain, FileSystem, MailTo)
general operations
Shoot the Other Node in the Head (STONITH)
• Is a Pacemaker fencing agent that detects if it loses contact with one of the nodes in
the cluster
• If Pacemaker ”thinks” a node is down, STONITH will force it offline
• The Pacemaker fencing is implemented by STONITH
• STONITH forcely shuts down and fences nodes removing them from the cluster to
maintain data integrity
corosync
• corosync is a component/daemon that handles the core membership enrollment
and the required communication with the members for cluster high availability for
any Linux instance enrolled in an HA clusters
• corosync is a required component and a daemon in a Linux HA cluster
• Manages quorum rules and determination.starts/stops the virtual machines
Policy Engine
Policy Engine is a software component that helps managing policies for clusters. Policy
Engine defines what end users can do on a cluster and ensures that clusters can
communicate. Any time a Kubernetes object is created, a policy evaluates and validates or
mutates the request. Policies can apply across a namespace or different pods with a
specific label in the cluster. Kubernetes policy engines block objects that could harm or
affect the cluster if they don’t meet the policy’s requirements
Takes that list of behaviors and maps them to cluster’s current state
Heartbeat
The Heartbeat of a node in a cluster it’s a signal that is sent between nodes to indicate
that they are still alive and functioning properly. It is used to detect when a node has failed
and to initiate failover procedures. The Heartbeat, in some cases, is used to monitor the
health of a node and to initiate maintenance actions
– Uses messaging between nodes to make sure they are alive and available
– Determines if an action is required when heartbeat stops after certain number of tries
Cluster-glue
Cluster Glue is a set of libraries, tools and utilities used in the Heartbeat / Pacemaker
cluster stack. In essence, Glue are the parts of the cluster stack that don't fit in anywhere
else. Basically, it’s everything that is not messaging layer and not resource manager
Resource-agents
Resource agents are scripts that allow Pacemaker to manage any service it knows
nothing about. They contain the logic for what to do when the cluster wishes to start, stop
or check the health of a service
– A resource agent is an executable that manages a cluster resource. No formal
definition of a cluster resource exists, other than "anything a cluster manages is a
resource". Cluster resources can be as diverse as IP addresses, file systems,
database services, and entire virtual machines — to name a few examples
– Resource agents run in clustered systems or remote
– Resource agents are able to start, stop or restart services
– Quorum ensures that the majority of a given cluster agrees on what the resource is
– Votequorum as an interface for members to agree
– It also provides messaging for applications coordinating / operating across multiple
members of a cluster
IBM ICIC is a software platform for managing the infrastructure of private clouds on IBM
LinuxONE. The IBM Cloud Infrastructure Center is an IaaS offering that delivers an
industry-standard user experience for the IaaS management of non-containerized and
containerized workloads IBM LinuxONE.
The final goal is to provide Continuous Availability. That means it is very important to have
capabilities of problem and failure detection as well as recovery automation processes, in
case one of the layers or components are malfunctioning and need recovery actions.
Multi-site environment
IBM GDPS technology provides a total business continuity solution for multi-site
environments. The technologies also provides a collection of end-to-end automated
disaster-recovery solutions on the IBM LinuxONE platform, each addressing a different set of
IT resiliency goals that can be tailored to meet the recovery objectives for your business.
GDPS is a collection of several offerings, each addressing a different set of IT resiliency goals
that can be tailored to meet the RPO and RTO for your business.
Each GDPS offering uses a combination of system and storage hardware or software-based
replication and automation, and clustering software technologies:
• Metro
• Metro HyperSwap Manager
• IBM Virtual Appliance
• Extended Disaster Recovery (xDR)
• Global - GM (also known as GM)
• Metro Global - MGM (also known as MGM)
• Continuous Availability
GDPS Metro
A near-CA and DR solution across two sites separated by metropolitan distances. The
solution is based on the IBM Metro Mirror synchronous disk mirroring technology
Provides continuous availability, disaster recovery, and production system / sysplex resource
management capabilities. Based on Metro Mirror synchronous disk mirroring technology, it
can achieve RPO = 0. Typically the recovery time is less than one hour (RTO < 1 hour)
following a complete site failure. Metro also supports HyperSwap to provide near-continuous
disk availability after a disk failure.
GDPS VA supports both planned and unplanned situations, which helps to maximize
application availability and provide business continuity. In particular, a Virtual Appliance
solution can deliver the following capabilities:
– Near-continuous availability solution
– Disaster recovery (DR) solution across metropolitan distances
– Recovery time objective (RTO) less than an hour
– Recovery point objective (RPO) of zero
The Virtual Appliance can manage starting and stopping z/VM or KVM hypervisors, and thus
all Linux guests running on the hypervisors. It is a fully integrated software solution for
providing continuous availability and disaster recovery (CA/DR) protection to Linux on Z, zVM
and KVM. If your Linux on Z production workloads have CA or DR requirements, then the
Virtual Appliance can help you meet those requirements.
A Virtual Appliance environment is typically spread across two data centers (Site1 and Site2)6
where the primary copy of the production disk is normally in Site1. The Appliance must have
connectivity to all the Site1 and Site2 primary and secondary devices that it will manage. For
availability reasons, the Virtual Appliance runs in Site2 on local disk that is not mirrored with
Metro Mirror. This provides failure isolation for the appliance system to ensure that it is not
impacted by failures that affect the production systems and remains available to automate
any recovery action.
xDR
CKD CKD
Site 1
VA HyperSwap function
The Virtual Appliance delivers a powerful function known as HyperSwap. HyperSwap
provides the ability to swap from using the primary devices in a mirrored configuration to
using what had been the secondary devices, transparent to the production systems and
applications using these devices.
Without HyperSwap, a transparent disk swap is not possible. All systems using the primary
disk would need to be shut down (or might have failed, depending on the nature and scope of
the failure) and would have to be re-IPLed using the secondary disks. Disk failures are often a
single point of failure for the entire production environment.
xDR is the product that allows to communicate / manage: z/OS Proxy (z/OS systems
monoplexes outside of the sysplex), z/VM, KVM, SSC(IDAA) and Linux in an LPAR. To
provide these capabilities, they must run System Automation for Multi-platforms (SAMP) with
the separetely licensed xDR feature.
The proxy nodes communicate commands from to z/VM, monitor the z/VM environment and
communicate status and failure information back to the Virtual Appliance.
The xDR KVM Proxy is delivered as a Linux RPM for SLES or RHEL or a DEB package for
Ubuntu. The proxy guest serves as the middleware for. It communicates commands from / to
KVM, monitors the KVM environment, and communicates status information back to the
Metro controlling system.
KVM does not provide a HyperSwap function. However, Metro coordinates planned and
unplanned HyperSwap for Linux under z/VM and Linux under KVM CKD disks to maintain
data integrity and control the shutdown and re-start in place of the KVM LPARs. For disk or
site failures, Metro provides a coordinated Freeze for data consistency on CKD disks across
KVM, z/VM and Linux LPAR.
This function is especially valuable for customers who share data and storage subsystems
between z/OS and Linux z/VM guests on IBM Z or SUSE Linux running native on IBM Z
LPARs. For example, an application server running on Linux on IBM Z and a database server
running on z/OS.
GDPS Metro can provide this capability when Linux is running as a z/VM guest or native.
Using the HyperSwap function so that the virtual device associated with one real disk can be
swapped transparently to another disk, HyperSwap can be used to switch to secondary disk
storage subsystems mirrored Metro Mirror. If there is a hard failure of a storage device,
coordinates the HyperSwap with z/OS for continuous availability spanning the multi-tiered
application. HyperSwap is supported for ECKD and xDR managed FB disk.
For site failures, GDPS invokes the Freeze function for data consistency and rapid application
restart, without the need for data recovery. HyperSwap can also be helpful in data migration
scenarios to allow applications to migrate to new disk volumes without requiring them to be
quiesced.
When using ECKD formatted disk, GDPS Metro can provide the reconfiguration capabilities
for the Linux on IBM Z servers and data in the same manner as for z/OS systems and data.
To support planned and unplanned outages, these functions have been extended to KVM with
GDPS V4.5 and above, which provides the recovery actions such as the following examples:
Re-IPL in place of failing operating system images.
Heartbeat checking of Linux guests.
Disk error detection.
Data consistency with freeze functions across z/OS and Linux.
Site takeover/failover of a complete production site.
Single point of control to manage disk mirroring configurations.
Coordinated recovery for planned and unplanned events.
Additional support is available for Linux running as a guest under z/VM. This includes:
Re-IPL in place of failing operating system images.
Ordered Linux node or cluster start-up and shut-down.
Coordinated planned and unplanned HyperSwap of disk subsystems, transparent to the
operating system images and applications using the disks.
Transparent disk maintenance and failure recovery with HyperSwap across z/OS and
Linux applications.
For more information about GDPS offerings, see: IBM GDPS: An Introduction to Concepts
and Capabilities, SG24-6374.
Table 2.9 below shows the capabilities of various GDPS components applicable to LinuxONE
resiliency scenarios discussed in detail in next chapter.
By design, the IBM LinuxONE platform is a highly resilient system. Through RAS design
principles, the LinuxONE Architecture has built-in self-detection, error correction, and
redundancy. The LinuxONE platform reduces single points of failure to deliver the best
reliability of any enterprise system in the industry. Transparent processor sparing, and
dynamic memory sparing enable concurrent maintenance and seamless scaling without
downtime.
A such single-system environment has redundancy in the hardware and defines Resiliency
based on the application and service availability.
For Resiliency reasons sensitive enterprises are using the IBM LinuxONE systems in a
redundant deployment, in different topologies.
The naming that was used back in the years was diversified in the distributed world now.
A first deployment with IBM LinuxONE and its corresponding distributed naming is:
– Hot – Cold ---- corresponds distributed naming Active – Passive
A second deployment with IBM LinuxONE and its corresponding distributed naming is:
– Hot – Warm --- that is known in distributed as Active – Idle
The most effective deployment for resiliency with IBM LinuxONE and its corresponding
distributed naming is:
– Hot – Hot -- that is known in a distributed world as Active – Active
In a such deployment we’re talking about 2 datacenters with at least one LinuxONE server
per datacenter and its associated Storage servers, network infrastructure and services.
The main datacenter, also named herein as PROD datacenter, is the one where main
production runs and the DR - Disaster Recovery datacenter is the second datacenter.
The specialty in a such active-passive environment is that clients are acting for cost efficiency
to keep the system in the DR datacenter shut down with active Storage servers in PROD and
DR for permanent data replication. In case of a disaster – for resiliency – the system in the
DR datacenter is activated and PROD is started to run in the DR datacenter.
The storage mirroring can be implemented on Storage server level and can be synchronous
replication called Metro Mirror or it can be asynchronous, called Global Mirror.
In a such environment the PROD environment is running the entire production workload, but
the DR datacenter has a system which is started and has some basic setup active such as
hypervisors, replication software for data or special software for automatic fail-over such as
IBM. The Storage servers in PROD and DR are setup for replication / mirroring of all data
from PROD.
Similar to the first case, storage mirroring can be implemented on Storage server level and
can be synchronous replication called Metro Mirror or it can be asynchronous, called Global
Mirror.
For Container workloads the mirroring can also be implemented on a logical level if on the DR
site, the storage replication software is actively running such as IBM Storage Fusion Data
Foundation or IBM Storage Scale servers.
The most effective resiliency can be reached in an environment were 2 sites are active at the
same time and are setup to be able to take over the entire PROD workload whenever one site
is failing or suffering a disaster.
For this environment it is necessary to have a Storage infrastructure that is able to share the
storage between the 2 sites if the same applications are running at the same time in both
sites.
A such scenario can be realized and is dependent on the type of workload that is used,
whether it is traditional VM based workload or Container based workloads or even combined.
2.9.2 Once storage, hardware, network and virtual infrastructure are covered,
reflect on the workload availability
– Fail-over / DR partitions:
• for when your entire partition needs to reappear in another location
– Mind volume access and crypto / key access
– VM duplication – having Virtual Machines (VM) or containers available to run workload
in an emergency
• Doesn’t avoid an outage, but having the spare VM may shorten the Mean Time to
Repair (MTTR)
– “Dead guest relocation”
• (shutdown and bring-up may be faster than guest mobility)
Next, we expand the implementation to a two and three site scenarios. We also explore each
scenario capabilities as well as possible variations in terms of their implemented resources.
As shown in Figure 3-1, in this proposed scenario we have one IBM LinuxONE system
running on a single physical site. The LinunxONE CPC is running z/VM with multiple Linux
guests to support the workload. LinuxONE CPC is configured to use Single System Image
(SSI) and IBM Operations Manager for z/VM.
Site I
Storage
System
Benefits:
• Relies only on LinuxONE platform RAS (Reliability, Serviceability and
Serviceability)
Improvements:
• Resiliency for a single IBM LinuxONE platform can be enhanced by using a
clustering technology with data sharing across two or more Linux images.
• Introduce an additional Store Subsystem to extend access to data in case of a
hardware failure
• Synchronous Copy with Peer to Peer Remote Copy (PPRC) and Hyperswap would
improve recovery time and resiliency
• Develop / implement processes for handling problem and change management
• Review or plan application and processes upgrades to ensure rapid failover can be
accomplished
• Implement / Exploit z/VM SSI and z/VM Operations Manager
As shown in Figure 3-2 on page 78, in this proposed scenario we have two IBM LinuxONE
systems running on a single physical site. The LinuxONE CPCs are running z/VM with
multiple Linux guests to support the workload. LinuxONE CPCs are configured to use z/VM
Single System Image (SSI) and IBM Operations Manager for z/VM.
Site I
Storage
Systems I
Replication
ECKD
z/VM z/VM z/VM z/VM
SSI
IBM® LinuxONE System IBM® LinuxONE System Storage
System II
ECKD
Network IP Links
Note that running your workload on a single site constitutes a single point of failure. Any
disturbance in the electrical facilities or in any other infrastructure capabilities at the site, will
cause an unplanned outage.
The benefits this scenario brings are related to the fact that your data is fully replicated
between the Storage Subsystems using synchronous data replication, which gets you
covered in case of a Subsystem Storage malfunction, see 3.2.3, “Synchronous data
replication” on page 80, and z/VM SSI which allows live migration of linux guests between the
four z/VM LPARs.
Also, having two LinuxONE systems running the workload, allows for an extended resiliency.
In case of a hardware or software failure in one of the footprints, the remaining active system
can eventually run all the workload. The impact of a system failure will depend on how the
workloads are split between both systems, and on the ability of the survival system to run
them. Client can use Capacity on Demand or Flexible Capacity for Cyber Resiliency
capabilities to bring the systems up to the required capacity to run the workload in one of the
footprints.
In a more elaborated example shown in Figure 3-3 on page 79, we have four z/VMs and one
Linux on Z LPARs defined, and three workload groups. One Kubernetes cluster group formed
by partitions I, II and III while LPARs IV holds a Web Server, and partition V runs a Database
workload. The Storage Subsystems data is being replicated with synchronous replication,
eliminating the storage single point of failure. In the example, z/VM is running on Logical
Partitions I, II, III and IV, while Logical Partitions V is running Linux on Z.
Note that the LPARs configuration is “mirrored” on both LinuxONE systems to share the
workload and to facilitate the recovery process in a rare case of a system hardware or
software failure.
Site 1
Active Workload II
Workload III
Example: Web Server
Example: Database
Workloads I
IBM Linux VM
Example: K8s Cluster II
GDPS
Virtual Appliance
Linux VM Linux VM Linux VM
Storage
Systems 2
For performance reasons, in this scenario we opted to use HyperPAV technology which
allows concurrent I/O operations to a single disk volume.
Figure 3-4 on page 80 shows how HyperPAV is exploited by z/VM. Multiple I/O operations are
initiated by the z/VM Guests and handled by the z/VM I/O Subsystems who is responsible for
the selection of the Base device address and the available Aliases allowing parallel
simultaneous operations to the same volume to take place. z/VM Control Program (CP) I/O
Subsystem drives real concurrent Start SubChannel (SSCH) I/O operations to the Alias
devices.
Both the base and the aliases devices addresses must be in the same subchannel set (SS) to
be properly recognized by z/VM. Please refer to documentation at:
https://www.ibm.com/docs/en/zvm/7.3?topic=administration-multiple-subchannel-set-support
Base
Alias
Guest I/O Alias
Guest I/O
Guest I/O
z/VM CP Alias
Storage
Systems 1
I/O
Subsystem Alias
Alias
I/O Queue
Alias
(one per volume)
Concurrent
Real SSCHs Alias
Volume with
Guests
minidisks
PPRC is a hardware solution which provides rapid and accurate disaster recovery as well as
a solution to workload movement and device migration. Updates made on the primary DASD
volumes are synchronously shadowed to the secondary DASD volumes. The local storage
subsystem and the remote storage subsystem are connected through a communications link
called a PPRC path. The protocol used to copy data using PPRC is Fibre Channel Protocol.
PPRC is configured in the Storage Subsystems using local administration capabilities. In the
case of IBM DS8K, one way of performing the configuration is using its Hot-wire Management
Console (HMC) or using its Command Line Interface (CLI).
2 In this publication the terms FB (Fixed Block) and FBA (Fixed Block Assignment) are used interchangeably.
Fixed-block architecture (FBA) is an IBM term for the hard disk drive layout in which each
addressable block on the disk has the same size, utilizing 4 byte block numbers and a new set
of command codes.
Besides supporting the emulation of CKD volumes, the DS8000 series Storage Server (for
instance, and other vendors’ storage servers) support the definition of FBA volumes. These
are known as Logical Unit Numbers (LUNs) in a Storage Area Network (SAN).
Extended Count Key Data (ECKD) is a direct-access storage device (DASD) data recording
format introduced in 1964, by IBM with its IBM System/360 and still being emulated today. It is
a self-defining format with each data record represented by a Count Area that identifies the
record and provides the number of bytes in an optional Key Area and an optional Data Area.
ECKD blocks are usually addressed by CCHHR (CC=Cylinder; HH=Head <i.e. Track>; R =
Record <i.e. Block>). This is in contrast to devices using fixed sector size or a separate format
track.
As mentioned above, note that internally, all modern storage servers use FBA. E.g. the
DS8000 storage servers emulate CKD DASD volumes but the underlying technology is all
FBA.
If using KVM with FBA disks then IBM Storage Scale for clustering can be an alternative.IBM
Storage Scale, based on technology from IBM General Parallel File System (herein referred
to as IBM Storage Scale or GPFS), is a high performance shared-disk file management
solution that provides fast, reliable access to data from multiple servers.
Applications can readily access files using standard file system interfaces, and the same file
can be accessed concurrently from multiple servers and protocols. IBM Storage Scale is
designed to provide high availability through advanced clustering technologies, dynamic file
system management, and data replication. IBM Storage Scale can continue to provide data
access even when the cluster experiences storage or server malfunctions.
Network
Network capabilities of the LinuxONE platform are discussed in 2.5, “LinuxONE Virtual
Network” on page 51.
• Review, plan and test applications and processes for changes and upgrades to
ensure a rapid failover can be accomplished.
• Implement / Exploit z/VM SSI and z/VM Operations Manager.
Site I
Storage
Systems
Multiple Workloads Multiple Workloads
Linux Linux
z/VM z/VM
on Z on Z
Synchronous
Replication
Site II
GDPS VA
GDPS VA is exclusive to IBM LinuxONE. It includes GDPS Metro and GDPS xDR capabilities
which are used in Scenario 2 Hot / Warm and Hot / Hot examples. See “GDPS Virtual
Appliance (VA)” on page 67.
GDPS Metro
GDPS Metro also has the capability to manage the Multi-Target Metro Mirror configuration,
extending PPRC management and HyperSwap capabilities to support two synchronous
Metro Mirror relationships from a single primary volume. Each leg is tracked and managed
independently. This provides additional data protection in the event of a disk subsystem
failure or local disaster scenario. When using ECKD formatted disk, GDPS Metro can provide
the reconfiguration capabilities for z/VM and its guests as well as Linux on IBM Z servers and
data. To support planned and unplanned outages these functions have been extended to
KVM on LinuxONE starting with GDPS V4.1.
Additional support is available for Linux running as a guest under z/VM. This includes:
Re-IPL in place of failing operating system images.
Ordered Linux node or cluster start-up and shut-down.
Coordinated planned and unplanned HyperSwap of disk subsystems, transparent to the
operating system images and applications using the disks.
Transparent disk maintenance and failure recovery with HyperSwap across Linux
applications.
GDPS xDR
GDPS Metro provides management extensions for heterogeneous platforms (GDPS xDR) to
be able to fully manage either z/VM systems and their guests, KVM and its guests, and Linux
native running on LinuxONE environments, providing full end-to-end support including disk
management with freeze and planned/unplanned HyperSwap support, systems management
and monitoring. This support also applies to z/VM and KVM on FBA formatted disk. See
GDPS xDR topics starting with “xDR for z/VM with VA” on page 69.
There are three configuration possibilities for setting up this two sites scenarios:
1. Hot / Cold - (active - passive):
Workload is not split and each site is configured to handle all operations. Manual intervention
is normally required to activate additional resources and the partitions in the cold (passive
site). Because of that, cold environment, often used in a DR situation, needs longer to get
activated.
The LinuxONE CPC on the secondary site might have a temporary capacity record installed
such as On-off-Capacity-on Demand (OOCoD), Capacity Back Up (CBU) or a Flexible
Capacity for Cyber Resiliency, ready to be activated manually by the operations staff. See ,
“Capacity on Demand - (Temporary Upgrades)” on page 36
The objective of the temporary capacity records listed above is to bring the secondary site
capacity to a level that would allow the partitions and workloads that were previously running
on the primary site to be executed on the secondary site without any impact to the end-user
or overall system performance. For more information about temporary capacity, refer to
Capacity on Demand User’s Guide.
Figure 3-6 on page 85 shows a more elaborated example of an active - passive two sites
implementation. Note that in this scenario the LPARS on Site 2 are defined but not activated.
This is due to the fact that the total resources required by the defined LPARs, which are the
same as the resources used by the primary site, may not be available till a Temporary
Capacity record is activated. Activating a temporary capacity record in this scenario requires
manual intervention using the LinuxONE HMC. Due to these manual interventions
characteristics, the Recovery Time Objective (RTO) can be significantly increased.
Site 1
Linux on Z Linux on Z Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM)
PR/SM Hypervisor/DPM
Synchronous
IBM® LinuxONE System I Replication
Site 2
Storage
Systems 2
Logical Partition I Logical Partition II Logical Partition III Logical Partition IV Logical Partition V Logical Partition VI Logical Partition VII
PR/SM Hypervisor/DPM ECKD
Figure 3-6 Hot - Cold (Active - Passive) two site implementation example
In an event of Site 1 failure, Site 2 needs to be activated manually, via GDPS-VA automation
or using user defined scripts.
Figure 3-7 on page 86 shows an implementation of a Hot / Warm (active - idle) environment.
Site 1
Storage
System I-II
Linux on Z Linux on Z Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM)
Logical Partition I Logical Partition II Logical Partition III Logical Partition IV Logical Partition V Logical Partition VI Logical Partition VII
PR/SM Hypervisor/DPM
Synchronous ECKD
IBM® LinuxONE System I Replication
Site 2
IBM
GDPS
Virtual Appliance
Storage
System III-IV
IBM on Z IBM on Z IBM z/VM IBM z/VM Linux on Z Linux on Z (z/VM) Linux on Z (z/VM)
Logical Partition X Logical Partition I Logical Partition II Logical Partition III Logical Partition IV Logical Partition V Logical Partition VI Logical Partition VII ECKD
PR/SM Hypervisor/DPM
Figure 3-7 Hot - Warm (Active - Idle) two site implementation example
Figure 3-8 on page 87 shows a more detailed two sites (active-active) implementation. The
two Storage Controllers are synchronously replicated. This allows the sites to become the
production site without compromising the information written (and mirrored) to the disks in
both controllers in case one of the sites goes down. In a normal situation (both sites up) the
workload can be shared between the CPCs on both sites.
Besides showing the workloads running on both sites, we added one partition (Logical
Partition XII) that runs GDPS VA. For details about GDPS VA, see “GDPS Virtual Appliance
(VA)” on page 67. Also note that on each z/VM image in Site 1 we added an xDR proxy LPAR.
The xDR proxies are guests dedicated to provide communication and coordination between
the z/VM and GDPS Virtual Appliance.
With active/active configurations you might have a router and a load balancer in front of the
cluster to balance and split the incoming requests among the active nodes in the cluster.
When a system fails, its service workload is migrated to an active node. When one active
member fails, the resources are still running on the other active members, and the new
incoming service is uninterrupted. Systems must have sufficient hardware resources to
handle extra work in case of an outage of one system; or work must be prioritized and service
restricted in case of a system failure.
Site 1
xDR Proxy xDR Proxy xDR Proxy xDR Proxy xDR Proxy
Linux on Z Linux on Z Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM)
Logical Partition I Logical Partition II Logical Partition III Logical Partition IV Logical Partition V Logical Partition VI Logical Partition VII ECKD
PR/SM Hypervisor/DPM
Synchronous
IBM® LinuxONE System I Replication
Site 2
xDR Proxy xDR Proxy xDR Proxy xDR Proxy xDR Proxy
Linux on Z Linux on Z Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM) Linux on Z (z/VM)
Logical Partition x Logical Partition XI Logical Partition XII Logical Partition XIII Logical Partition XIV Logical Partition XV Logical Partition XVI Logical Partition XVII
PR/SM Hypervisor/DPM ECKD
Also, based on the Figure 3-8, note that GDPS VA partition is recommended to be placed in
the secondary (or DR) site. The xDR Proxies partitions are present on both sites and on each
z/VM partition.
Site 1 Site 2
IP Links (OSA)
xDR Proxy
xDR Proxy
Linux
Linux
Linux
Linux
GDPS
Virtual
Appliance
Z/VM Z/VM
LPAR LPAR LPAR
With two sites, the exploitation of GDPS Virtual Appliance (VA) and GDPS Metro
MultiPlatform Resiliency for IBM Z (GDPS xDR) can be configured to automate the process of
transferring the workloads, swapping Storage Subsystems disk access, activating temporary
records, restarting hypervisors such as z/VM and z/KVM and their respective guests, in order
to keep production running in case of a site unexpected failure, disaster or planned and
unplanned outage.
z/VM SSI is applicable to both scenarios presented so far. It applies to a single site with one
or more CPCs (Scenario 1), as well as to a dual site (Scenario 2) in a metro distance. In a
single site or dual site. SSI requires communication between the z/VM images which is
normally provided by Channel-to-Channel (CTC) connections.
This is one of the offerings in the GDPS family, along with GDPS Metro, that offers the
potential of zero data loss, and that can achieve the shortest recovery time objective, typically
less than one hour following a complete site failure.
HyperSwap is also one of the only members of the GDPS family, again along with GDPS
Metro, that is based on hardware replication and that provides the capability to manage the
production LPARs.
There is a trade-off between overall DASD I/O subsystems performance and the ability to
perform a quick, dynamic and automated recovery process for a Storage Subsystem eventual
failure.
PAV and HyperPAV will benefit performance allowing multiple I/Os to be performed
concurrently against one disk volume.
HyperSwap will improve recovery time in case of a failure in the data replication process or
even in a Storage Subsystem hardware incident.
With z/VM, use of virtual alias devices (PAV or HyperPAV) in a guest, whether dedicated or
virtual, is considered an unsupported configuration for Hyperswap.
• Active-active requires that machines at both sites have extra capacity or temporary
capacity available to be activated allowing them to support all workloads.
• Exploits GDPS VA, GDPS xDR and GDPS Metro and HyperSwap.
– Benefits:
• Provides near-continuous availability and continuous operations (active-active).
• Minimal impact due to system and data outages - Redundancy in software,
hardware and data.
• Full data replication between sites and storage subsystems.
• GDPS VA GUI allows configuration and monitoring of the environment without
requiring additional operations skills.
• With GDPS-VA and the selected scenario implementation, active-cold, active-warm
or active-active, the failover process can be automated using service management
processes.
• In a failover scenario, impact can be minutes to hours depending on the level of
detection and automation implemented.
– Improvements:
• Proper workload distribution and routing between sites and resources
(active-active).
• Test and exercise all possible failure DR situations.
• Determine the levels and frequency of maintenance of all installed equipment.
• Improve processes for handling change and automated problem management.
• Review or plan application updates to ensure rapid failover can be accomplished.
• Continuous Availability.
shows an environment where IBM Z is running z/OS in a Parallel Sysplex environment along
with distributed Linux in LPAR, Native Linux, z/VM and KVM Linux guests running on
LinuxONE servers. LinuxONE servers do not support running z/OS.
Figure 3-10 Example of two IBM Z systems running z/OS Sysplex, Linux on Z, KVM and z/VM
GDPS Metro
GDPS Metro also has the capability to manage the Multi-Target Metro Mirror configuration,
extending PPRC management and HyperSwap capabilities to support two synchronous
Metro Mirror relationships from a single primary volume. Note that HyperSwap is supported
for ECKD and xDR managed FB disk. Each leg is tracked and managed independently. This
provides additional data protection in the event of a disk subsystem failure or local disaster
scenario.
GDPS Metro provides management extensions for heterogeneous platforms (xDR) to be able
to fully manage either z/VM systems and their guests, Linux in LPAR and KVM on LinuxONE
environments providing full end-to-end support including disk management with freeze and
planned/unplanned HyperSwap support, systems management and monitoring. This support
also applies to z/VM and KVM on FBA formatted disks.
As shown in Figure 3-10, GDPS Global requires two z/OS systems images, (K1and K2) to
manage, monitor and execute recovery scripts.
A typical configuration has the secondary disk from a Metro Mirror Remote Copy
configuration which in turn becomes the primary disk for a Global Mirror Remote Copy pair.
Data is replicated in a “cascading” fashion.
Metro Global Mirror is a method of continuous, remote data replication that operates between
three sites that varying distances apart. Metro Global Mirror combines Metro Mirror
synchronous copy and Global Mirror asynchronous copy into a single session, where the
Metro Mirror target is the Global Mirror source. Using Metro Global Mirror and Metro Global
Mirror with HyperSwap, your data exists on a second site that is less than 300 km away, and a
third site that is more than 300 km away. Metro Global Mirror uses both Metro Mirror and
Global Mirror Failover / Failback to switch the direction of the data flow. This ability enables
you to run your business from the secondary or tertiary sites.
Note: If the client has zFBA in a shared ECKD Consistency Group and GDPS does a
HyperSwap, GDPS resets all non-HyperSwap capable systems ("inhibited", Linux on Z,
KVM, SSC) in case of an Unplanned HyperSwap (UHS). A Planned HyperSwap (PHS)
will complain about those systems being up and running and a PHS will not be
performed until all such systems are down.
Using Fixed Block disk management support allows GDPS to be a single point of control to
manage business resiliency across multiple tiers in the infrastructure, improving
cross-platform system management and business processes.
Site I
GDPS
xDR
Storage
K1 Systems
Multiple Workloads
Metro Mirror
IBM® LinuxONE System IBM Z GDPS
xDR
Synchronous
Replication K1
Multiple Workloads
Storage
Systems
Metro Mirror
K2
Multiple Workloads
Storage Asynchronous
Systems Replication
Linux z/VM KVM z/OS
PR/SM PR/SM
ECKD
IBM® LinuxONE System IBM Z FB
Figure 3-11 Multi-Site fault tolerant Option - Continuous availability implementation example
To support this complex configuration two z/OS partitions are required. As LinuxONE cannot
run z/OS, another IBM Z system footprint is required on each site. The IBM Z systems will
hold the z/OS LPARs, GDPS (with xDR) as well as the K1 and K2 controlling systems.
The operating systems must run on servers that are connected to the same Hardware
Management Console (HMC) Local Area Network (LAN) as the GDPS control system
– Benefits:
• Rapid restart.
• Multiple LPARs cloned.
• Data sharing (GDPS Metro - Global Mirror.
• GDPS Full Automation
• Provisioning Spare Capacity - auto activation by GDPS.
– Improvements:
• GDPS and SafeGuarded Copy for snapshots of system contents.
• Restoration of previous copies of workload, dependent upon forensic investigation.
Chapter four will explore a configuration that is designed to allow this extended availability.
Disclaimer: IBM internal data based on measurements and projections was used in
calculating the expected value. Necessary components include IBM LinuxONE Emperor 4;
IBM z/VM V7.3 systems collected in a single system image, each running RHOCP 4.14 or
above; IBM Operations Manager; GDPS 4.6 for management of data recovery and virtual
machine recovery across metro distance systems and storage, including Metro multi-site
workload and GDPS Global; and IBM DS8000 series storage with IBM Hyper Swap. A
MongoDB v4.2 workload was used. Necessary resiliency technology must be enabled,
including z/VM single system image clustering, GDPS xDR Proxy for z/VM, and Red Hat
OpenShift Data Foundation (ODF) 4.14 for management of local storage devices.
Application-induced outages are not included in the above measurements. Other
configurations (hardware or software) may provide different availability characteristics.
This chapter also discusses the means of availability calculation, IBM’s differentiation and
value-add to RHOCP workload, the required Resiliency configuration, potential points of
planned and unplanned outage, technical mitigations and means around service interruption,
disaster recovery, and cyber resiliency.
OpenShift Container Platform (OCP) provides continuous operation and service resiliency
when running on IBM LinuxONE. By combining the Red Hat Open Container Platform
clustering technology with enterprise-level compute (IBM LinuxONE) and storage (IBM
DS8K), along with an appropriate virtual infrastructure (IBM PR/SM, IBM z/VM and
associated automation, IBM GDPS), client workload can achieve over 99.999999% (eight
nines) of uptime despite planned or unplanned outages.
This uptime can be achieved for data and enterprise services within the bounds of this
configuration.
By combining the Red Hat Open Container Platform clustering technology with enterprise
level compute (IBM LinuxONE) and storage (IBM DS8K), along with an appropriate virtual
infrastructure, consisting of IBM PR/SM, z/VM and associated automation and IBM GDPS,
client workload can achieve over 99.999999% (eight nines) of uptime, despite planned or
unplanned outages.
Assumptions
The claims mentioned above are based on an optimum configuration for the calculation of
availability for Openshift Container Platform workloads running on LinuxONE hardware.
We assume proper configuration, whereupon all components (hardware/software) have
appropriate back-up and redundancy to allow for outage mitigation and to ensure minimal
performance impact, within client requirements
– This includes ”free space” on systems (CPU, storage, memory) in case of planned
outages and workload redistribution
– This includes configuring z/VM and RHOCP for High Availability and redundancy –
virtual resources are resources
We assume that the greater datacenter ecosystem is configured for redundancy and high
availability:
– This includes services such as DNS and load balancing for applications running in
RHOCP
– This includes physical cabling for networking
– This includes power/electricity, fire suppression, etc
We assume a competent and non-malicious set of system administrators
– This includes testing fixes on a development system, isolated from production
workloads
We assume a properly configured z/OS system running GDPS and xDR exists
somewhere pertinent in the enterprise:
– Parallel Sysplex Best Practices:
http://www.redbooks.ibm.com/abstracts/sg247817.html?Open
– GDPS Configuration:
For GDPS and xDR configuration guidance, please refer to this RedBook
publication:
Threats
The resiliency model has been designed to withstand the following types of threats to service
availability:
Natural disasters
– Multi-site configuration meant to support this, but flexibility depends on size/scope of
disaster
Planned outages to systems
Power outages
– Having backup generators and power failover should already be a physical availability
consideration.
Replication failures (non-malicious, non-user-error)
– Restoration of previous copies of workload, dependent upon forensic investigation
– Data integrity check following, for instance, a system upgrade
While security threats (cyber attack, malicious users, ransomware) are acknowledged
potential threats, they are outside the scope of this particular calculation. Refer to “The Value
of Virtualization Security” ( https://www.vm.ibm.com/devpages/hugenbru/L1SECV22.PDF) for
more information about IBM LinuxONE Security and cyber resiliency.
4.1.2 Differentiators
IBM LinuxONE hardware
designed for maximal resiliency using its RAS characteristics such as core sparing,
Redundant Array of Independent Memory (RAIM) and critical parts redundancy, for instance
Power Supplies, cooling components, service network, and more. LinuxONE provides Vertical
Scalability of systems (more servers hosted; less hardware that can fail), and at given
performance points, processor cores and memory can be added concurrently and transaction
security can be achieved via on-chip symmetric encryption or shared Crypto Express
domains.
IBM GDPS
used in support of data replication, mirroring and HyperSwap of disk volumes.
No claims are made specifically about zCX and z/OS-hosted workloads in this document.
As in all cases, this guidance is not a client mandate; clients may optimize their environments
to meet their needs, including modifications to achieve even higher redundancy
1 Previous version of z/VM supported a maximum of four members in a Single System Image.
OCP Control OCP Compute OCP Control OCP Compute OCP Control OCP Compute
GDPS Node (control
plane)
Node Node (control
plane)
Node Node (control
plane)
Node
GDPS
RedHat OCP
xDR xDR
xDR Proxy xDR Proxy xDR Proxy xDR Proxy xDR Proxy xDR Proxy
Operations Manager for VM Operations Manager for VM Operations Manager for VM
ECKD ECKD
DS8K
DS8K (ECKD)
(ECKD)
Site A Site B
Figure 4-1 Structural Diagram
Figure 4-2 expands one or the z/VM partitions from the diagram shown in Figure 4-1. Some
important considerations about the single partition view are:
Reminder: RHOCP does not run under Linux on IBM Z; it runs under an IBM Z hypervisor
(z/VM or KVM)
RHEL-hosted workloads (or SLES, etc) are not under discussion in this example
IBM hardware must be in PR/SM mode and not DPM mode
z/VM is a requirement due to the necessity of HyperSwap. It will contain 2 xDR proxies.
App
App
App RedHat RedHat
w\ HA w\ HA
OCP Control OCP
Node (control OCP Compute Infrastructure
Operations GDPS GDPS plane) Node Node
4.2.2 Network
Network implementation is shown in Figure 4-3 on page 100.
The implementation has multiple (three) virtual devices at the CoreOS level connected to a
z/VM Layer 2 Virtual Switch. Link aggregation is used to bind multiple OSA Ports across
multiple adapters on a CPC.
fro Metro-Mirror (GDPS). This implementation allows for network traffic separation at virtual,
logical and/or physical layers in accordance with security requirements.
App App
Container Container
OCP OCP OCP
RHOCP RHOCP Infra RHOCP RHOCP RHOCP RHOCP
Infra Infra
Control Compute Node Control Compute Control Compute
Node Node
Plane Node Plane Plane Node
Node
App
Container
Switch
The Heartbeat Interval is the frequency with which the leader will notify followers that it is still
the leader. For best practices, the parameter should be set around round-trip time between
members. By default, etcd uses a 100ms heartbeat interval. The Election Timeout is how long
a follower node will go without hearing a heartbeat before attempting to become leader itself.
By default, etcd uses a 1000ms election timeout.
z/VM considerations
– analyze site-to-site.
– Vswich congestion - add network prioritization to give control plane traffic first access.
– measure link aggregation to determine optimal port access.
Physical
– GDPS communication IP-based; an outage there represents a catastrophic network
problem. (arrange z/OS cross-site accordingly).
– Network separation:
• Separate VSWITCHs, or direct-attached OSAs, for sensitive guests.
• VSWITCH VEPA mode to force traffic separation to physical switch.
– Internet latency not solvable by IBM, but we can optimize local traffic to give sensitive
communications a running head start.
App
App ODF
Red Hat
App OpenShift Data
Foundation
OCP Control OCP
(ODF)
Node (control OCP Compute Infrastructure
- CephDb,
Operations GDPS GDPS plane) Node Node
Hyperswap
Refer to: “VA HyperSwap function” on page 68
– HyperSwap is IBM’s High Availability solution that provides active-active storage
access to volumes shared across two physical sites. It is a high availability feature that
provides dual-site, active-active access to a volume. This means that data is
continuously available at both sites, improving the availability of your business. It is
based on synchronous Peer to Peer Remote Copy (PPRC) technology for use within a
single data center. Data is copied from the primary storage device to a secondary
storage device. See “VA HyperSwap function” on page 68.
Figure 4-5 shows DS8K storage being mirrored between Site A and Site B using PPRC /
Metro Mirror. Storage synchronous data replication (mirroring) is the foundational
pre-requisite for the HyperSwap solution take place.
The blue lines connecting the partitions to the DS8K disks represent the “in-use” data access
paths while the PPRC / Metro Mirror green arrow represents the required data synchronous
replication capability between sites.
The gray dotted lines represent the alternate access paths to storage in an event of a
HyperSwap. HyperSwap can be monitored and controlled by GDPS. See Figure 4-5.
ECKD ECKD
DS8K
DS8K (ECKD)
(ECKD)
Site A Site B
Figure 4-5 PPRC / Metro Mirror, GDPS xDR and HyperSwap between sites
Storage unit fails: one DS8K All data on storage unit is • RAID and data striping to avoid
physical device (not a unavailable integrity issues
volume, not a link) • GDPS doing mirroring to second
Figure 4-6 local storage unit,
PPRC to remote storage unit, and
Cyber Vaulting via SafeGuard Copy
Impact: negligible. With HyperSwap, storage failover happens within 2-3 seconds. (z/VM Switch
is quiesced at this time as well, so there may be a momentary network blip.)
Time to recovery of the failing component: 2 hours to restore physical unit, 4 hours to uptime.
Additional time to mirror current data to restored volume will be required. (Business impact not
included in calculation.)
DS8K DS8K
(ECKD) (ECKD)
Hardware unit outage (one Compute for an entire • Core sparing for general workload
LinuxONE IV CPC in system (1-n logical bypass
entirety) partitions and hosted • RAIM to avoid memory corruption
Figure 4-7 workload) not available • Physical redundancy inside a CPC
to mitigate problems
Time to recovery of the failing component: 4 hours typical for CPC repair. No data rebuild in
this case (it’s compute).
DS8K DS8K
(ECKD) (ECKD)
Logical partition takes Compute for one partition • SIE isolation to prevent impact to
wait-state or has outage and hosted workload is other workloads
Figure 4-8 gone • Workload rescheduled on other
Compute nodes (OpenShift)
Impact: negligible. Similar to hardware, z/VM and RHOCP workloads have the capacity to move
to other partitions in the SSI / systems in the cluster.
Time to recovery of the failing component: n/a. LPAR instantiation is near-immediate, and
failure at the LPAR level almost always means “workload”; see following slides.
DS8K DS8K
(ECKD) (ECKD)
Impact: negligible. SSI allows guest mobility to relocate workloads; RHOCP will
reschedule work onto other compute.
Time to recovery of the failing component: Five minutes to reactivate partition and IPL z/VM
system. Recovery time past that will vary based upon the number of guests starting, and the size
of the z/VM partition (which will have impacted the time to take a dump used in problem
determination).
IPL of z/VM system includes guests, virtual storage, GDPS xDR proxies, virtual networking, and
RHOCP for that system. It will not include resync operations of various layers.
DS8K DS8K
(ECKD) (ECKD)
z/VM virtual networking Traffic for z/VM or guests • Link aggregation to collect multiple
encounters problem is disrupted, leading to OSA ports into common logical
Figure 4-10 problems with heartbeats channel
and system assessment; • VSWITCH automatic failover (four
disruption to client controller nodes) in case of disrupted
workload network operation
• Inter-VSWITCH Link to coordinate
across multiple z/VM systems
Impact: negligible, as link aggregation collates resources and failover prevents an outage.
Time to recovery of the failing component: if physical networking, varies. If virtual networking,
1-2 hours to debug and rebuild.
DS8K DS8K
(ECKD) (ECKD)
z/VM I/O links or channels Access to storage via • Multiple channel paths for access to
take an outage virtual infrastructure is storage units
Figure 4-10 on page 107 disrupted; data cannot be • Operations Manager for local
read in or written out, scripting to mitigate failures
potentially disrupting
business
Impact: negligible, as data remains available even with a failing virtual or logical device. (If the last
channel path standing fails, a HyperSwap is triggered.)
Time to recovery of the failing component: varies upon whether failure is physical or logical.
(2-3 seconds if it’s a reboot of a channel path; hours if there’s a physical problem / cabling issue.)
z/VM Single System Image Individual systems keep • Systems in SAFE mode to continue
has a one-node failure running, but guest operating. (Manual intervention.)
Figure 4-10 on page 107 mobility no longer • SSI members don’t STONITH
feasible; potential for • RHOCP should STONITH
splitbrain problems in • SSI rejoins missing member(s) after
zoned workloads running repairs are made and state is
under z/VM (RHOCP) reaffirmed
Impact: potential disruption to RHOCP quorum and workload dispatch if SAFE-mode system
remains unavailable, but workload soon rebalanced.
Time to recovery of the failing component: see “Planned Outages for z/VM.” Recovery of one
system will vary based upon damage; time to IPL remains constant.
z/VM Single System Image PDR volume maintains GDPS – the PDR is not handled
takes a hit to the Persistent state of overall cluster and separately from the rest of the
Data Record heartbeat tracking; if GDPS consistency group.
Figure 4-11 on page 109 damaged, whole cluster
moves into recovery
mode
Impact: minimal; HyperSwap takes 2-3 seconds. Note: PDR becomes a single point of failure until
original PDR volume is restored.
Time to recovery of the failing component: Once storage device is repaired (see Point of
Outage 1), Metro Mirroring resumes active backup.
DS8K DS8K
(ECKD) (ECKD)
xDR Proxy fails xDR satellite crashes on a • Automatic failover to backup xDR
Figure 4-12 on page 110 given z/VM system, satellite on z/VM system
inhibiting
Time to recovery of the failing component: 2-3 seconds, then 2-3 minutes to re-IPL xDR proxy
guest.
DS8K DS8K
(ECKD) (ECKD)
Impact: n/a. Since GDPS is IP-based, this only happens if an entire network is destroyed.
DS8K DS8K
(ECKD) (ECKD)
CoreOS system crashes z/VM system is running • RHOCP checks control status
Figure 4-14 on page 112 fine, but a Red Hat guest and reschedules workload if
has taken a kernel panic, pertinent,
disrupting workload • z/VM re-IPL as pertinent
above (automatable)
Time to recovery of the failing component: n/a. There isn’t really a CoreOS instantiation
independent of RHOCP operations, so this is not a distinct point of outage from RHOCP crash
DS8K DS8K
(ECKD) (ECKD)
Time to recovery of the failing component: n/a. There isn’t really a CoreOS instantiation
independent of RHOCP operations, so this is not a distinct point of outage from RHOCP crash
DS8K DS8K
(ECKD) (ECKD)
CoreOS loses storage Linux instance loses • Linux stays up and can save to the
devices access to necessary page cache. When device is
Figure 4-15 storage for hosted recovered, store on disk
applications • If guest crashed Reboot of guest
after storage connection is
re-established
• If storage is invalid – reinstall node
Time to recovery of the failing component: n/a. There isn’t really a CoreOS instantiation
independent of RHOCP operations, so this is not a distinct point of outage from RHOCP crash
Impact: Temporary loss of compute related to Control for certain programs; workload redistributed
or restarted on other nodes as pertinent.
Time to recovery of the failing component: 30-60min, assuming automation and presence of
backups. (System administrator needs to reinstall from scratch and have the new instantiation of
the node rejoin the cluster. No data recovery per se.)
DS8K DS8K
(ECKD) (ECKD)
Impact: Temporary loss of compute related to Control for certain programs; workload redistributed
or restarted on other nodes as pertinent.
Time to recovery of the failing component: 30-60min, assuming automation and presence of
backups. (System administrator needs to reinstall from scratch and have the new instantiation of
the node rejoin the cluster. No data recovery per se)
Planned outage for CPC or Hardware down or • z/VM relocates work to alternate
Storage Unit firmware down for MCL hardware; RHOCP redistributes
Figure 4-17 on page 116 application compute to available nodes.
No outage.
• IBM Hyperswap points workload to
alternate storage volume.
No outage.
DS8K DS8K
(ECKD) (ECKD)
Planned outage for z/VM 1 z/VM system down for • z/VM admin (or GDPS) relocates
Figure 4-17 PTF application work to alternate z/VM system(s).
RHOCP redistributes compute to
available nodes. No outage
• https://www.ibm.com/docs/en/zvm/7.3?topic=members-removing-service
• https://www.ibm.com/docs/en/zvm/7.3?topic=summaries-vmfrem-exec
• https://www.youtube.com/watch?v=ISN39CWEk7k
Generally, physical data centers (and full recovery thereof) are measured at five 9’s of
availability (99.999%)
Data center providers will have SLAs around availability and resources for same
Calculating availability, based upon data centers around the world, is less reliable than for
hardware/software.
Standards vary in adoption and locality, but f.ex ANSO/BICSI 002-2019.
Tiered approach based upon how much a client can tolerate (or afford).
For lower-extreme problems (water, electricity, air conditioning), it’s understood that data
centers are meant to provide duplication of resource, back-up generators, alternate lines,
disparate power grids (for maintenance / planned outages).
Rate of failure understood to be low
However, due to state of disaster, restoring the physical assets may be extremely difficult /
impossible.
Having geographically dispersed data centers minimizes the chance of a single disaster
hitting both data centers simultaneously.
Having data replication via GDPS matters – whether or not the presented Framework is
configured Active-Active.
ECKD
DS8K
(ECKD)
Site A Site B 59
One entire datacenter Loss of all compute, - Per previous slides, redundant
Figure 4-18, and network, and storage in electrical, water, power,
Figure 4-19 on page 119 that building infrastructure to keep data center
running
- GDPS restarts virtual machines on
surviving cluster members, or z/VM
itself in new partitions
- GDPS points compute to surviving
storage
- z/VM can restart virtual machines
on surviving cluster members
- RHOCP will redirect workload onto
surviving compute nodes when it
detects loss of Compute or Control
virtual machines
Impact: reduced white space overhead for future failover, reduced redundancy opportunities for
compute and storage; change in network latency patterns.
- Since this is an unplanned outage, restarting multiple systems worth of workload may take time
- RHOCP will restart applications (or failover to backups) as pertinent
- z/VM Operations Manager will need to trigger VM IPL (minutes)
Time to recovery of the failing component: depends on nature of disaster; somewhere between
days, weeks, and years. Workload can continue, but at greater risk if other problems occur.
DS8K DS8K
(ECKD) (ECKD)
The framework covered in this chapter illustrates continuous operation of RHOCP workload in
the context of ~3 partitions, backed by GDPS xDR:
– A robust Disaster Recovery Plan will often have reserve logical partitions available
Framework does not consume three full CPCs
– Other partitions will be under use for other work, development work, test work, etc
– Given the right System Configuration, a z/VM SSI member can be IPL’d in a new
location
Based upon availability and DR requirements, consider having a failover plan for z/VM SSI
member nodes:
– This mitigates the damage another outage might incur
– This lessens the time an SSI may be running closer to a memory ‘ceiling’
– Partition failover allows RHOCP to run at “full strength” while physical site is under
repair
Solutions Assurance Team has some whitepapers and redpapers (around z/VM,
around Linux HA, around OCP):
https://www.ibm.com/docs/en/linux-on-systems?topic=assurance-solution-papers
ECDSA Elliptic Curve Digital Signature GBIC German Banking Industry Commis-
Algorithm sion
123
8544abrv.fm Draft Document for Review January 30, 2024 10:12 am
HADR high availability and disaster recov- IML initial machine load
ery IMPP Installation Manual for Physical
HCD hardware configuration definition Planning
HCM Hardware Configuration Manager IMS IBM Information Management Sys-
HDD hard disk drive tem
LAG Link Aggregation Port Group MIDAW Modified Indirect Data Address
LAN local area network Word
125
8544abrv.fm Draft Document for Review January 30, 2024 10:12 am
RAIM redundant array of independent SDSF System Display and Search Facility
memory SE Support Element
RAS reliability, availability, and service- SEs Support Elements
ability SFP Small Form-Factor Pluggable
RAS reliability, availability, serviceability Special Function Processors
SFP
RAs Repair Actions Secure Hash Algorithm
SHA
RCL Remote Code Load shared
SHR
RDMA Remote Direct Memory Access Signal Adapter
SIGA
RDP Read Diagnostic Parameters Single Instruction Multiple Data
SIMD
RG Resource Group Single-instruction multiple-data
SIMD
RG resource group SUSE Linux Enterprise Server
SLES
RG3 resource group 3 Shared Memory Communication
SMC
RGs resource groups SMC Version 2
SMCv2
RHEL Red Hat Enterprise Linux System Management Facilities
SMF
RI Runtime Instrumentation symmetric multiprocessing
SMP
RII Redundant I/O Interconnect symmetric multiprocessor
SMP
RII redundant I/O interconnect symmetric multiprocessing
SMP9
RMF Resource Measurement Facility simultaneous multithreading
SMT
RNG Random Number Generator serial number
SN
RNI Relative Nest Intensity Simple Network Management Pro-
SNMP
RNI Relative nest intensity tocol
RNID request node identification data SOO Single Object Operations
RSF Remote Support Facility SOOs Single Object Operations
RSM Real Storage Manager SORTL SORT LISTS
RSU reconfigurable storage unit
RTM Recovery Termination Manager
RU recovery unit
SR Short Reach
RaP Report a Problem
SR short reach
RoCE RDMA over CEE
SRAM static random access memory
Rx Receive
SRB Service Request Block
SA System Automation
SRB System recovery Boost
SADMP standalone dump
SS subchannel set
SAN Storage Area Network
SS1 subchannel set 1
SAN storage area network
SS2 subchannel set 2
SAP system assist processor
127
8544abrv.fm Draft Document for Review January 30, 2024 10:12 am
129
8544abrv.fm Draft Document for Review January 30, 2024 10:12 am
Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this book.
IBM Redbooks
The following IBM Redbooks publications provide additional information about the topics in
this document. Note that some publications referenced in this list might be available in
softcopy only.
IBM GDPS: An Introduction to Concepts and Capabilities, SG24-6374
IBM z16 Configuration Setup, SG24-8960-01
IBM Z Connectivity Handbook, SG24-5444
Getting Started with IBM Z Resiliency, SG24-8446
You can search for, view, download or order these documents and other Redbooks,
Redpapers, Web Docs, draft and additional materials, at the following website:
ibm.com/redbooks
Other publications
These publications are also relevant as further information sources:
Hardware Management Console (HMC) Operations Guide Version 2.16.0.
See IBM Resource Link (requires IBM ID authentication).
IBM Hardware Management Console Help
Online resources
The IBM Resource Link for documentation and tools website is also relevant as another
information source:
http://www.ibm.com/servers/resourcelink
SG24-8544-00
ISBN DocISBN
Printed in U.S.A.
®
ibm.com/redbooks