Redp 5721
Redp 5721
Vasfi Gucer
Jussi Lehtinen
Jean-Charles (JC) Lopez
Christopher Maestas
Franck Malterre
Suha Ondokuzmayis
Daniel Parkes
John Shubeck
Redpaper
IBM Redbooks
June 2024
REDP-5721-00
Note: Before using this information and the product it supports, read the information in “Notices” on
page vii.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ceph and storage challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Ceph approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Ceph storage types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 What is new with IBM Storage Ceph 7.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Write Once, Read Many compliance certification . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Multi-site replication with bucket granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Object archive zone (technology preview) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 RGW policy-based data archive and migration capability. . . . . . . . . . . . . . . . . . . . 9
1.3.5 IBM Storage Ceph Object S3 Lifecycle Management. . . . . . . . . . . . . . . . . . . . . . 10
1.3.6 Dashboard UI enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.7 NFS support for CephFS for non-native Ceph clients. . . . . . . . . . . . . . . . . . . . . . 11
1.3.8 NVMe over Fabrics (technology preview) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.9 Object storage for machine learning and analytics: S3 Select . . . . . . . . . . . . . . . 12
1.3.10 RGW multi-site performance improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.11 Erasure code EC2+2 with four nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Contents v
vi IBM Storage Ceph Concepts and Architecture Guide
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
IBM® IBM Spectrum® Redbooks (logo) ®
IBM Cloud® Redbooks®
The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Red Hat, Ansible, Ceph, OpenShift, are trademarks or registered trademarks of Red Hat, Inc. or its
subsidiaries in the United States and other countries.
VMware, and the VMware logo are registered trademarks or trademarks of VMware, Inc. or its subsidiaries in
the United States and/or other jurisdictions.
Other company, product, or service names may be trademarks or service marks of others.
IBM Storage Ceph is an IBM® supported distribution of the open-source Ceph platform that
provides massively scalable object, block, and file storage in a single system.
IBM Storage Ceph is designed to infuse AI with enterprise resiliency, consolidate data with
software simplicity, and run on multiple hardware platforms to provide flexibility and lower
costs.
This IBM Redpaper publication explains the concepts and architecture of IBM Storage Ceph
in a clear and concise way. For more information about how to implement IBM Storage Ceph
for real-life solutions, see IBM Storage Ceph Solutions Guide, REDP-5715.
The target audience for this publication is IBM Storage Ceph architects, IT specialists, and
technologists.
Authors
This paper was produced by a team of specialists from around the world.
Kenneth David Hartsoe, Elias Luna, Henry Vo, Wade Wallace, William West
IBM US
Marcel Hergaarden
IBM Netherlands
The team extends its gratitude to the Upstream Community, IBM, and Red Hat Ceph
Documentation teams for their contributions to continuously improve Ceph documentation.
The authors would like to express their sincere gratitude to Anthony D'Atri of Red Hat Inc for
his valuable contributions to this publication through his thorough review of the book.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
Preface xi
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Chapter 1. Introduction
This chapter introduces the origins of and the basic architectural concepts that are used
by this software-defined solution.
To address the inherent Lustre architecture problem, Sage Weil developed Ceph, a
distributed storage system that uses a pseudo-random placement algorithm Controlled
Replication Under Scalable Hashing (CRUSH) to distribute data across a heterogeneous
cluster. The novel algorithm leverages a sophisticated calculation to optimally place and
redistribute data across the cluster, which minimizes data movement when the cluster's state
changes. While Ceph employs a distributed approach to metadata management, it does
utilize a metadata pool to store essential file system information.
It is designed to distribute the data across all devices in the cluster to avoid the classic bias of
favoring empty devices to write new data. During cluster expansion, this approach will likely
generate a bottleneck because new empty devices are favored to receive all the new writes
while generating unbalanced data distribution because the old data is not redistributed across
all the devices in the storage cluster.
As CRUSH is designed to distribute and maintain the distribution of the data throughout the
lifecycle of the storage cluster (expansion, reduction, or failure), it favors an equivalent
mixture of old and new data on each physical disk of the cluster, which leads to a more even
distribution of the I/Os across all the physical disk devices.
To enhance data distribution, the solution was designed to break down large elements (for
example, a 100 GiB file) into smaller elements, each assigned a specific placement through
the CRUSH algorithm. Therefore, reading a large file leverages multiple physical disk drives
rather than a single disk drive as if the file had been kept as a single element.
Sage Weil prototyped the new algorithm and doing so created a new distributed storage
software solution: Ceph. The name Ceph was chosen as a reference to the ocean and the life
that it harbors. Ceph is a short for cephalopod.
Note: Cephalopods have multiple arms or tentacles that can operate independently yet in
a coordinated manner. This reflects Ceph's ability to distribute data and operations across
numerous storage devices in a cluster, working in parallel for high performance and
redundancy.
On January 2023, all Ceph developers and product managers were moved from Red Hat to
IBM to provide greater resources for the future of the project. The Ceph project at IBM
remains an open-source project. Code changes still follow the upstream first rule, and Ceph is
the base for the IBM Storage Ceph software-defined storage product.
Figure 1-1 on page 3 represents the milestones of the Ceph project over the past two
decades.
All Ceph community versions were assigned the name of a member of the Cephalopoda
natural sciences family. The first letter of the name helps identify the version.
Table 1-1 represents all Ceph community version names with the matching Inktank, Red Hat
Ceph Storage, or IBM Storage Ceph version that is leveraging it.
Chapter 1. Introduction 3
Name Upstream Release End of life Downstream product
Version
Pacific 16.2 2021-03-31 2023-10-01 IBM Storage Ceph 5 (start with 5.3)
Red Hat Ceph Storage 5
Quincy 17.2 2022-04-19 2024-06-01 IBM Storage Ceph 6 (start with 6.1)
Red Hat Ceph Storage 6
Reef 18.2 2023-08-07 2025-08-01 IBM Storage Ceph 7 (start with 7.0)
Red Hat Ceph Storage 7
Ceph releases marked with an asterisk in Table 1-1 indicate development versions with short
lifespans. These releases, lacking a corresponding downstream product, were not used by
Inktank, Red Hat, or IBM to create long-term supported products. Starting with the Pacific
release, the Ceph project adopted a new release cycle: yearly Long Term Support (LTS)
versions without intermediate development releases. Consequently, Red Hat/IBM major
releases now directly mirror community major releases.
Following the transfer of the Ceph project to IBM, Red Hat Ceph Storage will be OEM,
starting with Red Hat Ceph Storage 6.
Figure 1-2 represents the different IBM Storage Ceph versions at the time of writing.
Technology changes
The rapid advancement of the IT lifecycle has rendered the once-dominant, centralized data
center obsolete, giving rise to a more decentralized architecture that hinges on
unprecedented levels of application and server communication. This shift marks a stark
departure from the era of centralized processing units that are housed within a single room,
where passive terminals served as mere conduits for accessing computational power.
Ceph is no exception to these challenges, and was designed from the ground up to be highly
available (HA), with no single point of failure, and highly scalable with limited day-to-day
operational requirements other than the replacement of failed physical resources, such as
nodes or drives. Ceph provides a complete, software-defined storage solution as an
alternative to proprietary storage arrays.
Chapter 1. Introduction 5
Runs as a software-defined storage solution on commodity hardware.
Open source to avoid vendor lock-in.
All the different types of storage are segregated and do not share data between them while
eventually sharing the physical storage where they are physically stored. However, a custom
CRUSH configuration enables you to separate the physical nodes and disks that are used by
each of them.
All the different types of storage are identified and implemented as Ceph access methods on
top of the native RADOS API. This API is known as librados.
Ceph is written entirely in C and C++ except for some API wrappers that are
language-specific (For example, a Python wrapper for librados or librbd).
IBM Storage Ceph is positioned for the use cases that are shown in Figure 1-3. IBM is
committed to support additional ones in upcoming versions, such as NVMe over Fabric for a
full integration with VMware.
Note: The US Securities and Exchange Commission (SEC) stipulates record keeping
requirements, including retention periods. FINRA rules regulate member brokerage firms
and exchange member markets.
Chapter 1. Introduction 7
Figure 1-4 shows the multi-site replication with bucket granularity feature.
Previously, replication was limited to full zone replication. This new feature grants clients
enhanced flexibility by enabling the replication of individual buckets with or against different
IBM Storage Ceph clusters. This granular approach enables selective replication, which can
be beneficial for edge computing, colocations, or branch offices. Bidirectional replication is
also supported.
The archive zone selectively replicates data from designated buckets within the production
zones. System administrators can control which buckets undergo replication to the archive
zone, enabling them to optimize storage capacity usage and prevent the accumulation of
irrelevant content.
Chapter 1. Introduction 9
This feature enables clients to seamlessly migrate data that adheres to policy criteria to the
public cloud archive. This function is available for both Amazon Web Services (AWS) and
Microsoft Azure public cloud users.
The primary benefit of this feature is its ability to liberate on-premises storage space that is
occupied by inactive, rarely accessed data. This reclaimed storage can be repurposed for
active datasets, enhancing overall storage efficiency.
Figure 1-7 shows the IBM Storage Ceph Object S3 Lifecycle Management feature.
IBM Storage Ceph Linux clients can seamlessly mount CephFS without additional driver
installations because CephFS is embedded in the Linux kernel by default. This capability
extends CephFS accessibility to non-Linux clients through the NFS protocol. In IBM Storage
Ceph 7, the NFS Ganesha service expands compatibility by supporting NFS v4, empowering
a broader spectrum of clients to seamlessly access CephFS resources.
Chapter 1. Introduction 11
The newly introduced IBM Storage Ceph NVMe-oF gateway bridges the gap for non-Linux
clients, enabling them to seamlessly interact with NVMe-oF initiators. These initiators
establish connections with the gateway, which connects to the RADOS block storage system.
The performance of NVMe-oF block storage through the gateway is comparable to native
RBD block storage, ensuring a consistent and efficient data access experience for both Linux
and non-Linux clients.
This feature empowers clients to employ straightforward SQL statements to filter the contents
of S3 objects and retrieve only the specific data that they require. By leveraging S3 Select for
data filtering, clients can minimize the amount of data that is transferred by S3, which reduces
both retrieval costs and latency. The following data formats are supported:
Comma-separated value (CSV)
JSON
Parquet
With IBM Storage Ceph 7, erasure code C2+2 can be used with just four nodes, making it
more efficient and cost-effective to deploy erasure coding (EC) for data protection.
IBM Storage Ready Nodes can be deployed with a minimum of four nodes and use EC for the
RADOS back end in this basic configuration. This scalable solution can be expanded to
accommodate up to 400 nodes.
Monitors
MONs are responsible for managing the state of the cluster. Like with any distributed storage
system, the challenge is tracking the status of each cluster component (MONs, MGRs, OSDs,
and others).
Ceph maintains its cluster state through a set of specialized maps, collectively referred to as
the cluster map. Each map is assigned a unique version number (epoch), which starts at 1
and increments by 1 on every state change for the corresponding set of components.
To prevent split-brain scenarios, the number of MONs that are deployed in a Ceph cluster
must always be an odd number greater than two to ensure that most MONs can validate map
updates. More than half of the MONs that are present in the Monitor Map (MONMap) must
agree on the change that is proposed by the PAXOS quorum leader for the map to be
updated.
Note: The MONs are not part of the data path, meaning that they do not directly handle
data storage or retrieval requests. They exist primarily to maintain cluster metadata and
keep all components synchronized.
Managers
The MGRs are integrated with the MONs, and collect the statistics within the cluster. The
MGRs provide a pluggable Python framework to extend the capabilities of the cluster. As
such, the developer or the user can leverage or create MGR modules that are loaded into the
MGR framework.
The following list provides some of the existing MGR modules that are available:
Balancer module (dynamically reassign PGs to OSDs for better data distribution).
Auto-scaler module (dynamically adjust the number of PGs that are assigned to a pool).
Dashboard module (provide a UI to monitor and manage the Ceph cluster).
RESTful module (provide a RESTful API for cluster management).
Prometheus module (provide metrics support for the Ceph cluster).
Note: A cluster node where OSDs are deployed is called an OSD node.
FileStore
In its early days, the Ceph OSD used an object store implementation that was known as
FileStore (Figure 2-2). This object store solution leverages an XFS formatted partition to store
the actual data and a raw partition to store the object store journal. The journal is written as a
wrap-around raw device that is written sequentially.
When flash-based drives arrived on the market, it became a best practice to use a solid-state
drive (SSD) to host the journal to enhance the performance of write operations in the Ceph
cluster.
However, the complexity of the solution and the write amplification due to two writes for each
write operation led the Ceph project to consider an improved solution for the future.
BlueStore
BlueStore is the new default OSD object store format since the launch of upstream Luminous
(Red Hat Ceph Storage 3.x). With BlueStore, data is written directly to the disk device, and a
separate RocksDB key-value store contains all the metadata.
When the data is written to the raw data block device, the RocksDB is updated with the
metadata that is related to the new data blobs that were written.
RocksDB uses a DB portion and a write-ahead log (WAL) portion. Depending on the size of
the I/O, RocksDB writes the data directly to the raw block device through BlueFS or to the
WAL so that the data can be later committed to the raw block device. The latter process is
known as a deferred write.
Note: A best practice is to use a device faster than the data device for the RocksDB
metadata device and a device faster than the RocksDB metadata device for the WAL
device.
When a separate device is configured for the metadata, it might overflow to the data device if
the metadata device becomes full. Although this situation is not a problem if both devices are
of the same type, it leads to performance degradation if the data device is slower than the
metadata device. This situation is known as BlueStore spillover.
These parameters are expressed as a percentage of the cache size that is assigned to the
OSD.
With BlueStore, you can configure more features to align best with your workload:
block.db sharding
Minimum allocation size on the data device
Pools
The cluster is divided into logical storage partitions that are called pools. The pools have the
following characteristics:
Group data of a specific type.
Group data that is to be protected by using the same mechanism (replication or EC).
Group data to control access from Ceph clients.
Only one CRUSH rule that is assigned to determine PG mapping to OSDs.
Note: At the time of writing, pools support compression, but do not support deduplication.
The compression can be activated on a per pool basis.
Data protection
The data protection scheme is assigned individually to each pool. The data protection that
IBM Storage Ceph supports are as follows:
Replicated, which makes a full copy of each byte that is stored in the pool (the default is
three copies):
– Two replicas or more are supported by underlying flash devices.
– Three replicas or more are supported by HDDs.
Note: EC, although supported for all types of storage (block, file, and object), is not
recommended for block and file because it delivers lower performance.
Figure 2-4 Replicated data protection versus erasure coding data protection
EC provides a more cost-efficient data protection mechanism and greater resiliency and
durability as you increase the number of coding chunks, which enables the pool to survive the
loss of many OSDs or servers before the data becomes unrecoverable, but offers lower
performance because of the computation and network traffic that is required to split the data
and calculate the coding chunks.
Table 2-1 Main parameters for pools and details whether the parameter can be dynamically modified
Name Function Dynamic update
Placement groups
The pool is divided into hash buckets that are called placement groups. The PG does the
following tasks:
Stores the objects as calculated by the CRUSH algorithm.
Ensures that the storage of the data is abstracted from the physical devices.
active+stale The PG is functional, but one or more replicas are stale and must be updated.
stale The PG is in an unknown state because the MONs did not receive an update
after the PG placement changed.
recovery_toofull The PG is waiting for recovery because the target OSD is full.
backfill_toofull The PG is waiting for backfill because the target OSD is full.
incomplete The PG is missing information about some writes that might have occurred or
do not have healthy copies.
degraded The PG has not replicated some objects to the correct number of OSDs.
repair The PG is being checked and repaired for any inconsistencies that it finds.
inconsistent The PG has inconsistencies between its different copies that are stored on
different OSDs.
HEALTH_OK The cluster is fully operational, and all components are operating as
expected.
HEALTH_WARN Some issues exist in the cluster, but the data is available to clients.
HEALTH_ERR Serious issues exist in the cluster, and some data is unavailable to clients.
HEALTH_FAILED The cluster is not operational, and data integrity might be at risk.
Figure 2-5 illustrates the general component layout within a RADOS cluster.
Note: With the latest version of IBM Storage Ceph, multiple components can be deployed
on one node by referencing the support matrix. Log in with your IBMid to access this link.
Because the placement remains the same for a cluster state, imagine the following scenario:
A RADOS object is hosted in PG 3.23.
A copy of the object is written on OSDs 24, 3, and 12 (state A of the cluster).
OSD 24 is stopped (state B of the cluster).
An hour later, OSD 24 is restarted.
When OSD 24 stops, the PG that contains the RADOS object is re-created to another OSD
so that the cluster satisfies the number of copies that must be maintained for the PGs that
belong to the specific pool.
Now, state B of the cluster becomes different from the original state A of the cluster. PG 3.23
is now protected by OSD 3, 12, and 20.
From an architectural point of view, CRUSH provides the mechanism that is shown in
Figure 2-6.
To determine the location of a specific object, the mechanism that is shown in Figure 2-7 is
used.
To make sure that a client or a Ceph cluster component finds the correct location of an object
that enables the client-to-cluster communication model, all maps that are maintained by the
MONs have versions, and the version of the map that is used to find an object or a PG is
checked by the recipient of a request.
The specific exchange determines whether the MONs or one of the OSDs updates the map
version that is used by the sender. Usually, these updates are differentials, meaning that only
the changes to the map are transmitted after the initial connection to the cluster.
Figure 2-8 Full data placement picture (objects on the left and OSDs on the right)
In the Ceph clusters, the following mechanisms exist to track the status of the different
components of the cluster:
MON-to-MON heartbeat
OSD-to-MON heartbeat
OSD-to-OSD heartbeat
On detecting an unavailable peer OSD (because it works with other OSDs to protect PGs), an
OSD relays this information to the MONs, which enables the MONs to update the OSDMap to
reflect the status of the unavailable OSD.
Backfill is the process of moving a PG after the addition or the removal of an OSD to or from
the Ceph cluster.
When a Ceph client connects to the Ceph cluster, it must contact the MONs of the cluster so
that the client can be authenticated. Once authenticated, the Ceph client is provided with a
copy of the different maps that are maintained by the MONs.
When a failure occurs within the cluster as the Ceph client tries to access a specific OSD, the
following cases can occur:
The OSD that it is trying to contact is unavailable.
The OSD that it is trying to contact is available.
In the first case, the Ceph client falls back to the MONs to obtain an updated copy of the
cluster map.
In Figure 2-10 on page 27, the different steps that occur when a Ceph client connects to the
cluster and then accesses data is this high-level sequence:
4. As the target OSD has become unavailable, the client contacts the MONs to obtain the
latest version of the cluster map.
5. The data placement for object name xxxxx is recalculated.
6. The Ceph client initiates a connection with the new primary OSD that protects the PG.
In the second case, the OSD detects that the client has an outdated map version and
provides the necessary map updates that took place since the map version that was used by
the Ceph client and the map version that was used by the OSD. On receiving these updates,
the Ceph client recalculates data placement and retries the operation, ensuring that it is
aware of the latest cluster configuration and can interact with the OSDs.
CephX is enabled by default during deployment. It is a best practice to keep it enabled for
optimal performance. However, some benchmark results that are available online might show
CephX disabled to eliminate protocol overhead during testing.
The installation process enables CephX by default, so that the cluster requires user
authentication and authorization by all client applications.
The usernames that are used by Ceph clients are expressed as client.{id}. For example,
when connecting OpenStack Cinder to a Ceph cluster, it is common to create the
client.cinder username.
The username that is used by the RADOS Gateways (RGWs) connecting to a Ceph cluster
follow the client.rgw.hostname structure.
By default, if no argument is passed to the librados API by code or the Ceph CLI, the
connection to the cluster is attempted with the client.admin username.
The default behavior can be configured by using the CEPH_ARGS environment variable. To
specify a specific username, use export CEPH_ARGS="--name client.myid". To specify a
specific user ID, use export CEPH_ARGS="--id myid".
Key rings
On creating a new username, a corresponding key ring file is generated in the Microsoft
Windows INI file format. This file contains a section that is named [client.myid] that holds
the username and its associated unique secret key. When a Ceph client application running
on a different node must connect to the cluster by using this username, the key ring file must
be copied to that node.
When the application starts, librados, which ends up being called whatever the access
method that is used, searches for a valid key ring file in /etc/ceph. The key ring file name is
generated as {clustername}.{username}.Key ring.
You can override the location of the key ring file by inserting key
ring={keyring_file_location} in the local Ceph configuration file (/etc/ceph/ceph.conf) in
a section that is named after your username:
[client.testclient]
Key ring=/mnt/homedir/myceph.keyring
All Ceph MONs can authenticate users so that the cluster does not present any single point of
failure in authentication. The process follows these steps:
1. The Ceph client contacts the MONs with a username and a key ring.
2. Ceph MONs return an authentication data structure like a Kerberos ticket that includes a
session key.
3. The Ceph client uses the session key to request specific services.
4. A Ceph MON provides the client with a ticket so it can authenticate with the OSDs.
5. The ticket expires so it can be reused and prevent spoofing risks.
Capabilities
Each username is assigned a set of capabilities that enables specific actions against the
cluster.
The cluster offers an equivalent of the Linux root user, which is known as client.admin. By
default, a user has no capabilities and must be allowed specific rights. To accomplish this
goal, use the allow keyword, which precedes the access type that is granted by the capability.
Profiles
To simplify capability creation, CephX profiles can be leveraged:
profile osd: A user can connect as an OSD and communicate with OSDs and MONs.
profile mds: A user can connect as an MDS and communicate with MDSs and MONs.
profile crash: Read-only access to MONs for crash dump collection.
profile rbd: A user can manipulate and access RADOS Block Device (RBD) images.
profile rbd-read-only: A user can access RBD images in read-only mode.
Other deployment-reserved profiles exist and are not listed for clarity.
All access methods except librados automatically stripe the data that is stored in RADOS to
4 MiB objects by default. This setting can be customized. For example, when using CephFS,
storing a 1 GiB file from a client perspective results in 256* 4 MiB RADOS objects, each
assigned to a PG with the pool that is used by CephFS.
For more information about each access method, see Chapter 3, “IBM Storage Ceph main
features and capabilities ” on page 31.
2.3 Deployment
There are many Ceph deployment tools, each of which were available at certain releases:
mkcephfs was the first tool.
ceph-deploy started with Cuttlefish.
ceph-asnible started with Jewel.
cephadm started with Octopus and later.
For more information about how to use cephadm to deploy your Ceph cluster,
see IBM Storage Ceph documentation. A cephadm-based deployment follows these steps:
For beginners:
a. Bootstrap your Ceph cluster (create one initial MON and MGR).
b. Add services to your cluster (OSDs, MDSs, RGWs, and others).
For advanced users, bootstrap your cluster with a complete service file to deploy
everything.
For more information about the guidelines and best practices for cluster deployment, see
Chapter 6, “Day 1 and Day 2 operations” on page 125.
Block storage devices are thin-provisioned, resizable volumes that store data striped over
multiple object storage daemons (OSDs). Ceph block devices leverage RADOS capabilities,
such as snapshots, replication, and data reduction. Ceph block storage clients communicate
with Ceph clusters through kernel modules or the librbd library.
Ceph File System (CephFS) is a file system service that is compatible with POSIX standards
and built on the Ceph distributed object store. CephFS provides file access to internal and
external clients by using POSIX semantics wherever possible. CephFS maintains strong
cache coherency across clients. The goal is for processes that use the file system to behave
the same when they are on different hosts as when they are on the same host. It is simple to
manage and yet meets the growing demands of enterprises for a broad set of applications
and workloads. CephFS can be additionally accessed by using industry-standard file sharing
protocols such as NFS and SMB/CIFS.
Object storage is a storage solution used in the cloud and by on-premises applications as a
central storage platform for large quantities of unstructured data. Object storage continues to
increase in popularity due to its ability to address the needs of the world’s largest data
repositories. IBM Storage Ceph supports object storage operations through the S3 API with a
high emphasis on what is commonly referred to as S3 API fidelity.
Object storage can serve many use cases, including archival, backup and DR, media and
entertainment, and big data analytics.
The Ceph Object Gateway provides interfaces that are compatible with OpenStack Swift and
Amazon Web Services (AWS) S3. The Ceph Object Gateway has its own user management
system and can also be interfaced with external systems including OpenStack Keystone. The
Ceph Object Gateway can store data in the same Ceph storage cluster that is used to store
data from Ceph Block Device and CephFS client, with separate pools and optionally different
storage drive media or a different Controlled Replication Under Scalable Hashing (CRUSH)
hierarchy.
Industry-specific use cases for the Ceph Object Gateway include the following examples:
Healthcare and Life Sciences:
– Medical imaging, such as picture archiving and communication system (PACS) and
MRI.
– Genomics research data.
– Health Insurance Portability and Accountability Act (HIPAA) of 1996 regulated data.
Media and entertainment (for example, audio, video, images, and rich media content).
Financial services (for example, regulated data that requires long-term retention or
immutability).
Object Storage as a Service (SaaS) as a catalog offering (cloud or on-premises).
The RGW stores its data, including user data and metadata, in a dedicated set of Ceph
Storage pools. This ensures efficient data management and organization within the Ceph
cluster.
RGW Data pools are often built on hard disk drives (HDDs) with EC for cost-effective and
high-capacity configurations.
Data pools can also use solid-state drives (SSDs) with replication or EC to cater to
performance-sensitive object storage workloads. QLC SSDs with lower cost and dense
capacities are especially suited to object storage.
The bucket index pool, which stores bucket index key/value entries (one per object), is a
critical component of Ceph Object Gateway performance. Due to its
performance-sensitive nature and the potential for millions of object entries, the bucket
index pool should exclusively use SSDs to ensure optimal performance and
responsiveness.
The RGW offers interfaces that are compatible with AWS S3 and OpenStack Swift, which
provide seamless integration with existing cloud environments. Also, it features an admin
operations RESTful API for automating day-to-day operations, which streamlines
management tasks.
The Ceph Object Gateway constructs of Realm, Zone Groups, and Zones are used to define
the organization of a storage network for purposes of replication and site protection.
A deployment in a single data center is simple to install and manage. In recent feature
updates, the Dashboard UI was enhanced to provide a single point of control for the
deployment and ongoing management of the Ceph Object Gateway in a single data center. In
this simple case, there is no need to define Zones and Zone Groups, and a minimized
topology is automatically created.
For deployments that involve multiple data centers and multiple IBM Storage Ceph clusters, a
more detailed configuration is required with granular control by using the cephadm CLI or the
dashboard UI. In these scenarios, the Realm, Zone Group, and Zones must be defined by the
storage administrator and configured. For more information about these constructs, see the
IBM Storage Ceph documentation.
To avoid single points of failure in a Ceph RGW deployment, one can present to clients an
S3/RGW endpoint that can tolerate the failure of one or more back end RGW services. RGW
is a restful HTTP endpoint that can be load-balanced for HA and increase performance.
There are some great examples of different RGW load-balancing mechanisms in this
repository.
Starting with IBM Storage Ceph 5.3, Ceph provides an HA and load-balancing named
ingress, which is based on keepalived and haproxy. With the ingress service, one may easily
deploy an HA endpoint for RGW with a minimum set of configuration options.
The orchestrator deploys and manages a combination of haproxy and keepalived to balance
the load through a floating virtual IP address. If SSL is used, then SSL must be configured
and terminated by the ingress service, not RGW itself.
Swift compatibility
Provides object storage functions with an interface that is compatible with a large subset of
the OpenStack Swift API.
The S3 and Swift APIs share a common namespace, so you can write data with one API and
retrieve it with the other. The S3 namespace can also be shared with NFS clients to offer a
true multiprotocol experience for unstructured data use cases.
Administrative API
Provides an administrative restful API interface for managing the Ceph Object Gateways.
Administrative API requests are sent via a URI that starts with the admin resource endpoint.
Authorization for the administrative API mimics the S3 authorization convention. Some
operations require the user to have special administrative capabilities. XML or JSON format
responses can specified by the format option in the request. JSON format is the default.
Management
The Ceph Object Gateway can be managed by using the Ceph Dashboard UI, the Ceph CLI
(cephadm), the Administrative API, and through service specification files.
IBM Storage Ceph object storage further enhances security by incorporating IAM
compatibility for authorization. This feature introduces IAM Role policies, empowering users
to request and assume specific roles during STS authentication. By assuming a role, users
inherit the S3 permissions that are configured for that role by an RGW administrator. This
role-based access control (RBAC) or attribute-based access control (ABAC) approach
enables granular control over user access, ensuring that users access only the resources that
they need.
Archive zone
The archive zone uses multi-site replication and S3 object versioning features. The archive
zone retains all versions of all objects available even when they are deleted from a production
zone.
An archive zone provides S3 object version history that can be eliminated only through the
RGWs associated with the archive zone. Including an archive zone in your multisite zone
replication setup provides the convenience of S3 object history while saving space that
replicas of the versioned S3 objects would consume in production zones.
One can manage the storage space usage of the archive zone through bucket lifecycle
policies, where one can define the number of versions to keep for each object.
An archive zone helps protect your data against logical or physical errors. It can save users
from logical failures, such as accidentally deleting a bucket in the production zone. It can also
save your data from massive hardware failures, like a complete production site failure. Also, it
provides an immutable copy, which can serve as a key component of a ransomware
protection strategy.
Security
IBM Storage Ceph provides the following security features for data access and management:
Comprehensive auditing: Track user activity and access attempts with granular detail by
enabling RGW operations per second (OPS) logs, which provides a valuable audit trail for
data access accountability.
Multi-factor authentication (MFA) for deletes: Prevent accidental or unauthorized data
deletion with an extra layer of security. MFA requires an additional verification step beyond
a traditional password.
Secure Token Service (STS) integration: Enhance security by using short-lived,
limited-privilege credentials that are obtained through STS. This approach eliminates the
need for long-lived S3 credentials, reducing the risk that is associated with compromised
keys.
Secure communication with TLS/SSL: Protect data in transit by securing the S3 HTTP
endpoint that is provided by RGW services with TLS/SSL encryption. Both external and
self-signed SSL certificates are supported for flexibility.
Replication
BM Ceph Object Storage provides enterprise-grade, highly mature, geo-replication
capabilities. The RGW multi-site replication feature facilitates asynchronous object replication
across single or multi-zone deployments. Leveraging asynchronous replication with eventual
consistency, Ceph Object Storage operates efficiently over LAN or WAN connections
between replicating sites by leveraging asynchronous replication with eventual consistency.
Figure 3-8 shows the IBM Ceph Object Storage replication feature.
Storage policies
Ceph Object Gateway supports multiple storage classes for the placement of RGW data. For
example, SSD storage pool devices are a best practice for index data and a default storage
class to house small client objects and HEAD RADOS objects, and erasure-coded HDD or
QLC storage pool devices may be targeted for high-capacity bucket data. Ceph Object
Gateway also supports storage lifecycle (LC) policies and migrations for data placement
across tiers of storage as content ages.
Migrations across storage classes and protection policies (for example, replicated and EC
pools) are supported. Ceph Object Gateway also supports policy-based data archiving to
AWS S3, Azure or IBM Cloud Object Storage.
Multiprotocol
Ceph Object Gateway supports a single unified namespace for S3 client operations (S3 API)
and NFS client operations (NFS Export) to the same bucket. This support provides a true
multiprotocol experience for many use cases, particularly in situations where applications are
being modernized from traditional file sharing access to native object storage access. As a
best practice, limit the usage of S3 and NFS to the same namespace in use cases such as
data archives, rich content repositories, or document management stores, that is, use cases
where the files and objects are unchanging by design. Multiprotocol is not recommended for
live collaboration use cases where multiuser modifications to the same content are required.
IBM watsonx.data
IBM Storage Ceph is the perfect candidate for a data lake or data lakehouse. For example,
watsonx.data includes an IBM Storage Ceph license. The integration and level of testing
between watsonx.data and IBM Storage Ceph is first rate. Some of the features that make
Ceph a great solution for watsonx.data are S3 Select, Table Encryption, IDP authentication
with STS, and Datacenter Caching with Datacenter-Data-Delivery Network (D3N).
S3 Select is a recent innovation that extends object storage to semi-structured use cases. An
example of a semi-structured object is one that contains comma-separated values (CSV),
JSON, or Parquet data formats. S3 Select enables a client to retrieve a subset of an object’s
content by using SQL-like arguments to filter the resulting payload. At the time of writing,
Ceph Object Gateway supports S3 Select for alignment with data lakehouse storage by IBM
watsonx.data clients, and S3 Select supports the CSV, JSON, and Parquet formats.
Bucket features
Ceph supports advanced bucket features like S3 bucket policy, S3 object versioning, S3
object lock, rate limiting, bucket object count quotas, and bucket capacity quotas. In addition
to these advanced bucket features, Ceph Object Gateway boasts impressive scalability that
empowers organizations to efficiently store and access massive amounts of data.
Note: In a 6-server and 3-enclosure configuration (see Evaluator Group tests performance
for 10 billion objects with Red Hat Ceph Storage), Ceph Object Gateway has supported
250 million objects in a single bucket and 10 billion objects overall.
Bucket notifications
Ceph Object Gateway supports bucket event notifications, which is a crucial feature for
event-driven architectures that is widely used when integrating with Red Hat OpenShift Data
Foundation. Notifications enable real-time event monitoring and triggering of downstream
actions that include data replication, alerting, and workflow automation.
Note: For more information about event-driven architectures, see Chapter 4, “S3 bucket
notifications for event-driven architectures”, in IBM Storage Ceph Solutions Guide,
REDP-5715.
2. In the Create Service dialog box, enter values similar to the ones that are shown in
Figure 3-11 on page 47. Then, click Create Service.
To observe the running RGW service, use any of the following dashboard locations:
Ceph Dashboard home page → Object Gateways section
Ceph Dashboard → Cluster → Services
Ceph Dashboard → Object Gateway - Daemons
2. In the Create Bucket window, enter the required values that are shown in Figure 3-15 on
page 51. When finished, click Create Bucket.
For this example, use the following Ceph Object gateway bucket configuration values:
Name: Any bucket name that you prefer.
Owner: The user that you created in “Creating an Object Gateway user” on page 48.
Placement: Accept the default value.
Figure 3-19 Displaying the S3 access key and secret key for a selected user
Alternatively, retrieve the S3 access keys by using the Ceph CLI. The radosgw-admin
command can be run from the shell, or within the cephadm CLI (Example 3-1).
Note: Substitute the RGW username that you created for “john” in “Creating an Object
Gateway user” on page 48. The access key and secret key will differ and be unique for
each Ceph cluster.
Example 3-1 Ceph Object Gateway S3 access key and secret key
[root@node1 ~]# radosgw-admin user info --uid="john"
{
"user_id": "john",
"display_name": "Shubeck, John",
"email": "[email protected]",
"suspended": 0,
"max_buckets": 1000,
"subusers": [],
"keys": [
{
"user": "john",
"access_key": "LGDR3IJB94XZIV4DM7PZ",
"secret_key": "qHAW3wdLGgGh78pyz8pigjxVeoM1sz1HT6lIdYD3"
}
],
There is nothing remarkable about an object. If a unit of data, for example, a JPEG/JFIF
image, is stored in a file system, then we refer to it as a file. If that file is uploaded or PUT into
an object storage bucket, we are likely to refer to it is an object. Regardless of where it is
stored, there is nothing that changes the nature of that image. The binary data within the file
or object, and the image that can be rendered from it, are unchanged.
Note: The access key and secret key will differ and be specific to each Ceph cluster.
3. Create an S3 bucket from the AWS CLI client (optional) (Example 3-4).
Note: The endpoint value should follow the local hostname of the Ceph Object Gateway
daemon host.
4. Create a 10 MB file named 10MB.bin. Upload the file to one of the S3 buckets
(Example 3-5).
5. Get a bucket listing to view the test object. Download the object to a local file
(Example 3-6).
Example 3-6 AWS CLI LIST object and GET object to file
[root@client ~]# aws --endpoint-url=http://ceph-node3:80 \
--profile=ceph s3 ls s3://s3-bucket-1
2023-07-05 16:55:39 10485760 10MB.bin
Example 3-7 Verifying the file against the object MD5 checksum
[root@client ~]# diff /tmp/10MB.bin /tmp/GET-10MB.bin
[root@client ~]#
[root@client ~]# openssl dgst -md5 /tmp/10MB.bin
MD5(/tmp/10MB.bin)= f1c9645dbc14efddc7d8a322685f26eb
[root@client ~]# openssl dgst -md5 /tmp/GET-10MB.bin
MD5(/tmp/GET-10MB.bin)= f1c9645dbc14efddc7d8a322685f26eb
[root@client ~]#
S3 API compatibility
Developers can use a RESTful API that is compatible with the Amazon S3 data access
model. It is through the S3 API that object storage clients and applications store, retrieve, and
manage buckets and objects stored by an IBM Storage Ceph cluster. IBM Storage Ceph and
the Ceph community are invested heavily in a design goal known as S3 Fidelity, which
means clients and independent software vendors (ISVs) can realize independence and
transportability for their applications across S3 vendors in a hybrid multi-cloud.
At a high level, here are the supported S3 API features in IBM Storage Ceph Object Storage:
Basic bucket operations (PUT, GET, LIST, HEAD, and DELETE).
Advanced bucket operations (Bucket policies, website, lifecycle, ACLs, and versions).
Basic object operations (PUT, GET, LIST, and POST).
Advanced object operations (object lock, legal hold, tagging, multipart, and retention).
S3 select operations (CSV, JSON, and Parquet formats).
Support for both virtual hostname and path name bucket addressing formats.
At the time of writing, Figure 3-21 on page 59 shows the S3 object operations.
3.2.11 Conclusion
IBM Storage Ceph Object storage provides a scale-out, high-capacity object store for S3 API
and Swift API client operations.
The Ceph service at the core of object storage is the RGW. The Ceph Object Gateway
services its clients through S3 endpoints, which are Ceph nodes where instances of RGW
operate and service S3 API requests on well-known TCP ports through HTTP and HTTPS.
Note: This section provides a brief overview of the block storage feature in Ceph. All the
following points and many others are documented in the IBM Storage Ceph
documentation.
IBM Storage Ceph block storage, also commonly referred to as RBD, is a distributed block
storage system that allows for the management and provisioning of block storage volumes
(aka images), akin to traditional storage area networks (SANs) or direct-attach storage (DAS)
but with increased flexibility and scalability.
RBD volumes can be accessed either through a kernel module (for Linux and Kubernetes) or
through the librbd API (for OpenStack and Proxmox). In the Kubernetes world, RBD volumes
are well suited for read/write once (RWO) persistent volume claims (PVCs). IBM Storage
Ceph RBD by default employs Thin Provisioning, which means that underlying storage is
allocated when client data is actually written, not when the volume itself is created. For
example, one might create a 1 TiB RBD volume but initially write only 10 GiB of data to it. In
this case, IBM Storage Ceph will only consume 10 GiB of physical storage media capacity
(modulo replication or erasure coding for data protection)).
Thhe virtual machine (VM), through the virtio-blk driver and Ceph library, accesses RBD
images as though they are physical drives directly attached to a VM.
RBD cache is enabled by default in write-back mode on a Ceph client machine, but it can be
set to write-through mode.
Snapshots
RBD images, like many storage modalities, support snapshots, which are convenient for data
protection, testing and development, and VM templating. RBD snapshots capture the state of
an RBD image at a specific point in time by using copy-on-write (COW) technology.
IBM Storage Ceph supports up to 512 snapshots per RBD image (the number is technically
unlimited, but the volume's performance is negatively affected beyond a handful).
Snapshots are read-only and track changes that are made to the original RBD image so that
it is possible to access or roll back to the state of the RBD image at snapshot creation time.
This means that snapshots cannot be modified. To make modifications, snapshots must be
used in conjunction with clones.
Because clones rely on a parent snapshot, losing the parent snapshot causes all child clones
to be lost. Therefore, the parent snapshot must be protected before creating clones
(Figure 3-24).
Clones are new images, so they can be snapshotted, resized, or renamed. They can also be
created on a separate pool for performance or cost reasons. Finally, clones are
storage-efficient because only modifications made relative to the parent’s data are stored on
the cluster.
Journaling mode can also be enabled to store data. In this mode, all writes are sent to a
journal before being stored as an object. The journal contains all recent writes and metadata
changes (device resizing, snapshots, clones, and others). Journaling mode is used for
efficient RBD mirroring.
RBD mirroring also supports a full lifecycle, meaning that after a failure, the synchronization
can be reversed and the original site can be made the primary site again when it is back
online. Mirroring can be set at the image level, so not all cluster images must be replicated.
RBD mirroring can be one-way or two-way. With one-way replication, multiple secondary sites
can be configured. Figure 3-25 on page 65 shows one-way RDB mirroring.
Two-way mirroring limits replication to two storage clusters. Figure 3-26 shows two-way RBD
mirroring.
Another strategy is to use snapshot-based mirroring. With this method, the remote cluster site
monitors the data and metadata differences between two snapshots before copying the
changes locally. Unlike the journal-based method, the snapshot-based method must be
scheduled or launched manually, so it is less incremental and current, but might also be
faster.
Note: For more information about mirroring, see the IBM Storage Ceph documentation.
Encryption
Using the Linux Unified Key Setup (LUKS) 1 or 2 format, RBD images can be encrypted if
they are used with librbd (at the time of writing, KRBD is not supported). The format
operation persists the encryption metadata to the RBD image. The encryption key is secured
by a passphrase that is provided by the user to create and access the device when it is
encrypted. Data at rest, which means the data that Ceph writes to physical storage drives,
may alternately be encrypted at the underlying RADOS layer by specifying the dmcrypt option
when OSDs are deployed.
Note: For more information about encryption, see the IBM Storage Ceph documentation.
Quality of service
By using librbd, it is possible to limit per-image I/O by using parameters, which are disabled
by default, that operate independently of each other. For example write input/output per
second (IOPS) can be limited while read IOPS is not. Throughput and IOPS may be similarly
throttled independently from each other. Here are these parameters:
IOPS The number of IOPS (read + write)
read IOPS The number of read IOPS
write IOPS The number of write IOPS
bps The bytes per second (read + write)
read bps The bytes per second read
write bps The bytes per second written
These settings can be configured at the image creation time or any time after by using either
the rbd CLI tool or the Dashboard.
Namespace isolation
Namespace isolation can restrict client access to private namespaces by using authentication
keys. In a private namespace, clients can see only their own RBD images and cannot access
the images of other clients in the same RADOS pool but in a different namespace.
You can create namespaces and configure images at image creation by using the rbd CLI
tool or the Dashboard.
When live migration is initiated, the source image is deep-copied to the destination image,
preserving all snapshot history and sparse allocation of data where possible. Live migration
requires creating a target image that maintains read/write access to data while the linked
Trash
When you move an RBD image to trash, you may keep it for a specified time before it is
permanently deleted and the underlying storage capacity freed. If the retention time has not
been exceeded and the trash has not been purged, a trashed image can be restored.
Trash management is available from both the CLI tool and the Dashboard.
This script is useful to automatically mount at boot time and umount at shutdown RBD images
from a Ceph client by adding one path per line to RBD devices (/dev/rbdX) and the
associated credentials to manage.
In summary, RBD image features make them well suited for various workloads, including both
virtualization (VMs and containers) and high-performance applications (such as databases
and applications that require high IOPS and use small I/O size). RBD images provide
capabilities including snapshots and layering, and stripe data across multiple servers in the
cluster to improve performance and availability.
Although the shared file system feature was the original use case for IBM Storage Ceph, due
to the demand for block and object storage by cloud providers and companies implementing
The MDS also maintains a cache to improve metadata access performance while managing
the cache of CephFS clients to ensure that the clients have the proper cache data and to
prevent deadlocks on metadata access.
When a new inode or dentry is created, updated, or deleted, the cache is updated and the
inode or dentry is recorded, updated, or removed in a RADOS pool that is dedicated to
CephFS metadata.
If the active MDS becomes inactive, a standby MDS becomes the active MDS.
An MDS can be marked as standby-replay for one or more ranks so that journal changes are
continuously applied to its cache. This feature enables faster failover between two MDS
daemons. To use this feature, each CephFS file system rank must be assigned to a
standby-replay MDS.
This journal is used to maintain file system consistency during an MDS failover operation.
Events can be replayed by the standby MDS to reach a file system state that is consistent
with the state last reached by the now defunct, previously active MDS.
The journal comprises multiple RADOS objects in the metadata pool, and journals are striped
across multiple objects for performance. Each active MDS maintains its own journal for
performance and resiliency reasons. Old journal entries are automatically trimmed by the
active MDS.
The file system configuration can be paired with an MDS deployment and configuration to
serve each file system with the required level of metadata performance:
A hot directory requires a dedicated set of MDSs.
An archival file system is read-only, and metadata updates occur in large batches.
Increasing the number of MDSs can remediate a metadata access bottleneck.
Each file system is assigned a parameter named max_mds that indicates the maximum
number of MDSs that can be active. By default, this parameter is set to 1 for each file system
that is created in the Ceph storage cluster.
A best practice when creating file systems is to deploy max_mds+1 MDSs for each file system,
with one MDS in tje standby role to ensure file system accessibility and resilience.
The number of active MDSs can be dynamically increased or decreased for a file system
without client disruption.
Note: A Damaged rank will not be assigned to any MDS until it is fixed and after the Ceph
administrator issues a ceph mds repaired command.
Regarding the number of MDSs that are assigned to a file system, the Ceph Manager (MGR)
provides the MDS Autoscaler module that monitors the max_mds and the
standby_count_wanted parameters for a file system.
Because a client may malfunction and cause some metadata to be held in the cache longer
than expected, the Ceph cluster has a built-in warning mechanism, which uses the
mds_health_cache_threshold parameter. A warning is raised when the actual cache usage is
150% of the mds_cache_memory_limit parameter.
This affinity is achieved by assigning the mds_join_fs parameter for a specific MDS instance.
When a rank enters the Failed state, the Monitors (mons) in the cluster assign to the Failed
rank a standby MDS for which the mds_join_fs is set to the name of the associated file
system.
If no standby MDS has this parameter set, a non-dedicated standby MDS is assigned to the
Failed rank.
Ephemeral pinning
Dynamic tree partitioning of directory subtree can be configured by using policies assigned to
directories within a file system, which influences the distribution of the MDS workload. The
following extended attributes of a file system directory can be used to control the balancing
strategy across active MDSs:
ceph.dir.pin.distributed All children will be pinned to a rank.
ceph.dir.pin.random Percentage of children that will be pinned to a rank.
Ephemera pinning does not persist after the inode is dropped from the MDS cache.
Manual pinning
If necessary, when Dynamic Tree Partitioning is not sufficient, with or without subtree
partitioning policies (for example, one directory is hot), the Ceph administrator can pin a
directory to a particular MDS. This goal is achieved by setting an extended attribute
(ceph.dir.pin) of a specific directory to indicate which MDS rank oversees metadata
requests for it.
Pinning a directory to a specific MDS rank does not dedicate that rank to this directory
because Dynamic Tree Partitioning never stops between active MDSs.
RADOS pools
Each file system requires a RADOS metadata pool to store metadata the and the file system
journal, and at least one data pool to store file data.
Data layout
This section describes metadata and data pools.
Metadata pool
The journaling mechanism uses dedicated RADOS objects, and the journal is striped across
multiple RADOS objects for performance.
Each inode in the file system is stored in a set of RADOS objects that are named as follows:
{inode_number}.{inode_extent}
The {inode_extent} starts as 00000000 and is incremented as additional data extents are
added.
Datapool
The organization of the data pool is driven by a set of extended attributes assigned to files
and directories. By default, a CephFS file system is created with one metadata pool and one
data pool. An extra data pool can be attached to an existing file system through the set of
extended attributes that is managed by the MDS. To add an additional data pool, run a
command of the following form:
$ ceph fs add_data_pool {filesysten_name} {data_pool_name}
The placement and the organization data follow the following rules:
A file, when created, inherits the attributes of the parent directory.
A file attribute can be changed only if it contains no data.
Existing files are affected by attribute changes on the parent directory.
All attributes at the file level have similar names prefixed with ceph.file.layout.
These attributes can be displayed by using the following commands when the file system is
mounted:
getfattr -n ceph.dir.layout {directory_path}
getfattr -n ceph.file.layout {file_path}
These attributes can be set by using the following commands when the file system is
mounted:
setfattr -n ceph.dir.layout.attribute -v {value} {dir_path}
setfattr -n ceph.file.layout.attribute -v {value} {file_path}
For example, let us visualize the RADOS physical layout of a CephFS file with the following
attributes, set explicitly or inherited from its parent directory (Figure 3-30 on page 75):
The file size is 8 MiB.
The object size is 4 MiB.
The stripe unit is 1 MiB.
The stripe count is 4.
We can display the names of the RADOS objects that store the file’s data (Example 3-9).
Two inodes were created (10000000001 and 100000001f6), and each one counts as
25 objects. Because the file system is empty and not pre-configured or customized, we use
stripe_unit=4 MiB, stripe_count=1, and object_size=4 MiB, that is, 25*4=100 MiB.
How do we know which file is which inode number? To answer this question, see
Example 3-11.
Volumes are used to manage exports through a Ceph Manager module that provides shared
file system capabilities via OpenStack Manila and Ceph CSI in Kubernetes / Red Hat
OpenShift environments:
Volumes represent an abstraction for CephFS filesystems.
Subvolumes represent an abstraction for directory trees.
Subvolume groups aggregate subvolumes to apply specific common policies across
multiple subvolumes.
This feature streamlines creating a CephFS file system with a single command that
automatically creates the underlying pools, then deploys the MDSs. Here is the command:
Quotas
IBM Storage Ceph CephFS supports quotas to restrict the number of bytes or the number of
files within a directory. This quota mechanism is enforced by both Linux kernel and userspace
FUSE clients.
Note: These commands are managed directly by the OpenStack Ceph Manila driver and
the Ceph CSI driver at the subvolume level.
The Ceph FUSE client was subsequently created to allow non-Linux clients to access a
CephFS filesystem (Figure 3-31).
IBM Storage Ceph supports NFS v3, and NFS v4 with Version 6.x and later.
The NFS implementation requires enabling the Ceph Manager nfs module . This can be
accomplished by running the following command:
ceph mgr module enable nfs
You can assign specific permissions to a directory for a specific Ceph client username or user
ID by running the following command:
$ ceph fs authorize {filesystem_name} {client_id|client_name} {path} {permissions}
For example, imagine a Ceph client CephX definition like Example 3-12.
If we try to modify an attribute of the root directory (/) of a CephFS file system mounted at
/mnt mount point, the request is denied because the set of capabilities do not include 'p':
# setfattr -n ceph.dir.layout.stripe_count -v 2 /mnt
setfattr: /mnt: Permission denied
Example 3-13 Creating another user with the correct capabilities and remounting the CephFS file
system
# mkdir /mnt/dir4
# ceph fs authorize myfs client.4 / rw /dir4 rwp
[client.4]
key = AQBmK71j0FcKERAAJqwhXOHoucR+iY0nzGV9BQ==
# umount /mnt
# mount -t ceph ceph-node01.example.com,ceph-node02.example.com:/ /mnt -o
name=4,secret="AQBmK71j0FcKERAAJqwhXOHoucR+iY0nzGV9BQ=="
# touch /mnt/dir4/file1
# setfattr -n ceph.file.layout.stripe_count -v 2 /mnt/dir4/file1
# getfattr -n ceph.file.layout /mnt/dir4/file1
# file: mnt/dir4/file1
ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304
pool=cephfs.fs_name.data"
Snapshots can be created at any level in the directory tree structure, including the root level of
a CephFS file system.
Snapshot capabilities for specific users can be enabled at the individual client level by
granting the 's' flag.
Snapshot capabilities can also be enabled or disabled for an entire file system by running a
command of the following form:
# ceph fs set {filesystem_name} allow_new_snaps true|false
To create a snapshot, the user that mounted the CephFS creates a unique subdirectory within
the '.snap' directory with the desired name by running a command of the following form:
# mkdir /mnt/.snap/mynewsnapshot
A user with sufficient privileges can create regular snapshots of a specific directory by
running the snap-schedule command (Example 3-14).
To set the retention of snapshots, run the snap-schedule command with a retention argument:
# ceph fs snap-schedule retention add / h 24
Retention added to path /
This feature requires both clusters to run an identical Ceph release, IBM Storage Ceph 5.3 or
later.
The feature is based on CephFS snapshots taken at regular intervals. The first snapshot
requires a full transfer of the data, and later snapshots require transferring only the data that
was updated (delta) since the last snapshot was applied on the remote cluster.
CephFS mirroring is disabled by default, and must be enabled separately by making the
following changes:
Enable the MGR mirroring module.
Deploy a cephfs-mirror component with cephadm.
Authorize the mirroring daemons in both clusters (source and target).
Set up peering among the source and the target clusters.
Configure the CephFS file system path to mirror.
Note: For more information about sizing an IBM Storage Ceph environment, see the
Planning section of the IBM Storage Ceph documentation.
4.1.1 IOPS-optimized
IOPS optimization deployments are suitable for cloud computing operations, such as running
MySQL or MariaDB instances as virtual machines (VMs) on OpenStack or as containers on
Red Hat OpenShift. IOPS-optimized deployments require higher performance SSD storage
devices to achieve high IOPS and throughput.
4.1.2 Throughput-optimized
Throughput-optimized deployments are ideal for serving large amounts of data, such as
graphics, audio, and video content. They require high-bandwidth networking hardware,
controllers, and HDDs or SDDs with fast sequential read/write performance. If fast data
access is required, use a throughput-optimized storage strategy. If fast write performance is
required, consider using SSDs for metadata and optionally payload data. A
throughput-optimized storage cluster has the following properties:
Lowest cost per MB/s (throughput)
Highest MB/s per TB
97th percentile latency consistency
4.1.3 Capacity-optimized
Capacity-optimized deployments are ideal for storing large amounts of data at the lowest
possible cost. They typically trade performance for a more attractive price point. For example,
capacity-optimized deployments often use slower and less expensive large capacity SATA or
NL-SAS HDDs or QLC SSDs, and can host many object storage daemon (OSD) drives per
server. A cost and capacity-optimized storage cluster has the lowest cost per TB. Note that
IBM Storage Ceph RGW object storage and CephFS deployments ideally use fast SSD
media for their metadata pools.
A cost and capacity-optimized storage cluster or pool has the following properties:
Often object storage
EC for maximizing usable capacity
Artifact archival
Video, audio, and image object repositories
By default, Ceph uses replicated pools, which means that data is copied from a primary OSD
node to one or more secondary OSDs. Erasure-coded pools reduce the disk space required
EC is a method of durably storing a RADOS object in the Ceph storage cluster by breaking it
into data chunks (k) and coding chunks (m). These chunks are stored on different OSDs in
unique hosts, racks, or other failure domain. When an OSD fails, Ceph retrieves the
remaining data (k) and coding (m) chunks from the surviving OSDs and uses the erasure
code algorithm to restore the object from those chunks.
EC uses storage capacity more efficiently than replication. The n-replication approach
maintains n copies of an object (3x by default in Ceph), but EC maintains k + m chunks. For
example, four data and two coding chunks protect data while consuming only 1.5x the
physical storage space of the original object’s size.
Although common EC profiles use less physical storage capacity than replication (lower
space amplification), it requires more memory and CPU than replication to access or recover
objects. EC is advantageous when data storage must be durable and fault-tolerant, but does
not require fast write performance (for example, cold storage, historical records, and others).
Ceph supports two networks: a public network and a storage cluster network. The public
network handles client traffic and communication with Ceph Monitors (mons). The cluster
(aka replication, back end, or private) network if deployed carries Ceph OSD heartbeats,
replication and recovery traffic. For production clusters, it is a best practice to use at least two
bonded 10 Gbps links for each logical network interface.
Link Aggregation Control Protocol (LACP) mode 4 (xmit_hash_policy 3+4 or 2+3 according
to local topology) can be used to bond network interfaces. Use jumbo frames with a maximum
transmission unit (MTU) of 9000, especially on the back-end or cluster network.
Clusters that provide object storage also deploy RADOS Gateway (RGW) daemons
(radosgw). If CephFS shared file services are required, Ceph Metadata Servers (MDSs)
(ceph-mds) will be configured.
Colocating multiple services within a single cluster node has the following benefits:
Significant improvement in total cost of ownership (TCO) versus smaller clusters
Reduction from six hosts to four for the minimum configuration
Simpler upgrade
Better resource isolation
Potential to share underlying physical storage device capacity
Modern servers have 16 or more CPU cores and large amounts of RAM, which supports
colocation of services. Running each service type in dedicated cluster nodes that are
separated from other types (non-collocated) would require many, smaller servers or
underutilize resources.
There are considerations that constrain deploying every type of service on a single node.
For any host that has ceph-mon/ceph-mgr and OSD, only one of the following should be
added to the same host:
RGW
Ceph MDSs
Grafana
RGW and Ceph Metadata services are best deployed on different nodes.
Daemon resource requirements might change between releases. Check the latest (or the
release that you want) requirements from IBM Documentation.
Figure 4-6 on page 91 shows the aggregate server CPU and memory resources that are
required for an OSD node with various numbers of HDD, SATA SSD OSDs, or NVMe SSD
OSDs.
It is best practice for many workloads to add SSD devices with capacity at least 4% of
aggregate HDD capacity to each server for metadata when deploying IBM Storage Ceph
RGW for object storage service. RGW index pools experience a high rate of small random
operations and do not perform well on slower media.
Figure 4-7 Performance scaling serving large objects as more RGW instances are deployed
Scaling from to one to seven RGW daemons in a seven-node cluster by spreading daemons
across all nodes improves performance 5 - 6x. Adding additional RGW daemons to nodes
that already host an RGW daemon does not scale performance as much.
Figure 4-8 shows the performance impact of adding a second RGW daemon to each of the
seven cluster nodes.
Figure 4-8 Performance effect of adding a second RGW daemon in each of the seven cluster nodes
Small objects also scale well in IBM Storage Ceph RGW deployments, and performance is
typically measured in operations per second (OPS).. Figure 4-9 on page 93 shows the small
object performance of the same cluster as above.
IBM Storage Ceph includes an ingress service for RGW that deploys haproxy and keepalived
daemons for convenient provisioning of a single virtual IP service endpoint.
A recovery calculator tool to estimate node recovery time can be found at Red Hat Storage
Recovery Calculator.
You must be registered as a Red Hat customer to access the tool. Go to this site to create a
Red Hat login.
Select Host as the OSD Failure Domain. The number to observe is MTTR in Hours (Host
Failure), which must be less than 8 hours.
The usage of the recovery calculator is important for small clusters where recovery times with
large drives are often more than 8 hours.
IBM Storage Ready Nodes for Ceph enable customers to receive both hardware maintenance
and software support from IBM.
At the time of writing, there is one server model available with various drive configurations, as
shown in Figure 4-11 on page 95.
Figure 4-11 IBM Storage Ceph with Storage Ready Node specifications
Each node comes with two 3.84 TB SSD acceleration drives, which are used as metadata
drives. Available data drives are 8 TB, 12 TB, 16 TB, and 20 TB SATA HDDs. Every node
must be fully populated with 12 drives of the same size.
Figure 4-12 Capacity examples for IBM Storage Ready Nodes for Ceph
IBM Storage Ready Nodes for Ceph are well suited for throughput-optimized workloads, such
as backup storage.
Examples of an 8-node IBM Storage Ceph cluster's performance using Ready Nodes are
shown in Figure 4-13.
Figure 4-13 Examples of an 8-node IBM Storage Ceph cluster's performance using Ready Nodes
Example server models and estimated IOPS per host using NVMe SSDs are shown in
Figure 4-14.
Figure 4-14 Example server models and estimated IOPS per host using NVMe SSDs
It is a best practice to have at least 10 Gbps of network bandwidth for every 12 HDDs in an
OSD node for both cluster network and client network. This approach ensures that the
network does not become a bottleneck.
4.10.2 Throughput-optimized
A throughput-optimized Ceph cluster can be designed with SSDs or hybrid using both SSDs
and HDDs. The objective of a throughput-optimized cluster is to provide required throughput
performance. The workload is typically large files, and the most common use cases are
backups, artifactories, and content distribution via RGW object access.
Figure 4-15 Example server models and estimated MBps per OSD drive by HDDs as data drives
4.10.3 Capacity-optimized
Capacity-optimized Ceph clusters use large-capacity HDDs, and are often designed to be
narrower (fewer servers) but deeper (more HDDs per server) than throughput-optimized
clusters. The primary objective is to achieve the lowest cost per GB.
Capacity-optimized Ceph clusters are typically used for archive storage of large files, with
RGW object access as the most common use case. Although it is possible to colocate
metadata on the same HDDs as data, it is still a best practice to separate metadata and place
it on SSDs, especially when using EC protection for the data. This approach can provide a
performance benefit.
Many of the typical server models connect to an external Just a Bunch of Disks (JBOD) drive
enclosure where data disks are. External JBODs can range from a 12-drive group to more
than a 100 drive models.
Example server models for capacity-optimized clusters are shown in Figure 4-16.
The IOPS requirement is high for a small capacity requirement. Therefore, plan to use NVMe
SSDs, which provide the highest IOPS per drive. Refer to the performance chart in
Figure 4-14 on page 97.
Performance planning
Figure 4-14 on page 97 shows that a single server with six NVMe drives can serve 50 K write
IOPS and 200 K read IOPS with a 4 KB block size.
Plan for ten servers to accommodate growth and server failure and maintenance.
CPU requirement
A best practice is to have 4 - 10 CPU cores per NVMe OSD. Our example solution has six
NVMe OSDs per server, so the CPU requirement for the OSD daemons is 24 - 60 CPU cores
(threads).
Memory requirement
A best practice is to have 12 - 15 GB of memory per NVMe drive. Our solution has six NVMe
drives per server, so the memory requirement for the OSD daemons is 72 - 90 GB.
Networking requirement
We know that six NVMe drives in one host serve about 200 K read IOPS (the write IOPS is
less). The block size is 4 KB. The network bandwidth that is required per server is 200000* 4
KBps =
800 MBps = 6.4 Gbps for both client and cluster networks.
Figure 4-18 shows our possible solution for an IOPS-optimized Ceph cluster.
Sizing example
Throughput-optimized systems often combine low-cost drives for data and high-performance
drives for metadata. IBM Storage Ready Nodes for Ceph offer this combination of drives.
The Ceph cluster must be designed in a way that allows data to be rebuilt on the remaining
nodes if one of the servers fails. This headroom must be added to the requested 1 PB.
Also, a Ceph cluster goes into read-only mode when an OSD reaches 95% capacity
utilization and OSDs that reach 90% (by default) cannot accept backfill/recovery .
Ceph S3 backup pools typically use EC to protect data. For a 1 PB usable capacity, EC 4+2 is
often a better choice than EC 8+3 or EC 8+4 because erasure coding profiles with large
values of (K+M) result in lower write performance and much longer elapsed time to recover
from component loss.
The minimum number of nodes to support an EC pool is (K+M+1). In our 4+2 case, this
means 7. Investigate whether a good solution for required capacity requirement can be found
that has seven nodes, or if more are needed.
The largest available drive size in IBM Storage Ready Nodes for Ceph is 20 TB (7 nodes * 12
drives per node * 20 TB drives = 1680 TB raw).
This configuration seems to be a low-cost alternative for 1 PB of usable space. But, can we
use 20 TB drives in a 7-node cluster, as shown in Figure 4-19? What does the recovery
calculator show?
The capacity math for an 8-node cluster with 20 TB drives is shown in the following list:
1. Eight nodes, each with 12 pcs of 20 TB drives = 8*12*20 TB = 1920 TB raw capacity.
2. EC 4+2 has four data chunks and two EC chunks. Each chunk is the same size.
3. One chunk of 1920 TB is 1920 TB / 6 = 320 TB.
4. There are four data chunks in EC 4+2, so the usable capacity is 4*320 TB = 1280 TB.
5. This scenario would satisfy the customer's 1 PB capacity requirement, and the cluster has
enough capacity to rebuild data if one node fails and still stay below 95% capacity
utilization.
a. Each server has 12x 20 TB = 240 TB raw capacity.
b. Total cluster raw capacity is 1920 TB.
c. The cluster raw capacity if one node breaks is 1920 TB - 240 TB = 1680 TB.
d. Raw 1680 TB equals 1120 TB usable.
e. A customer has 1 PB (1000 TB) of data.
f. The capacity utilization if one host is broken is 1000 TB / 1120 TB = 0.89 = 89%.
6. The conclusion is that an 8-node cluster with 20 TB drives has enough capacity to stay
below 95% capacity utilization even if one the servers fails and its data is rebuilt on the
remaining servers.
7. Now that the raw capacity in each OSD node is calculated, add SSD drives for metadata
with capacity at least 4% of the raw capacity on the host.
– 4% of 240 TB is 9.6 TB.
– A best practice is to have one metadata SSD for every four or six HDDs. Therefore,
provision two or three 3.84 TB SSDs in each host.
CPU requirement
A best practice is to have at least one CPU core (thread) per HDD. This solution has 12 HDDs
per server, so the CPU requirement for the OSD daemons is 12 CPU cores.
Memory requirement
A best practice is to have 10 GB of memory per HDD for a throughput-optimized use case.
Our solution has 12 HDDs per server, so the memory requirement for the OSD daemons is
120 GB.
Networking requirement
One HDD delivers 90 MB/s read performance. Our solution has 12 HDDs per server, so the
network bandwidth requirement per network is 12* 90 MBps = 1080 MBps = 8.64 Gbps.
A possible solution
An object storage archive for cold data is a use case where you can design a cluster that is
narrow (fewer servers) and deep (many OSD drives per host). The result is a low-cost but not
high-performing cluster. Dense servers benefit greatly from using SSDs for metadata, and it is
a best practice to include them even for a cold archive use case.
Because the capacity requirement is high, consider using EC 8+3, which has the lowest
overhead (raw capacity versus usable capacity). An IBM Storage Ceph cluster that uses EC
8+3 requires a minimum of 12 nodes.
First, check whether a 12-node cluster with large drives and and dense chassis can recover
from host failure in less than 8 hours.
Bring up the Recovery Calculator tool and enter the parameters of the cluster that you want to
check:
Twelve OSD hosts
Eighty-four OSD drives at 20 TB each per host
Two 25 GbE network ports
Figure 4-22 shows that the calculated recovery time for such a cluster is less than 8 hours,
which means that this scenario is a supported design.
CPU requirement
For a cold archive use case, you can provision as few as 0.5 CPU cores (threads) per HDD.
Our solution has 84 HDDs per server, so the CPU requirement for OSD daemons is 42 CPU
cores.
Memory requirement
A best practice is to have 5 GB of memory per HDD for capacity-optimized use case. Our
solution has 84 HDDs per server, so the memory requirement for OSD daemons is 420 GB.
Networking requirement
One HDD has 90 MBps read performance. Our solution has 84 HDDs per server, so the
network bandwidth requirement per network is 84* 90 MBps = 7560 MBps ~60 Gbps.
Here are the three technical skills that are needed to manage and monitor a Ceph cluster:
Operating system: The ability to install, configure, tune, and manage the operating system
running on Ceph nodes (Red Hat Enterprise Linux).
Networking: The ability to configure TCP/IP networking in support of Ceph cluster
operations for both client access (public network) and Ceph node-to-node
communications (cluster network).
Storage: The ability to design and implement a Ceph storage architecture tailored to the
performance, data availability, and data durability requirements of client applications and
data.
Administrators must perform these roles in managing a Ceph cluster by using the correct tool
at the correct time. This chapter provides an overview of Ceph cluster monitoring and the
tools that are available to perform monitoring tasks.
Ceph Monitor
Ceph Monitors (mons) are the daemons that are responsible for maintaining the cluster map,
which is a collection of data that provides comprehensive information about the cluster's
topology, state, and configuration. Ceph proactively handles cluster events, updates the
relevant map, and distributes the updated map to all Monitor daemons. A typical Ceph cluster
includes three or five Monitors, each running on a separate for high availability. To ensure
data integrity and consistency, Monitors employ a consensus mechanism that requires a
majority of the configured Monitors to be available and agree on the map update before it is
applied. This mechanism is why a Ceph cluster must be configured with an odd number of
Monitors to establish a quorum and prevent potential conflicts.
Ceph Manager
The Ceph Manager (mgr) is a critical component of any Ceph cluster that is responsible for
collecting and aggregating cluster-wide statistics. The first Manager daemon in a cluster
becomes the active Manager and all others are on standby. If the active Manager does not
send a beacon within the configured time interval, a standby takes over. Client I/O operations
continue normally if no Manager is active, but queries for cluster statistics fail. A best practice
is to deploy at least two Managers in a Ceph cluster to provide high availability (HA). Ceph
Managers are typically run on the same hosts as the Monitor daemons, but this association is
not required.
Ceph daemons
Each daemon within a Ceph cluster maintains a log of events, and the Monitors maintain a
cluster log that records high-level events and status regarding the cluster as a whole. These
events are logged to disk on Monitors servers (in the default location /var/log/ceph/ceph.log),
and can be monitored with CLI commands or passed to log management solutions like
Splunk or Grafana Loki.
Monitoring services
A Ceph cluster is a collection of daemons and services that perform their respective roles in
the operation of a Ceph cluster. The first step in Ceph monitoring or troubleshooting is to
inspect these Ceph components and discover whether they are running on the nodes they
are designated to run, and if they report a healthy state. For example, if the cluster design
specifies two Ceph Object Gateway (RGW) instances for handling S3 API client requests on
distinct nodes, a single CLI command or a glance at the Ceph Dashboard provides insights
into the operating status of the RGW daemons and their respective nodes.
Monitoring resources
Ceph resources encompass a cluster’s entities and constructs that define its characteristics.
These resources include networking infrastructure, storage devices (SSDs or HDDs), logical
storage pools and their capacities, and data protection mechanisms. As with monitoring other
resources, understanding the health of Ceph storage resources is crucial for effective cluster
management. At the core, administrators must ensure sufficient capacity and expansion
capabilities to provide the appropriate data protection for the applications that they support.
Monitoring performance
nsuring that both physical and software-defined resources operate at required levels is the
responsibility of the Ceph administrator. System alerts and notifications serve as valuable
tools, alerting the administrator to anomalies without requiring constant hands-on
surveillance. Key metrics include node utilization (CPU, memory, and disk), network utilization
(interfaces, bandwidth, latency, and routing), storage device performance, and daemon
workload.
Ceph Dashboard
The Ceph Dashboard UI provides an HTTP web browser interface that is accessible on port
8443. Dashboard navigation menus provide real-time health and basic statistics, which are
context-sensitive, for cluster resources. For example, the Dashboard can display cluster read
and write throughput and IOPS. When you bootstrap your cluster with the cephadm bootstrap
command, then the Dashboard is enabled by default.
The Dashboard also provides a convenient access point to observe the state and the health
of resources and services in the Ceph cluster. The following Dashboard UI navigation
provides at-a-glance views into the health of the cluster.
Cluster → Services (View Ceph services and daemon node placement and instances.)
Cluster → Logs (View and search within the cluster logs, audit logs, and daemon logs..)
Cluster → Monitoring (Management of active alerts, alert history, and alert silences.)
Prometheus
The Prometheus plug-in to the Dashboard facilitates the collection and visualization of Ceph
performance metrics by enabling the export of performance counters directly from ceph-mgr.
ceph-mgr gathers MMgrReport messages from all MgrClient processes, including Monitors
and OSDs, containing performance counter schema and data. These messages are stored in
a circular buffer, maintaining a record of the last N samples for analysis.
This plug-in establishes an HTTP endpoint, akin to other Prometheus exporters, and retrieves
the most recent sample of each counter on polling, or scraping in Prometheus parlance. The
HTTP path and query parameters are disregarded, and all available counters for all reporting
entities are returned in the Prometheus text exposition format (for more information, see the
Prometheus documentation). By default, the module accepts HTTP requests on TCP port
9283 on all IPv4 and IPv6 addresses on the host.
The Prometheus metrics can be viewed in Grafana from an HTTP web browser on TCP port
3000 (for example, https://ceph-mgr-node:3000).
cluster:
id: 899f61d6-5ae5-11ee-a228-005056b286b1
health: HEALTH_OK
services:
mon: 3 daemons, quorum
techzone-ceph6-node1,techzone-ceph6-node2,techzone-ceph6-node3 (age 16h)
mgr: techzone-ceph6-node1.jqdquv(active, since 2w), standbys:
techzone-ceph6-node2.akqefd
mds: 1/1 daemons up
osd: 56 osds: 56 up (since 2w), 56 in (since 4w)
rgw: 2 daemons active (2 hosts, 1 zone)
data:
volumes: 1/1 healthy
pools: 10 pools, 273 pgs
objects: 283 objects, 1.7 MiB
usage: 8.4 GiB used, 823 GiB / 832 GiB avail
pgs: 273 active+clean
. . . output omitted . . .
. . . output omitted . . .
Example 5-8 Checking the current OSD fullness ratios for your cluster
[ceph: root@ceph-node1 /]# ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
Tip: When a CephOSD reaches the full ratio, it indicates that the device is nearly full and
can no longer accommodate additional data. This condition can have several negative
consequences for the Ceph cluster:
Reduced performance: As an OSD fills up, its performance degrades. This situation can
lead to slower read/write speeds, increased latency, and longer response times for
client applications.
Cluster rebalancing challenges: When an OSD reaches the full ratio, it can make it
more difficult to rebalance the cluster. Rebalancing is the process of evenly distributing
data across all OSDs to optimize performance and improve fault tolerance.
To avoid these consequences, monitor OSD usage and take proactive measures to prevent
OSDs from reaching the full ratio. This task might involve adding OSDs to the cluster,
increasing the capacity of existing OSDs, or deleting old or unused data.
Optional Ceph Manager modules include InfluxDB, Insights, Telegraph, Alerts, Disk
Prediction, and iostat, and can help you integrate your IBM Storage Ceph cluster into existing
monitoring and alerting systems. You can also use the SNMP Gateway service to enhance
monitoring capabilities.
Administrators can leverage the Ceph Dashboard for a straightforward and intuitive view into
the health of the Ceph cluster. The built-in Grafana dashboard presents detailed information,
context-sensitive performance counters, and performance data for specific resources and
services.
Administrators can also use cephadm, the stand-alone Grafana dashboards, and other
third-party tools to visualize and record detailed metrics on cluster utilization and
performance.
In conclusion, the overall performance of a software-defined storage solution like IBM Storage
Ceph is heavily dependent on the network. Therefore, it is crucial to integrate the nodes and
switches serving your IBM Storage Ceph cluster with existing network monitoring
infrastructure. This includes tracking packet drops and other network errors and the health of
network interfaces. The SNMP subsystem included in Red Hat Enterprise Linux can facilitate
comprehensive monitoring.
6.1.1 Prerequisites
This section describes IBM Ceph prerequisites.
For more information about the server sizing and configuration, see Chapter 4, “Sizing IBM
Storage Ceph” on page 83.
cephadm requires password-less SSH capabilities. DNS resolution must be configured for all
nodes. OSD nodes must have all the tools in the preceding list plus LVM2.
Firewall requirements
Table 6-1 on page 127 lists the TCP ports that IBM Storage Ceph uses.
Miscellaneous requirements
If you configure SSL for your Dashboard access and your object storage endpoint (RGW),
ensure that you obtain the correct certificate files from your security team and deploy them
where needed.
Ensure that your server network and configuration are adequate for the workload served by
your IBM Storage Ceph cluster:
Size your network cards, interfaces, and switches for the throughput that you must deliver.
Evaluate the network bandwidth required for data protection because it must be carried by
the network for all write operation.
Ensure that your network configuration does not present single points of failure.
If your cluster spans multiple subnets, ensure that each server can communicate with the
other servers.
6.1.2 Deployment
The IBM Storage Ceph documentation describes how to use cephadm to deploy your Ceph
cluster. A cephadm based deployment has the following steps:
For beginners:
a. Bootstrap your Ceph cluster (create one initial Monitor and one Manager).
b. Add services to your cluster (OSDs, MDSs, RGWs, and other services).
For advanced users: Bootstrap your cluster with a complete service file to deploy
everything.
Cephadm
The only supported tool for IBM Storage Ceph, cephadm has been available since the Ceph
upstream Octopus release, and has been the default deployment tool since Pacific.
You can assign labels to hosts by using the labels: field of a host service file (Example 6-2).
You can assign specific placement for a service by using the placement: field in a service file.
For more information, see “Placement” on page 130.
An OSD service file offers many specific options to specify how the OSDs should be
deployed:
block_db_size to specify the RocksDB size on separate devices.
block_wal_size to specify the RocksDB size on separate devices.
data_devices to specify which devices receive file data.
db_devices to specify which devices receive the RocksDB data.
wal_devices to specify which devices receive the write-ahead log (WAL).
The data_devices, db_devices, and wal_devices parameters accept the following arguments:
all to specify that all devices are to be consumed (true or false).
limit to specify how many OSD to deploy per node.
rotational to specify the type of devices to select (0 for SSDs, 1 for HDDs).
size to specify the size of the devices to select:
– xTB to select a specific device size.
– xTB:yTB to select devices between the two capacities.
– :xTB to select any device up to this size.
– xTB: to select any device at least this size.
– path to specify the device path to use.
model to specify the manufacturer name..
vendor to specify the vendor model name.
encrypted to specify whether the data will be encrypted at rest (data_devices only).
Container parameters
You can customize the parameters that are used by the Ceph containers by using a special
section of your service file that is known as extra_container_args. To add extra parameters,
use the template that is shown in Example 6-4 in the appropriate service files.
Placement can use explicit naming: --placement="host1 host2 …". In such a configuration,
the daemons are deployed on the nodes that are listed.
Using a service file, you encode the count as shown in Example 6-5.
Using a service file, you encode the label as shown in Example 6-6.
Using a service file, you encode the host list as shown in Example 6-7.
Using a service file, you encode the pattern as shown in Example 6-8.
You can pass an initial Ceph configuration file to the bootstrap command by using the
--config {path_to_config_file} option.
You can override the SSH user that is used by cephadm by using the --ssh-user {user_name}
option.
You can pass a specific set of container registry parameters in a valid registry JSON file by
using the --registry-json {path_to_registry_json} option.
You can choose the Ceph container image to deploy by using the --image
{registry}[:{port}]/{imagename}:{imagetag} option.
You can specify the network configuration that the cluster uses. A Ceph cluster uses up to two
networks:
Public network,
– UUsed by clients (including RGWs) to connect to Ceph daemons.
– Used by Monitors to converse with other daemons.
– Used by Managers and MDSs to communicate with other daemons.
Cluster network: Used by OSDs to perform OSD operations, including heartbeats,
replication, and recovery. If a cluster network is not defined, these operations use the
public network.
The public network is extrapolated from the --mon-ip parameter provided to the bootstrap
command. The cluster network can be specified with the --cluster-network parameter. If
the --cluster-network parameter is not specified, it is set to the same value as the public
network.
To connect to the Dashboard GUI, look for the following lines in the cephadm bootstrap output
and point your HTTP browser to the URL that is displayed. (Example 6-9 on page 133).
This document is not designed to provide extensive details about the deployment and
configuration of your cluster. For more information, see IBM Storage Ceph documentation.
IBM Storage Ceph Solutions Guide, REDP-5715 provides GUI-based deployment methods
for users who want to use a GUI instead of the CLI.
After the cluster is successfully bootstrapped you can deploy additional cluster services. A
production cluster requires the following elements to be deployed in appropriate numbers for
a reliable cluster that does not present any single point of failure:
Monitors
Managers
OSDs
Adding nodes
When the cluster is bootstrapped, the Ceph administrator must add each node where
services will be deployed. Copy the cluster public SSH key to each node that will be part of
the cluster. When the nodes are prepared, add them to the cluster (Example 6-10).
Tip: To add multiple hosts, use a service file that describes all your nodes.
Removing nodes
If you must remove a node that was added by mistake, run the commands in Example 6-11. If
the node does not have OSD already deployed, skip the second command of the example.
Tip: You can also clean up the SSH keys that are copied to the host.
Assigning labels
You can assign labels to hosts after they are added to the cluster (Example 6-12).
After bootstrapping your cluster, you have only a single Monitor, a single Manager, no OSDs,
no MDSs, and no RGWs. These services are deployed in the following order:
1. Deploy at least another two Monitors for a total of three or five.
2. Deploy at least one more Manager.
3. Deploy the OSDs.
4. If needed, deploy CephFS MDS daemons.
5. If needed, deploy RGWs.
Monitors
To deploy a total of three Monitors, deploy two more (Example 6-13).
Tip: This task can also be achieved by using a service file and running the following
command:
Tip: If you want your MON to bind to a specific IP address or subnet, use
{hostname}:{ip_addr} or {hostname}:{cidr} to specify the host.
Managers
To deploy a total of two Managers, deploy one more (Example 6-14).
List the devices that are available on all nodes by using the command in Example 6-15 after
you add all the nodes to your cluster.
cephadm scans all nodes for the available devices (free of partitions, formatting, LVM
configuration) (Example 6-16).
To visualize what devices will be consumed by this command, run the following command:
ceph orch apply osd --all-available-devices --dry-run
An OSD service file, using the information that is provided in 6.1.2, “Deployment” on
page 127, can be tailored to each node’s needs. As such, an OSD service file can contain
multiple specifications (Example 6-17).
You can also add an OSD that uses a specific device (Example 6-18).
Tip: You can pass many parameters through the CLI, including data_devices, db_devices,
and wal_devices. For example:
{hostname}:data_devices={dev1},{dev2},db_devices={db1}
If an OSD does not deploy on a given device it might be necessary to initialize it so that it can
be consumed by an OSD (Example 6-19).
Arguments can be passed through CLI arguments or provided by a service file containing a
detailed configuration.
Tip: You can pass many parameters through the CLI, including --realm={realm_name},
--zone={zone_name} --placement={placement_specs}, --rgw_frontend_port={port}, or
count-per-host:{n}.
Ingress service
If your cluster deploys an RGW serviced, you need a load balancer in front of the RGW
service to ensure the distribution of the traffic between multiple RGWs and provide a highly
available (HA) service endpoint with no single point of failure.
When the MDSs are active, manually create the file system by running the following
command:
ceph fs new {fs_name} {meta_pool} {data_pool}
When your cluster is fully deployed, a best practice is to export the cluster configuration and
back up the configuration. Use git to manage the cluster configuration (Example 6-22).
Tip: Redirect the command output to a file to create a full cluster deployment service file.
In a production cluster, it is best for each OSD to serve 100 - 200 PGs replicas.
The Placement Group Autoscaler automatically adjust the number of PGs within each pool
based on the target size ratio that is assigned to each pool.
The failure of a node or the failure of multiple devices can endanger the resiliency of the data
and expose it to a double failure scenario that can degrade the availability or durability of data.
Ceph clusters must be monitored and action taken swiftly to replace failed nodes and failed
devices to limit the risk presented by multiple failure scenarios.
Data rebalancing or recovery within a Ceph cluster may impact client. As Ceph matured, the
default recovery and backfill mechanisms and parameters were adjusted to minimize this
negative impact.
ceph pg dump_stuck {pgstate} Lists PGs that are stuck in stale, inactive, or unclean
states.
For example, to view the daemon mon.foo logs for a cluster with ID
5c5a50ae-272a-455d-99e9-32c6a013e694, run the command that is shown in Example 6-24.
Tip: You can set a single daemon to log to a file by using osd.0 instead of global.
By default, cephadm enables log rotation on each host. You can adjust the retention schedule
and other specifics by modifying /etc/logrotate.d/ceph.<CLUSTER FSID>.
Because a few Ceph daemons (notably, Monitors and Prometheus) store a large amount of
data in /var/lib/ceph, you may find it helpful to arrange for this directory to reside on a
dedicated disk, partition, or logical volume so that it does not fill up the / or /var file system.
Cephadm logs
cephadm writes logs to the cephadm cluster log channel. You can monitor the Ceph activity in
real time by reading the logs as they fill up. To see the logs in real time, run the command in
Example 6-27.
By default, this command shows info-level events and above. To see debug-level messages
as well, run the commands in Example 6-28.
You can see recent events by running the command in Example 6-29.
If your Ceph cluster is configured to log events to files, these will be found in
ceph.cephadm.log on all Monitor hosts (for more information, see the cephadm utility).
For more information, see Monitor cephadm log messages in IBM Documentation.
A logging setting can take a single value for the log and memory levels, which sets them both
to the same value. For example, if you specify debug ms = 5, Ceph treats it as a log and
memory level of 5. You may also specify them separately. The first setting is the log level, and
the second is the memory level. Separate them with a forward slash (/). For example, if you
want to set the ms subsystem's debug logging level to 1 and its memory level to 5, you specify
it as debug ms = 1/5. You can adjust log verbosity at runtime in different ways. Note that
runtime adjustments will be reverted when each daemon is eventually restarted.
Example 6-31 shows how to configure the logging level for a specific subsystem.
Example 6-32 how to use the admin socket from inside a daemon’s container.
Example 6-32 Configuring the logging level by using the admin socket
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.<name>.asok config set
debug_rgw 20
Example 6-33 shows how to make the verbose/debug change permanent so that it persists
after a restart.
To do so, use the cephadm logs to get logs from the containers running Ceph services
(Example 6-34).
Example 6-34 Checking the startup logs from a Ceph service container
# cephadm ls | grep mgr
"name": "mgr.ceph-mon01.ndicbs",
"systemd_unit":
"[email protected]",
"service_name": "mgr",
# cephadm logs --name mgr.ceph-mon01.ndicbs
Inferring fsid 3c6182ba-9b1d-11ed-87b3-2cc260754989
-- Logs begin at Tue 2023-01-24 04:05:12 EST, end at Tue 2023-01-24 05:34:07 EST.
--
Jan 24 04:05:21 ceph-mon01 systemd[1]: Starting Ceph mgr.ceph-mon01.ndicbs for
3c6182ba-9b1d-11ed-87b3-2cc260754989...
Jan 24 04:05:25 ceph-mon01 podman[1637]:
Jan 24 04:05:26 ceph-mon01 bash[1637]:
36f6ae35866d0001688643b6332ba0c986645c7fba90d60062e6a4abcd6c8123
Jan 24 04:05:26 ceph-mon01 systemd[1]: Started Ceph mgr.ceph-mon01.ndicbs for
3c6182ba-9b1d-11ed-87b3-2cc260754989.
Jan 24 04:05:27 ceph-mon01
ceph-3c6182ba-9b1d-11ed-87b3-2cc260754989-mgr-ceph-mon01-ndicbs[1686]: debug
2023-01-24T09:05:27.272+0000 7fe90710d>
Jan 24 04:05:27 ceph-mon01
ceph-3c6182ba-9b1d-11ed-87b3-2cc260754989-mgr-ceph-mon01-ndicbs[1686]: debug
2023-01-24T09:05:27.272+0000 7fe90710d>
Major releases include more disruptive changes, such as Dashboard or Manager API
deprecation, than minor releases. Major IBM Storage Ceph releases correlate with an
upstream major release, for example from 5.X to 6.X or from 6.X to 7. Major upgrades may
also require upgrading the operating system. To see the matrix of OS-supported versions
depending on the IBM Storage Ceph release, see What are the Red Hat and IBM Storage
Ceph releases and corresponding Ceph package Versions?
Minor upgrades generally correlate with the same upstream release and usually avoid
disruptive changes. For IBM Storage Ceph, minor releases are IBM Storage Ceph 7.1, 7.2,
and so forth.
For each minor release there are periodic maintenance releases. Maintenance releases bring
security and bug fixes. New features are rarely introduced in maintenance releases.
Maintenance releases are named 6.1z1, 6.1z2, and so on.
If you are upgrading from Red Hat Ceph Storage to IBM Storage Ceph, and the Red Hat
Ceph Storage cluster is at version 3.X or 4.X, the upgrade from 3.X to 4.X to 5.X is done with
ceph-ansible. Before cephadm, the ceph-ansible tool found here was used for upgrading minor
and major versions of Red Hat Ceph Storage.
Both approaches are valid, and have pros and cons depending on the use case and
infrastructure resources.
Example 6-35 Preventing OSDs from getting marked out during an upgrade and avoiding unnecessary
load on the cluster
# ceph osd set noout
# ceph osd set noscrub
# ceph osd set nodeep-scrub
The conservative approach is to recover all OSDs when upgrading the OS, especially if
you are doing a clean OS installation to upgrade RHEL. This approach can take hours or
days to complete in a large cluster because the upgrade blocks between hosts while the
cluster is in a degraded state with only two valid copies of the data (assuming replica 3 is
used).
Example 6-36 shows how to determine whether an upgrade is in process and the version to
which the cluster is upgrading.
To get detailed information about all the steps that the upgrade is taking, query the cephadm
logs (Example 6-37).
You can pause, resume, or stop an upgrade while it is running (Example 6-38).
Cluster expansion
One of the outstanding features of Ceph is its ability to add or remove Ceph OSD nodes at
run time. You can resize cluster capacity without taking it down, adding, removing, or
replacing hardware during regular business hours. Note that adding or removing OSD nodes
can impact performance.
Before adding Ceph OSD nodes, consider the effects on storage cluster performance. Adding
or removing Ceph OSD nodes results in backfill or recovery as the storage cluster
rebalances. During this rebalancing period, Ceph uses additional network and storage drive
resources, which can impact client performance.
Because a Ceph OSD host is part of the CRUSH hierarchy adding or removing a node can
affect the performance clients of pools that use OSDs placed on the affected node.
3. Ensure that the appropriate RHEL repositories are enabled (Example 6-40).
4. Add the IBM Storage Ceph repository for the release the cluster currently runs
(Example 6-41).
5. From the IBM Storage Ceph Cluster bootstrap node, enter the cephadm shell
(Example 6-42).
7. Copy the Ceph cluster public SSH keys to the root user's authorized_keys file on the new
host (Example 6-44).
8. Add the new host to the Ansible inventory file. The default location for the file is
/usr/share/cephadm-ansible/hosts. Example 6-45 shows the structure of a typical
inventory file.
[admin]
host00
Example 6-46 Copying the SSH key to the new node that you are adding
$ ansible-playbook -i INVENTORY_FILE cephadm-preflight.yml --extra-vars
"ceph_origin=ibm" --limit NEWHOST
10.From the Ceph administration node, log in to the cephadm shell (Example 6-47).
11.Use the cephadm orchestrator to add hosts to the storage cluster (Example 6-48).
When the node is added to the cluster, you should see the node listed in the output of the
ceph orch host ls command or ceph orch device list.
If any disks on the newly added host pass the filter that you configured in your cephadm OSD
service spec, new OSDs are on them. The cluster rebalances, moving PGs from other OSDs
to the new OSDs to distribute data evenly across the cluster.
The best practices described in “Cluster expansion” on page 146 also apply to this section, so
read them carefully before starting a node replacement.
For more information about disk replacement, see Replacing the OSDs.
All the best practices in this section on cluster expansion also apply to disk replacement, so
read them carefully before starting a disk replacement. These best practices are used at a
smaller scale because, in this case, it is a single OSD, not a full node.
The IBM Storage Ceph Observability Stack provides management and monitoring capabilities
to administer and configure the cluster and visualize related information and performance
statistics. The Dashboard uses a web server that is hosted by the ceph-mgr daemon.
Management features
The Dashboard has the following management features:
View the cluster hierarchy: You can view, for example, the CRUSH map to determine the
host on which a specific OSD is located. This is helpful if there is an issue with an OSD.
Configure Manager modules: You can view and change parameters for Ceph Manager
modules.
Embedded Grafana Dashboards: Grafana Dashboards may be embedded in external
applications and web pages to surface information and performance metrics that are
gathered by the Prometheus module.
View and filter logs: You can view event and cluster logs and filter them based on priority,
keyword, date, or time range.
Toggle Dashboard components: You can enable and disable Dashboard components so
only the features that you need are available.
Embedded Grafana Dashboards: Grafana Dashboards may be embedded in external
applications and web pages to surface information and performance metrics that are
gathered by the Prometheus module.
Viewing Alerts: The alerts page shows details of the current alerts.
Quality of service for RBD volumes: You can set performance limits on volumes, limiting
IOPS or throughput.
Monitoring features
The Dashboard has the following monitoring features:
Username and password protection: You can access the Dashboard only by providing a
configurable username and password.
Overall cluster health: Displays overall cluster status as well as performance and capacity
metrics. Storage utilization is also shown including number of RADOS objects, raw
capacity, usage per pool, and a list of pools and their status and usage statistics.
Hosts: Provides a list of all hosts in the cluster along with their running services and the
installed Ceph version.
Performance counters: Displays detailed statistics for each running service.
Monitors: Lists Monitor daemons, their quorum status, and open sessions.
Configuration editor: Displays available configuration options, their descriptions, types,
defaults, and current values. These values are editable.
Cluster logs: Displays and filters the cluster's event and audit log files by priority, date, or
keyword.
Device management: Lists hosts known by the Orchestrator. Lists all drives attached to
each host and their properties. Displays drive health predictions, SMART data, and
manages enclosure LEDs.
View storage cluster capacity: You can view the raw storage capacity of the IBM Storage
Ceph cluster in the Capacity pane of the Ceph Dashboard.
Pools: Lists and manages all Ceph pools and their details. For example, application tags,
PGs, replication size, EC profile, quotas, CRUSH ruleset, and others.
Dashboard access
You can access the Dashboard with the credentials that are provided on bootstrapping the
cluster.
Cephadm installs the Dashboard by default. Example 6-50 is an example of the Dashboard
URL.
To find the Ceph Dashboard credentials, search the var/log/ceph/cephadm.log file for the
string "Ceph Dashboard is now available at".
Change the password the first time that you log in to the Dashboard with the credentials that
are provided by bootstrapping only if the --dashboard-password-noupdate option is not used
while bootstrapping.
Ceph OSDs send heartbeat liveness probes to each other to test daemon availability and
network performance. If a single delayed response is detected, it might indicate nothing more
than a busy OSD. But if recurring delays between OSDs are detected, it might indicate a
failed network switch, a NIC failure, or a layer 1 issue.
In the output of the ceph health detail command, you can see which OSDs are
experiencing delays and how long the delays are. The output of ceph health detail is limited to
ten lines. Example 6-51shows an example of the output the ceph health detail command.
To see more details and collect a complete dump of network performance information, use
the dump_osd_network command.
The Alertmanager that is part of the Observability stack offers a comprehensive set of
preconfigured alerts related to networking. Example 6-52 shown an example.
For more information about the preconfigured alarms, see this GitHub repository.
The publications that are listed in this section are considered suitable for a more detailed
description of the topics that are covered in this paper.
IBM Redbooks
The following IBM Redbooks publications provide more information about the topics in this
document. Some publications that are referenced in this list might be available in softcopy
only.
IBM Storage Ceph Solutions Guide, REDP-5715
Unlocking Data Insights and AI: IBM Storage Ceph as a Data Lakehouse Platform for IBM
watsonx.data and Beyond, SG24-8563 (in draft at the time of editing)
You can search for, view, download, or order these documents and other Redbooks,
Redpapers, web docs, drafts, and additional materials, at the following website:
ibm.com/redbooks
Online resources
These websites are also relevant as further information sources:
Amazon Web Services (AWS) CLI documentation:
https://docs.aws.amazon.com/cli/index.html
Community Ceph Documentation:
https://docs.ceph.com/en/latest/monitoring/
IBM Storage Ceph Documentation:
https://www.ibm.com/docs/en/storage-ceph/6?topic=dashboard-monitoring-cluster
IP load balancer documentation:
https://github.ibm.com/dparkes/ceph-top-gun-enablement/blob/main/training/modul
es/ROOT/pages/radosgw_ha.adoc
REDP-5721-00
ISBN 0738461520
Printed in U.S.A.
®
ibm.com/redbooks