Ceph
Ceph
Redpaper
Draft Document for Review November 28, 2023 12:23 am 5721edno.fm
IBM Redbooks
November 2023
REDP-5721-00
5721edno.fm Draft Document for Review November 28, 2023 12:23 am
Note: Before using this information and the product it supports, read the information in “Notices” on
page vii.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ceph and storage challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Data keeps growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Technology changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Data organization, access, and costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Data added value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.5 Ceph approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.6 Ceph storage types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 What is new with IBM Storage Ceph V 7.0? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 WORM compliance certification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Multi-site replication with bucket granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Object archive zone (Tech preview) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 RGW policy-based data archive and migration capability. . . . . . . . . . . . . . . . . . . . 9
1.3.5 IBM Storage Ceph Object S3 Lifecycle Management. . . . . . . . . . . . . . . . . . . . . . 10
1.3.6 Dashboard UI enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.7 NFS support for CephFS for non-native Ceph clients. . . . . . . . . . . . . . . . . . . . . . 12
1.3.8 NVMe over Fabrics (Tech preview). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.9 Object storage for ML/analytics: S3 Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.10 RGW multi-site performance improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.11 Erasure code EC2+2 with 4 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Contents v
5721TOC.fm Draft Document for Review November 28, 2023 7:59 am
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at https://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
Redbooks (logo) ® IBM Cloud® Redbooks®
IBM® IBM Spectrum®
The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Ansible, Ceph, OpenShift, Red Hat, are trademarks or registered trademarks of Red Hat, Inc. or its
subsidiaries in the United States and other countries.
VMware, and the VMware logo are registered trademarks or trademarks of VMware, Inc. or its subsidiaries in
the United States and/or other jurisdictions.
Other company, product, or service names may be trademarks or service marks of others.
Preface
IBM® Storage Ceph is an IBM-supported distribution of the open-source Ceph platform that
provides massively scalable object, block, and file storage in a single system.
IBM Storage Ceph is designed to operationalize AI with enterprise resiliency and consolidate
data with software simplicity and run on multiple hardware platforms to provide flexibility and
lower costs.
Engineered to be self-healing and self-managing with no single point of failure and includes
storage analytics for critical insights into growing amounts of data. IBM Storage Ceph can be
used as an easy and efficient way to build a data lakehouse for IBM watsonx.data and for
next-generation AI workloads.
This IBM Redpaper publication explains the concepts and architecture of IBM Storage Ceph
in a clear and concise way. For detailed instructions on how to implement IBM Storage Ceph
for real life solutions, see the IBM Storage Ceph Solutions Guide, REDP-5715 IBM Redpaper.
The target audience for this publication is IBM Storage Ceph architects, IT specialists, and
technologists.
Authors
This paper was produced by a team of specialists from around the world .
Marcel Hergaarden
IBM Netherlands
The team extends its gratitude to the Upstream Community, IBM and Red Hat Ceph
Documentation teams for their contributions to continuously improve Ceph documentation.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface xi
5721pref.fm Draft Document for Review November 28, 2023 12:23 am
Chapter 1. Introduction
This chapter introduces the origins of Ceph and the basic architectural concepts that are used
by this Software Defined Solution.
1.1 History
The Ceph project emerged from a critical observation: the Lustre architecture was inherently
limited by its metadata lookup mechanism. In Lustre, locating a specific file requires querying
a dedicated software component called the Metadata Server. This centralized approach
proved to be a bottleneck under heavy workloads, hindering the overall performance and
scalability of the storage system.
To solve this inherent Lustre architecture problem, Sage Weil envisioned a new mechanism to
distribute and locate the data in a distributed and heterogeneous structured storage cluster.
This new concept is metadata-less and relies on a pseudo-random placement algorithm to do
so.
The novel algorithm, named CRUSH (Controlled Replication Under Scalable Hashing),
leverages a sophisticated calculation to optimally place and redistribute data across the
cluster, minimizing data movement when the cluster's state changes, all without the need for
any additional metadata.
It is designed to distribute the data across all devices in the cluster to avoid the classic bias of
favoring empty devices to write new data. During cluster expansion this will likely generate
bottleneck as new empty devices are favored to receive all the new writes while generating
unbalance of the data distribution as the old data is not redistributed across all the devices in
the storage cluster.
As CRUSH is designed to distribute and maintain the distribution of the data throughout the
lifecycle of the storage cluster (expansion, reduction, or failure), it favors an equivalent mix of
old and new data on each physical disk of the cluster and therefor leads to a more even
distribution of the I/Os across all the physical disk devices.
To enhance the data distribution, the solution was designed to allow the breakdown of large
elements (for example, 100 GiB file) into smaller elements, each assigned a specific
placement via the CRUSH algorithm. Therefore, reading a large file will leverage multiple
physical disk drives rather than a single disk drive if the file would have been kept as a single
element.
Sage Weil prototyped the new algorithm and doing so created a new distributed storage
software solution, Ceph. The name Ceph was chosen as a reference to the ocean and the life
it harbors given Santa Cruz, CA is a Pacific Ocean coastal town. Ceph is a short for
cephalopod2.
On January 2023, all Ceph developers and product managers were moved from Red Hat to
IBM to provide greater resources for the future of the project. The Ceph project at IBM
remains an open-source project and code changes still follow the upstream first rule and is
the base for the IBM Storage Ceph software defined storage product.
Figure 1-1 on page 3 represents the milestones of the Ceph project over the past two
decades.
All Ceph community versions were assigned the name of a member of the Cephalopoda
natural sciences family. The first letter of the name helps identify the version.
Table 1-1 represents all Ceph community version names with the matching Inktank, Red Hat
Ceph Storage or IBM Storage Ceph version leveraging it.
Inktank Ceph
Enterprise 1.2
Chapter 1. Introduction 3
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am
Squid … … … …
Available until
Following the transfer of the Ceph project to IBM, Red Hat will OEM Red Hat Ceph Storage
starting with its version Red Hat Ceph Storage 6.
Figure 1-2 represents the different IBM Storage Ceph versions as of today.
Chapter 1. Introduction 5
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am
enabling real-time data accessibility from multiple locations without the need for human
intervention.
Ceph is no exception to these challenges, and was designed from the ground up to be highly
available, with no single point of failure, and highly scalable with limited day-to-day operational
requirements other than the replacement of failed physical resources, such as nodes or
drives. Ceph provides a complete software-defined storage solution as an alternative to
proprietary storage arrays.
All the different types of storage are segregated and do not share data between them while
eventually sharing the physical storage where they are physically stored. A custom CRUSH
configuration will however allow you to separate the physical nodes and disks used by each of
them.
All the different types of storage are identified and implemented as Ceph access methods on
top of the native RADOS API. This API is known as librados.
Ceph is written entirely in C and C++ except for some API wrappers that are language
specific (For example, Python wrapper for librados or librbd).
IBM Storage Ceph is positioned for the following use cases (Figure 1-3 on page 7) and IBM is
committed to support additional ones in the upcoming versions such as NVMe over Fabric for
a full an easy integration with VMware.
Note: U.S. Securities and Exchange Commission (SEC) stipulates record keeping
requirements, including retention periods. Financial Industry Regulatory Authority (FINRA)
rules regulate member brokerage firms and exchange member markets.
The Cohasset Associates IBM Storage Ceph certification assessment page can be found
here.
Figure 1-4 on page 8 shows the multi-site replication with bucket granularity feature.
Chapter 1. Introduction 7
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am
Previously, replication was limited to full zone replication. This new feature grants clients
enhanced flexibility by enabling the replication of individual buckets with or against different
IBM Storage Ceph clusters. This granular approach allows for selective replication, which can
be beneficial for edge computing, co-locations, or branch offices. Bidirectional replication is
also supported.
The archive zone selectively replicates data from designated buckets within the production
zones. System administrators can control which buckets undergo replication to the archive
zone, enabling them to optimize storage capacity usage and prevent the accumulation of
irrelevant content.
The primary benefit of this feature lies in its ability to liberate on-premises storage space that
is currently occupied by inactive, rarely accessed data. This reclaimed storage can then be
repurposed for active datasets, enhancing overall storage efficiency.
Chapter 1. Introduction 9
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am
With these new UI functionalities, a Ceph cluster administrator has the ability to manage the
whole lifecycle of CephFS filesystems, volumes and subvolumes via the graphical dashboard
UI.
RGW multi-site UI configuration:
– RGW multi-site setup and configuration from the dashboard UI.
Chapter 1. Introduction 11
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am
IBM Storage Ceph Linux clients can seamlessly mount CephFS without additional driver
installations, as CephFS is embedded in the Linux kernel by default. This capability extends
CephFS accessibility to non-Linux clients through the NFS protocol. In IBM Storage Ceph V7,
the NFS Ganesha service expands compatibility by supporting NFS v4, empowering a
broader spectrum of clients to seamlessly access CephFS resources.
The newly introduced IBM Storage Ceph NVMe-oF gateway bridges the gap for non-Linux
clients, enabling them to seamlessly interact with NVMe-oF initiators. These initiators
establish connections with the gateway, which in turn connects to the RADOS block storage
system.
The performance of NVMe-oF block storage through the gateway is comparable to native
RBD block storage, ensuring a consistent and efficient data access experience for both Linux
and non-Linux clients.
This feature empowers clients to employ straightforward SQL statements to filter the contents
of S3 objects and retrieve only the specific data they require. By leveraging S3 Select for data
filtering, clients can significantly minimize the amount of data transferred by S3, thereby
reducing both retrieval costs and latency. The following data formats are supported:
CSV
JSON
Parquet
With IBM Storage Ceph 7, erasure code C2+2 can be used with just four nodes, making it
more efficient and cost-effective to deploy erasure coding for data protection.
IBM Storage Ready Nodes can be deployed with a minimum of four nodes and utilize erasure
coding for the RADOS backend in this basic configuration. This scalable solution can be
expanded to accommodate up to 400 nodes.
Chapter 1. Introduction 13
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am
2.1 Architecture
Figure 2-1 represents the structure and layout of a Ceph cluster, starting from RADOS at the
bottom, up to the various access methods provided by the cluster.
Monitors
The Monitors, known as MONs, are responsible for managing the state of the cluster. Like
with any distributed storage system, the challenge is to keep track of the status of each
cluster component (Monitors, Managers, Object Storage Devices and so forth).
Ceph maintains its cluster state through a set of specialized maps, collectively referred to as
the cluster map. Each map is assigned a unique version number, called an epoch, which
starts at 1 and increments by 1 upon every state change for the corresponding set of
components.
To ensure the integrity and consistency of map updates, the Monitors employ the PAXOS
algorithm, enabling them to reach consensus among multiple Monitors before validating and
implementing any map changes.
To prevent split-brain scenarios, the number of Monitors deployed in a Ceph cluster must
always be an odd number greater than two to ensure that a majority of Monitors can validate
map updates. This means that more than half of the Monitors present in the Monitor Map
(MONMap) must agree on the change proposed by the PAXOS quorum leader for the map to
be updated.
Note: The Monitors are not part of the data path, meaning they do not directly handle data
storage or retrieval requests. They exist primarily to maintain cluster metadata and keep all
components synchronized.
Managers
The Managers, abbreviated as MGRs, are integrated with the Monitors, and collect the
statistics within the cluster. The Managers provide a pluggable Python framework to extend
the capabilities of the cluster. As such the developer or the end-user can leverage or create
Manager modules that will be loaded into the Manager framework.
The list below provides you with some of the existing Manager modules that are available:
Balancer module (dynamically reassign placement groups to OSDs for better data
distribution).
Auto-scaler module (dynamically adjust the number of placement groups assigned to a
pool).
Dashboard module (provide a UI to monitor and manage the Ceph cluster).
RESTful module (provide RESTful API for cluster management).
Prometheus module (provide metrics supports for the Ceph cluster).
Each OSD can be assigned one role for a given placement group:
Primary OSD.
Secondary OSD.
The primary OSD performs all the above functions while a secondary OSD always acts under
the control of a primary OSD. For example, if a write operation lands on the primary OSD for
a given placement group, the primary OSD will send a copy of the IO to one or more
secondary OSDs, and the secondary OSD will be solely responsible for writing the data onto
the physical media and when down acknowledge to the primary OSD.
Note: A cluster node where OSDs are deployed is called an OSD node.
In most cases, you deploy one Object Storage Device per physical drive.
When flash-based drives arrived on the market, it became best practice to use a Solid State
Drive to host the journal to enhance the performance of write operations in the Ceph cluster.
However, the complexity of the solution and the write amplification due to 2 writes for each
write operation led the Ceph project to consider an improved solution for the future.
BlueStore
BlueStore is the new default OSD object store format since upstream Luminous (Red Hat
Ceph Storage 3.x). With BlueStore, data is written directly to the disk device, while a separate
RocksDB key-value store contains all the metadata.
Once the data is written to the raw data block device, the RocksDB is updated with the
metadata related to the new data blobs that just got written.
RocksDB utilizes a DB portion and a write-ahead log (WAL) portion. Depending on the size of
the IO, RocksDB will write the data directly to the raw block device through BlueFS or to the
WAL so it can be later committed to the raw block device. The latter process is known as a
deferred write.
Note: The best practice is to use a device faster than the data device for the RocksDB
metadata device and a device faster than the RocksDB metadata device for the WAL
device.
When a separate device is configured for the metadata, metadata might overflow to the data
device if the metadata device becomes full. While this is not a problem if both devices are of
the same type, it leads to performance degradation if the data device is slower than the
metadata device. This situation is known as the BlueStore spillover.
below the value assigned to the osd_memory_target parameter and not under
osd_memory_cache_min value.
Need be the cache sizing can be adjusted manually by setting the parameter
bluestore_cache_autotune to 0, and the following parameters can be adjusted to allocate
specific portions of the BlueStore cache:
cache_meta: BlueStore onode and associated data.
cache_kv: RocksDB block cache including indexes and filters.
data_cache: BlueStore cache for data buffers.
The above parameters are expressed as a percentage of the cache size assigned to the
OSD.
BlueStore allows you to configure more features to align best with your workload:
block.db sharding.
Minimum allocation size on the data device.
Pools
The cluster is divided into logical storage partitions called pools. The pools have the following
characteristics:
Group data of a specific type.
Group data that is to be protected using the same mechanism (replication or erasure
coding).
Group data to control access from Ceph clients.
Assigned one and only one CRUSH rule to determine placement group mapping to OSDs.
Note: Pools support compression but do not support deduplication for now. The
compression can be activated on a per pool basis.
Pools support compression but do not support deduplication for the time being. The
compression can be activated on a per pool basis.
Data protection
The data protection scheme is assigned individually to each pool. The data protection IBM
Storage Ceph supports are:
Replicated that makes a full copy of each byte stored in the pool (default 3 copies):
– 2 replicas and higher are supported with underlying flash devices.
– 3 replicas and higher are supported with rotational hard drives.
Erasure coding that functions in a similar way as the parity RAID mechanism:
– 4+2 erasure coding is supported with jerasure plugin.
– 8+3 erasure coding is supported with jerasure plugin.
Note: Erasure coding, although supported for all types of storage (block, file and object), is
not recommended for block and file as it delivers lower performance.
The difference between replicated and erasure coding pools when it comes to data protection
is summarized in Figure 2-4.
Figure 2-4 Replicated data protection versus erasure coding data protection
Erasure coding provides a more cost-efficient data protection mechanism and greater
resiliency and durability as you increase the number of coding chunks, allowing the pool to
survive the loss of many OSDs or servers before the data becomes irrecoverable but offers
lower performance because of the computation and network traffic required to split the data
and calculate the coding chunks.
Each pool is assigned a set of parameters that can be changed on the fly. Table 2-1 lists the
main parameters used for pools and details if the parameter can be dynamically modified.
Table 2-1 Main parameters used for pools and details if the parameter can be dynamically modified
Name Function Dynamic Update
Placement groups
The pool is divided into hash buckets called placement groups or PGs. The role of the PG is
to:
Store the objects as calculated by the CRUSH algorithm.
Guarantee that the storage of the data is abstracted from the physical devices.
Note: The mapping of a placement group will always be the same for a given cluster state.
State Description
active+stale PG is fully functional, but one or more replicas is stale and needs to be updated.
stale PG is in an unknown state as the Monitors have not received an update after
the PG placement was changed.
incomplete PG is missing information about some writes that may have occurred or does
not have healthy copies.
degraded PG has not replicated some objects to the correct number of OSDs.
inconsistent PG has inconsistencies between its different copies stored on different OSDs.
Cluster status
While placement groups have their distinct state, the Ceph cluster has its own global status
that can be checked with the ceph status, ceph health or ceph health detail commands.
HEALTH_OK Cluster is fully operational, and all components are operating as expected.
HEALTH_WARN Some issues exist in the cluster, but data is available to clients.
HEALTH_ERR Serious issues exist in the cluster and some data has become unavailable
to clients.
Figure 2-5 illustrates the general component layout within a RADOS cluster.
Note: With the latest version of IBM Storage Ceph multiple components can be deployed
on one node following the support matrix. You need to log in with your IBMid to access this
link.
As the placement remains the same for a given cluster state, imagine the following scenario:
A RADOS object is hosted in placement group 3.23.
A copy of the object is written on OSD 24, 3 and 12 (state A of the cluster).
OSD 24 is stopped (state B of the cluster).
An hour later OSD 24 is restarted.
Upon the stoppage of OSD 24, the placement group that contains the RADOS object will be
recreated to another OSD so that the cluster satisfies the number of copies that must be
maintained for the placement groups that belong to the specific pool.
At this point, state B of the cluster becomes different from the original state A of the cluster.
Placement group 3.23 is now protected by OSD 3, 12 and 20.
When OSD 24 is restarted, assuming no changes were made to the cluster in the meantime
such as cluster expansion, reduction or CRUSH customization or another OSD failing, the
copy of the placement group 3.23 will be resynchronized on OSD 24 to provide the same
mapping for the placement group as the cluster state is identical to state A.
From an architecture point of view, CRUSH provides the following mechanism (see
Figure 2-6).
To determine the location of a specific object, the following mechanism is used (see
Figure 2-7).
To make sure a client or a Ceph cluster component will locate the correct location of an object
that enables the client to cluster communication model, all maps maintained by the Monitors
are versioned and the version of the map used to locate an object, or a placement group is
checked by the recipient of a request.
The specific exchange determines whether the Monitors or one of the OSDs updates the map
version used by the sender. In most cases, these updates are differentials, meaning only the
changes to the map are transmitted after the initial connection to the cluster.
If you now look at the whole Ceph cluster, with its many pools, each pool with its own
placement groups, the picture will look like Figure 2-8 on page 26.
Figure 2-8 Full data placement picture (Objects on the left, OSDs on the right)
In the Ceph clusters, the following mechanisms exist to track the status of the different
components of the cluster:
Monitor to Monitor heartbeat.
OSD to Monitor heartbeat.
OSD to OSD heartbeat.
Upon detecting an unavailable peer OSD ((because it works with other OSDs to protect
placement groups), an OSD relays this information to the Monitors. This enables the Monitors
to update the OSDMap accordingly, reflecting the status of the unavailable OSD.
OSD failures
When an OSD becomes unavailable, the following statements become true:
The total capacity of the cluster is reduced.
The total throughput that can be delivered by the cluster is reduced.
The cluster will enter a recovery process that generates disk IOs (to read the data that
must be recovered) and network traffic (to send a copy of the data to another OSD and
recreate the missing data.
Recovery is the process of moving or synchronizing a placement group following the failure of
an OSD.
Backfill is the process of moving a placement group following the addition or the removal of an
OSD to or from the Ceph cluster.
When a Ceph client connects to the Ceph cluster, it needs to contact the Monitors of the
cluster so that it can be authenticated. Once authenticated the Ceph client will be provided
with a copy of the different maps maintained by the Monitors.
In Figure 2-9 on page 28, the different steps that occur when a Ceph client connects to the
cluster and then accesses data is following this high-level sequence:
1. Upon successful authentication, the client is provided with the cluster map.
2. The data placement for object name xxxxx is calculated.
3. The Ceph client initiates a connection with the primary OSD that protects the PG.
When a failure has occurred within the cluster, as the Ceph client tries to access a specific
OSD, the following cases can occur:
The OSD it is trying to contact is unavailable.
The OSD it is trying to contact is available.
In the first case, the Ceph client will fall back to the Monitors to obtain an updated copy of the
cluster map.
In Figure 2-10 on page 29, the different steps that occur when a Ceph client connects to the
cluster and then accesses data is following this high-level sequence:
4. As the target OSD has become unavailable, the client contacts the Monitors to obtain the
latest version of the cluster map.
5. The data placement for object name xxxxx is recalculated.
6. The Ceph client initiates a connection with the new primary OSD that protects the PG.
In the second case, the OSD will detect that the client has an outdated map version and will
provide the necessary map updates that took place since the map version used by the Ceph
client and the map version used by the OSD. Upon receiving these updates, the Ceph client
will recalculate data placement and retry the operation, ensuring that it is aware of the latest
cluster configuration and can interact with OSDs accordingly.
CephX is enabled by default during deployment and is recommended to be kept enabled for
optimal performance. However, some benchmark results available online may show CephX
disabled to eliminate protocol overhead during testing.
The installation process enables cephx by default, so that the cluster requires user
authentication and authorization by all client applications.
The usernames used by the Ceph daemons are expressed as {type}.{id}. For example
mgr.servera, osd.0 or mds.0. They are created during the deployment process.
The usernames used by Ceph clients are expressed as client.{id}. For example, when
connecting OpenStack Cinder to a Ceph cluster it is common to create the client.cinder
username.
The username used by the RADOS Gateways connecting to a Ceph cluster follows the
client.rgw.hostname structure.
By default, if no argument is passed to the librados API via code or via the Ceph command
line interface, the connection to the cluster will be attempted with the client.admin username.
The default behavior can be configured via the CEPH_ARGS environment variable. To specify a
specific username, use export CEPH_ARGS="--name client.myid". To specify a specific
userid, use export CEPH_ARGS="--id myid".
Keyrings
Upon creating a new username, a corresponding keyring file is generated in the Microsoft
Windows ini file format. This file contains a section named [client.myid] that holds the
username and its associated unique secret key. When a Ceph client application running on a
different node needs to connect to the cluster using this username, the keyring file must be
copied to that node.
When the application starts, librados which ends up being called whatever the access
method used, searches for a valid keyring file in /etc/ceph. The keyring file name is
generated as {clustername}.{username}.keyring.
All Ceph Monitors can authenticate users so that the cluster does not present any single point
of failure when it comes to authentication. The process follows these steps:
1. Ceph client contacts the Monitors with a username and a keyring.
2. Ceph monitors return an authentication data structure like a Kerberos ticket that includes a
session key.
3. Ceph client uses the session key to request specific services.
4. Ceph Monitor provides the client with a ticket so it can authenticate with the OSDs.
5. The ticket expires so it can be reused and prevents spoofing risks.
Capabilities or Caps
Each username is assigned a set of capabilities that enables specific actions against the
cluster.
The cluster offers the equivalent of the Linux root user known as client.admin. By default, a
user has no capabilities and must be allowed specific rights. This is achieved via the allow
keyword that precedes the access type granted by the capability.
Profiles
To simplify capability creation, cephx profiles can be leveraged:
profile osd: User can connect as an OSD and communicate with OSDs and Monitors.
profile mds: User can connect as an MDS and communicate with MDSs and Monitors.
profile crash: Read only access to Monitors for crash dump collection.
profile rbd: User can manipulate and access RBD images.
profile rbd-read-only: User can access RBD images in read only mode.
Other deployment reserved profiles exist and are not listed for clarity.
All access methods except librados automatically stripe the data stored in RADOS to 4 MiB
objects by default, which can be customized. For example, when using CephFS, storing a 1
GiB file from a client perspective results in creating 256* 4 MiB RADOS objects, each
assigned to a placement group with the pool used by CephFS.
As another example, Figure 2-11 on page 32 represents the RADOS object layout of a 32
MiB RBD Image created in a pool with ID 0 on top of OSDs.
Chapter 3, “IBM Storage Ceph main features and capabilities ” on page 33 will provide you
with more details regarding each access method.
2.3 Deployment
Many Ceph deployment tools have existed throughout the Ceph timeline:
mkcephfs historically the first tool.
ceph-deploy starts with Cuttlefish.
ceph-asnible starting with Jewel.
cephadm starting with Octopus and later.
IBM Storage Ceph documentation details how to use cephadm to deploy your Ceph cluster. A
cephadm-based deployment follows these steps:
For beginners:
– Bootstrap your Ceph cluster (create one initial Monitor and Manager).
– Add services to your cluster (OSDs, MDSs, RADOS Gateways and so forth).
For advanced users:
– Bootstrap your cluster with a complete service file to deploy everything.
For the guidelines and recommendations about cluster deployment, refer to Chapter 6, “Day 1
and Day 2 operations” on page 121.
Block storage devices are thin-provisioned, resizable, volumes that store data striped over
multiple OSDs. Ceph block devices leverage RADOS capabilities, such as snapshots,
replication, and data reduction. Ceph block storage clients communicate with Ceph clusters
through kernel modules or the librbd library.
Ceph File System (CephFS) is a file system service compatible with POSIX standards and
built on top of Ceph’s distributed object store. CephFS provides file access to internal and
external clients, using POSIX semantics wherever possible. CephFS maintains strong cache
coherency across clients. The goal is for processes that use the file system to behave the
same when they are on different hosts as when they are on the same host. It is easy to
manage and yet meets the growing demands of enterprises for a broad set of applications
and workloads. CephFS can be further extended using industry standard file sharing
protocols such as NFS.
Object Storage is the primary storage solution that is used in the cloud and by on-premises
applications as a central storage platform for large quantities unstructured data. Object
Storage continues to increase in popularity due to its ability to address the needs of the
world’s largest data repositories. IBM Storage Ceph provides support for object storage
operations via the S3 API with a high emphasis on what is commonly referred to as S3 API
fidelity.
that data is stored redundantly and can be recovered in the event of a failure. This makes
it especially suitable for use cases such as archiving, backup and disaster recovery.
Object storage is also cost-effective, it uses commodity hardware, which is less expensive
than specialized storage hardware used in proprietary storage systems.
Object storage systems usually have built-in features such as data versioning, data tiering,
and data lifecycle management.
Object storage can be used for a variety of use cases, including archiving, backup and
disaster recovery, media and entertainment, and big data analytics.
The Ceph Object Gateway provides interfaces compatible with OpenStack Swift and AWS S3,
the Ceph Object Gateway has its own user management system. Ceph Object Gateway can
store data in the same Ceph storage cluster used to store data from Ceph Block Device and
CephFS clients; however, it would involve separate pools and likely a different CRUSH
hierarchy.
The Ceph Object Gateway is a separate service that runs containerized on a Ceph node
and provides object storage access to its clients. In a production environment, we
recommend running more than one instance of the object gateway service on different
Ceph nodes to provide high availability, which the clients access through an IP Load
Balancer. See Figure 3-1.
Industry-specific use cases for the Ceph Object Gateway include the following examples:
Healthcare and Life Sciences:
– Medical imaging, such as picture archiving and communication system (PACS) and
magnetic resonance imaging (MRI).
– Genomics research data.
– Health Insurance Portability and Accountability Act (HIPAA) of 1996 regulated data.
Media and entertainment (for example, audio, video, images, and rich media content).
Financial services (for example, regulated data that requires long-term retention or
immutability).
Object Storage as a Service (SaaS) as a catalogue offering (cloud or on-premises).
The RADOS Gateway stores its data, including user data and metadata, in a dedicated set
of Ceph Storage pools. This specialized storage mechanism ensures efficient data
management and organization within the Ceph cluster.
RADOS Gateway Data pools are typically stored on hard disk drives (HDDs) with erasure
coding for cost-effective and high-capacity configurations.
Data pools can also utilize solid-state drives (SSDs) in conjunction with erasure coding
(EC) to cater to performance-sensitive object storage workloads.
The bucket index pool, which can store millions of bucket index key/value entries one per
object, is a critical component of Ceph Object Gateway's performance. Due to its
performance-sensitive nature, the bucket index pool should exclusively utilize flash media,
such as SSDs, to ensure optimal performance and responsiveness.
Instances of non-collocated RGW daemons on nodes that are dedicated to the RGW
service. See Figure 3-5 on page 40.
The RADOS Gateway offers interfaces compatible with both AWS S3 and OpenStack Swift,
providing seamless integration with existing cloud environments. Additionally, it features an
Admin operations RESTful API for automating day-to-day operations, streamlining
management tasks.
The Ceph Object Gateway constructs of a Realm, Zone Groups, and Zones are used to
define the organization of a storage network for purposes of replication and site protection.
A deployment in a single data center can be very simple to install and easy to manage. In
recent feature updates, the Dashboard UI has been enhanced to provide a single point of
control for the startup and ongoing management of the Ceph Object Gateway in a single data
center. In this case, there is no need to define Zones and Zone Groups and a minimized
organization is automatically created.
For deployments that involve multiple data centers and multiple IBM Storage Ceph clusters, a
more detailed configuration is required with granular control using the cephadm CLI or the
dashboard UI. In these scenarios, the Realm, Zone Group and Zones must be defined by the
storage administrator and configured accordingly. The IBM Storage Ceph documentation fully
describes these constructs in great detail.
Figure 3-7 on page 42 shows a design supporting a multiple data center implementation for
replication and disaster recovery.
To avoid single points of failure in our Ceph RGW deployment, we need to provide an
S3/RGW endpoint that can tolerate the failure of one or more RGW services. RGW is a restful
HTTP endpoint that can be load-balanced for HA and increase performance. There are some
great examples of different RadosGW load-balancing mechanisms in this repo.
Starting with IBM Storage Ceph 5.3, Ceph provides an HA and load-balancing stack called
ingress based on keepalived and haproxy. The ingress service allows you to create a
high-availability endpoint for RGW with a minimum set of configuration options.
The orchestrator will deploy and manage a combination of haproxy and keepalived to balance
the load on a floating virtual IP. If SSL is used, then SSL must be configured and terminated
by the ingress service, not RGW itself.
S3 compatibility
S3 compatibility provides object storage functionality with an interface that is compatible with
a large subset of the AWS S3 RESTful API.
Swift compatibility
Provides object storage functionality with an interface that is compatible with a large subset of
the OpenStack Swift API.
The S3 and Swift APIs share a common namespace, so you can write data with one API and
retrieve it with the other. The S3 namespace can also be shared with NFS clients to offer a
true multiprotocol experience for unstructured data use cases.
Administrative API
Provides an administrative restful API interface for managing the Ceph Object Gateways.
Administrative API requests are done on a URI that starts with the admin resource end point.
Authorization for the administrative API mimics the S3 authorization convention. Some
operations require the user to have special administrative capabilities. The response type can
be either XML or JSON by specifying the format option in the request, but defaults to the
JSON format.
Management
The Ceph Object Gateway can be managed using the Ceph Dashboard UI, the Ceph
command line (cephadm), the Administrative API mentioned above, and through service
specification files.
IBM Storage Ceph object storage further enhances security by incorporating IAM
compatibility for authorization. This feature introduces IAM Role policies, empowering users
to request and assume specific roles during STS authentication. By assuming a role, users
inherit the S3 permissions configured for that role by an RGW administrator. This role-based
access control (RBAC) or attribute-based access control (ABAC) approach enables granular
control over user access, ensuring that users only access the resources they need.
Immutability
IBM Storage Ceph Object storage also supports the S3 Object Lock API in both Compliance
and Governance modes. Ceph has been certified or passed the compliance assessments for
SEC 17a-4(f), SEC 18a-6(e), FINRA 4511(c) and CFTC 1.31(c)-(d). Certification documents
will be available upon their publication.
Archive zone
The archive zone uses multi-site replication and S3 object versioning features, the archive
zone will keep all versions of all the objects available even when deleted in the production
site.
An archive zone provides you with a history of versions of S3 objects that can only be
eliminated through the gateways associated with the archive zone. Including an archive zone
in your multisite zone replication setup gives you the convenience of an S3 object history
while saving space that replicas of the versioned S3 objects would consume in the production
zones.
You can control the storage space usage of the archive zone through bucket lifecycle policies,
where you can define the number of versions you would like to keep for each object.
An archive zone helps protect your data against logical or physical errors. It can save users
from logical failures, such as accidentally deleting a bucket in the production zone. It can also
save your data from massive hardware failures, like a complete production site failure.
Additionally, it provides an immutable copy, which can help build a ransomware protection
strategy.
Security
Data access auditing by enabling the RadosGW OPS logs feature. Multi-Factor
Authentication for Delete. Support for Secure Token Service (STS), helping to avoid using S3
long-lived keys. STS provides temporary and limited privilege credentials. You can secure
with TLS/SSL the S3 HTTP endpoint provided by the RGW services, the use of External SSL
certificates or self-signed SSL certificates are supported.
Replication
IBM Ceph Object Storage provides enterprise-grade, highly mature object geo-replication
capabilities. The RGW multi-site replication feature facilitates asynchronous object replication
across single or multi-zone deployments. Leveraging asynchronous replication with eventual
consistency, Ceph Object Storage operates efficiently over WAN connections between
replicating sites.
With the latest 6.1 release, Ceph Object Storage introduces granular bucket-level replication,
unlocking a plethora of valuable features. Users can now enable or disable sync per individual
bucket, enabling precise control over replication workflows. This empowers full-zone
replication while opting out specific buckets, replicating a single source bucket to
multi-destination buckets, and implementing both symmetrical and directional data flow
configurations.
Figure 3-8 on page 45 shows the IBM Ceph Object Storage replication feature.
Storage policies
Ceph Object Gateway supports different storage classes for the placement of internal RGW
data structures. For example, SSD storage pool devices are recommended for index data,
while HDD storage pool devices can be targeted for high capacity bucket data. Ceph Object
Gateway also supports storage lifecycle policies and transitions for data placement on tiers of
storage depending on the age of the content.
Transitions across storage classes as well as protection policies (for example, replicas,
erasure coding) are supported. Ceph Object Gateway also supports policy-based data
archiving to AWS S3 or Azure. As of the date of this publication, archiving to Amazon S3 and
Microsoft Azure. Archiving to IBM Cloud Object Storage is also under consideration as a
roadmap vision.
Multiprotocol
Ceph Object Gateway supports a single unified namespace for S3 client operations (S3 API)
and NFS client operations (NFS Export) to the same bucket. This provides a true
multiprotocol experience for a variety of use cases, particularly in situations where
applications are being modernized from traditional file sharing access to native object storage
access. IBM recommends limiting the use of S3 and NFS to the same namespace in use
cases such as data archives, rich content repositories, or document management stores; that
being use cases where the files and objects are unchanging by design. Multiprotocol is not
recommended for live collaboration use cases where multiuser modifications to the same
content is required.
IBM watsonx.data
IBM Storage Ceph is the perfect candidate for a data lake or data lakehouse, as an example,
watsonx.data includes and IBM Storage Ceph license so it can be used out of the box when
deploying watsonx.data, as a result the integration and level of testing between watsonx.data
and IBM Storage Ceph is first class. Some of the features that make Ceph a great match for
watsonx.data are: S3 Select, Table Encryption, IDP authentication with STS, Datacenter
Caching with D3N. S3 Select is a recent innovation that extends object storage to
semi-structured use cases. An example of a semi-structured object is one that contains
comma separated values (CSV), or JSON, or Parquet file formats. S3 Select allows a client to
GET a subset of the object content by using SQL-like arguments to filter the resulting
payload. Ceph Object Gateway currently supports S3 Select for alignment with data lake
house storage with IBM watsonx.data clients. At the time of publication, S3 Select supports
the CSV, JSON, and Parquet formats.
Bucket features
Ceph supports advanced bucket features such as,S3 bucket policy, S3 object versioning, S3
object lock, rate limiting, bucket object count quotas, bucket capacity quotas. In addition to
these advanced bucket features, Ceph Object Gateway boasts impressive scalability that
empowers organizations to store massive amounts of data with ease and efficiency.
Bucket notifications
Ceph Object Gateway supports bucket event notifications, a crucial feature for event-driven
architectures and widely used when integrating with OpenShift Data Foundation (OCP/FDF)
externally. Notifications enable real-time event monitoring and triggering of downstream
actions, such as data replication, alerting, and workflow automation.
Note: You can refer to “Chapter 4 - S3 bucket notifications for event-driven architectures”
in the IBM Redpaper IBM Storage Ceph Solutions Guide, REDP-5715 for a detailed
discussion of the event-driven architectures.
2. In the Create Service dialog box, enter values similar to those as shown below and click
Create Service. See Figure 3-11.
The running RGW service can be observed in any of the following dashboard locations:
Ceph Dashboard home page → Object Gateways section
Ceph Dashboard → Cluster → Services
Ceph Dashboard → Object Gateway - Daemons
2. In the Create User dialogue, enter the required values using Figure 3-13 as a guide.
When finished, click the Create User button.
2. In the Create Bucket dialogue, enter the required values using Figure 3-15 on page 51 as
a guide. When finished, click the Create Bucket button.
Figure 3-16 on page 52 shows the Ceph Object gateway services listing.
Figure 3-17 on page 53 shows the Ceph Object Gateway user listing.
Figure 3-18 on page 53 shows the Ceph Object Gateway bucket listing.
Figure 3-19 Display S3 access key and secret key for a selected user
2. In the alternate, obtain the S3 access keys using the Ceph command line. The
radosgw-admin command can be run from the shell, or within the cephadm CLI. See
Example 3-1.
Note: Substitute the RGW username you created for “john” in the previous section. The
access key and secret key will differ and be unique for each Ceph cluster.
Example 3-1 Ceph Object Gateway S3 access key and secret key
[root@node1 ~]# radosgw-admin user info --uid="john"
{
"user_id": "john",
"display_name": "Shubeck, John",
"email": "[email protected]",
"suspended": 0,
"max_buckets": 1000,
"subusers": [],
"keys": [
{
"user": "john",
"access_key": "LGDR3IJB94XZIV4DM7PZ",
"secret_key": "qHAW3wdLGgGh78pyz8pigjxVeoM1sz1HT6lIdYD3"
}
],
. . . output omitted . . .
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}
It is worth noting here, there is nothing remarkable about an object. If a unit of data, for
example a .JPG image, is stored in a file system then we refer to it as a file. If at some point
that file is uploaded or PUT into an object storage bucket, we are likely to refer to it is an
object. Regardless of where it is stored, there is nothing that changes the nature of that
image. The binary data within the file or object, and the image that can be rendered from it,
are unchanged.
1. Configure the AWS CLI tool to use the client credentials. Enter the access key and secret
key that the system generated for you in the previous step. See Example 3-2.
Note: The access key and secret key will differ and be specific to each Ceph cluster.
2. The AWS CLI client uses the S3 bucket as the repository to write and read objects. The
action of uploading a local file to an S3 bucket is called a PUT, and the action of
downloading an object from an S3 bucket to a local file is called a GET. The syntax of the
various S3 API clients might use different commands, but at the underlying S3 API layer is
always a PUT and a GET. See Example 3-3.
3. Create a new S3 bucket from the AWS CLI client (optional). See Example 3-4 on page 56.
Note: The endpoint value should follow the local hostname of the Ceph Object Gateway
daemon host.
4. Create a 10 MB file called 10MB.bin. Upload the file to one of the S3 buckets. See
Example 3-5.
5. Get a bucket listing to view the test object. Download the object to a local file. See
Example 3-6.
Example 3-6 AWS CLI list objects and get object to file
[root@client ~]# aws --endpoint-url=http://ceph-node3:80 \
--profile=ceph s3 ls s3://s3-bucket-1
2023-07-05 16:55:39 10485760 10MB.bin
6. Verify the data integrity of the uploaded and downloaded files. See Example 3-7.
S3 API compatibility
As a developer, you can use a RESTful application programming interface (API) that is
compatible with the Amazon S3 data access model. It is through the S3 API that object
storage clients and applications store, retrieve, and manage the buckets and objects stored in
an IBM Storage Ceph cluster. IBM Storage Ceph, and moreover the Ceph community,
continues to invest heavily in a design goal referred to as “S3 Fidelity”. This means clients,
and in particular Independent Software Vendors (ISVs), can enjoy independence and
transportability for their applications across S3 vendors in a hybrid multi-cloud.
At a high level, the supported S3 API features in IBM Storage Ceph Object Storage are:
Basic bucket operations (PUT, GET, LIST, HEAD, DELETE).
Advanced bucket operations (Bucket policies, website, lifecycle, ACLs, versions).
Basic object operations (PUT, GET LIST, POST).
Advanced object operations (object lock, legal hold, tagging, multipart, retention).
S3 select operations (CSV, JSON, Parquet formats).
Support for both virtual hostname and pathname bucket addressing formats.
At time of publication, the following list shows the S3 API support listing for bucket and object
operations.
3.2.11 Conclusion
IBM Storage Ceph Object storage provides a scale-out high capacity object store for S3 API
and Swift API client operations.
The Ceph service at the core of object storage is the RADOS Gateway (RGW). The Ceph
Object Gateway services its clients through S3 endpoints which are Ceph nodes where
instances of RGW operates and in turn services S3 API requests on well-known TCP ports
via HTTP and HTTPS.
Because the Ceph OSD nodes and the Ceph Object Gateway nodes can be deployed
separately, the cluster offers the ability to independently scale bandwidth and capacity across
a broad range of object storage workloads and use cases.
Note: This section provides a brief overview of the block storage feature in Ceph. All the
following points, and many others, are already widely documented in the IBM Storage
Ceph documentation,
IBM Storage Ceph block storage, also commonly referred to as RBD (RADOS Block Device)
images, is a distributed block storage system that allows for the management and
provisioning of block storage volumes, like traditional storage area networks (SANs) or
direct-attach storage (DAS).
RBD images can be accessed either through a kernel module (for Linux and Kubernetes) or
through the librbd API (for OpenStack and Proxmox). In the Kubernetes world, RBD images
are well-suited for Read Write Once (RWO) Persistent Volume Claims (PVCs).
The virtual machine (VM), through the virtio-blk driver and Ceph library, accesses the RBD
images as if it were a physical drive directly attached to this VM.
RBD cache is, by default, enabled in write-back mode on Ceph client machine, but it can be
set to write-through mode.
The following parameters can be used to control each librbd client caching:
rbd_cache - true or false (defaults to true)
rbd_cache_size - Cache size in bytes (defaults to 32MiB per RBD Image)
Snapshots
RBD images, like many storage solutions, can have snapshots, which are very convenient for
data protection, testing and development, and virtual machine replication. RBD snapshots
capture the state of an RBD image at a specific point in time using the Copy-On-Write (COW)
technology and IBM Storage Ceph supports up to 512 snapshots per RBD image (the
number is technically unlimited, but the volume's performance is negatively affected).
They are read-only and are used to keep track of changes made to the original RBD image,
so it is possible to roll back to the state of the RBD image at the snapshot creation time. This
also means that snapshots cannot be modified. To make modifications, snapshots must be
used in conjunction with clones.
Clones
Snapshots support Copy-On-Write (COW) clones, also known as snapshot layering, which
are new RBD images that share data with the original image or snapshot (parent). Using this
feature, many writable (child) clones can be created from one single snapshot (theoretically
no limit), allowing for rapid provisioning of new block devices that are initially identical to the
source data. For example, you might create a block device image with a Linux VM written to it.
Then, snapshot the image, protect the snapshot, and create as many clones as you like. A
snapshot is read-only, so cloning a snapshot simplifies semantics—making it possible to
create clones rapidly. See Figure 3-23 on page 63.
Because clones rely on a parent snapshot, losing the parent snapshot will cause all child
clones to be lost. Therefore, the parent snapshot must be protected before creating clones.
See Figure 3-24.
Clones are essentially new images, so they can be snapshotted, resized, or renamed. They
can also be created on a separate pool for performance or cost reasons. Finally, clones are
storage-efficient because only the modifications made to them are stored on the cluster.
Journaling mode can also be enabled to store data. In this mode, all writes are sent to a
journal before being stored as an object. The journal contains all recent writes and metadata
changes (device resizing, snapshots, clones, and so on). Journaling mode is intended to be
used for RBD mirroring.
RBD mirroring also supports full lifecycle, meaning that after a failure, the synchronization
can be reversed and the original site can be made the primary site again once it is back
online. Mirroring can be set at the image level, so not all cluster images have to be replicated.
RBD mirroring can be one-way or two-way. With one-way replication, multiple secondary sites
can be configured. Figure 3-25 shows one-way RDB mirroring.
Two-way mirroring limits replication to two storage clusters. Figure 3-26 shows two-way RBD
mirroring.
Another way of mirroring RBD images is to use snapshot-based mirroring. In this method, the
remote cluster site monitors the data and metadata differences between two snapshots
before copying the changes locally. Unlike the journal-based method, the snapshot-based
method must be scheduled or launched manually, so it is less accurate, but might also be
faster.
Note: For more details about mirroring, refer to the IBM Storage Ceph documentation,
Striping spreads data across multiple objects. Striping helps with parallelism for sequential
read and write workloads.
Exclusive locks prevent multiple processes from accessing RBD images at the same time
in an uncoordinated fashion. This helps to address the write conflict situation that can
occur when multiple clients try to write to the same object, which can be the case in
virtualization environments or when using RBD mirroring to avoid simultaneous access to
the journal.
This feature is enabled by default on new RBD devices. It can be disabled, but other
features that rely on it may be affected.
This feature is mostly transparent to the user. When a client attempts to access an RBD
device, it requests an exclusive lock on the device. If another client already has a lock on
the device, the lock request is denied. The client holding the lock is requested to release it
when its write is done, allowing the other client to access the device.
Object map support depends on exclusive lock support. Block devices are thin
provisioned, meaning they only store data that actually exists. Object map support tracks
which objects actually exist (have data stored on a drive). Enabling object map support
speeds up I/O operations for cloning, importing and exporting sparsely populated images,
and deleting.
Fast-diff support depends on object map support and exclusive lock support. It adds
another property to the object map, making it much faster to generate diffs between
snapshots of an image and determine the actual data usage of a snapshot.
Deep-flatten enables RBD flatten to work on all snapshots of an image, in addition to the
image itself. Without deep-flatten, snapshots of an image rely on the parent, so the parent
cannot be deleted until the snapshots are deleted. Deep-flatten makes a parent
independent of its clones, even if they have snapshots.
Journaling support depends on exclusive lock support. Journaling records all
modifications to an image in the order they occur. RBD mirroring utilizes the journal to
replicate a crash consistent image to a remote cluster.
Encryption
Using Linux Unified Key Setup (LUKS) 1 or 2 format, a RBD images can be encrypted if used
with librbd (krbd is not supported yet). The format operation persists the encryption metadata
to the RBD image. The encryption key is secured by a passphrase provided by the user to
create and to access the device once it is encrypted.
Note: For more information about encryption, refer to the IBM Storage Ceph
documentation.
Quality of service
Using librbd it is possible to limit per-image IO using parameters, disabled by default, that
operates independently of each other, meaning that write IOPS can be limited while read
IOPS is not. These parameters are:
IOPS: number of I/Os per second (any type of I/O)
read IOPS: number of read I/Os per second
write IOPS: number of write I/Os per second
bps: bytes per second (any type of I/O)
read bps: bytes per second read
write bps: bytes per second written
These settings can be configured at the image creation time or any time after, using either the
rbd command line tool or the Dashboard.
Namespace isolation
Namespace isolation allows you to restrict client access to different private namespaces
using their authentication keys. In a private namespace, clients can only see their own RBD
images and cannot access the images of other clients in the same RADOS pool but located in
a different namespace.
You can create namespaces and configure images at image creation using the rbd
command-line tool or the Dashboard.
When live migration is initiated, the source image is deep-copied to the destination image,
preserving all snapshot history and sparse allocation of data where possible. The live
migration requires creating a target image that can maintain read and write access to the data
while the linked source image is marked read-only. Once the background migration is
complete, you can commit the migration, removing the cross-link and deleting the source
(unless import-only mode is used). You can also cancel the migration, removing the cross-link
and deleting the target image.
Trash
Moving a RBD image to trash allows to keep it for a specified time before it is really deleted. If
the retention time is not expired or the trash has been purged, the trashed images can be
restored.
Trash management is available from both the command line tool and the Dashboard.
Rbdmap service
The systemd unit file, rbdmap.service, is included with the ceph-common package. The
rbdmap.service unit executes the rbdmap shell script.
This script is very useful to automatically mount at boot time and umount at shutdown RBD
images from a Ceph client, by simply adding one path per line to rbd devices (/dev/rbdX) and
associated credentials to be managed.
RBD-NBD
RBD Network Block Device (RBD-NBD) is an alternative to Kernel RBD (KRBD) for mapping
Ceph RBD images. Unlike KRBD, which works in kernel space, RBD-NBD relies on the NBD
kernel module and works in user space. This allows access to librbd features such as
exclusive lock and fast-diff. NBD exposes mapped devices as local devices in paths like
/dev/nbdX.
In summary, RBD image features make them well-suited for a variety of workloads, including
both virtualization (virtual machines and containers) and high-performance applications (such
as databases and applications that require high IOPS and use small I/O size). RBD images
provide functionality like snapshots and layering, and stripe data across multiple servers in
the cluster to improve performance.
Although the shared file system feature was the original use case for IBM Storage Ceph, the
demand for block and object storage by Cloud providers and companies implementing
OpenStack, this feature was put on the backburner and was the last current Ceph core
feature to become generally available, preceded by virtual block devices (RADOS Block
Device) and the OpenStack Swift and Amazon S3 gateway (RADOS Gateway or RGW).
The first line of code of a prototype file system was written in 2004 during Sage Weil's
internship at Lawrence Livermore National Laboratory and Sage continued working on the
project during another summer project at the University of California Santa Cruz in 2005 to
create a fully functional file system, baptized Ceph, and presented at Super Compute and
USENIX in 2006.
The Metadata Server also maintains a cache to improve metadata access performance while
managing the cache of CephFS clients to ensure the clients have the proper cache data and
to prevent deadlocks on metadata access.
When a new inode or dentry is created, updated, or deleted, the cache is updated and the
inode or dentry is recorded, updated, or removed in a RADOS pool dedicated to CephFS
metadata.
A Metadata Server can be marked as standby-replay for a given rank to apply journal
changes to its cache continuously. It allows for the failover between two Metadata Servers to
occur faster. If using this feature, each rank must be assigned to a standby-replay Metadata
Server.
This journal is used to maintain the consistency of the file system during a Metadata Server
failover operation as events can be replayed by the standby Metadata Server to reach a file
system state consistent with the state last reached by the now defunct previously active
Metadata Server.
The benefit of using a journal to record the changes is that most journal updates are
sequential. They are handled faster by all type of physical disk drives, and sequential
consecutive writes can be merged for even better performance.
The performance of the metadata pool where the journal is maintained is of vital importance
to the level of performance delivered by a CephFS subsystem. This is why it is a best practice
to use flash device based OSDs to host the placement groups of the metadata pool.
The journal comprises multiple RADOS objects in the metadata pool, and journals are striped
across the multiple objects for better performance. Each active Metadata Server maintains its
journal for performance and resiliency reasons. Old journal entries are automatically trimmed
by the active Metadata Server.
The file system configuration can then be paired with specific Metadata Server deployment
and configuration to serve each file system with the appropriate level of performance when it
comes to file system metadata access requirements:
A very hot directory requires a dedicated set of Metadata Servers.
Archival file system is essentially read-only, and metadata update occurs in large batches.
Increase the number of Metadata Servers to remediate a metadata access bottleneck.
Each file system is assigned a parameter named max_mds to set the maximum number of
Metadata Servers that can be active for a given file system. By default, this parameter is set to
1 for each file system created in the Ceph storage cluster.
The best practice when creating file systems is to deploy max_mds+1 Metadata Servers for a
single file system to keep one Metadata Server with a standby role to preserve the
accessibility and the resilience of the access to the metadata for the file system.
The number of active Metadata Servers can be dynamically increased or decreased for a
given file system without traffic disruption.
Figure 3-28 provides a visual representation of a multi MDS configuration within a Ceph
cluster. The shaded nodes represent directories, and the unshaded nodes represent files.
The MDSs at the bottom of the picture illustrate which specific MDS rank is in charge of a
specific subtree.
Note: A Damaged rank will not be assigned to any MDS until it is fixed and after the Ceph
administrator issues a ceph mds repaired command.
To make the life of the Ceph administrator easier when it comes to the number of Metadata
Servers assigned to a file system, the Ceph Manager provides the MDS Autoscaler module
that will monitor the max_mds and the standby_count_wanted parameters for a file system.
As the client may be misbehaving and causing some metadata to be held in the cache more
than expected, the Ceph cluster has a built-in warning mechanism, set through the
mds_health_cache_threshold parameter, when the actual cache usage is 150% of the
mds_cache_memory_limit parameter.
This is achieved by assigning the mds_join_fs parameter for a specific MDs instance. If a
rank becomes Failed, the Monitors in the cluster will assign to the Failed rank a standby
MDS for which the mds_join_fs is set to the name of the file system to which the Failed rank
is assigned.
If no standby Metadata Server has the parameter set, an existing standby Metadata Server is
assigned to the Failed rank.
Ephemeral pinning
The dynamic tree partitioning of an existing directory subtree can be configured via policies
assigned to directories within a file system to influence the distribution of the Metadata Server
workload. The following extended attributes of a file system directory can be used to control
the balancing method across the active Metadata Servers:
ceph.dir.pin.distributed - All children are to be pinned to a rank,
ceph.dir.pin.random - Percentage of children to be pinned to a rank.
The ephemeral pinning does not persist once the inode is dropped from the Metadata Server
cache.
Manual pinning
If needed when the Dynamic Tree Partitioning is not satisfying, with or without subtree
partitioning policies (for example, one directory is hot), the Ceph administrator can pin a
directory to a particular Metadata Server. This is achieved by setting an extended attribute
(ceph.dir.pin) of a specific directory to indicate which Metadata Server rank will oversee
metadata requests for it.
Pinning a directory to a specific Metadata Server rank does not dedicate that rank to this
directory, as Dynamic Tree Partitioning never stops between active Metadata Servers.
RADOS pools
Each file system requires one (1) pool to store the metadata managed by the Metadata
Servers and the file system journal and at least one (1) data pool to store the data itself.
The overall architecture for the Ceph File System can be represented in Figure 3-29.
Data layout
In this section we discuss metadata and data pools.
Metadata pool
The journaling mechanism uses dedicated objects and the journal is striped across multiple
objects for performance reasons.
Each inode in the file system is stored using a separate set of objects that will be named
{inode_number}.{inode_extent} starting with {inode_extent} as 00000000.
Data pool
The organization of the pool that contains the data is driven by a set of extended attributes
assigned to files and directories in the Ceph File System. By default, a Ceph File System is
created with one metadata pool and one data pool. An additional data pool can be attached to
the existing file system so it can be leveraged via the set of extended attributes managed by
the Metadata Server:
The placement and the organization of the data follow the following rules:
A file, when created, inherits the attributes of the parent directory.
A file attributes can only be changed if it contains no data.
The files already created are affected by attribute changes on the parent directory.
All the attributes used at the file level have the same name but are prefixed with
ceph.file.layout.
The attributes can be visualized using the following commands with the file system mounted:
getfattr -n ceph.dir.layout {directory_path}
getfattr -n ceph.file.layout {file_path}
The attributes can be visualized using the following commands with the file system mounted:
setfattr -n ceph.dir.layout.attribute -v {value} {dir_path}
setfattr -n ceph.file.layout.attribute -v {value} {file_path}
Let us look at the RADOS physical layout of a file that would have the following attributes,
inherited or not from its parent directory (Figure 3-30 on page 74):
File size is 8 MiB.
Object size is 4 MiB.
Stripe unit is 1 MiB.
Stripe count is 4.
We can dump the RADOS objects that are created to support the actual data. See
Example 3-9.
Example 3-9 RADOS objects that are created to support the actual data
$ mount.ceph 10.0.1.100:/ /mnt/myfs -o name=admin
$ df | grep myfs
10.0.1.100:/ 181141504 204800 180936704 1% /mnt/myf
$ mkdir /mnt/myfs/testdir
$ dd if=/dev/zero of=/mnt/myfs/emptyfile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.194878 s, 538 MB/s
$ dd if=/dev/zero of=/mnt/myfs/testdir/emptyfileindir bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0782966 s, 1.3 GB/s
$ rados -p cephfs_data ls | cut -f1 -d. | sort -u
10000000001
100000001f6
$ for inode in $(rados -p cephfs_data ls | cut -f1 -d. | sort -u); do echo
"Processing INODE=${inode}";echo "----------------------------";rados -p
cephfs_data ls | grep $inode; done
Processing INODE=10000000001
----------------------------
10000000001.00000009
10000000001.0000000f
10000000001.00000012
10000000001.00000018
10000000001.00000015
10000000001.00000003
10000000001.0000000e
10000000001.00000002
10000000001.00000004
10000000001.0000000b
10000000001.00000010
10000000001.00000008
10000000001.0000000a
10000000001.00000005
10000000001.00000007
10000000001.00000014
10000000001.00000006
10000000001.00000000
10000000001.00000013
10000000001.0000000d
10000000001.00000017
10000000001.00000001
10000000001.00000011
10000000001.0000000c
10000000001.00000016
Processing INODE=100000001f6
----------------------------
100000001f6.00000017
100000001f6.00000018
100000001f6.00000016
100000001f6.0000000e
100000001f6.0000000d
100000001f6.0000000b
100000001f6.0000000c
100000001f6.00000004
100000001f6.00000015
100000001f6.00000005
100000001f6.00000001
100000001f6.00000007
100000001f6.00000012
100000001f6.00000000
100000001f6.00000003
100000001f6.00000008
100000001f6.00000006
100000001f6.00000013
100000001f6.00000011
100000001f6.00000009
100000001f6.0000000f
100000001f6.00000014
100000001f6.0000000a
100000001f6.00000002
100000001f6.00000010
We can see that we have two inodes created, 10000000001 and 100000001f6, and each
actually counts 25 objects. As the file system is empty and has not been pre-configured or
customized, we use stripe_unit=4 MiB, stripe_count=1 and object_size=4 MiB, therefor,
25*4=100 MiB.
How can we now which file is which inode number? There is a simple way to do that. See
Example 3-11.
Example 3-11 printf command to find out which file is which inode number
$ printf '%x\n' $(stat -c %i mnt/myfs/emptyfile)
10000000001
$ printf '%x\n' $(stat -c %i /mnt/myfs/testdir/emptyfileindir)
100000001f6
The volumes are used to manage exports through a Ceph Manager module and is currently
used to provide shared file system capabilities for OpenStack Manila and Ceph CSI in the
Kubernetes and Red Hat OpenShift environments:
Volumes represent an abstraction for Ceph file systems.
Sub-volumes represent an abstraction for directory trees.
Sub-volume-groups aggregate sub-volumes to apply specific common policies across
multiple sub-volumes.
The introduction of this feature enabled an easier creation of a Ceph file system through a
single command that automatically creates the underlying pools required by the file system
and then deploys the Metadata Servers to serve the file system: ceph fs volume create
{filesystem_name} [options].
Quotas
The Ceph File System supports quotas to restrict the number of bytes or the number of files
within a directory. The quota mechanism is enforced by both the Ceph kernel and the Ceph
FUSE clients.
Quotas are managed via extended attributes of a directory and can be set via the setfattr
command for a specific directory:
ceph.quota.max_bytes
ceph.quota.max_files
Note: The above parameters are managed directly by the OpenStack Ceph Manila driver
and the Ceph CSI driver at the sub-volume level.
Later on, the Ceph FUSE client is created to allow non-Linux based clients to access a Ceph
file system. See Figure 3-31.
Ganesha supports multiple protocols such as NFS v3 and NFS v4 and does so through a File
System Abstraction Layer also known as FSAL.
IBM Storage Ceph only supports NFS v4 with version 6.x but NFS v3 support is expected in a
future IBM Storage Ceph version.
The NFS implementation requires a specific Ceph Manager module to be enabled to leverage
this feature. The name of the module is nfs and can be enabled with the ceph mgr module
enable nfs command.
You can assign specific permissions to a directory for a specific Ceph client user name or
user id:
$ ceph fs authorize {filesystem_name} {client_id|client_name} {path} {permissions}
As an example, imagine a Ceph client cephx definition like this one (Example 3-12).
If we try to modify an attribute of the root directory (/) of the Ceph file system mounted on the
/mnt mountpoint, it is denied as the set of capabilities do not include 'p'.
# setfattr -n ceph.dir.layout.stripe_count -v 2 /mnt
setfattr: /mnt: Permission denied
If we create another user with the correct capabilities and then remount the Ceph file system
on the same /mnt mountpoint, the denial disappears. See Example 3-13.
Example 3-13 Create another user with the correct capabilities and then remount the Ceph file system
# mkdir /mnt/dir4
# ceph fs authorize myfs client.4 / rw /dir4 rwp
[client.4]
key = AQBmK71j0FcKERAAJqwhXOHoucR+iY0nzGV9BQ==
# umount /mnt
# mount -t ceph ceph-node01.example.com,ceph-node02.example.com:/ /mnt -o
name=4,secret="AQBmK71j0FcKERAAJqwhXOHoucR+iY0nzGV9BQ=="
# touch /mnt/dir4/file1
# setfattr -n ceph.file.layout.stripe_count -v 2 /mnt/dir4/file1
# getfattr -n ceph.file.layout /mnt/dir4/file1
# file: mnt/dir4/file1
ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304
pool=cephfs.fs_name.data"
Snapshots can be created at any level in the directory tree structure include the root level of a
given file system.
Snapshots capabilities for specific users as we have seen earlier can be granted at the
individual client level via the 's' flag.
Snapshot capabilities can also be enabled or disabled as a whole feature for an entire file
system.
# ceph fs set {filesystem_name} allow_new_snaps true|false
To create a snapshot, the end user that has the Ceph file system mounted will simply create a
subdirectory within the '.snap' directory with the name of his or her choice.
# mkdir /mnt/.snap/mynewsnapshot
The end user with sufficient privileges can create regular snapshot of a specific directory
name via the snap-schedule command. See Example 3-14.
This feature requires both clusters to be identical version for support reasons and at least
running version 5.3 of IBM Storage Ceph.
The feature is based on CephFS snapshot, taken at regular intervals. The first snapshot will
require a complete transfer of the data while subsequent snapshot will only require
transferring the data that has been updated since the last snapshot was applied on the
remote cluster.
Ceph file system mirroring is disabled by default and must be enabled separately to make the
following changes to the Ceph clusters:
Enable the Manager mirroring module.
Deploy a cephfs-mirror component via cephadm.
Authorize the mirroring daemons in both clusters (source and target).
Peer the source and the target clusters.
Configure the path to mirror.
The sequence, shown in Example 3-15, highlights the sequence of changes required.
Note: For more information on sizing an IBM Storage Ceph environment, you can refer to
the Planning section of IBM Storage Ceph documentation.
4.1.1 IOPS-optimized
Input, output per second (IOPS) optimization deployments are suitable for cloud computing
operations, such as running MYSQL or MariaDB instances as virtual machines on OpenStack
or as containers on OpenShift. IOPS optimized deployments require higher performance
storage, such as flash storage, to improve IOPS and total throughput.
4.1.2 Throughput-optimized
Throughput-optimized deployments are ideal for serving large amounts of data, such as
graphic, audio, and video content. They require high-bandwidth networking hardware,
controllers, and hard disk drives with fast sequential read and write performance. If fast data
access is required, use a throughput-optimized storage strategy. If fast write performance is
required, consider using SSDs for metadata. A throughput-optimized storage cluster has the
following properties:
Lowest cost per MBps (throughput).
Highest MBps per TB.
97th percentile latency consistency.
4.1.3 Capacity-optimized
Capacity-optimized deployments are ideal for storing large amounts of data at the lowest
possible cost. They typically trade performance for a more attractive price point. For example,
capacity-optimized deployments often use slower and less expensive large capacity SATA or
NL-SAS drives and can use a large number of OSD drives per server.
By default, Ceph uses replicated pools, which means that each object is copied from a
primary OSD node to one or more secondary OSDs. Erasure-coded pools reduce the disk
space required to ensure data durability, but are computationally more expensive than
replication.
Erasure coding is a method of storing an object in the Ceph storage cluster durably by
breaking it into data chunks (k) and coding chunks (m). These chunks are then stored in
different OSDs. In the event of an OSD failure, Ceph retrieves the remaining data (k) and
coding (m) chunks from the other OSDs and uses the erasure code algorithm to restore the
object from those chunks.
Erasure coding uses storage capacity more efficiently than replication. The n-replication
approach maintains n copies of an object (3x by default in Ceph), whereas erasure coding
maintains only k + m chunks. For example, 3 data and 2 coding chunks use 1.5x the storage
space of the original object.
Figure 4-2 shows the comparison between replication and erasure coding.
While erasure coding uses less storage overhead than replication, it requires more RAM and
CPU than replication to access or recover objects. Erasure coding is advantageous when
data storage must be durable and fault tolerant, but does not require fast read performance
(for example, cold storage, historical records, and so on).
Ceph supports two networks: a public network and a storage cluster network. The public
network handles client traffic and communication with Ceph monitors. The storage cluster
network handles Ceph OSD heartbeats, replication, backfilling, and recovery traffic. For
production clusters, it is recommended to use at least two 10 Gbps networks for each network
type.
Link Aggregation Control Protocol (LACP) mode 4 (xmit_hash_policy 3+4) can be used to
bond network interfaces. Use jumbo frames with a maximum transmission unit (MTU) of
9000, especially on the backend or cluster network.
Clusters that are used as object storage also deploy RADOS Gateway daemons (radosgw),
and if shared file services are required, additional Ceph Metadata Servers (ceph-mds) would
be configured.
Co-locating multiple services to a single cluster node has the following benefits:
Significant improvement in total cost of ownership (TCO) in smaller clusters.
Reduction from six hosts to four for the minimum configuration.
Easier upgrade.
Better resource isolation.
Modern x86 servers have 16 or more CPU cores and large amounts of RAM memory which
justifies the co-location of services. Running each service type in dedicated cluster nodes
separated from other types (non-collocated) would require a larger number of smaller servers
which may even be difficult to find on the market.
There are rules to be considered that restrict deploying every type of service on a single
node.
For any host that has ceph-mon/ceph-mgr and OSD, only one of the following can be added
to the same host:
RADOS Gateway
Ceph Metadata Servers
Grafana
RADOS Gateway and Ceph Metadata services are always deployed on different nodes.
Grafana is typically deployed on a node, which does not host Monitor or Manager services, as
shown in Figure 4-3.
Smallest supported cluster size for IBM Storage Ceph is four nodes. An example of services
co-location for a four-node cluster is shown in Figure 4-4.
Daemon resource requirements may change between releases. Remember to check the
latest (or the release you are after) requirements from IBM Documentation.
Supported configurations: Check out the following link for supported configurations:
Supported configurations Note that you need to login with your IBM Internet ID to access
this link:
Figure 4-6 shows aggregate server CPU and RAM resources required for an OSD node with
x number of HDD drives, SSD devices or NVMe devices.
Figure 4-6 Aggregate server CPU and RAM resources required for an OSD node
Make a note, that it is strongly recommended to add SSD or NVMe devices equaling 4% of
HDD capacity per server as RADOS Gateway metadata storage. This allows Ceph to perform
much better compared to not having such capacity.
Figure 4-7 shows performance scaling using large objects as more RGW instances are
deployed in a 7-node cluster.
Figure 4-7 Performance scaling using large objects as more RGW instances are deployed
Scaling from to 1 to 7 RGW daemons in a 7-node cluster by spreading daemons across all
nodes improves performance 5-6x. Adding RGW daemons to nodes which already have an
RGW daemon does not scale performance as much.
Figure 4-8 shows the performance impact of adding a second RGW daemon to each of the 7
cluster nodes.
Figure 4-8 Performance effect of adding a second RGW daemon in each of the 7 cluster nodes
Small objects also scale well in RGW, and performance is typically measured in OPS
(operations per second). See Figure 4-9 on page 91 shows the small object performance of
the same cluster.
IBM Storage Ceph includes an ingress service for RGW that deploys HAProxy and
keepalived daemons on hosts running RGW and allows for an easy creation of a single virtual
IP for the service.
Official recovery calculator tool to be used to determine node recovery time can be found
here: https://access.redhat.com/labs/rhsrc/
You must be registered as a Red Hat customer to be able to log in to the tool. Here is a link to
the site where you can create a Red Hat login:
https://www.redhat.com/wapps/ugc/register.html?_flowId=register-flow&_flowExecutio
nKey=e1s1
It is important to choose Host as OSD Failure Domain. The number to observe is MTTR in
Hours (Host Failure) which needs to be less than 8 hours.
The use of recovery calculator is especially important for small clusters where recovery times
with large drives are often higher than 8 hours.
By using I IBM Storage Ready Nodes for Ceph customers get both hardware maintenance
and software support from IBM.
Currently there is one server model available with various drive configurations, as shown in
Figure 4-11 on page 93.
Figure 4-11 IBM Storage Ceph with Storage Ready Node specifications
Each node comes with two 3.84TB SSD acceleration drives, which are used as metadata
drives. Available data drives are 8TB, 12TB, 16TB and 20TB SATA drives. Every node must
be fully populated with 12 drives of the same size.
Possible capacity examples using IBM Storage Ready Nodes for Ceph are shown in
Figure 4-12.
Figure 4-12 Capacity examples using IBM Storage Ready Nodes for Ceph
IBM Storage Ready Nodes for Ceph are well suited for throughput optimized workloads, such
as backup storage.
Examples of an 8-node IBM Storage Ceph cluster's performance using Ready Nodes can be
seen in Figure 4-13.
Figure 4-13 Examples of an 8-node IBM Storage Ceph cluster's performance using Ready Nodes
Once the servers are correctly sized then we can make estimates of actual cluster
performance.
4.10.1 IOPS-optimized
IOPS-optimized Ceph clusters use SSD or NVME drives for the data. Main objective of the
cluster is to provide a large number of OPS for RBD block devices or CephFS. Workload is
typically database application requiring high IOPS with small block size.
Example server models and estimated IOPS per host using NVMe drives are shown in
Figure 4-14.
Figure 4-14 Example server models and estimated IOPS per host using NVMe drives
It is a best practice to have at least 10 Gbps of network bandwidth for every 12 HDDs in OSD
node for both cluster network and client network. By doing this, we make sure that network
does not become a bottleneck.
4.10.2 Throughput-optimized
Throughput optimized Ceph cluster can be all flash or hybrid using both SSDs and HDDs.
Main objective of the cluster is to achieve a required throughput performance. Workload is
typically large files and most common use case is backup storage where Ceph capacity is
used via RADOS Gateway object access.
Example server models and estimated MB/s per OSD drive using HDDs as data drives are
shown in Figure 4-15.
Figure 4-15 Example server models and estimated MB/s per OSD drive using HDDs as data drives
4.10.3 Capacity-optimized
Capacity-optimized Ceph clusters use large-capacity HDDs and are often designed to be
narrower (fewer servers) but deeper (more HDDs per server) than throughput-optimized
clusters. The primary objective is to achieve the lowest cost per GB.
Capacity-optimized Ceph clusters are typically used for archive storage of large files, with
RADOS Gateway object access being the most common use case. While it is possible to
co-locate metadata on the same HDDs as data, it is still recommended to separate metadata
and place it on SSDs, especially when using erasure coding protection for the data. This can
provide a significant performance benefit.
Many of the typical server models connect to an external JBOD (Just a Bunch of Disks) drive
enclosure where data disks reside. External JBODs can range from 12-drive to more than a
hundred drive models.
Example server models for capacity-optimized clusters are shown in Figure 4-16.
Figure 4-16 Example server models to build a capacity optimized Ceph cluster
Make a note that too narrow clusters cannot be used with largest drives since calculated
recovery time would be greater than 8 hours. For cold archive type of use case it is possible to
have less than one CPU core per OSD drive.
The IOPS requirement is quite high for a small capacity requirement. Therefore, we should
plan to use NVMe drives which provide highest IOPS per drive. Refer to the performance
chart in Figure 4-14 on page 95.
Performance planning
Figure 4-14 on page 95 shows that a single server with six NVMe drives can do 50 K write
IOPS and 200 K read IOPS with 4 KB block size.
We should plan for 10 servers to have a little headroom and to be prepared for a server
failure.
Capacity planning
The following are considerations for capacity planning:
It is recommended to use 3-way replication with block data.
Total raw storage requirement is 3x 50 TB = 150 TB.
Our performance calculation is based on having six NVMe drives per server.
10 servers in the cluster means 60 OSD drives.
Capacity requirement per NVMe drive is 150 TB (total raw capacity) / 60 (OSD
drives) = 2.5 TB.
Closest larger capacity NVMe is 3.84 TB.
CPU requirement
Best practice is to have 4 to 10 CPU cores per NVMe drive. Our solution has 6 NVMe drives
per server, therefore CPU requirement for OSD daemons is 24 to 60 CPU cores.
RAM requirement
Best practice is to have 12 to 15 GB of RAM per NVMe drive. Our solution has 6 NVMe drives
per server, therefore RAM memory requirement for OSD daemons is 72 to 90 GB.
Networking requirement
We know that 6 NVMe drives in one host is about 200K read IOPS (write IOPS is less). Block
size is 4 KB. Bandwidth required per server is 200000* 4KB/s = 800MB/s = 6.4 Gbps for both
client and cluster networks.
Sizing example
Throughput optimized systems often combine lower cost drives for data and higher
performance drives for RadosGW metadata. IBM Storage Ready Nodes for Ceph offer this
combination of drives.
Ceph cluster needs to be designed in a way that allows data to be rebuilt on remaining nodes
if one of the servers fail. This needs to be considered as extra capacity of top of the requested
1 PB.
Also, Ceph cluster goes into read only mode once it reaches 95% capacity utilization.
Ceph S3 backup targets typically use erasure coding to protect data. For a 1 PB usable
capacity, EC 4+2 is a better choice than EC 8+3 or EC 8+4.
Minimum number of nodes to support EC 4+2 is 7. We should investigate if a good match for
customer capacity requirement can be found with 7 nodes or if more are needed.
Largest available drive size in IBM Storage Ready Nodes for Ceph is 20 TB.
7 nodes * 12 drives per node * 20TB drives = 1680 TB raw.
This would seem like a low cost alternative for 1 PB usable. But can we use 20 TB drives in a
7-node cluster, as shown in Figure 4-19 on page 99? What does the recovery calculator
show?
Since calculated recovery time for host failure is longer than 8 hours this cluster would not be
supported.
First, we increase the number of hosts in recovery calculator from 7 to 8. See Figure 4-20.
Now let us do the capacity math for a 8-node cluster with 20 TB drives:
1. 8 nodes each with 12pcs of 20 TB drives = 8*12*20 TB = 1920 TB raw capacity.
2. EC 4+2 has 4 data chunks and 2 EC chunks. Each chunk is the same size.
3. One chunk of 1920 TB is therefore 1920 TB / 6 = 320 TB.
4. There are 4 data chunks in EC 4+2 and therefore usable capacity is 4*320 TB = 1280 TB.
5. This would satisfy customer's 1 PB capacity requirement and cluster has enough capacity
to rebuild data if one node fails and still stay below 95% capacity utilization.
a. Each server has 12x 20TB = 240 TB raw capacity.
b. Total cluster raw capacity is 1920 TB.
c. Cluster raw capacity if one node breaks is 1920 TB - 240 TB = 1680 TB.
d. 1680 TB raw equals 1120 TB usable.
e. Customer has 1 PB (1000 TB) of data.
f. Capacity utilization if one host is broken is 1000 TB / 1120 TB = 0.89 = 89%.
6. Conclusion is that 8-node cluster with 20 TB drives has enough capacity to stay below
95% capacity utilization even if one the servers breaks and its data is rebuilt on remaining
servers.
7. Now that the raw capacity in each OSD node is calculated, we add the recommended
SSD drives for metadata that equal 4% of the raw capacity on the host.
– 4% of 240 TB is 9.6 TB.
– Best practice is to have one metadata SSD for every four or six HDDs.
– Therefore, it would be recommended to add two or three 3.84 TB SSDs in each host.
– Two 3.84 TB SSDs (non-raided) amounts to 7.68 TB which is 3.2% of raw capacity in
the host.
– Three 3.84 TB SSDs (non-raided) amounts to 11.52 TB which is 4.8% of raw capacity
in the host.
8. Since the customer performance requirement (2 GB/s ingest) is not very high compared to
what an 8-node cluster can achieve (over 4 GB/s) we can settle for the lower SSD
metadata capacity.
CPU requirement
Best practice is to have 1 CPU core per HDD. Our solution has 12 HDDs per server, therefore
CPU requirement for OSD daemons is 12 CPU cores.
RAM requirement
Best practice is to have 10 GB of RAM per HDD for throughput optimized use case. Our
solution has 12 HDDs per server, therefore RAM memory requirement for OSD daemons is
120 GB.
Networking requirement
One HDD has 90MB/s read performance. Our solution has 12 HDDs per server, therefore
network bandwidth requirement per network is 12* 90MB/s = 1080MB/s = 8.64 Gbps.
An example use case scenario for a capacity optimized Ceph cluster is described below.
A possible solution
An object storage archive for cold data is a type of use case where we could design a cluster
that is narrow (few servers) and deep (large number of OSD drives per host). The end result
is a low-cost but not high performing cluster. Very large servers benefit greatly of SSDs for
metadata and it is recommended to include them even for a cold archive use case.
Since capacity requirement is quite high, we can consider EC 8+3 which has the lowest
overhead (raw capacity vs. usable capacity). An IBM Storage Ceph cluster using EC 8+3
requires a minimum of 12 nodes.
First, we want to check if a 12-node cluster with large drives and large JBODs can recover
from host failure in less than 8 hours.
We log in to Recover Calculator tool and enter the parameters of the cluster we want to
check:
12 OSD hosts
84pcs of 20 TB OSD drives per host
2x 25 GbE network ports
We can see from Recovery Calculator (Figure 4-22) that calculate recovery time for such a
cluster is less than 8 hours meaning it is a supported design.
CPU requirement
For a cold archive use case we can use 0.5 CPU cores per HDD. Our solution has 84 HDDs
per server, therefore CPU requirement for OSD daemons is 42 CPU cores.
RAM requirement
Best practice is to have 5 GB of RAM per HDD for capacity optimized use case. Our solution
has 84 HDDs per server, therefore RAM memory requirement for OSD daemons is 420 GB.
Networking requirement
One HDD has 90MB/s read performance. Our solution has 84 HDDs per server, therefore
network bandwidth requirement per network is 84* 90MB/s = 7560MB/s ~60 Gbps.
The three technical skills needed to manage and monitor a Ceph cluster are:
Operating system: The ability to install, configure, and operate the operating system that
hosts the Ceph nodes (Red Hat Enterprise Linux).
Networking: The ability to configure TCP/IP networking in support of Ceph cluster
operations for both client access (public network) and Ceph node to node communications
(cluster network).
Storage: The ability to design and implement a Ceph storage design to provide the
performance, data availability, and data durability that aligns with the requirements of the
client applications and the value of the data.
Regardless of the number of individuals who perform the work, whether it be one person or
several, these administrators will need to perform their roles in managing a Ceph cluster
using the right tool at the right time. This chapter provides an overview of Ceph cluster
monitoring and the tools that are available to perform monitoring tasks.
The supporting hardware of a Ceph cluster is subject to failure over time. The Ceph
administrator is responsible for monitoring and, if necessary, troubleshooting the cluster to
keep it in a healthy state. The remainder of this chapter provides an overview of the Ceph
services, metrics, and tools that are in scope for end-to-end monitoring of cluster health and
performance.
Ceph Monitor
Ceph Monitors (MONs) are the daemons responsible for maintaining the cluster map, a
collection of five maps that provide comprehensive information about the cluster's state and
configuration. Ceph proactively handles each cluster event, updates the relevant map, and
replicates the updated map to all MON daemons. A typical Ceph cluster comprises three
MON instances, each running on a separate host. To ensure data integrity and consistency,
MONs adopt a consensus mechanism, requiring a voting majority of the configured monitors
to be available and agree on the map update before it is applied. This is why a Ceph cluster
must be configured with an odd number of monitors (for example, 3 or 5) to establish a
quorum and prevent potential conflicts.
Ceph Manager
The Ceph Manager (MGR) is a critical component of the Ceph cluster responsible for
collecting and aggregating cluster-wide statistics. The first MGR daemon that is started in a
cluster becomes the active MGR and all other MGRs are on standby. If the active MGR does
not send a beacon within the configured time interval, a standby MGR takes over. Client I/O
operations continue normally while MGR nodes are down, but queries for cluster statistics fail.
The best practice is to deploy at least two MGRs in the Ceph cluster to provide high
availability. Ceph MGRs are typically run on the same hosts as the MON daemons, but it is
not required.
The existence of the first Ceph Manager, and the first Ceph Monitor, defines the existence of
a Ceph cluster. This means that when cephadm bootstraps the first node, a Monitor and a
Manager are started on that node, and the Ceph cluster is considered operational. In the
Ceph single node experience, that first node will also have one or several OSD daemons
further extending the concept of an operational and fully functioning Ceph cluster capable of
storing data.
Ceph daemons
Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster itself
maintains a cluster log that records high-level events about the entire Ceph cluster. These
events are logged to disk on monitor servers (in the default location
/var/log/ceph/ceph.log), and they can be monitored via the command line.
Monitoring services
The Ceph cluster is a collection of daemons and services that perform their respective roles
in the operation of a Ceph cluster. The first step in Ceph monitoring or troubleshooting is to
inspect these Ceph components and make observations as to whether they are up and
running, on the nodes they were designated to run, and are they reporting a healthy state. For
instance, if the cluster design specifies two Ceph Object Gateway (RGW) instances for
handling S3 API client requests from distinct nodes, a single cephadm command or a glance at
the Ceph Dashboard will provide insights into the operating status of the RGW daemons and
their respective nodes.
Monitoring resources
Ceph resources encompass the cluster entities and constructs that define its characteristics.
These resources include networking infrastructure, storage devices (for example, SSDs,
HDDs), storage pools and their capacity, and data protection mechanisms. As with monitoring
other resources, understanding the health of Ceph storage resources is crucial for effective
cluster management. At the core, administrators must ensure sufficient capacity with
adequate expansion capabilities to provide the appropriate data protection for the
applications they support.
Monitoring performance
Ensuring that both physical and software-defined resources operate within their defined
performance service levels is the responsibility of the Ceph administrator. System alerts and
notifications serve as valuable tools, alerting the administrator to anomalies without requiring
constant monitoring. Key resources and constructs for monitoring Ceph performance include
node utilization (CPU, memory, and disk), network utilization (interfaces, bandwidth, latency,
and routing), storage device performance, and daemon workload.
Ceph Dashboard
The Ceph Dashboard UI exposes an HTTP web browser interface accessible on port 8443.
The various Dashboard navigation menus provide real-time health and basic statistics, which
are context sensitive, for the cluster resources. For example, the Dashboard can display the
number of OSD read bytes, write bytes, read operations, and write operations. If you
bootstrap your cluster with the cephadm bootstrap command, then the Dashboard is enabled
by default.
The Dashboard plug-in provides context sensitive metrics for the following services:
Hosts
OSDs
Pools
Block devices
File systems
Object storage gateways
The Dashboard also provides a convenient access point to observe the state and the health
of resources and services in the Ceph cluster. The following Dashboard UI navigation
provides at-a-glance views into the health of the cluster.
Cluster → Services (View Ceph services and daemon node placement and instances.)
Cluster → Logs (View and search within the Cluster logs, Audit logs, Daemon logs.)
Cluster → Monitoring (Control for active alerts, alert history, and alert silences.)
Prometheus
The Prometheus plug-in to the Dashboard facilitates the collection and visualization of Ceph
performance metrics by enabling the export of performance counters directly from ceph-mgr.
Ceph-mgr gathers MMgrReport messages from all MgrClient processes, including monitors
(mons) and object storage devices (OSDs), containing performance counter schema and
data. These messages are stored in a circular buffer, maintaining a record of the last N
samples for analysis.
This plugin establishes an HTTP endpoint, akin to other Prometheus exporters, and retrieves
the most recent sample of each counter upon polling, or scraping in Prometheus parlance.
The HTTP path and query parameters are disregarded, and all available counters for all
reporting entities are returned in the Prometheus text exposition format (Refer to the
Prometheus documentation for further details.). By default the module will accept HTTP
requests on TCP port 9283 on all IPv4 and IPv6 addresses on the host.
The Prometheus metrics can be viewed in a Grafana from an HTTP web browser on TCP port
3000 (for example, https://ceph-mgr-node:3000).
HEALTH_OK
cluster:
id: 899f61d6-5ae5-11ee-a228-005056b286b1
health: HEALTH_OK
services:
mon: 3 daemons, quorum
techzone-ceph6-node1,techzone-ceph6-node2,techzone-ceph6-node3 (age 16h)
mgr: techzone-ceph6-node1.jqdquv(active, since 2w), standbys:
techzone-ceph6-node2.akqefd
mds: 1/1 daemons up
osd: 56 osds: 56 up (since 2w), 56 in (since 4w)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 10 pools, 273 pgs
objects: 283 objects, 1.7 MiB
usage: 8.4 GiB used, 823 GiB / 832 GiB avail
pgs: 273 active+clean
Figure 5-4 Ceph Dashboard plug-in for Grafana context sensitive performance charts
The Grafana stand-alone Dashboard can be accessed from a web browser that addresses
TCP port 3000 on the Ceph MGR node (for example, https://ceph-node1:3000). See
Figure 5-5 on page 114.
. . . output omitted . . .
. . . output omitted . . .
Check the version of the Ceph client on a Ceph client machine. See Example 5-6.
Check the software versions on each of the Ceph cluster nodes. See Example 5-7.
Example 5-8 Check the current OSD capacity ratio for your cluster
[ceph: root@ceph-node1 /]# ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
Change one of the ratios currently set for your cluster. See Example 5-9.
Example 5-9 Change one of the ratios currently set for your cluster
[ceph: root@ceph-node1 /]# ceph osd set-full-ratio 0.9
osd set-full-ratio 0.9
Tip: When a CephOSD reaches the full ratio, it indicates that the device is nearly full and
can no longer accommodate additional data. This condition can have several negative
consequences for the Ceph cluster, including:
Reduced performance: As an OSD fills up, its performance degrades. This can lead to
slower read and write speeds, increased latency, and longer response times for client
applications.
Increased risk of data loss: When an OSD is full, it becomes more susceptible to data
loss. If the OSD fails, it can cause data corruption or loss of access to stored data.
Cluster rebalancing challenges: When an OSD reaches the full ratio, it can make it
more difficult to rebalance the cluster. Rebalancing is the process of evenly distributing
data across all OSDs to optimize performance and improve fault tolerance.
Cluster outage: If a full OSD fails, it can cause the entire cluster to become
unavailable. This can lead to downtime for critical applications and data loss.
To avoid these consequences, it is important to monitor OSD usage and take proactive
measures to prevent OSDs from reaching the full ratio. This may involve adding new OSDs
to the cluster, increasing the capacity of existing OSDs, or deleting old or unused data.
5.3 Conclusion
The Ceph cluster maintains centralized cluster logging, capturing high-level events pertaining
to the entire cluster. These events are stored on disk on monitor servers and can be accessed
and monitored through various administrative tools.
The many optional Ceph Manager modules such as Zabbix, Influx, Insights, Telegraph,
Alerts, Disk Prediction, iostat can help you better integrate the monitoring of your IBM Storage
Ceph cluster in your existing monitoring and alerting solution while refining the granularity of
your monitoring. Additionally, you can utilize the SNMP Gateway service to further enhance
monitoring capabilities.
Administrators can leverage the Ceph Dashboard for a straightforward and intuitive view into
the health of the Ceph cluster. The built-in Grafana dashboard enables them to examine
detailed information, context-sensitive performance counters, and performance data for
specific resources and services.
Administrators can further use cephadm, the stand-alone Grafana dashboard, and other
supporting 3rd party tools to visualize and record detailed metrics on cluster utilization and
performance.
In conclusion, the overall performance of a software-defined storage solution like IBM Storage
Ceph is heavily influenced by the network. Therefore, it is crucial to integrate the monitoring of
your IBM Storage Ceph cluster with the existing network monitoring infrastructure. This
includes tracking packet drops and other network errors alongside the health of network
interfaces. The SNMP subsystem included in Red Hat Enterprise Linux is a built-in tool that
can facilitate this comprehensive monitoring.
https://www.ibm.com/docs/en/storage-ceph/6?topic=dashboard-monitoring-cluster
https://docs.ceph.com/en/latest/monitoring/
6.1.1 Prerequisites
This section we will discuss the IBM Ceph prerequisites.
Check the sizing section in chapter 2 of this book for details about the server sizing and
configuration.
Firewall requirements
Table 6-1 lists the various TCP ports IBM Storage Ceph uses.
Miscellaneous requirements
If configuring SSL for your Dashboard access and your object storage endpoint (RADOS
Gateway), ensure you obtain the correct certificate files from your security team and deploy
them where needed.
Make sure your server network cards and configuration are adequate for the workload served
by your IBM Storage Cluster:
Size your server's network cards appropriately for the throughput you need to deliver.
Evaluate the network bandwidth generated by the data protection, as it will need to be
carried by the network for all write operations.
Make sure your network configuration does not present any single point of failure.
If your cluster spans multiple subnets, make sure each server can communicate with the
other servers.
6.1.2 Deployment
IBM Storage Ceph documentation details how to use cephadm to deploy your Ceph cluster. A
cephadm based deployment follows the following steps:
For beginners:
• Bootstrap your Ceph cluster (create one initial Monitor and Manager).
• Add services to your cluster (OSDs, MDSs, RADOS Gateways and so forth).
For advanced users:
• Bootstrap your cluster with a complete service file to deploy everything.
Cephadm
The only supported tool for IBM Storage Ceph, cephadm is available since Ceph upstream
Octopus and is the default deployment tool since Pacific.
Service files
An entire cluster can be deployed using a single service file. The service file follows the
following syntax. See Example 6-1.
You can assign labels to hosts using the labels: field of a host service file. See
Example 6-2.
You can assign a specific placement for any service using the placement: field in a service
file. You can refer to “Placement” on page 126 for details on placement.
The OSD service file offers many specific options to specify how the OSDs are to be deployed
on the nodes.
block_db_size to specify the RocksDB size on separate devices.
block_wal_size to specify the RocksDB size on separate devices.
data_devices to specify which devices will receive the data.
db_devices to specify which devices will receive RocksDB DB portion.
wal_devices to specify which devices will receive the RocksDB WAL portion.
db_slots to specify how many RocksDB DB partition per db_device.
wal_slots to specify how many RocksDB WAL partition per wal_device.
data_directories to specify a list of device paths to be used.
filter_logic to specify OR or AND between filters. Default is AND.
The data_devices, db_devices and wal_devices parameters accept the following arguments:
all to specify all devices are to be consumed (true or false).
limit to specify how many OSD are to be deployed per node.
rotational to specify the type of devices to select (0 or 1).
size to specify the size of the devices to select:
• xTB to select a specific device size.
• xTB:yTB to select devices between the two capacities.
• :xTB to select any device up to this size.
• xTB: to select any device at least this size.
path to specify the device path to use.
model to specify the disk model name.
vendor to specify the vendor model name.
encrypted to specify if the data is to be encrypted at rest (data_devices only).
The RADOS Gateway service file accepts the following specific arguments:
networks to specify which CIDR the gateway will bind to.
spec:
• rgw_frontend_port to specify which TCP port the gateway will bind.
• rgw_realm to specify the realm for this gateway.
• rgw_zone to specify the zone for this gateway.
• ssl to specify if this gateway uses SSL (true or false).
• rgw_frontend_ssl_certificate to specify the certificate to use.
• rgw_frontend_ssl_key to specify the key to use.
• rgw_frontend_type to specify the frontend to use (default is beast).
placement.count_per_host to specify how many RADOS Gateways per node.
Container parameters
You can customize the parameters used by the Ceph containers using a special section of
your service file known as extra_container_args. To add extra parameters, use template as
shown in Example 6-4, in the appropriate service files.
- "--cpus=2"
Placement
Placement can be a simple count to indicate the number of daemons to deploy. In such a
configuration, cephadm will choose where to deploy the daemons.
Placement can use explicit naming: --placement="host1 host2 …". In such a configuration,
the daemons will be deployed on the nodes listed.
Using a service file, you would encode the following for count. See Example 6-5.
Using a service file, you would encode the following for label. See Example 6-6.
Using a service file, you would encode the following for the host list. See Example 6-7.
Using a service file, you would encode the following for pattern. See Example 6-8.
You can pass an initial Ceph configuration file to the bootstrap command through the
--config {path_to_config_file} command line option.
You can override the SSH user that will be used by cephadm through the --ssh-user
{user_name} command line option.
You can pass a specific set of registry parameters through a valid registry JSON file via the
--registry-json {path_to_registry_json} command line option.
You can choose the Ceph container image you want to deploy via the --image
{registry}[:{port}]/{imagename}:{imagetag} command line option.
You can specify the network configuration to be used by the cluster. A Ceph cluster uses two
networks:
Public network,
• Used by clients, including RGWs, to connect to all Ceph daemons.
• Used by Monitors to converse with other daemons.
• Used by MGRs and MDSs to communicate with other daemons.
Cluster network.
• Used by OSDs to perform OSD operations such as replication and recovery.
The public network will be extrapolated from the --mon-ip parameter provided by the
bootstrap command. The cluster network can be provided during the bootstrap operation via
the --cluster-network parameter. If the --cluster-network parameter is not specified, it will
be set to the same as the public network value.
$ cephadm shell
To connect to the graphical user interface of your initial cluster, look for the following lines in
the cephadm bootstrap output and point your HTTP browser to the URL being displayed. See
Example 6-9 on page 129.
This document is not designed to provide extensive details about the deployment and
configuration your cluster. Go to the IBM Storage Ceph documentation for more details.
BM Storage Ceph Solutions Guide, REDP-5715 provides GUI based deployment methods for
those who feel more comfortable using a graphical user interface.
Once the cluster is bootstrapped, you can deploy the appropriate service of your clusters. A
production cluster will require the following elements to be deployed in appropriate numbers
for a reliable cluster that does not present any single point of failure.
Adding nodes
Once the cluster is bootstrapped, the Ceph administrator must add all the nodes where the
services will be deployed. You must copy the cluster public SSH key to each node that will be
part of the cluster. Once the nodes are prepared, add it to the cluster. See Example 6-10.
Tip: To add multiple hosts, use a service file describing all your nodes.
Removing nodes
If you need to remove a node that was added by mistake, use the following commands. If it
does not have OSD already deployed, skip the second command of the example. See
Example 6-11.
Tip: You can also clean up the SSH keys copied to the host.
Assigning labels
You can assign labels to hosts after they were added to the cluster. See Example 6-12.
Adding services
Once the cluster is bootstrapped the Ceph administrator must add the required services for
the cluster to become fully operational
After bootstrapping your cluster, you only have a single Monitor, a single Manager, no OSDs,
no MDSs, no RGWs. The services will be deployed in the following order:
7. Deploy another 2 Monitors at least.
8. Deploy at least 2 more Manager.
9. Deploy the OSDs.
10.If needed, deploy your Ceph file system.
11.If needed, deploy your RADOS Gateways.
Monitors
To deploy a total of 3 Monitors, simply deploy 2 additional Monitors. See Example 6-13.
Tip: This can also be achieved using a service file via the command ceph orch apply -i
{path_to_mon_service_file}.
Tip: If you want your Monitor to bind to a specific IP address or subnet use the
{hostname}:{ip_addr} or {hostname}:{cidr} to specify the host.
Managers
To deploy a total of 2 Managers, simply deploy 1 additional Manager. See Example 6-14.
You can list the devices available on all nodes using the following command after you added
all the nodes to your cluster. See Example 6-15.
cephadm will scan all nodes for available devices (free of partitions, free of formatting, free of
LVM configuration). See Example 6-16.
Tip: For production cluster with specific configuration and strict deployment scenarios, it is
recommended to use a service ceph orch apply -i {path_to_osd_service_file}
command.
Tip: To visualize what devices will be consumed by the command above, use ceph orch
apply osd --all-available-devices --dry-run command.
A OSD service file, using the information provided in 6.1.2, “Deployment” on page 123, can
be tailored to each node need. As such, an OSD service file can contain multiple
specifications. See Example 6-17.
You can also add OSD targeting a specific device. See Example 6-18.
Tip: You can pass many parameters through the command line, including data_devices,
db_devices, wal_devices. For example,
{hostname}:data_devices={dev1},{dev2},db_devices={db1}.
In some cases, it may be necessary to initialize or clean a local device so that it can be
consumed by the OSD. See Example 6-19.
Just like other services, arguments can be passed through the command line arguments or
be provided with a service file with a detailed configuration.
Tip: You can pass many parameters through the command line, including
--realm={realm_name}, --zone={zone_name} --placement={placement_specs},
--rgw_frontend_port={port} or count-per-host:{n}.
Ingress service
If your cluster has a RADOS Gateway service deployed, you will likely require a load balancer
in front of them to guarantee the distribution of the traffic between multiple RADOS Gateways
but also to provide a highly available object service with no single point of failure.
These parameters can be used in an ingress service file. See Example 6-21.
Tip: For a production cluster with specific configuration and strict deployment scenarios, it
is recommended to run the service ceph orch apply -i {path_to_mds_service_file}.
command. Once the MDSs are active, you must manually create the file system with the
following command ceph fs new {fs_name} {meta_pool} {data_pool}.
Once your cluster is fully deployed, the best practice is to export the cluster configuration and
backup the configuration. Using git to manage the cluster configuration is recommended.
See Example 6-22.
spec:
data_devices:
all: true
filter_logic: AND
objectstore: bluestore
---
service_type: prometheus
service_name: prometheus
placement:
count: 1
---
service_type: rgw
service_id: myrgw
service_name: rgw.myrgw
placement:
count: 1
hosts:
- ceph01
Tip: Simply redirect the command output to a file. This is an easy way to create a full
cluster deployment service file once you have learned cephadm.
In a production cluster, it is best for each OSD to serve between 100 and 200 placement
groups.
Best practice is never to have more than 300 placement groups managed by an OSD.
The Placement Group Auto-Scaler module automatically resizes the PGs assigned to each
pool based on the target size ratio assigned to each pool.
This chapter provides insights about each topic and highlights some of the best practices and
necessary steps when applicable.
Nonetheless, the failure of a node or the failure of multiple devices can endanger the
resiliency of the data and expose it to a double failure scenario that can alter the availability of
the data.
As such, Ceph clusters must be monitored and action taken swiftly to replace failed nodes
and failed devices when exposure to double failure scenarios gets closer.
As the data is being rebalanced or recovered in a Ceph cluster, it may impact the
performance of the client traffic entering the cluster. As Ceph matured, the default recovery
and backfill parameters have been adapted to minimize this negative impact.
Command Description
For example, to view the daemon mon.foo logs for a cluster with ID
5c5a50ae-272a-455d-99e9-32c6a013e694, the command would be like the following, as shown
in Example 6-24.
Tip: You can set a single daemon to log to a file by using osd.0 instead of global.
By default, cephadm sets up log rotation on each host. You can configure the logging
retention schedule by modifying /etc/logrotate.d/ceph.<CLUSTER FSID>.
Because a few Ceph daemons (notably, the monitors and Prometheus) store a large amount
of data in /var/lib/ceph, we recommend moving this directory to its own disk, partition, or
logical volume so that it does not fill up the root file system.
For more information, refer to Ceph daemon logs section in IBM Documentation.
Cephadm logs
Cephadm writes logs to the cephadm cluster log channel. You can monitor Ceph's activity in
real-time by reading the logs as they fill up. Run the following command in Example 6-27 to
see the logs in real-time.
By default, this command shows info-level events and above. Run the following commands in
Example 6-28 to see debug-level messages and info-level events.
You can see recent events by running the following command in Example 6-29.
If your Ceph cluster has been configured to log events to files, there will exist a cephadm log
file called ceph.cephadm.log on all monitor hosts (see Ceph daemon control for a more
complete explanation of this).
For more information, refer to Monitor cephadm log messages section in IBM Documentation.
A debug logging setting can take a single value for the log and memory levels, which sets
them as the same value. For example, if you specify debug ms = 5, Ceph will treat it as a log
and memory level of 5. You may also specify them separately. The first setting is the log level,
and the second is the memory level. You must separate them with a forward slash (/). For
example, if you want to set the ms subsystem's debug logging level to 1 and its memory level
to 5, you will specify it as debug ms = 1/5. For example, you can increase log verbosity during
runtime in different ways.
Example 6-31 shows how to configure the logging level for a specific subsystem.
Example 6-32 shows how to use the admin socket from inside the affected service container.
Example 6-32 Configure the logging level using the admin socket
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.<name>.asok config set
debug_rgw 20
Example 6-33 shows how to make the verbose/debug change permanent, so that it persists
after a restart.
We can use the cephadm logs to get logs from the containers running ceph services. See
Example 6-34.
Example 6-34 Check the startup logs from a Ceph service container
# cephadm ls | grep mgr
"name": "mgr.ceph-mon01.ndicbs",
"systemd_unit":
"[email protected]",
"service_name": "mgr",
# cephadm logs --name mgr.ceph-mon01.ndicbs
Inferring fsid 3c6182ba-9b1d-11ed-87b3-2cc260754989
-- Logs begin at Tue 2023-01-24 04:05:12 EST, end at Tue 2023-01-24 05:34:07 EST.
--
Jan 24 04:05:21 ceph-mon01 systemd[1]: Starting Ceph mgr.ceph-mon01.ndicbs for
3c6182ba-9b1d-11ed-87b3-2cc260754989...
Jan 24 04:05:25 ceph-mon01 podman[1637]:
Jan 24 04:05:26 ceph-mon01 bash[1637]:
36f6ae35866d0001688643b6332ba0c986645c7fba90d60062e6a4abcd6c8123
Jan 24 04:05:26 ceph-mon01 systemd[1]: Started Ceph mgr.ceph-mon01.ndicbs for
3c6182ba-9b1d-11ed-87b3-2cc260754989.
Jan 24 04:05:27 ceph-mon01
ceph-3c6182ba-9b1d-11ed-87b3-2cc260754989-mgr-ceph-mon01-ndicbs[1686]: debug
2023-01-24T09:05:27.272+0000 7fe90710d>
Jan 24 04:05:27 ceph-mon01
ceph-3c6182ba-9b1d-11ed-87b3-2cc260754989-mgr-ceph-mon01-ndicbs[1686]: debug
2023-01-24T09:05:27.272+0000 7fe90710d>
Major upgrade releases include more disruptive changes, such as Dashboard or mgr API
deprecation, than minor releases. Major IBM Storage Ceph releases typically use a new
major upstream version, such as moving from Pacific to Quincy or from Quincy to Reef. This
would be represented as moving from 5.X to 6.X or from 6.X to 7.X on the IBM Storage Ceph
side. Major upgrades may also require upgrading the operating system. Here is a link with the
matrix of OS-supported versions depending on the IBM Storage Ceph release.
Minor upgrades generally use the same upstream release as the Major it belongs to and try to
avoid disruptive changes. In IBM Storage Ceph, minor releases would be IBM Storage Ceph
7.1, 7.2, and so forth.
Inside a Minor upgrade, there are periodic maintenance releases. Maintenance releases
bring security and bug fixes. Very rarely new features are introduced in maintenance
releases. Maintenance releases are represented like 6.1z1, 6.1z2, and so forth.
The first version of IBM Storage Ceph was 5.X, so cephadm is the only orchestrator tool you
would need to work with. If by any chance you are upgrading from Red Hat Ceph Storage
Cluster (RHCS) to IBM Storage Ceph and the RHCS cluster is in versions 3.X or 4.X, the
upgrade from 3.X to 4.X to 5.X would be done with Ceph-ansible, before cephadm, the
Ceph-ansible repo was used for upgrading minor and major versions of Red Hat Ceph
Storage.
Before starting the actual Ceph upgrade, there are some prerequisites that we need to
double-check:
1. Read the release notes of the version you are upgrading to; you may be affected by a
known issue, like, for example, an API deprecation that you may be using.
2. Open a proactive case with the IBM support team; you can open a support ticket informing
the IBM Ceph support team.
3. Upgrade to the latest maintenance release of the latest minor version before doing a major
upgrade; for example, before upgrading from 5.3 to 6.1, ensure you are running the latest
5.3 maintenance release, in this case, 5.3z5.
4. Label a second node as the admin in the cluster to manage the cluster when the admin
node is down during the OS upgrade.
5. Check your podman version. podman and IBM Storage Ceph have different end-of-life
strategies that might make it challenging to find compatible versions. There is a matrix of
supported podman versions for each release: The following link provides an example.
6. Check if the current RHEL version is supported by the IBM Storage Ceph version you are
upgrading to. If the current version of RHEL is not supported on the new IBM Storage
Ceph release, you need to upgrade the OS before you upgrade to the new IBM Storage
Ceph version. There are two recommended ways to upgrade the OS:
– RHEL Leapp upgrade. You can find the instructions here. The Leapp upgrade is an
in-place upgrade of the OS. All your OSD data will be available once the upgrade of the
OS has finished, so the recovery time for the OSDs on these nodes will be shorter than
a full reinstall.
– RHEL full OS reinstall. With this approach, we do a clean OS install with the new
version. The data on the OSDs will be lost, so when we add the updated OS node to
the Ceph cluster, it will need to recover all the data in the OSDs fully.
Both approaches are valid and have pros and cons depending on the use case and
infrastructure resources.
Upgrades can also be done in disconnected environments with limited internet connectivity.
Here is the documentation link with the step-by-step guide for disconnected upgrades.
There are two main approaches. The most common one is setting the following parameters to
avoid any data movement, even if it takes longer than 10 minutes for an OS upgrade of an
OSD node.
Example 6-35 Prevent OSDs from getting marked out during an upgrade and avoid unnecessary load
on the cluster
# ceph osd set noout
# ceph osd set noscrub
# ceph osd set nodeep-scrub
The conservative approach is to recover all OSDs when upgrading the OS, especially if doing
a clean OS installation to upgrade RHEL. This can take hours, as the running Ceph cluster
will be in a degraded state with only two valid copies of the data (assuming replica 3 is used).
To avoid this scenario, you can take a node out of the cluster by enabling maintenance mode
in Cephadm. Once the monitor OSD timeout has expired (default: 10 minutes), recovery will
start. Once recovery is finished, the cluster will be in a fully protected state with 3 replicas.
Once the Red Hat OS has been updated, zap the OSD drives and add the node back to the
cluster. This will trigger a data rebalance, which will finish once the cluster is HEALTH_OK.
You can then upgrade the next node and repeat the process.
Example 6-36 shows how to determine whether an upgrade is in process and the version to
which the cluster is upgrading.
If you want to get detailed information on all the steps the upgrade is taking, you can query
the cephadm logs. See Example 6-37.
Cluster expansion
One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at
run time. It allows you to resize the storage cluster capacity without taking it down, so you can
add, remove, or replace hardware during regular business hours. However, adding and
removing OSD nodes can significantly impact performance.
Before adding Ceph OSD nodes, consider the effects on storage cluster performance. Adding
or removing Ceph OSD nodes causes backfilling as the storage cluster rebalances,
regardless of whether you are expanding or reducing capacity. During this rebalancing
period, Ceph uses additional resources, which can impact performance.
Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or
removing a node typically affects the performance of pools that use the CRUSH ruleset.
3. Ensure the correct RHEL repositories are enabled. See Example 6-40 on page 143.
5. From the IBM Storage Ceph Cluster bootstrap node, enter the cephadm shell. See
Example 6-42.
6. Extract the cluster public ssh key to a folder. See Example 6-43.
7. Copy Ceph cluster's public SSH keys to the root user's authorized_keys file on the new
host. See Example 6-44.
Example 6-44 Copy ssh key to new node you are adding
$ ssh-copy-id -f -i ~/PATH root@HOSTNAME
8. Add the new host to the Ansible inventory file. The default location for the file is
/usr/share/cephadm-ansible/hosts. Example 6-45 shows the structure of a typical
inventory file.
[admin]
host00
9. Run the preflight playbook with the --limit option. See Example 6-46.
Example 6-46 Copy the ssh key to the new node you are adding
$ ansible-playbook -i INVENTORY_FILE cephadm-preflight.yml --extra-vars
"ceph_origin=ibm" --limit NEWHOST
10.From the Ceph administration node, log into the Cephadm shell. See Example 6-47.
11.Use the cephadm orchestrator to add hosts to the storage cluster. See Example 6-48.
Once the node is added to the cluster, you should see the node listed in the output of the
ceph orch host ls command or ceph orch device list
If the disks on the newly added host pass the filter you configured in your cephadm OSD
service spec, new OSDs will be created using the drives on the new host. The cluster will
then rebalance the data, moving PGs from other OSDs to the new OSDs to distribute the data
evenly across the cluster.
All the recommendations we suggested in “Cluster expansion” on page 142 also apply to this
section, so read them carefully before starting a node replacement.
a. Disable backfilling.
b. Create a backup of the Ceph configuration.
c. Replace the node and add the Ceph OSD disks from the failed node.
d. Configuring disks as JBOD.
e. Install the operating system.
f. Restore the Ceph configuration.
Add the new node to the storage cluster commands, and Ceph daemons are placed
automatically on the respective node.
a. Enable backfilling.
b. Replacing the node, reinstalling the operating system, and using all new Ceph OSDs
disks.
c. Disable backfilling.
d. Remove all OSDs on the failed node from the storage cluster.
e. Create a backup of the Ceph configuration.
f. Replace the node and add the Ceph OSD disks from the failed node.
g. Configuring disks as JBOD.
h. Install the operating system.
i. Add the new node to the storage cluster commands, and Ceph daemons are placed
automatically on the respective node.
j. Enable backfilling.
In the official documentation, you can find detailed steps that take you through a disk
replacement. Here is the link.
At a smaller scale because, in this case, it is a single OSD, not a full node, but all the
recommendations we suggested in the chapter on cluster expansion also apply to disk
replacement, so please read them carefully before starting a disk replacement.
The IBM Storage Ceph Dashboard Observability Stack provides management and monitoring
capabilities, allowing you to administer and configure the cluster and visualize related
information and performance statistics. The Dashboard uses a web server hosted by the
ceph-mgr daemon.
– Auditing: The Dashboard backend can be configured to log all PUT, POST and
DELETE API requests in the Ceph manager log.
Management features
– View cluster hierarchy: You can view the CRUSH map, for example, to determine which
host a specific OSD ID is running on. This is helpful if there is an issue with an OSD.
– Configure manager modules: You can view and change parameters for Ceph manager
modules.
– Embedded Grafana Dashboards: Ceph Dashboard Grafana Dashboards might be
embedded in external applications and web pages to surface information and
performance metrics gathered by the Prometheus module.
– View and filter logs: You can view event and audit cluster logs and filter them based on
priority, keyword, date, or time range.
– Toggle Dashboard components: You can enable and disable Dashboard components
so only the features you need are available.
– Manage OSD settings: You can set cluster-wide OSD flags using the Dashboard. You
can also Mark OSDs up, down or out, purge and reweight OSDs, perform scrub
operations, modify various scrub-related configuration options, select profiles to adjust
the level of backfilling activity. You can set and change the device class of an OSD,
display and sort OSDs by device class. You can deploy OSDs on new drives and hosts.
– Viewing Alerts: The alerts page allows you to see details of current alerts.
– Quality of Service for images: You can set performance limits on images, for example
limiting IOPS or read BPS burst rates.
Monitoring features
– Username and password protection: You can access the Dashboard only by providing
a configurable user name and password.
– Overall cluster health: Displays performance and capacity metrics. This also displays
the overall cluster status, storage utilization, for example, number of objects, raw
capacity, usage per pool, a list of pools and their status and usage statistics.
– Hosts: Provides a list of all hosts associated with the cluster along with the running
services and the installed Ceph version.
– Performance counters: Displays detailed statistics for each running service.
– Monitors: Lists all Monitors, their quorum status and open sessions.
– Configuration editor: Displays all the available configuration options, their descriptions,
types, default, and currently set values. These values are editable.
– Cluster logs: Displays and filters the latest updates to the cluster's event and audit log
files by priority, date, or keyword.
– Device management: Lists all hosts known by the Orchestrator. Lists all drives
attached to a host and their properties. Displays drive health predictions, SMART data,
and blink enclosure LEDs.
– View storage cluster capacity: You can view raw storage capacity of the IBM Storage
Ceph cluster in the Capacity panels of the Ceph Dashboard.
– Pools: Lists and manages all Ceph pools and their details. For example: applications,
placement groups, replication size, EC profile, quotas, CRUSH ruleset, etc.
– OSDs: Lists and manages all OSDs, their status and usage statistics as well as
detailed information like attributes, like OSD map, metadata, and performance
counters for read and write operations. Lists all drives associated with an OSD.
– Images: Lists all RBD images and their properties such as size, objects, and features.
Create, copy, modify and delete RBD images. Create, delete, and rollback snapshots
of selected images, protect or unprotect these snapshots against modification. Copy or
clone snapshots, flatten cloned images.
– RBD Mirroring: Enables and configures RBD mirroring to a remote Ceph server. Lists
all active sync daemons and their status, pools and RBD images including their
synchronization state.
– Ceph File Systems: Lists all active Ceph file system (CephFS) clients and associated
pools, including their usage statistics. Evict active CephFS clients, manage CephFS
quotas and snapshots, and browse a CephFS directory structure.
– Object Gateway (RGW): Lists all active Object Gateways and their performance
counters. Displays and manages, including add, edit, delete, Object Gateway users
and their details, for example quotas, as well as the users' buckets and their details, for
example, owner or quotas.
Security features
– SSL and TLS support: All HTTP communication between the web browser and the
Dashboard is secured via SSL. A self-signed certificate can be created with a built-in
command, but it is also possible to import custom certificates signed and issued by a
Certificate Authority (CA).
Dashboard access
You can access the Dashboard with the credentials provided on bootstrapping the cluster.
Cephadm installs the Dashboard by default. Example 6-50 is an example of the Dashboard
URL:
Example 6-50 Dashboard credentials example during bootstrap of the Ceph cluster
URL: https://ceph-mon01:8443/
User: admin
Password: XXXXXXXXX
To find the Ceph Dashboard credentials, search the var/log/ceph/cephadm.log file for the
string "Ceph Dashboard is now available at".
You have to change the password the first time you log into the Dashboard with the
credentials provided on bootstrapping only if --dashboard-password-noupdate option is not
used while bootstrapping.
IBM Storage Ceph has built in some network warnings at the CLI level and also in the
observability stack in Alert Manager.
Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon
availability and network performance. If a single delayed response is detected, this might
indicate nothing more than a busy OSD. But if multiple delays between distinct pairs of OSDs
are detected, this might indicate a failed network switch, a NIC failure, or a layer 1 failure.
In the output of the ceph health detail command, you can see which OSDs are experiencing
delays and how long the delays are. The output of ceph health detail is limited to ten lines.
Example 6-51shows an example of the output you can expect from the ceph health detail
command.
To see more detail and to collect a complete dump of network performance information, use
the dump_osd_network command.
From the Alert Manager that is part of the out of the box Observability stack, there are
different preconfigured alarms related to networking issues. See Example 6-52 as an
example.
) >= 0.0001 or (
rate(node_network_receive_errs_total{device!="lo"}[1m]) +
rate(node_network_transmit_errs_total{device!="lo"}[1m])
) >= 10
labels:
oid: "1.3.6.1.4.1.50495.1.2.1.8.3"
severity: "warning"
type: "ceph_default"
Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this paper.
IBM Redbooks
The following IBM Redbooks publications provide additional information about the topic in this
document. Note that some publications referenced in this list might be available in softcopy
only.
IBM Storage Ceph Solutions Guide, REDP-5715
You can search for, view, download or order these documents and other Redbooks,
Redpapers, Web Docs, draft and additional materials, at the following website:
ibm.com/redbooks
Online resources
These websites are also relevant as further information sources:
IBM Storage Ceph Documentation:
https://www.ibm.com/docs/en/storage-ceph/6?topic=dashboard-monitoring-cluster
Community Ceph Documentation
https://docs.ceph.com/en/latest/monitoring/
AWS CLI documentation:
https://docs.aws.amazon.com/cli/index.html
IP load balancer documentation:
https://github.ibm.com/dparkes/ceph-top-gun-enablement/blob/main/training/modul
es/ROOT/pages/radosgw_ha.adoc
REDP-5721-00
ISBN DocISBN
Printed in U.S.A.
®
ibm.com/redbooks