0% found this document useful (0 votes)
40 views168 pages

Ceph

Uploaded by

tuandat071101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views168 pages

Ceph

Uploaded by

tuandat071101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 168

Front cover

Draft Document for Review November 28, 2023 8:05 am REDP-5721-00

IBM Storage Ceph


Concepts and Architecture
Guide
Christopher Maestas
Daniel Parkes
Franck Malterre
Jean-Charles (JC) Lopez
John Shubeck
Jussi Lehtinen
Suha Ondokuzmayis
Vasfi Gucer

Redpaper
Draft Document for Review November 28, 2023 12:23 am 5721edno.fm

IBM Redbooks

IBM Storage Ceph Concepts and Architecture Guide

November 2023

REDP-5721-00
5721edno.fm Draft Document for Review November 28, 2023 12:23 am

Note: Before using this information and the product it supports, read the information in “Notices” on
page vii.

First Edition (November 2023)

This edition applies to IBM Storage Ceph Version 6.

This document was created or updated on November 28, 2023.

© Copyright International Business Machines Corporation 2023. All rights reserved.


Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Draft Document for Review November 28, 2023 7:59 am 5721TOC.fm

Contents

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ceph and storage challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Data keeps growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Technology changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Data organization, access, and costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Data added value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.5 Ceph approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.6 Ceph storage types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 What is new with IBM Storage Ceph V 7.0? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 WORM compliance certification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Multi-site replication with bucket granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Object archive zone (Tech preview) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 RGW policy-based data archive and migration capability. . . . . . . . . . . . . . . . . . . . 9
1.3.5 IBM Storage Ceph Object S3 Lifecycle Management. . . . . . . . . . . . . . . . . . . . . . 10
1.3.6 Dashboard UI enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.7 NFS support for CephFS for non-native Ceph clients. . . . . . . . . . . . . . . . . . . . . . 12
1.3.8 NVMe over Fabrics (Tech preview). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.9 Object storage for ML/analytics: S3 Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.10 RGW multi-site performance improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.11 Erasure code EC2+2 with 4 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Chapter 2. IBM Storage Ceph architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 RADOS components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Ceph cluster partitioning and data distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Controlled Replication Under Scalable Hashing (CRUSH) . . . . . . . . . . . . . . . . . . 24
2.1.4 OSD failure and recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.5 Cephx authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Access methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Chapter 3. IBM Storage Ceph main features and capabilities . . . . . . . . . . . . . . . . . . . 33


3.1 IBM Storage Ceph access methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 IBM Storage Ceph object storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 IBM Storage Ceph object storage overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Use cases for IBM Storage Ceph object storage . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 RADOS Gateway architecture within the Ceph cluster . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Deployment options for IBM Storage Ceph object storage . . . . . . . . . . . . . . . . . . 39
3.2.5 IBM Storage Ceph object storage topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

© Copyright IBM Corp. 2023. iii


5721TOC.fm Draft Document for Review November 28, 2023 7:59 am

3.2.6 IBM Storage Ceph object storage key features . . . . . . . . . . . . . . . . . . . . . . . . . . 42


3.2.7 Ceph Object Gateway deployment step-by-step. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.8 Ceph Object Gateway client properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.9 Configuring the AWS CLI client. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.10 The S3 API interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.12 References on the World Wide Web. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 IBM Storage Ceph block storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.1 Managing RBD images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.2 Snapshots and clones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.3 RBD write modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.4 RBD mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.5 Other RBD features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 IBM Storage Ceph file storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.1 Metadata Server (MDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.2 File System configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.3 File System layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.4 Data layout in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.5 File System clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.6 File System NFS Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.7 Ceph File System permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.8 IBM Storage Ceph File System snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.9 IBM Storage Ceph File System asynchronous replication . . . . . . . . . . . . . . . . . . 80

Chapter 4. Sizing IBM Storage Ceph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81


4.1 Workload considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.1 IOPS-optimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.2 Throughput-optimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.3 Capacity-optimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Performance domains and storage pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Network considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Collocation versus non-collocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Minimum hardware requirements for daemons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6 OSD node CPU and RAM requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.7 Scaling RGWs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.8 Recovery calculator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.9 IBM Storage Ready Nodes for Ceph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.10 Performance guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.10.1 IOPS-optimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.10.2 Throughput-optimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.10.3 Capacity-optimized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.11 Sizing examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11.1 IOPS-optimized scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11.2 Throughput-optimized scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.11.3 Capacity-optimized scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Chapter 5. Monitoring your IBM Storage Ceph environment . . . . . . . . . . . . . . . . . . . 105


5.1 Monitoring overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1.1 IBM Storage Ceph monitoring components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1.2 IBM Storage Ceph monitoring categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.3 IBM Storage Ceph monitoring tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 IBM Storage Ceph monitoring examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.1 Ceph Dashboard health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

iv IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721TOC.fm

5.2.2 Ceph command line health check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


5.2.3 Ceph Dashboard alerts and logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.4 Ceph Dashboard plug-in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2.5 Grafana stand-alone dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.6 Ceph command line deeper dive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2.7 Ceph monitoring Day 2 operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 References on the World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Chapter 6. Day 1 and Day 2 operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


6.1 Day 1 operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.1.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1.3 Initial cluster configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Day 2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.2 Maintenance mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.3 Configure logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.4 Cluster upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.5 Node replacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.2.6 Disk replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2.7 Cluster monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2.8 Network monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Contents v
5721TOC.fm Draft Document for Review November 28, 2023 7:59 am

vi IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721spec.fm

Notices

This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS”


WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.

IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.

The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.

Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.

Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.

© Copyright IBM Corp. 2023. vii


5721spec.fm Draft Document for Review November 28, 2023 12:23 am

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at https://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
Redbooks (logo) ® IBM Cloud® Redbooks®
IBM® IBM Spectrum®

The following terms are trademarks of other companies:

ITIL is a Registered Trade Mark of AXELOS Limited.

The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.

Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.

Ansible, Ceph, OpenShift, Red Hat, are trademarks or registered trademarks of Red Hat, Inc. or its
subsidiaries in the United States and other countries.

VMware, and the VMware logo are registered trademarks or trademarks of VMware, Inc. or its subsidiaries in
the United States and/or other jurisdictions.

Other company, product, or service names may be trademarks or service marks of others.

viii IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721pref.fm

Preface

IBM® Storage Ceph is an IBM-supported distribution of the open-source Ceph platform that
provides massively scalable object, block, and file storage in a single system.

IBM Storage Ceph is designed to operationalize AI with enterprise resiliency and consolidate
data with software simplicity and run on multiple hardware platforms to provide flexibility and
lower costs.

Engineered to be self-healing and self-managing with no single point of failure and includes
storage analytics for critical insights into growing amounts of data. IBM Storage Ceph can be
used as an easy and efficient way to build a data lakehouse for IBM watsonx.data and for
next-generation AI workloads.

This IBM Redpaper publication explains the concepts and architecture of IBM Storage Ceph
in a clear and concise way. For detailed instructions on how to implement IBM Storage Ceph
for real life solutions, see the IBM Storage Ceph Solutions Guide, REDP-5715 IBM Redpaper.

The target audience for this publication is IBM Storage Ceph architects, IT specialists, and
technologists.

Authors
This paper was produced by a team of specialists from around the world .

Christopher Maestas is the Chief Executive Architect for IBM


File and Object Storage Solutions with over 25 years of
experience deploying and designing IT systems for clients in
various spaces. He has experience scaling performance and
availability with various file systems technologies. He has
developed benchmark frameworks to test out systems for
reliability and to validate research performance data. He also
has led global enablement sessions online and face to face,
where he described how best to position mature technologies
like IBM Spectrum® Scale with emerging technologies in
cloud, object, container, or AI spaces.

Daniel Parkes has been a die-hard Infrastructure enthusiast


for many years with a huge passion for open-source
technologies and a keen eye for innovation. Daniel is working in
the IBM Storage Ceph Product Management team, focusing on
the IBM Storage Ceph Object Storage offering. Always eager to
get involved in Ceph technical discussions, helping our
customers to succeed with Ceph on their journey to digital
excellence.

© Copyright IBM Corp. 2023. ix


5721pref.fm Draft Document for Review November 28, 2023 12:23 am

Franck Malterre is an information technology professional with


a background of over 25 years designing, implementing and
maintaining large x86 physical and virtualized environments.
Last 10 years, he specialized on IBM File and Object Storage
Solution, developing IBM Cloud® Object Storage, IBM Storage
Scale and IBM Storage Ceph live demonstrations and proof of
concept for IBMers and Business Partners.

Jean-Charles (JC) Lopez has been in storage for 8/10th of his


professional career and doing Software Defined Storage since
2013. JC really likes to think he can help people understand
how they can move to a containerized environment when it
comes to data persistence and protection. His go-to solutions
are Fusion, Ceph and OpenShift. He has been practicing both
the upstream community and downstream versions of the two
latter ones.

John Shubeck is an information technology professional with


over 42 years of industry experience across both the customer
and technology provider perspective. John is currently serving
in the IBM Advanced Technology Group as a Senior Storage
Technical Specialist on IBM Object Storage platforms across all
of the Americas. He has participated in many technology
conferences during the past twenty years as a facilitator,
presenter, teacher, and panelist.

Jussi Lehtinen is a Principal Storage SME for IBM Object


Storage working for IBM Infrastructure in Nordics, Benelux and
Eastern Europe . He has over 35 years of experience working
in IT, with the last 25 years with Storage. He holds a bachelor’s
degree in Management and Computer Studies from Webster
University in Geneva, Switzerland.

Suha Ondokuzmayis is a cloud-native and open source


enthusiast, lifelong learner, and passionate technologist. He
began his career in 2001 as an assembler developer intern at
Turkish Airlines, followed by Turkcell, Turkish Telekom, IBM and
Veritas Technologies within different departments, all with
infrastructure related responsibilities. He joined İşbank in 2013
where he has led IT infrastructure, IaaS and PaaS platforms,
and block, file, and object storage product offerings, as well as
backup operations. He holds a Master of Science in
Management Information Systems from the University of
Istanbul.

x IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721pref.fm

Vasfi Gucer works as the Storage Team Leader on the IBM


Redbooks® Team. He has more than 30 years of experience in
the areas of systems management, networking hardware, and
software. He writes extensively and teaches IBM classes
worldwide about IBM products. For the past decade, his
primary focus has been on storage, cloud computing, and
cloud storage technologies. Vasfi is also an IBM Certified
Senior IT Specialist, Project Management Professional (PMP),
IT Infrastructure Library (ITIL) V2 Manager, and ITIL V3 Expert.

Thanks to the following people for their contributions to this project:

Elias Luna, Kenneth David Hartsoe, William West, Henry Vo


IBM USA

Marcel Hergaarden
IBM Netherlands

The team extends its gratitude to the Upstream Community, IBM and Red Hat Ceph
Documentation teams for their contributions to continuously improve Ceph documentation.

Now you can become a published author, too!


Here’s an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an IBM Redbooks residency project and help write a book
in your area of expertise, while honing your experience using leading-edge technologies. Your
efforts will help to increase product acceptance and customer satisfaction, as you expand
your network of technical contacts and relationships. Residencies run from two to six weeks
in length, and you can participate either in person or as a remote resident working from your
home base.

Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html

Comments welcome
Your comments are important to us!

We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks publications in one of the following ways:
򐂰 Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
򐂰 Send your comments in an email to:
[email protected]
򐂰 Mail your comments to:
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400

Preface xi
5721pref.fm Draft Document for Review November 28, 2023 12:23 am

Stay connected to IBM Redbooks


򐂰 Find us on LinkedIn:
https://www.linkedin.com/groups/2130806
򐂰 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/subscribe
򐂰 Stay current on recent Redbooks publications with RSS Feeds:
https://www.redbooks.ibm.com/rss.html

xii IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721ch01-intro.fm

Chapter 1. Introduction
This chapter introduces the origins of Ceph and the basic architectural concepts that are used
by this Software Defined Solution.

This chapter has the following sections:


򐂰 “History” on page 2
򐂰 “Ceph and storage challenges” on page 5
򐂰 “What is new with IBM Storage Ceph V 7.0?” on page 7

© Copyright IBM Corp. 2023. 1


5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am

1.1 History
The Ceph project emerged from a critical observation: the Lustre architecture was inherently
limited by its metadata lookup mechanism. In Lustre, locating a specific file requires querying
a dedicated software component called the Metadata Server. This centralized approach
proved to be a bottleneck under heavy workloads, hindering the overall performance and
scalability of the storage system.

To solve this inherent Lustre architecture problem, Sage Weil envisioned a new mechanism to
distribute and locate the data in a distributed and heterogeneous structured storage cluster.
This new concept is metadata-less and relies on a pseudo-random placement algorithm to do
so.

The novel algorithm, named CRUSH (Controlled Replication Under Scalable Hashing),
leverages a sophisticated calculation to optimally place and redistribute data across the
cluster, minimizing data movement when the cluster's state changes, all without the need for
any additional metadata.

It is designed to distribute the data across all devices in the cluster to avoid the classic bias of
favoring empty devices to write new data. During cluster expansion this will likely generate
bottleneck as new empty devices are favored to receive all the new writes while generating
unbalance of the data distribution as the old data is not redistributed across all the devices in
the storage cluster.

As CRUSH is designed to distribute and maintain the distribution of the data throughout the
lifecycle of the storage cluster (expansion, reduction, or failure), it favors an equivalent mix of
old and new data on each physical disk of the cluster and therefor leads to a more even
distribution of the I/Os across all the physical disk devices.

To enhance the data distribution, the solution was designed to allow the breakdown of large
elements (for example, 100 GiB file) into smaller elements, each assigned a specific
placement via the CRUSH algorithm. Therefore, reading a large file will leverage multiple
physical disk drives rather than a single disk drive if the file would have been kept as a single
element.

Sage Weil prototyped the new algorithm and doing so created a new distributed storage
software solution, Ceph. The name Ceph was chosen as a reference to the ocean and the life
it harbors given Santa Cruz, CA is a Pacific Ocean coastal town. Ceph is a short for
cephalopod2.
On January 2023, all Ceph developers and product managers were moved from Red Hat to
IBM to provide greater resources for the future of the project. The Ceph project at IBM
remains an open-source project and code changes still follow the upstream first rule and is
the base for the IBM Storage Ceph software defined storage product.

Figure 1-1 on page 3 represents the milestones of the Ceph project over the past two
decades.

2 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721ch01-intro.fm

Figure 1-1 Ceph project timeline

All Ceph community versions were assigned the name of a member of the Cephalopoda
natural sciences family. The first letter of the name helps identify the version.

Table 1-1 represents all Ceph community version names with the matching Inktank, Red Hat
Ceph Storage or IBM Storage Ceph version leveraging it.

Table 1-1 Table 1.Ceph versions


Name Upstream Release EOL Downstream
Version product

Argonaut 0.48 2012-07-03 N/A

Bobtail 0.56 2013-01-01 N/A Inktank Ceph


Enterprise 1.0

Cuttlefish 0.61 2013-05-07 N/A

Dumpling 0.67 2013-08-01 2015-05-01 Inktank Ceph


Enterprise 1.1

Emperor* 0.72 2013-11-01 2014-05-01

Firefly 0.80 2014-05-01 2016-04-01 Red Hat Ceph


Storage 1.2

Inktank Ceph
Enterprise 1.2

Giant* 0.87 2014-10-01 2015-04-01

Hammer 0.94 2015-04-01 2017-08-01 Red Hat Ceph


Storage 1.3

Infernalis* 9.2.1 2015-11-01 2016-04-01

Jewel 10.2 2016-04-01 2018-07-01 Red Hat Ceph


Storage 2

Kraken* 11.2 2017-01-01 2017-08-01

Chapter 1. Introduction 3
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am

Name Upstream Release EOL Downstream


Version product

Luminous 12.2 2017-08-01 2020-03-01 Red Hat Ceph


Storage 3

Mimic* 13.2 2018-06-01 2020-07-22

Nautilus 14.2 2019-03-19 2021-06-30 Red Hat Ceph


Storage 4

Octopus* 15.2 2020-03-23 2022-08-09

Pacific 16.2 2021-03-31 2023-10-01 IBM Storage


Ceph 5 (start with
5.3)

Red Hat Ceph


Storage 5

Quincy 17.2 2022-04-19 2024-06-01 IBM Storage


Ceph 6 (start with
6.1)

Red Hat Ceph


Storage 6

Reef 18.2 2023-08-07 2025-08-01 Future IBM


Storage Ceph 7

Squid … … … …

The versions, identified by an * with no corresponding downstream product are considered


Ceph development versions with a short lifespan and are not used by Inktank, Red Hat or IBM
to create a durable enough product. Table 1-2 shows the downstream product lifecycle.

Table 1-2 .Downstream product lifecycle


Product EOL Extended
Support

Available until

Inktank Ceph Enterprise 1.1 2015-07-31 N/A

Inktank Ceph Enterprise 1.2

Red Hat Ceph Storage 1.2 2016-05-31 N/A

Red Hat Ceph Storage 1.3 2018-06-30 N/A

Red Hat Ceph Storage 2 2019-12-16 2021-12-17

Red Hat Ceph Storage 3 2021-02-28 2023-06-27

Red Hat Ceph Storage 4 2023-03-31 2025-04-30

Red Hat Ceph Storage 5

IBM Storage Ceph 5 2024-08-31 2027-07-31

Red Hat Ceph Storage 6

IBM Storage Ceph 6 2026-03-20 2028-03-20

Red Hat Ceph Storage 7

4 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721ch01-intro.fm

Product EOL Extended


Support

IBM Storage Ceph 7 TBD TBD

Following the transfer of the Ceph project to IBM, Red Hat will OEM Red Hat Ceph Storage
starting with its version Red Hat Ceph Storage 6.

Figure 1-2 represents the different IBM Storage Ceph versions as of today.

Figure 1-2 IBM Storage Ceph versions

1.2 Ceph and storage challenges


Over the many years persistent storage has existed, it has always faced the same basic
challenges:

1.2.1 Data keeps growing


The evolution of Information Technology has transformed data from character-based to
multimedia, leading to an exponential increase in data volume and a decentralized data
creation process across diverse terminals, effectively shattering the traditional notion of
structured data.

1.2.2 Technology changes


The rapid advancement of the Information Technology lifecycle has rendered the
once-dominant centralized data center obsolete, giving rise to a more decentralized
architecture that hinges on unprecedented levels of application and server communication.
This shift marks a stark departure from the era of centralized processing units housed within
a single room, where passive terminals served as mere conduits for accessing computational
power.

1.2.3 Data organization, access, and costs


Historically, data was centralized due to the limitations of early storage technologies like
magnetic disks, tapes, and punch cards. Access to this data was often restricted and required
physical handling of the storage medium. Preserving the availability and integrity of this
diverse data over time was a crucial responsibility of the IT department.

1.2.4 Data added value


The dispersion of data across diverse infrastructure, end-users, and terminal types has made
the generation of value from data increasingly dependent on the ability to process large
volumes of data from multiple sources and in multiple formats, employing advanced
techniques like Artificial Intelligence and Machine Learning. The key to unlocking value lies in

Chapter 1. Introduction 5
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am

enabling real-time data accessibility from multiple locations without the need for human
intervention.

Ceph is no exception to these challenges, and was designed from the ground up to be highly
available, with no single point of failure, and highly scalable with limited day-to-day operational
requirements other than the replacement of failed physical resources, such as nodes or
drives. Ceph provides a complete software-defined storage solution as an alternative to
proprietary storage arrays.

1.2.5 Ceph approach


To address all the above challenges, Ceph relies on a layered architecture with an object
store at its core. This core, known as the Reliable Autonomic Object Store, or RADOS,
provides the following features:
򐂰 Stores the data with data protection and consistency as goal number one.
򐂰 Follows a scale-out model to cope with growth and changes.
򐂰 Present no single point of failure.
򐂰 Provides CRUSH as a customizable metadata-less placement algorithm.
򐂰 Runs as a software defined storage solution on commodity hardware.
򐂰 Open-Source to avoid vendor lock-in.

1.2.6 Ceph storage types


Ceph is known as a unified storage solution and supports the following types of storage:
򐂰 Block storage via a virtual block device solution known as RBD (RADOS Block Device).
򐂰 File storage via a POSIX compliant shared filesystem solution known as CephFS or Ceph
Filesystem
򐂰 Object storage via a OpenStack Swift and Amazon S3 gateway known as the RADOS
Gateway or RGW.

All the different types of storage are segregated and do not share data between them while
eventually sharing the physical storage where they are physically stored. A custom CRUSH
configuration will however allow you to separate the physical nodes and disks used by each of
them.

All the different types of storage are identified and implemented as Ceph access methods on
top of the native RADOS API. This API is known as librados.

Ceph is written entirely in C and C++ except for some API wrappers that are language
specific (For example, Python wrapper for librados or librbd).

IBM Storage Ceph is positioned for the following use cases (Figure 1-3 on page 7) and IBM is
committed to support additional ones in the upcoming versions such as NVMe over Fabric for
a full an easy integration with VMware.

6 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721ch01-intro.fm

Figure 1-3 BM Storage Ceph use cases

1.3 What is new with IBM Storage Ceph V 7.0?


IBM Storage Ceph V 7.0 is anticipated for general availability (GA) in December 2023.
Overall, IBM Storage Ceph V 7.0 is a major release that introduces a number of new features
and enhancements that can significantly improve the performance, scalability, and security of
your Ceph storage cluster. This section delves into the some of the new features introduced
with this upcoming version. Note that these features are subject to change prior to the GA
date.

1.3.1 WORM compliance certification


Cohasset Associates, Inc. evaluated IBM Storage Ceph Object Lock against electronic
records requirements mandated by various regulatory bodies. They concluded that IBM
Storage Ceph, when configured and utilized with Object Lock, adheres to the electronic
record keeping system stipulations of SEC and FINRA rules, as listed below:
򐂰 SEC 17a-4(f)
򐂰 SEC 18a-6(e)
򐂰 FINRA 4511(c)
򐂰 CFTC 1.31(c)-(d)

Note: U.S. Securities and Exchange Commission (SEC) stipulates record keeping
requirements, including retention periods. Financial Industry Regulatory Authority (FINRA)
rules regulate member brokerage firms and exchange member markets.

The Cohasset Associates IBM Storage Ceph certification assessment page can be found
here.

1.3.2 Multi-site replication with bucket granularity


This feature enables the replication of individual buckets or groups of buckets to a separate
IBM Storage Ceph cluster. This functionality is analogous to RGW multi-site, but with the
added capability of replicating at the bucket level.

Figure 1-4 on page 8 shows the multi-site replication with bucket granularity feature.

Chapter 1. Introduction 7
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am

Figure 1-4 Multi-site replication with bucket granularity feature

Previously, replication was limited to full zone replication. This new feature grants clients
enhanced flexibility by enabling the replication of individual buckets with or against different
IBM Storage Ceph clusters. This granular approach allows for selective replication, which can
be beneficial for edge computing, co-locations, or branch offices. Bidirectional replication is
also supported.

1.3.3 Object archive zone (Tech preview)


The archive zone serves as a repository for all object versions from the production zones. It
maintains a complete history of each object, providing users with a comprehensive object
catalog. The data stored in the archive zone is rendered immutable through the
implementation of WORM (Write Once, Read Many) with Ceph Object Lock. This immutability
safeguard renders data in the archive zone impervious to modifications or deletions caused
by ransomware, computer viruses, or other external threats.

Figure 1-5 on page 9 shows the object archive zone feature.

8 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721ch01-intro.fm

Figure 1-5 Object archive zone

The archive zone selectively replicates data from designated buckets within the production
zones. System administrators can control which buckets undergo replication to the archive
zone, enabling them to optimize storage capacity usage and prevent the accumulation of
irrelevant content.

1.3.4 RGW policy-based data archive and migration capability


Figure 1-6 on page 10 highlights the policy-driven data archive and migration capability. This
feature enables clients to seamlessly migrate data that adheres to policy criteria to the public
cloud archive. This functionality is available for both Amazon Web Services (AWS) and
Microsoft Azure public cloud users.

The primary benefit of this feature lies in its ability to liberate on-premises storage space that
is currently occupied by inactive, rarely accessed data. This reclaimed storage can then be
repurposed for active datasets, enhancing overall storage efficiency.

Chapter 1. Introduction 9
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am

Figure 1-6 RGW policy-based data archive and migration capability

1.3.5 IBM Storage Ceph Object S3 Lifecycle Management


IBM Storage Ceph lifecycle management provides clients with the flexibility to combine object
transition and expiration within a single policy. For instance, you can configure a policy to
transition objects to the cold tier after 30 days and subsequently delete them after 365 days.
Figure 1-7 shows the IBM Storage Ceph Object S3 Lifecycle Management feature.

10 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721ch01-intro.fm

Figure 1-7 IBM Storage Ceph Object S3 Lifecycle Management

1.3.6 Dashboard UI enhancements


With these new UI functionalities, a Ceph cluster administrator has the ability to manage the
whole lifecycle of CephFS filesystems, volumes and subvolumes via the graphical dashboard
UI.
򐂰 CephFS UI interaction:
– Creation, listing, changing options, and deletion of CephFS filesystem(s), volumes,
subvolume groups and subvolumes.
򐂰 CephFS access and encryption:
– Access management for CephFS resources.
– Encryption management options for CephFS.
򐂰 CephFS snapshots management:
– List all snapshots for a particular FS, volume subvolume or directory.
– Create or delete a one-time snapshot.
– Display occupied capacity by a snapshot.
򐂰 CephFS monitoring:
– Health status, basic read/write throughput and capacity utilization for filesystems,
volumes and subvolumes. including historical graphs.

With these new UI functionalities, a Ceph cluster administrator has the ability to manage the
whole lifecycle of CephFS filesystems, volumes and subvolumes via the graphical dashboard
UI.
򐂰 RGW multi-site UI configuration:
– RGW multi-site setup and configuration from the dashboard UI.

Chapter 1. Introduction 11
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am

– Easy setup without having to go through manual setup steps.


򐂰 RGW bucket level view and management:
– ACL status (private, public).
– Bucket tags (add or remove tags)
– On a per-bucket basis.
򐂰 Labeled performance counters per user or bucket:
– Reporting into Prometheus monitoring stack.
– Bucket operations and statistics.
– User operations and statistics.
򐂰 Multisite synchronization status:
– Synchronization status.
– RGW operation metrics.
– Client and replica traffic.
– Dashboard visibility of a RGW latest sync status.

1.3.7 NFS support for CephFS for non-native Ceph clients


After configuring the Ceph File System, users can easily manage NFS exports on the Ceph
dashboard, creating, editing, and deleting them as needed.

IBM Storage Ceph Linux clients can seamlessly mount CephFS without additional driver
installations, as CephFS is embedded in the Linux kernel by default. This capability extends
CephFS accessibility to non-Linux clients through the NFS protocol. In IBM Storage Ceph V7,
the NFS Ganesha service expands compatibility by supporting NFS v4, empowering a
broader spectrum of clients to seamlessly access CephFS resources.

1.3.8 NVMe over Fabrics (Tech preview)


IBM Storage Ceph leverages NVMe over Fabrics (NVMe-oF) to deliver block storage access
to non-Linux clients. While Linux clients can directly connect to CephFS using the RBD
protocol due to its native integration with the Linux kernel, non-Linux clients require an
additional gateway.

The newly introduced IBM Storage Ceph NVMe-oF gateway bridges the gap for non-Linux
clients, enabling them to seamlessly interact with NVMe-oF initiators. These initiators
establish connections with the gateway, which in turn connects to the RADOS block storage
system.

The performance of NVMe-oF block storage through the gateway is comparable to native
RBD block storage, ensuring a consistent and efficient data access experience for both Linux
and non-Linux clients.

1.3.9 Object storage for ML/analytics: S3 Select


S3 Select is a feature of Amazon S3 that allows you to filter the contents of an S3 object
without having to download the entire object. IBM Storage Ceph V 7.0 adds support for S3
Select, which can reduce the amount of data that needs to be transferred for certain
workloads.

12 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 7:59 am 5721ch01-intro.fm

This feature empowers clients to employ straightforward SQL statements to filter the contents
of S3 objects and retrieve only the specific data they require. By leveraging S3 Select for data
filtering, clients can significantly minimize the amount of data transferred by S3, thereby
reducing both retrieval costs and latency. The following data formats are supported:
򐂰 CSV
򐂰 JSON
򐂰 Parquet

1.3.10 RGW multi-site performance improvements


IBM Storage Ceph V7 boosts performance for data replication and metadata operations by
enabling parallel processing of operations as the number of RadosGW daemons increases.
This capability enables horizontal scalability, empowering you to efficiently handle growing
workloads.

1.3.11 Erasure code EC2+2 with 4 nodes


Erasure code is a data protection technique that uses multiple copies of data to protect
against data loss. Erasure code EC2+2 is a new erasure code that provides better data
protection and reduced storage costs than previous erasure codes.

With IBM Storage Ceph 7, erasure code C2+2 can be used with just four nodes, making it
more efficient and cost-effective to deploy erasure coding for data protection.

Previously, a larger cluster of nodes was required to establish a supported configuration.


However, this latest release reduces the minimum configuration requirement to four nodes,
enabling users to leverage the benefits of Erasure code C2+2 with a smaller and more
cost-effective setup.

This enhancement is particularly beneficial for smaller deployments or organizations seeking


to optimize their storage infrastructure costs. By reducing the minimum node requirement,
IBM Storage Ceph 7 makes Erasure code C2+2 more accessible to a wider range of users
and use cases.

IBM Storage Ready Nodes can be deployed with a minimum of four nodes and utilize erasure
coding for the RADOS backend in this basic configuration. This scalable solution can be
expanded to accommodate up to 400 nodes.

Chapter 1. Introduction 13
5721ch01-intro.fm Draft Document for Review November 28, 2023 7:59 am

14 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

Chapter 2. IBM Storage Ceph architecture


This chapter introduces the architecture of IBM Storage Ceph. It details the various
components of a Ceph cluster, namely the components used to build the RADOS cluster built
around the CRUSH concept.

This chapter has the following sections:


򐂰 “Architecture” on page 16
򐂰 “Access methods” on page 31
򐂰 “Deployment” on page 32

© Copyright IBM Corp. 2023. 15


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

2.1 Architecture
Figure 2-1 represents the structure and layout of a Ceph cluster, starting from RADOS at the
bottom, up to the various access methods provided by the cluster.

Figure 2-1 Ceph general architecture

2.1.1 RADOS components


RADOS is built using the following components:
򐂰 Monitors
򐂰 Object Storage Devices
򐂰 Managers
򐂰 Metadata Servers

Monitors
The Monitors, known as MONs, are responsible for managing the state of the cluster. Like
with any distributed storage system, the challenge is to keep track of the status of each
cluster component (Monitors, Managers, Object Storage Devices and so forth).

Ceph maintains its cluster state through a set of specialized maps, collectively referred to as
the cluster map. Each map is assigned a unique version number, called an epoch, which
starts at 1 and increments by 1 upon every state change for the corresponding set of
components.

The Monitors maintain the following maps:


򐂰 MON map
򐂰 MGR map
򐂰 OSD map
򐂰 MDS map
򐂰 PG map
򐂰 CRUSH map

16 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

To ensure the integrity and consistency of map updates, the Monitors employ the PAXOS
algorithm, enabling them to reach consensus among multiple Monitors before validating and
implementing any map changes.

To prevent split-brain scenarios, the number of Monitors deployed in a Ceph cluster must
always be an odd number greater than two to ensure that a majority of Monitors can validate
map updates. This means that more than half of the Monitors present in the Monitor Map
(MONMap) must agree on the change proposed by the PAXOS quorum leader for the map to
be updated.

Note: The Monitors are not part of the data path, meaning they do not directly handle data
storage or retrieval requests. They exist primarily to maintain cluster metadata and keep all
components synchronized.

Managers
The Managers, abbreviated as MGRs, are integrated with the Monitors, and collect the
statistics within the cluster. The Managers provide a pluggable Python framework to extend
the capabilities of the cluster. As such the developer or the end-user can leverage or create
Manager modules that will be loaded into the Manager framework.

The list below provides you with some of the existing Manager modules that are available:
򐂰 Balancer module (dynamically reassign placement groups to OSDs for better data
distribution).
򐂰 Auto-scaler module (dynamically adjust the number of placement groups assigned to a
pool).
򐂰 Dashboard module (provide a UI to monitor and manage the Ceph cluster).
򐂰 RESTful module (provide RESTful API for cluster management).
򐂰 Prometheus module (provide metrics supports for the Ceph cluster).

Object Storage Devices


Sometimes referred to as Object Storage Daemons but always as OSDs, this component of
the RADOS cluster is responsible for storing the data in RADOS and for serving the IO
requests originating from the Ceph cluster clients.

OSDs are responsible for the following:


򐂰 Serve IO requests.
򐂰 Protect the data (replication or erasure coding model).
򐂰 Recover the data after failure.
򐂰 Rebalance the data on cluster expansion or reduction.
򐂰 Check the consistency of the data (scrubbing).
򐂰 Check for bit rot detection (deep scrubbing).

Each OSD can be assigned one role for a given placement group:
򐂰 Primary OSD.
򐂰 Secondary OSD.

The primary OSD performs all the above functions while a secondary OSD always acts under
the control of a primary OSD. For example, if a write operation lands on the primary OSD for
a given placement group, the primary OSD will send a copy of the IO to one or more

Chapter 2. IBM Storage Ceph architecture 17


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

secondary OSDs, and the secondary OSD will be solely responsible for writing the data onto
the physical media and when down acknowledge to the primary OSD.

Note: A cluster node where OSDs are deployed is called an OSD node.

In most cases, you deploy one Object Storage Device per physical drive.

OSD Object Store


FileStore
In its early days, the Ceph OSD was using an object store implementation known as FileStore
(Figure 2-2). This object store solution leverages an xfs formatted partition to store the actual
data and a raw partition to store the object store journal. The journal is written as a
wrap-around raw device written sequentially. The roles of the journal are:
򐂰 Guarantee transaction atomicity across multiple OSDs.
򐂰 Leverage sequential performance of hard drives.

When flash-based drives arrived on the market, it became best practice to use a Solid State
Drive to host the journal to enhance the performance of write operations in the Ceph cluster.

Figure 2-2 FileStore architecture

However, the complexity of the solution and the write amplification due to 2 writes for each
write operation led the Ceph project to consider an improved solution for the future.

BlueStore
BlueStore is the new default OSD object store format since upstream Luminous (Red Hat
Ceph Storage 3.x). With BlueStore, data is written directly to the disk device, while a separate
RocksDB key-value store contains all the metadata.

Once the data is written to the raw data block device, the RocksDB is updated with the
metadata related to the new data blobs that just got written.

RocksDB is a high-performance key-value store developed in 2012 at Facebook that performs


very well with flash-based devices. As RocksDB cannot write directly to a raw block device, a
virtual file system (VFS) layer was developed for RocksDB to be able to store its .sst files on
the raw block device. The VFS layer is named BlueFS.

RocksDB utilizes a DB portion and a write-ahead log (WAL) portion. Depending on the size of
the IO, RocksDB will write the data directly to the raw block device through BlueFS or to the
WAL so it can be later committed to the raw block device. The latter process is known as a
deferred write.

18 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

As BlueStore is introduced, additional functionalities are added to the Ceph cluster:


򐂰 Data compression at the OSD level independent of the access method used.
򐂰 Checksums are verified on each read request.

Figure 2-3 shows the BlueStore architecture.

Figure 2-3 BlueStore architecture

BlueStore provides the following improvements over the FileStore:


򐂰 Removal of the extra file system layer for direct access to the raw device to store data.
򐂰 RocksDB performance to store and retrieve metadata for faster physical object location.
򐂰 Full data and metadata checksum.
򐂰 Snapshot and clone benefit of a more efficient copy-on-write on the raw block device.
򐂰 Minimizes write amplification for many workloads.
򐂰 Deferred write threshold configurable per device type.

BlueStore operates with the following layout:


򐂰 One device for the data (aka block).
򐂰 One optional RocksDB device for the metadata (aka block.db).
򐂰 One optional RocksDB device for the WAL (aka block.wal).

Note: The best practice is to use a device faster than the data device for the RocksDB
metadata device and a device faster than the RocksDB metadata device for the WAL
device.

When a separate device is configured for the metadata, metadata might overflow to the data
device if the metadata device becomes full. While this is not a problem if both devices are of
the same type, it leads to performance degradation if the data device is slower than the
metadata device. This situation is known as the BlueStore spillover.

When deploying BlueStore OSDs, follow the following best practices:


򐂰 If using the same device type, all rotational drives for example, specify the block device
only and do not separate block.db or block.wal.
򐂰 If using a mix of fast and slow devices (SSD / NVMe and rotational), choose a fast device
for block.db while block uses the slower rotational drive.

BlueStore uses a cache mechanism that dynamically adjusts (bluestore_cache_autotune =


1). When cache autotuning is enabled, the OSD will try to keep the OSD heap memory usage

Chapter 2. IBM Storage Ceph architecture 19


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

below the value assigned to the osd_memory_target parameter and not under
osd_memory_cache_min value.

Need be the cache sizing can be adjusted manually by setting the parameter
bluestore_cache_autotune to 0, and the following parameters can be adjusted to allocate
specific portions of the BlueStore cache:
򐂰 cache_meta: BlueStore onode and associated data.
򐂰 cache_kv: RocksDB block cache including indexes and filters.
򐂰 data_cache: BlueStore cache for data buffers.

The above parameters are expressed as a percentage of the cache size assigned to the
OSD.

BlueStore allows you to configure more features to align best with your workload:
򐂰 block.db sharding.
򐂰 Minimum allocation size on the data device.

2.1.2 Ceph cluster partitioning and data distribution


In this section we discuss Ceph cluster partitioning and data distribution.

Pools
The cluster is divided into logical storage partitions called pools. The pools have the following
characteristics:
򐂰 Group data of a specific type.
򐂰 Group data that is to be protected using the same mechanism (replication or erasure
coding).
򐂰 Group data to control access from Ceph clients.
򐂰 Assigned one and only one CRUSH rule to determine placement group mapping to OSDs.

Note: Pools support compression but do not support deduplication for now. The
compression can be activated on a per pool basis.

Pools support compression but do not support deduplication for the time being. The
compression can be activated on a per pool basis.

Data protection
The data protection scheme is assigned individually to each pool. The data protection IBM
Storage Ceph supports are:
򐂰 Replicated that makes a full copy of each byte stored in the pool (default 3 copies):
– 2 replicas and higher are supported with underlying flash devices.
– 3 replicas and higher are supported with rotational hard drives.
򐂰 Erasure coding that functions in a similar way as the parity RAID mechanism:
– 4+2 erasure coding is supported with jerasure plugin.
– 8+3 erasure coding is supported with jerasure plugin.

20 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

– 8+4 erasure coding is supported with jerasure plugin.

Note: Erasure coding, although supported for all types of storage (block, file and object), is
not recommended for block and file as it delivers lower performance.

The difference between replicated and erasure coding pools when it comes to data protection
is summarized in Figure 2-4.

Figure 2-4 Replicated data protection versus erasure coding data protection

Erasure coding provides a more cost-efficient data protection mechanism and greater
resiliency and durability as you increase the number of coding chunks, allowing the pool to
survive the loss of many OSDs or servers before the data becomes irrecoverable but offers
lower performance because of the computation and network traffic required to split the data
and calculate the coding chunks.

The benefits of the replicated model are:


򐂰 Very high durability with 3 copies.
򐂰 Quicker recovery.
򐂰 Performance optimized.

The benefits of the erasure coding model are:


򐂰 Cost-effective durability with the multiple coding chunks.
򐂰 More expensive recovery.
򐂰 Capacity optimized.

Each pool is assigned a set of parameters that can be changed on the fly. Table 2-1 lists the
main parameters used for pools and details if the parameter can be dynamically modified.

Table 2-1 Main parameters used for pools and details if the parameter can be dynamically modified
Name Function Dynamic Update

id Unique ID for the pool. No

pg_num Total number of placement groups. Yes

pgp_num Effective number of placement groups. Yes

size Number of replicas or chunks for the pool. Yes (replicated)


No (erasure coding)

Chapter 2. IBM Storage Ceph architecture 21


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

Name Function Dynamic Update

min_size Minimum number of replicas for a placement Yes


group to be active.

crush_rule CRUSH rule to use for the placement group Yes


placement.

nodelete Prevent the pool from being deleted. Yes

nopgchange Prevent the number of placement groups to be Yes


modified.

nosizechange Prevent size and min_size to be modified. Yes

noscrub Prevent scrubbing for the placement groups of Yes


the pool.

nodeep-scrub Prevent deep-scrubbing for the placement Yes


groups of this pool.

pg_autoscale_mode Enable or disable placement group autoscaling Yes


for this pool.

target_size_ratio Expected capacity for the pool used by Yes


placement group auto-scaler.

Placement groups
The pool is divided into hash buckets called placement groups or PGs. The role of the PG is
to:
򐂰 Store the objects as calculated by the CRUSH algorithm.
򐂰 Guarantee that the storage of the data is abstracted from the physical devices.

As such each PG is assigned to a set of OSDs and the following rules:


򐂰 A RADOS object is assigned to one and only one placement group.
򐂰 A placement group is assigned multiple objects.
򐂰 Each placement group may occupy different capacity on disk.
򐂰 Each placement group exists on multiple OSDs (assigned by CRUSH).
򐂰 Guarantee that the storage of the data is abstracted from the physical devices.

Note: The mapping of a placement group will always be the same for a given cluster state.

Placement group states


Each placement group has a state that represent the status of the PG and help you diagnose
if the PG is healthy or not, can serve IO requests or not. Table 2-2 lists the placement group
states.

Table 2-2 Placement group states


State Description

creating PG is being created and is not ready for use.

active+clean PG is fully functional, and all data is replicated and available.

undersized PG is fully functional, but has fewer copies than configured.

22 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

State Description

active+stale PG is fully functional, but one or more replicas is stale and needs to be updated.

stale PG is in an unknown state as the Monitors have not received an update after
the PG placement was changed.

down A replica with necessary data is down so the PG is offline.

peering PG is in the process of adjusting its placement and replication settings.

activating PG is peered but not yet active.

recovering PG is migrating or synchronizing objects and their replicas.

recovery_wait PG is waiting to start recovery.

recovery_toofull PG is waiting for recovery as target OSD is too full.

backfilling PG is adding new replicas to its data.

backfill_wait PG is waiting to start backfilling.

backfill_toofull PG is waiting for backfill as target OSD is too full.

incomplete PG is missing information about some writes that may have occurred or does
not have healthy copies.

scrubbing PG is checking group metadata inconsistencies.

deep PG is checking stored checksum.

degraded PG has not replicated some objects to the correct number of OSDs.

repair PG is being checked and repaired for any inconsistencies it finds.

inconsistent PG has inconsistencies between its different copies stored on different OSDs.

snaptrim PG is trimming snapshot data.

snaptrim_wait PG is waiting for snap trimming to start.

snaptrim_error PG encountered error during the snap trimming process.

Cluster status
While placement groups have their distinct state, the Ceph cluster has its own global status
that can be checked with the ceph status, ceph health or ceph health detail commands.

Table 2-3 Cluster status


Cluster Status Description

HEALTH_OK Cluster is fully operational, and all components are operating as expected.

HEALTH_WARN Some issues exist in the cluster, but data is available to clients.

HEALTH_ERR Serious issues exist in the cluster and some data has become unavailable
to clients.

HEALTH_FAILED Cluster is not operational and data integrity may be at risk.

HEALTH_UNKNOWN The status of the cluster is unknown.

Figure 2-5 illustrates the general component layout within a RADOS cluster.

Chapter 2. IBM Storage Ceph architecture 23


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

Figure 2-5 Ceph cluster layout

Note: With the latest version of IBM Storage Ceph multiple components can be deployed
on one node following the support matrix. You need to log in with your IBMid to access this
link.

2.1.3 Controlled Replication Under Scalable Hashing (CRUSH)


To locate the data in the cluster, Ceph implements a unique algorithm known as CRUSH. It
allows the clients and the members of a Ceph cluster to locate the data via a pseudo random
method that provides the same result for a given cluster state.

As the placement remains the same for a given cluster state, imagine the following scenario:
򐂰 A RADOS object is hosted in placement group 3.23.
򐂰 A copy of the object is written on OSD 24, 3 and 12 (state A of the cluster).
򐂰 OSD 24 is stopped (state B of the cluster).
򐂰 An hour later OSD 24 is restarted.

Upon the stoppage of OSD 24, the placement group that contains the RADOS object will be
recreated to another OSD so that the cluster satisfies the number of copies that must be
maintained for the placement groups that belong to the specific pool.

At this point, state B of the cluster becomes different from the original state A of the cluster.
Placement group 3.23 is now protected by OSD 3, 12 and 20.

When OSD 24 is restarted, assuming no changes were made to the cluster in the meantime
such as cluster expansion, reduction or CRUSH customization or another OSD failing, the
copy of the placement group 3.23 will be resynchronized on OSD 24 to provide the same
mapping for the placement group as the cluster state is identical to state A.

From an architecture point of view, CRUSH provides the following mechanism (see
Figure 2-6).

24 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

Figure 2-6 CRUSH from object to OSD architecture

To determine the location of a specific object, the following mechanism is used (see
Figure 2-7).

Figure 2-7 CRUSH calculation detail

To make sure a client or a Ceph cluster component will locate the correct location of an object
that enables the client to cluster communication model, all maps maintained by the Monitors
are versioned and the version of the map used to locate an object, or a placement group is
checked by the recipient of a request.

The specific exchange determines whether the Monitors or one of the OSDs updates the map
version used by the sender. In most cases, these updates are differentials, meaning only the
changes to the map are transmitted after the initial connection to the cluster.

If you now look at the whole Ceph cluster, with its many pools, each pool with its own
placement groups, the picture will look like Figure 2-8 on page 26.

Chapter 2. IBM Storage Ceph architecture 25


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

Figure 2-8 Full data placement picture (Objects on the left, OSDs on the right)

2.1.4 OSD failure and recovery


Object Storage Devices operate with two separate set of statuses:
򐂰 up / down: Tracks if the OSD is running and responsive or not.
򐂰 in / out: Tracks if the OSD participate in data placement or not.

In the Ceph clusters, the following mechanisms exist to track the status of the different
components of the cluster:
򐂰 Monitor to Monitor heartbeat.
򐂰 OSD to Monitor heartbeat.
򐂰 OSD to OSD heartbeat.

The OSD-to-OSD heartbeat mechanism enables each OSD to operate independently,


ensuring data availability even if communication with other OSDs is disrupted. Additionally,
the OSD-to-Monitor heartbeat ensures that the OSDMap remains up-to-date even if all OSDs
become unavailable.

Upon detecting an unavailable peer OSD ((because it works with other OSDs to protect
placement groups), an OSD relays this information to the Monitors. This enables the Monitors
to update the OSDMap accordingly, reflecting the status of the unavailable OSD.

OSD failures
When an OSD becomes unavailable, the following statements become true:
򐂰 The total capacity of the cluster is reduced.
򐂰 The total throughput that can be delivered by the cluster is reduced.

26 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

򐂰 The cluster will enter a recovery process that generates disk IOs (to read the data that
must be recovered) and network traffic (to send a copy of the data to another OSD and
recreate the missing data.

When an OSD fails or becomes unresponsive it will be marked as down. If after a


configurable amount of time via the mon_osd_down_out_interval parameter (default 600
seconds), the OSD does not return to an up state, it will automatically be marked as out of the
cluster and the cluster will start recovering the data for all the placement groups the failed
OSD was hosting.

OSD recovery and backfill


While the recovery and backfill processes can be customized by Ceph administrators, the
default parameters are carefully selected to minimize performance impact on client traffic.
Modifying these parameters requires expertise and a thorough understanding of the potential
performance implications.

Recovery is the process of moving or synchronizing a placement group following the failure of
an OSD.

Backfill is the process of moving a placement group following the addition or the removal of an
OSD to or from the Ceph cluster.

Client impact on failures


The cluster provides different maps so that every client or component of the Ceph cluster can
be kept aware of the status of every single component within the Ceph cluster.

When a Ceph client connects to the Ceph cluster, it needs to contact the Monitors of the
cluster so that it can be authenticated. Once authenticated the Ceph client will be provided
with a copy of the different maps maintained by the Monitors.

In Figure 2-9 on page 28, the different steps that occur when a Ceph client connects to the
cluster and then accesses data is following this high-level sequence:
1. Upon successful authentication, the client is provided with the cluster map.
2. The data placement for object name xxxxx is calculated.
3. The Ceph client initiates a connection with the primary OSD that protects the PG.

Chapter 2. IBM Storage Ceph architecture 27


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

Figure 2-9 Client Cluster interaction

When a failure has occurred within the cluster, as the Ceph client tries to access a specific
OSD, the following cases can occur:
򐂰 The OSD it is trying to contact is unavailable.
򐂰 The OSD it is trying to contact is available.

In the first case, the Ceph client will fall back to the Monitors to obtain an updated copy of the
cluster map.

In Figure 2-10 on page 29, the different steps that occur when a Ceph client connects to the
cluster and then accesses data is following this high-level sequence:
4. As the target OSD has become unavailable, the client contacts the Monitors to obtain the
latest version of the cluster map.
5. The data placement for object name xxxxx is recalculated.
6. The Ceph client initiates a connection with the new primary OSD that protects the PG.

28 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

Figure 2-10 Client Cluster interaction on OSD failure (case 1)

In the second case, the OSD will detect that the client has an outdated map version and will
provide the necessary map updates that took place since the map version used by the Ceph
client and the map version used by the OSD. Upon receiving these updates, the Ceph client
will recalculate data placement and retry the operation, ensuring that it is aware of the latest
cluster configuration and can interact with OSDs accordingly.

2.1.5 Cephx authentication


Ceph uses the cephx protocol to manage the authorizations and the authentication between
client applications and the Ceph cluster and between Ceph cluster components. This protocol
uses shared secret keys.

CephX is enabled by default during deployment and is recommended to be kept enabled for
optimal performance. However, some benchmark results available online may show CephX
disabled to eliminate protocol overhead during testing.

The installation process enables cephx by default, so that the cluster requires user
authentication and authorization by all client applications.

Usernames and userids


The following user account types exist in a Ceph cluster:
򐂰 Accounts for communication between Ceph daemons.
򐂰 Accounts for Ceph client applications accessing the Ceph cluster.
򐂰 Account for the Ceph cluster administrator.

Chapter 2. IBM Storage Ceph architecture 29


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

The usernames used by the Ceph daemons are expressed as {type}.{id}. For example
mgr.servera, osd.0 or mds.0. They are created during the deployment process.

The usernames used by Ceph clients are expressed as client.{id}. For example, when
connecting OpenStack Cinder to a Ceph cluster it is common to create the client.cinder
username.

The username used by the RADOS Gateways connecting to a Ceph cluster follows the
client.rgw.hostname structure.

By default, if no argument is passed to the librados API via code or via the Ceph command
line interface, the connection to the cluster will be attempted with the client.admin username.

The default behavior can be configured via the CEPH_ARGS environment variable. To specify a
specific username, use export CEPH_ARGS="--name client.myid". To specify a specific
userid, use export CEPH_ARGS="--id myid".

Keyrings
Upon creating a new username, a corresponding keyring file is generated in the Microsoft
Windows ini file format. This file contains a section named [client.myid] that holds the
username and its associated unique secret key. When a Ceph client application running on a
different node needs to connect to the cluster using this username, the keyring file must be
copied to that node.

When the application starts, librados which ends up being called whatever the access
method used, searches for a valid keyring file in /etc/ceph. The keyring file name is
generated as {clustername}.{username}.keyring.

The default {clustername} is ceph. For example, if the username is client.testclient,


librados will search for /etc/ceph/ceph.client.testclient.keyring.

You can override the location of the keyring file by inserting a


keyring={keyring_file_location} in the local Ceph configuration file (/etc/ceph/ceph.conf)
in a section named after your username.
[client.testclient]
keyring=/mnt/homedir/myceph.keyring

All Ceph Monitors can authenticate users so that the cluster does not present any single point
of failure when it comes to authentication. The process follows these steps:
1. Ceph client contacts the Monitors with a username and a keyring.
2. Ceph monitors return an authentication data structure like a Kerberos ticket that includes a
session key.
3. Ceph client uses the session key to request specific services.
4. Ceph Monitor provides the client with a ticket so it can authenticate with the OSDs.
5. The ticket expires so it can be reused and prevents spoofing risks.

Capabilities or Caps
Each username is assigned a set of capabilities that enables specific actions against the
cluster.

The cluster offers the equivalent of the Linux root user known as client.admin. By default, a
user has no capabilities and must be allowed specific rights. This is achieved via the allow
keyword that precedes the access type granted by the capability.

30 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch02architecture.fm

The existing access types common to all Ceph daemons are:


򐂰 r: Provides a read permission.
򐂰 w: Provides a write permission.
򐂰 x: Provides an execute permission on class methods that exist for the daemon.
򐂰 *: Provides rwx permission.

Profiles
To simplify capability creation, cephx profiles can be leveraged:
򐂰 profile osd: User can connect as an OSD and communicate with OSDs and Monitors.
򐂰 profile mds: User can connect as an MDS and communicate with MDSs and Monitors.
򐂰 profile crash: Read only access to Monitors for crash dump collection.
򐂰 profile rbd: User can manipulate and access RBD images.
򐂰 profile rbd-read-only: User can access RBD images in read only mode.

Other deployment reserved profiles exist and are not listed for clarity.

To create a username use the following command:


ceph auth get-or-create client.forrbd mon 'profile rbd' osd 'profile rbd'

2.2 Access methods


Ceph supports the following access methods:
򐂰 librados the native object protocol for direct RADOS access.
򐂰 Block through RADOS Block Devices also known as RBD:
– Mount support via a Linux kernel module (krbd).
– Mount support and boot for QEMU, KVM and OpenStack Cinder via librbd.
򐂰 File through the Ceph File System also known as CephFS:
– Mount support via a Linux kernel module (kceph).
– Mount support via FUSE for non-Linux environments (ceph-fuse).
– NFS support via Ganesha.
򐂰 HTTP(s)/Object through the RADOS Gateway also known as RGW:
– S3 protocol support.
– OpenStack Object Storage Swift support.
– Static web site support.
– Bucket access through NFS via Ganesha.

All access methods except librados automatically stripe the data stored in RADOS to 4 MiB
objects by default, which can be customized. For example, when using CephFS, storing a 1
GiB file from a client perspective results in creating 256* 4 MiB RADOS objects, each
assigned to a placement group with the pool used by CephFS.

As another example, Figure 2-11 on page 32 represents the RADOS object layout of a 32
MiB RBD Image created in a pool with ID 0 on top of OSDs.

Chapter 2. IBM Storage Ceph architecture 31


5721ch02architecture.fm Draft Document for Review November 28, 2023 12:23 am

Figure 2-11 Ceph Virtual Block Device layout in a Ceph cluster

Chapter 3, “IBM Storage Ceph main features and capabilities ” on page 33 will provide you
with more details regarding each access method.

2.3 Deployment
Many Ceph deployment tools have existed throughout the Ceph timeline:
򐂰 mkcephfs historically the first tool.
򐂰 ceph-deploy starts with Cuttlefish.
򐂰 ceph-asnible starting with Jewel.
򐂰 cephadm starting with Octopus and later.

IBM Storage Ceph documentation details how to use cephadm to deploy your Ceph cluster. A
cephadm-based deployment follows these steps:
򐂰 For beginners:
– Bootstrap your Ceph cluster (create one initial Monitor and Manager).
– Add services to your cluster (OSDs, MDSs, RADOS Gateways and so forth).
򐂰 For advanced users:
– Bootstrap your cluster with a complete service file to deploy everything.

For the guidelines and recommendations about cluster deployment, refer to Chapter 6, “Day 1
and Day 2 operations” on page 121.

32 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Chapter 3. IBM Storage Ceph main features


and capabilities
In this chapter we cover the main features and capabilities of IBM Storage Ceph. This chapter
has the following sections:
򐂰 “IBM Storage Ceph access methods” on page 34
򐂰 “IBM Storage Ceph object storage ” on page 34
򐂰 “IBM watsonx.data” on page 46
򐂰 “IBM Storage Ceph block storage” on page 60
򐂰 “IBM Storage Ceph file storage” on page 68

© Copyright IBM Corp. 2023. 33


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

3.1 IBM Storage Ceph access methods


The IBM Storage Ceph cluster is a distributed data object store designed to provide excellent
performance, reliability and scalability. The key capability of IBM Storage Ceph is that it is a
multiprotocol storage system that supports block storage operations, file systems and file
sharing, and object storage APIs. The attributes of these access methods follow:
򐂰 Block storage is implemented within the RADOS Block Device (RBD) to support native
block volume operations by Ceph clients.
򐂰 File services are enabled by the Ceph Metadata Service (ceph-mds) to support POSIX file
operations as well as file sharing via standard protocols (for example, NFS).
򐂰 Object services are provided by the RADOS Gateway (RGW) to support standard
RESTful APIs for object storage operations (for example, S3).

Block storage devices are thin-provisioned, resizable, volumes that store data striped over
multiple OSDs. Ceph block devices leverage RADOS capabilities, such as snapshots,
replication, and data reduction. Ceph block storage clients communicate with Ceph clusters
through kernel modules or the librbd library.

Ceph File System (CephFS) is a file system service compatible with POSIX standards and
built on top of Ceph’s distributed object store. CephFS provides file access to internal and
external clients, using POSIX semantics wherever possible. CephFS maintains strong cache
coherency across clients. The goal is for processes that use the file system to behave the
same when they are on different hosts as when they are on the same host. It is easy to
manage and yet meets the growing demands of enterprises for a broad set of applications
and workloads. CephFS can be further extended using industry standard file sharing
protocols such as NFS.

Object Storage is the primary storage solution that is used in the cloud and by on-premises
applications as a central storage platform for large quantities unstructured data. Object
Storage continues to increase in popularity due to its ability to address the needs of the
world’s largest data repositories. IBM Storage Ceph provides support for object storage
operations via the S3 API with a high emphasis on what is commonly referred to as S3 API
fidelity.

3.2 IBM Storage Ceph object storage


Object storage is a type of data storage that stores data as objects, with each object having a
unique identifier and metadata associated with it. The objects are typically stored in what is
referred to as a bucket and a placement target that maps to a storage pool in the Ceph cluster
and is accessed through a RESTful API. Object storage is different from traditional file and
block storage that store data in a hierarchical file system or as blocks on a storage device.
Object storage stores data in a single flat namespace that is addressed by meta data (object
name).

The key attributes of object storage are:


򐂰 Object storage enables applications to store, retrieve and delete large amounts of
unstructured data. It is highly scalable and flexible, meaning it can easily handle large
amounts of data and accommodate growth as needed. Objects can be accessed over a
routed IP network from anywhere by any client that supports the RESTful API.
򐂰 Object storage is also designed to be highly durable and fault-tolerant. It inherently uses
the Ceph data protection features such as data replication and erasure coding to ensure

34 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

that data is stored redundantly and can be recovered in the event of a failure. This makes
it especially suitable for use cases such as archiving, backup and disaster recovery.
򐂰 Object storage is also cost-effective, it uses commodity hardware, which is less expensive
than specialized storage hardware used in proprietary storage systems.
򐂰 Object storage systems usually have built-in features such as data versioning, data tiering,
and data lifecycle management.

Object storage can be used for a variety of use cases, including archiving, backup and
disaster recovery, media and entertainment, and big data analytics.

3.2.1 IBM Storage Ceph object storage overview


IBM Storage Ceph uses the Ceph Object Gateway daemon (radosgw) to provide object
storage client access into the cluster. The Ceph Object Gateway is an HTTP server designed
to support a wide variety of client applications and use cases. Also referred to as the RADOS
Gateway (RGW), it provides interfaces that are compatible with both Amazon S3 and
OpenStack Swift, and it has its own user management.

The Ceph Object Gateway provides interfaces compatible with OpenStack Swift and AWS S3,
the Ceph Object Gateway has its own user management system. Ceph Object Gateway can
store data in the same Ceph storage cluster used to store data from Ceph Block Device and
CephFS clients; however, it would involve separate pools and likely a different CRUSH
hierarchy.
򐂰 The Ceph Object Gateway is a separate service that runs containerized on a Ceph node
and provides object storage access to its clients. In a production environment, we
recommend running more than one instance of the object gateway service on different
Ceph nodes to provide high availability, which the clients access through an IP Load
Balancer. See Figure 3-1.

Chapter 3. IBM Storage Ceph main features and capabilities 35


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-1 Conceptual view of the Ceph Object Gateway

3.2.2 Use cases for IBM Storage Ceph object storage


Typical application use cases for the Ceph Object Gateway across industries and use cases
include the following examples:
򐂰 Analytics, artificial intelligence, and machine learning data repository. For example,
Hadoop, Spark, and IBM watsonx.data lakehouse.
򐂰 IoT data repository; for example, Sensor data collection for autonomous driving.
򐂰 Secondary storage; for example:
– Active archive: Tiering of inactive data from primary NAS filers.
– Storage for backup data: Leading backup applications have native integration with
Object Storage for longer term retention purposes.
򐂰 Storage for cloud native applications: Object storage is the de-facto standard for cloud
native applications.
򐂰 Event-driven data pipelines.

36 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Industry-specific use cases for the Ceph Object Gateway include the following examples:
򐂰 Healthcare and Life Sciences:
– Medical imaging, such as picture archiving and communication system (PACS) and
magnetic resonance imaging (MRI).
– Genomics research data.
– Health Insurance Portability and Accountability Act (HIPAA) of 1996 regulated data.
򐂰 Media and entertainment (for example, audio, video, images, and rich media content).
򐂰 Financial services (for example, regulated data that requires long-term retention or
immutability).
򐂰 Object Storage as a Service (SaaS) as a catalogue offering (cloud or on-premises).

3.2.3 RADOS Gateway architecture within the Ceph cluster


The IBM Storage Ceph RADOS Gateway (RGW) is ultimately a client of the RADOS system.
It interfaces with RADOS through the LIBRADOS API. The relationship can be implemented
using a few different strategies, which we will discuss in 3.2.4, “Deployment options for IBM
Storage Ceph object storage” on page 39.
򐂰 The RADOS Gateway serves as an internal client for Ceph Storage, facilitating object
access for external client applications. Client applications interact with the RADOS
Gateway through standardized S3 or Swift APIs, while the RADOS Gateway utilizes
librados module calls to communicate with the Ceph cluster.
Figure 3-2 shows the RADOS Gateway within the Ceph cluster.

Chapter 3. IBM Storage Ceph main features and capabilities 37


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-2 RADOS Gateway within the Ceph cluster

򐂰 The RADOS Gateway stores its data, including user data and metadata, in a dedicated set
of Ceph Storage pools. This specialized storage mechanism ensures efficient data
management and organization within the Ceph cluster.
򐂰 RADOS Gateway Data pools are typically stored on hard disk drives (HDDs) with erasure
coding for cost-effective and high-capacity configurations.
򐂰 Data pools can also utilize solid-state drives (SSDs) in conjunction with erasure coding
(EC) to cater to performance-sensitive object storage workloads.
򐂰 The bucket index pool, which can store millions of bucket index key/value entries one per
object, is a critical component of Ceph Object Gateway's performance. Due to its
performance-sensitive nature, the bucket index pool should exclusively utilize flash media,
such as SSDs, to ensure optimal performance and responsiveness.

Figure 3-3 on page 39 shows the interaction of these components.

38 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-3 Interaction of the Ceph Object Gateway components

3.2.4 Deployment options for IBM Storage Ceph object storage


The IBM Storage Ceph RADOS Gateway can be deployed using several strategies. For
example:
򐂰 Instances of collocated RGW daemons on nodes that are shared with other Ceph
services. See Figure 3-4.

Figure 3-4 RADOS Gateway collocated daemons

Chapter 3. IBM Storage Ceph main features and capabilities 39


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

򐂰 Instances of non-collocated RGW daemons on nodes that are dedicated to the RGW
service. See Figure 3-5 on page 40.

Figure 3-5 RADOS Gateway non-collocated daemons

򐂰 Multiple instances of the RGW daemons on a node either collocated or non-collocated


with other Ceph services.

40 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-6 Multiple RADOS Gateway daemons on the same node

3.2.5 IBM Storage Ceph object storage topology


IBM Storage Ceph leverages the Ceph Object Gateway daemon (radosgw) to enable object
storage client access into the cluster. This HTTP server, also known as the RADOS Gateway
(RGW), caters to a diverse range of client applications and use cases.

The RADOS Gateway offers interfaces compatible with both AWS S3 and OpenStack Swift,
providing seamless integration with existing cloud environments. Additionally, it features an
Admin operations RESTful API for automating day-to-day operations, streamlining
management tasks.

The Ceph Object Gateway constructs of a Realm, Zone Groups, and Zones are used to
define the organization of a storage network for purposes of replication and site protection.

A deployment in a single data center can be very simple to install and easy to manage. In
recent feature updates, the Dashboard UI has been enhanced to provide a single point of
control for the startup and ongoing management of the Ceph Object Gateway in a single data
center. In this case, there is no need to define Zones and Zone Groups and a minimized
organization is automatically created.

For deployments that involve multiple data centers and multiple IBM Storage Ceph clusters, a
more detailed configuration is required with granular control using the cephadm CLI or the
dashboard UI. In these scenarios, the Realm, Zone Group and Zones must be defined by the
storage administrator and configured accordingly. The IBM Storage Ceph documentation fully
describes these constructs in great detail.

Chapter 3. IBM Storage Ceph main features and capabilities 41


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-7 on page 42 shows a design supporting a multiple data center implementation for
replication and disaster recovery.

Figure 3-7 Ceph Object Gateway multi-site constructs

To avoid single points of failure in our Ceph RGW deployment, we need to provide an
S3/RGW endpoint that can tolerate the failure of one or more RGW services. RGW is a restful
HTTP endpoint that can be load-balanced for HA and increase performance. There are some
great examples of different RadosGW load-balancing mechanisms in this repo.

Starting with IBM Storage Ceph 5.3, Ceph provides an HA and load-balancing stack called
ingress based on keepalived and haproxy. The ingress service allows you to create a
high-availability endpoint for RGW with a minimum set of configuration options.

The orchestrator will deploy and manage a combination of haproxy and keepalived to balance
the load on a floating virtual IP. If SSL is used, then SSL must be configured and terminated
by the ingress service, not RGW itself.

3.2.6 IBM Storage Ceph object storage key features


The RADOS Gateway supports a wide range of advanced features specific to object storage
and the S3 API, while taking advantage of the capabilities of the IBM Storage Ceph cluster
itself.

Ceph Object Gateway client interfaces


The Ceph Object Gateway supports three interfaces:

S3 compatibility
S3 compatibility provides object storage functionality with an interface that is compatible with
a large subset of the AWS S3 RESTful API.

42 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Swift compatibility
Provides object storage functionality with an interface that is compatible with a large subset of
the OpenStack Swift API.

The S3 and Swift APIs share a common namespace, so you can write data with one API and
retrieve it with the other. The S3 namespace can also be shared with NFS clients to offer a
true multiprotocol experience for unstructured data use cases.

Administrative API
Provides an administrative restful API interface for managing the Ceph Object Gateways.

Administrative API requests are done on a URI that starts with the admin resource end point.
Authorization for the administrative API mimics the S3 authorization convention. Some
operations require the user to have special administrative capabilities. The response type can
be either XML or JSON by specifying the format option in the request, but defaults to the
JSON format.

Ceph Object Gateway features


The Ceph Object Gateway supports the following key features. It is not an exhaustive list but
rather addresses the interests expressed by customers, business partners, and Ceph
champions. The key features are listed below.

Management
The Ceph Object Gateway can be managed using the Ceph Dashboard UI, the Ceph
command line (cephadm), the Administrative API mentioned above, and through service
specification files.

Authentication and authorization


IBM Storage Ceph seamlessly integrates with Security Token Service (STS) to enable
authentication of object storage users against your enterprise identity provider (IDP), such as
LDAP or Active Directory, through an OpenID Connect (OIDC) provider. This integration
leverages the STS to issue temporary security credentials, ensuring secure access to your
object storage resources.

IBM Storage Ceph object storage further enhances security by incorporating IAM
compatibility for authorization. This feature introduces IAM Role policies, empowering users
to request and assume specific roles during STS authentication. By assuming a role, users
inherit the S3 permissions configured for that role by an RGW administrator. This role-based
access control (RBAC) or attribute-based access control (ABAC) approach enables granular
control over user access, ensuring that users only access the resources they need.

Data at rest encryption


Support for HTTP Secure (HTTPS), data at rest encryption, FIPS 140-2 compliance, and over
the wire encryption with Messenger v2.1. Data at rest encryption is further supported by SSE,
including the variants of SSE-C, SSE-KMS, and SSE-S3.

Immutability
IBM Storage Ceph Object storage also supports the S3 Object Lock API in both Compliance
and Governance modes. Ceph has been certified or passed the compliance assessments for
SEC 17a-4(f), SEC 18a-6(e), FINRA 4511(c) and CFTC 1.31(c)-(d). Certification documents
will be available upon their publication.

Chapter 3. IBM Storage Ceph main features and capabilities 43


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Archive zone
The archive zone uses multi-site replication and S3 object versioning features, the archive
zone will keep all versions of all the objects available even when deleted in the production
site.

An archive zone provides you with a history of versions of S3 objects that can only be
eliminated through the gateways associated with the archive zone. Including an archive zone
in your multisite zone replication setup gives you the convenience of an S3 object history
while saving space that replicas of the versioned S3 objects would consume in the production
zones.

You can control the storage space usage of the archive zone through bucket lifecycle policies,
where you can define the number of versions you would like to keep for each object.

An archive zone helps protect your data against logical or physical errors. It can save users
from logical failures, such as accidentally deleting a bucket in the production zone. It can also
save your data from massive hardware failures, like a complete production site failure.
Additionally, it provides an immutable copy, which can help build a ransomware protection
strategy.

Security
Data access auditing by enabling the RadosGW OPS logs feature. Multi-Factor
Authentication for Delete. Support for Secure Token Service (STS), helping to avoid using S3
long-lived keys. STS provides temporary and limited privilege credentials. You can secure
with TLS/SSL the S3 HTTP endpoint provided by the RGW services, the use of External SSL
certificates or self-signed SSL certificates are supported.

Replication
IBM Ceph Object Storage provides enterprise-grade, highly mature object geo-replication
capabilities. The RGW multi-site replication feature facilitates asynchronous object replication
across single or multi-zone deployments. Leveraging asynchronous replication with eventual
consistency, Ceph Object Storage operates efficiently over WAN connections between
replicating sites.

With the latest 6.1 release, Ceph Object Storage introduces granular bucket-level replication,
unlocking a plethora of valuable features. Users can now enable or disable sync per individual
bucket, enabling precise control over replication workflows. This empowers full-zone
replication while opting out specific buckets, replicating a single source bucket to
multi-destination buckets, and implementing both symmetrical and directional data flow
configurations.

Figure 3-8 on page 45 shows the IBM Ceph Object Storage replication feature.

44 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-8 IBM Ceph Object Storage replication feature

Storage policies
Ceph Object Gateway supports different storage classes for the placement of internal RGW
data structures. For example, SSD storage pool devices are recommended for index data,
while HDD storage pool devices can be targeted for high capacity bucket data. Ceph Object
Gateway also supports storage lifecycle policies and transitions for data placement on tiers of
storage depending on the age of the content.

Transitions across storage classes as well as protection policies (for example, replicas,
erasure coding) are supported. Ceph Object Gateway also supports policy-based data
archiving to AWS S3 or Azure. As of the date of this publication, archiving to Amazon S3 and
Microsoft Azure. Archiving to IBM Cloud Object Storage is also under consideration as a
roadmap vision.

Figure 3-9 shows the Ceph Object Gateway storage policies.

Chapter 3. IBM Storage Ceph main features and capabilities 45


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-9 Storage policies

Multiprotocol
Ceph Object Gateway supports a single unified namespace for S3 client operations (S3 API)
and NFS client operations (NFS Export) to the same bucket. This provides a true
multiprotocol experience for a variety of use cases, particularly in situations where
applications are being modernized from traditional file sharing access to native object storage
access. IBM recommends limiting the use of S3 and NFS to the same namespace in use
cases such as data archives, rich content repositories, or document management stores; that
being use cases where the files and objects are unchanging by design. Multiprotocol is not
recommended for live collaboration use cases where multiuser modifications to the same
content is required.

IBM watsonx.data
IBM Storage Ceph is the perfect candidate for a data lake or data lakehouse, as an example,
watsonx.data includes and IBM Storage Ceph license so it can be used out of the box when
deploying watsonx.data, as a result the integration and level of testing between watsonx.data
and IBM Storage Ceph is first class. Some of the features that make Ceph a great match for
watsonx.data are: S3 Select, Table Encryption, IDP authentication with STS, Datacenter
Caching with D3N. S3 Select is a recent innovation that extends object storage to
semi-structured use cases. An example of a semi-structured object is one that contains
comma separated values (CSV), or JSON, or Parquet file formats. S3 Select allows a client to
GET a subset of the object content by using SQL-like arguments to filter the resulting
payload. Ceph Object Gateway currently supports S3 Select for alignment with data lake
house storage with IBM watsonx.data clients. At the time of publication, S3 Select supports
the CSV, JSON, and Parquet formats.

Datacenter-Data-Delivery Network (D3N) uses high-speed storage such as NVMe flash or


DRAM to cache datasets on the access side D3N improves the performance of big-data jobs
running in analysis clusters by speeding up recurring reads from the data lake.

46 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Bucket features
Ceph supports advanced bucket features such as,S3 bucket policy, S3 object versioning, S3
object lock, rate limiting, bucket object count quotas, bucket capacity quotas. In addition to
these advanced bucket features, Ceph Object Gateway boasts impressive scalability that
empowers organizations to store massive amounts of data with ease and efficiency.

Note: In a 6-server and 3-enclosures configuration (see


https://www.redhat.com/en/resources/data-solutions-overview), Ceph Object
Gateway has been demonstrated to support 250 million objects in a single bucket and 10
billion objects overall.

Bucket notifications
Ceph Object Gateway supports bucket event notifications, a crucial feature for event-driven
architectures and widely used when integrating with OpenShift Data Foundation (OCP/FDF)
externally. Notifications enable real-time event monitoring and triggering of downstream
actions, such as data replication, alerting, and workflow automation.

Note: You can refer to “Chapter 4 - S3 bucket notifications for event-driven architectures”
in the IBM Redpaper IBM Storage Ceph Solutions Guide, REDP-5715 for a detailed
discussion of the event-driven architectures.

3.2.7 Ceph Object Gateway deployment step-by-step


In this section, we will demonstrate how to configure the Ceph Object Gateway, also referred
to as the RADOS Gateway, and explore the S3 object storage client experience. We present
these pages to emphasis the simplicity of presenting Ceph storage for S3 object access in a
matter of minutes.

Start the RGW service


Perform the following steps to start the RGW service:
1. From the Ceph Dashboard, in the navigation panel, expand the Cluster group and select
Services. The Services page is displayed along with the currently running Ceph services
(daemons). Click the Create button to configure and start an instance of the RGW service.
See Figure 3-10.

Chapter 3. IBM Storage Ceph main features and capabilities 47


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-10 Ceph cluster services screen

2. In the Create Service dialog box, enter values similar to those as shown below and click
Create Service. See Figure 3-11.

Figure 3-11 Ceph Object gateway service startup screen

48 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Recommended RGW configuration values for a getting started experience:


򐂰 Type: RGW
򐂰 Id: s3service
򐂰 Placement: Hosts
򐂰 Host: <ceph node hostname>
򐂰 Count:1
򐂰 Port: 80

The running RGW service can be observed in any of the following dashboard locations:
򐂰 Ceph Dashboard home page → Object Gateways section
򐂰 Ceph Dashboard → Cluster → Services
򐂰 Ceph Dashboard → Object Gateway - Daemons

Create an Object Gateway user


Perform the following steps to create an Object Gateway user.
1. From the Ceph Dashboard, in the navigation panel, expand the Object Gateway group and
select Users. The Object Gateway Users page is displayed as below. Click Create and
proceed to the create user dialogue. See Figure 3-12 on page 49.

Figure 3-12 Ceph Object Gateway user listing

2. In the Create User dialogue, enter the required values using Figure 3-13 as a guide.
When finished, click the Create User button.

Chapter 3. IBM Storage Ceph main features and capabilities 49


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-13 Ceph Object Gateway create user screen

Create an Object Gateway S3 bucket


Perform the following steps to create an Object Gateway S3 bucket.
1. From the Ceph Dashboard, in the navigation panel, expand the Object Gateway group and
select Buckets. The Object Gateway Buckets page is displayed as shown in Figure 3-14.
Click Create and proceed to the create bucket dialogue.

50 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-14 Ceph Object Gateway bucket listing

2. In the Create Bucket dialogue, enter the required values using Figure 3-15 on page 51 as
a guide. When finished, click the Create Bucket button.

Figure 3-15 Ceph cluster services screen

Recommended Ceph Object gateway bucket configuration values:

Chapter 3. IBM Storage Ceph main features and capabilities 51


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

򐂰 Name:[ Any bucket name you prefer]


򐂰 Owner:[ The User you created in the previous step]
򐂰 Placement: Accept the default value

3.2.8 Ceph Object Gateway client properties


In this section, we will explore the S3 API client experience to PUT and GET a file into the
Object Gateway bucket we created in the previous step.

Verify the Ceph Object Gateway configuration


We will first navigate through the three Object Gateway configuration pages as shown in the
following figures. If any of the settings are not configured, then return to 3.2.7, “Ceph Object
Gateway deployment step-by-step” on page 47 to finish the RADOS Gateway configuration
tasks.

Figure 3-16 on page 52 shows the Ceph Object gateway services listing.

Figure 3-16 Ceph Object Gateway services listing

Figure 3-17 on page 53 shows the Ceph Object Gateway user listing.

52 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-17 Ceph Object Gateway user listing

Figure 3-18 on page 53 shows the Ceph Object Gateway bucket listing.

Figure 3-18 Ceph Object Gateway bucket listing

Chapter 3. IBM Storage Ceph main features and capabilities 53


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Obtain the S3 API access keys


The S3 client access needs a set of credentials that are called the Access Key and Secret
Access Key. There are two ways in Ceph to view or generate case. We will show each of them
below.
1. From the Ceph Dashboard, navigate to the Object Gateway and then Users section.
Select a user to display and select Show Keys. If multiple keys exist, select one key to
show. The Access Key and Secret Key can be hidden or shown as readable. See
Figure 3-19 on page 54.

Figure 3-19 Display S3 access key and secret key for a selected user

2. In the alternate, obtain the S3 access keys using the Ceph command line. The
radosgw-admin command can be run from the shell, or within the cephadm CLI. See
Example 3-1.

Note: Substitute the RGW username you created for “john” in the previous section. The
access key and secret key will differ and be unique for each Ceph cluster.

Example 3-1 Ceph Object Gateway S3 access key and secret key
[root@node1 ~]# radosgw-admin user info --uid="john"
{
"user_id": "john",
"display_name": "Shubeck, John",
"email": "[email protected]",
"suspended": 0,
"max_buckets": 1000,
"subusers": [],
"keys": [
{
"user": "john",

54 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

"access_key": "LGDR3IJB94XZIV4DM7PZ",
"secret_key": "qHAW3wdLGgGh78pyz8pigjxVeoM1sz1HT6lIdYD3"
}
],

. . . output omitted . . .
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}

3.2.9 Configuring the AWS CLI client


The S3 client access needs a set of credentials that are called the Access Key and Secret
Access Key. In this next step, we will copy/paste the keys that we obtained in the previous
step to the AWS CLI client configuration tool.

It is worth noting here, there is nothing remarkable about an object. If a unit of data, for
example a .JPG image, is stored in a file system then we refer to it as a file. If at some point
that file is uploaded or PUT into an object storage bucket, we are likely to refer to it is an
object. Regardless of where it is stored, there is nothing that changes the nature of that
image. The binary data within the file or object, and the image that can be rendered from it,
are unchanged.
1. Configure the AWS CLI tool to use the client credentials. Enter the access key and secret
key that the system generated for you in the previous step. See Example 3-2.

Note: The access key and secret key will differ and be specific to each Ceph cluster.

Example 3-2 - AWS CLI configuration dialogue


[root@client ~]#

[root@client ~]# aws configure --profile=ceph


AWS Access Key ID [None]: LGDR3IJB94XZIV4DM7PZ
AWS Secret Access Key [None]: qHAW3wdLGgGh78pyz8pigjxVeHT6lIdYD3
Default region name [None]: <enter>
Default output format [None]: <enter>
[root@client ~]#

2. The AWS CLI client uses the S3 bucket as the repository to write and read objects. The
action of uploading a local file to an S3 bucket is called a PUT, and the action of
downloading an object from an S3 bucket to a local file is called a GET. The syntax of the
various S3 API clients might use different commands, but at the underlying S3 API layer is
always a PUT and a GET. See Example 3-3.

Chapter 3. IBM Storage Ceph main features and capabilities 55


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Example 3-3 AWS CLI list buckets


[root@client ~]# aws --endpoint-url=http://ceph-node1:80 \
--profile=ceph s3 ls
2023-07-05 09:09:45 s3-bucket-1
[root@client ~]#

3. Create a new S3 bucket from the AWS CLI client (optional). See Example 3-4 on page 56.

Note: The endpoint value should follow the local hostname of the Ceph Object Gateway
daemon host.

Example 3-4 AWS CLI put bucket and list buckets


root@client ~]# aws --endpoint-url=http://ceph-node3:80 \
--profile=ceph s3api create-bucket --bucket s3-bucket-2

2023-07-05 09:09:45 s3-bucket-1

[root@client ~]# aws --endpoint=http://ceph-node3:80 \


--profile=ceph s3 ls
2023-07-05 09:09:45 s3-bucket-1
2023-07-05 11:47:26 s3-bucket-2
[root@client ~]#

4. Create a 10 MB file called 10MB.bin. Upload the file to one of the S3 buckets. See
Example 3-5.

Example 3-5 AWS CLI put object from file


[root@client ~]# dd if=/dev/zero of=/tmp/10MB.bin bs=1024K count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0115086 s, 911 MB/s

[root@client ~]# aws --endpoint-url=http://ceph-node3:80 \


--profile=ceph s3 cp /tmp/10MB.bin s3://s3-bucket-1/10MB.bin
upload: ../tmp/10MB.bin to s3://s3-bucket-1/10MB.bin

5. Get a bucket listing to view the test object. Download the object to a local file. See
Example 3-6.

Example 3-6 AWS CLI list objects and get object to file
[root@client ~]# aws --endpoint-url=http://ceph-node3:80 \
--profile=ceph s3 ls s3://s3-bucket-1
2023-07-05 16:55:39 10485760 10MB.bin

[root@client ~]# aws --endpoint-url=http://ceph-node3:80 \


--profile=ceph s3 cp s3://s3-bucket-1/10MB.bin /tmp/GET-10MB.bin
download: s3://s3-bucket-1/10MB.bin to ../tmp/GET-10MB.bin

6. Verify the data integrity of the uploaded and downloaded files. See Example 3-7.

Example 3-7 Verify the file vs. object MD5 checksum


[root@client ~]# diff /tmp/10MB.bin /tmp/GET-10MB.bin
[root@client ~]#

56 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

[root@client ~]# openssl dgst -md5 /tmp/10MB.bin


MD5(/tmp/10MB.bin)= f1c9645dbc14efddc7d8a322685f26eb
[root@client ~]# openssl dgst -md5 /tmp/GET-10MB.bin
MD5(/tmp/GET-10MB.bin)= f1c9645dbc14efddc7d8a322685f26eb
[root@client ~]#

3.2.10 The S3 API interface


IBM Storage Ceph Object Storage S3 compatibility provides object storage functionality with
an interface that is highly compatible with the Amazon S3 RESTful API.

S3 API compatibility
As a developer, you can use a RESTful application programming interface (API) that is
compatible with the Amazon S3 data access model. It is through the S3 API that object
storage clients and applications store, retrieve, and manage the buckets and objects stored in
an IBM Storage Ceph cluster. IBM Storage Ceph, and moreover the Ceph community,
continues to invest heavily in a design goal referred to as “S3 Fidelity”. This means clients,
and in particular Independent Software Vendors (ISVs), can enjoy independence and
transportability for their applications across S3 vendors in a hybrid multi-cloud.

At a high level, the supported S3 API features in IBM Storage Ceph Object Storage are:
򐂰 Basic bucket operations (PUT, GET, LIST, HEAD, DELETE).
򐂰 Advanced bucket operations (Bucket policies, website, lifecycle, ACLs, versions).
򐂰 Basic object operations (PUT, GET LIST, POST).
򐂰 Advanced object operations (object lock, legal hold, tagging, multipart, retention).
򐂰 S3 select operations (CSV, JSON, Parquet formats).
򐂰 Support for both virtual hostname and pathname bucket addressing formats.

At time of publication, the following list shows the S3 API support listing for bucket and object
operations.

For more information see IBM Documentation.

Figure 3-20 on page 58 shows the S3 bucket operations.

Chapter 3. IBM Storage Ceph main features and capabilities 57


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-20 S3 API bucket operations

Figure 3-21 on page 59 shows the S3 object operations.

58 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-21 S3 API object operations

3.2.11 Conclusion
IBM Storage Ceph Object storage provides a scale-out high capacity object store for S3 API
and Swift API client operations.

The Ceph service at the core of object storage is the RADOS Gateway (RGW). The Ceph
Object Gateway services its clients through S3 endpoints which are Ceph nodes where
instances of RGW operates and in turn services S3 API requests on well-known TCP ports
via HTTP and HTTPS.

Chapter 3. IBM Storage Ceph main features and capabilities 59


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Because the Ceph OSD nodes and the Ceph Object Gateway nodes can be deployed
separately, the cluster offers the ability to independently scale bandwidth and capacity across
a broad range of object storage workloads and use cases.

3.2.12 References on the World Wide Web


You are invited to explore the wealth of Ceph documentation and resources that can be found
on the World Wide Web a brief list of recommended reading follows.
򐂰 IBM Storage Ceph documentation:
https://www.ibm.com/docs/en/storage-ceph/6?topic=developer-ceph-object-gateway-
s3-api
򐂰 Community Ceph documentation:
https://docs.ceph.com/en/quincy/radosgw/
򐂰 AWS CLI documentation:
https://docs.aws.amazon.com/cli/index.html

3.3 IBM Storage Ceph block storage

Note: This section provides a brief overview of the block storage feature in Ceph. All the
following points, and many others, are already widely documented in the IBM Storage
Ceph documentation,

IBM Storage Ceph block storage, also commonly referred to as RBD (RADOS Block Device)
images, is a distributed block storage system that allows for the management and
provisioning of block storage volumes, like traditional storage area networks (SANs) or
direct-attach storage (DAS).

RBD images can be accessed either through a kernel module (for Linux and Kubernetes) or
through the librbd API (for OpenStack and Proxmox). In the Kubernetes world, RBD images
are well-suited for Read Write Once (RWO) Persistent Volume Claims (PVCs).

RBD access architecture is shown in Figure 3-22 on page 61.

60 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-22 RBD access

The virtual machine (VM), through the virtio-blk driver and Ceph library, accesses the RBD
images as if it were a physical drive directly attached to this VM.

Many virtualization solutions are supported:


򐂰 Libvirt
򐂰 OpenStack (Cinder, Nova, Glance)
򐂰 Proxmox, CloudStack, Nebula and so forth

Librbd access method


The librbd access method uses user space and therefore can leverage all existing RBD
features, such as RBD mirroring. However, using user space means that the Linux page
cache cannot be used, librbd instead uses its own in-memory caching, known as RBD
caching. librbd honors o_sync and o_dsync flags on IO requests as well as flush requests
initiated from RBD clients. This means that using write-back caching is just as safe as using
physical hard disk caching with a VM that properly sends flushes (for example, Linux kernel
>= 2.6.32). The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode
can coalesce contiguous requests for better throughput.

RBD cache is, by default, enabled in write-back mode on Ceph client machine, but it can be
set to write-through mode.

The following parameters can be used to control each librbd client caching:
򐂰 rbd_cache - true or false (defaults to true)
򐂰 rbd_cache_size - Cache size in bytes (defaults to 32MiB per RBD Image)

Chapter 3. IBM Storage Ceph main features and capabilities 61


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

򐂰 rbd_cache_max_dirty - Max dirty bytes (defaults to 24MiB. Set to 0 for write-through


mode)
򐂰 rbd_cache_target_dirty - Dirty bytes to start preemptive flush (defaults to 16MiB)

Krbd access method


On a Linux client machine, the kernel module maps the RBD block device to a kernel block
device, typically represented by the device file /dev/rbdX, where X is a number assigned to
the RBD image. The kernel block device appears as a regular block device in the Linux
system. The kernel RBD driver offers superior performance, but not the same level of
functionality. For example, it does not support RBD mirroring with journaling mode.

3.3.1 Managing RBD images


Ceph RBD images can be managed either through the user-friendly Ceph Dashboard or
using the powerful Ceph rbd command-line tool. While the Dashboard simplifies basic
operations like image creation, including striping settings, QoS, and namespace or trash
management, the rbd command provides extensive control and flexibility for in-depth RBD
image management.

The following list provides a summary of the different RBD commands.


򐂰 rbd create - To create a RBD Image
򐂰 rbd rm - To delete a RBD Image
򐂰 rbd ls - List the RBD Images in a pool
򐂰 rbd info - To view RBD Image parameters
򐂰 rbd du - To view the space used in a RBD Image
򐂰 rbd snap - To create a snapshot of a RBD Image
򐂰 rbd clone - To create a clone based on a RBD Image snapshot

3.3.2 Snapshots and clones


In this section we discuss snapshots and clones.

Snapshots
RBD images, like many storage solutions, can have snapshots, which are very convenient for
data protection, testing and development, and virtual machine replication. RBD snapshots
capture the state of an RBD image at a specific point in time using the Copy-On-Write (COW)
technology and IBM Storage Ceph supports up to 512 snapshots per RBD image (the
number is technically unlimited, but the volume's performance is negatively affected).

They are read-only and are used to keep track of changes made to the original RBD image,
so it is possible to roll back to the state of the RBD image at the snapshot creation time. This
also means that snapshots cannot be modified. To make modifications, snapshots must be
used in conjunction with clones.

Clones
Snapshots support Copy-On-Write (COW) clones, also known as snapshot layering, which
are new RBD images that share data with the original image or snapshot (parent). Using this
feature, many writable (child) clones can be created from one single snapshot (theoretically
no limit), allowing for rapid provisioning of new block devices that are initially identical to the
source data. For example, you might create a block device image with a Linux VM written to it.

62 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Then, snapshot the image, protect the snapshot, and create as many clones as you like. A
snapshot is read-only, so cloning a snapshot simplifies semantics—making it possible to
create clones rapidly. See Figure 3-23 on page 63.

Figure 3-23 Ceph Block Device layering

Because clones rely on a parent snapshot, losing the parent snapshot will cause all child
clones to be lost. Therefore, the parent snapshot must be protected before creating clones.
See Figure 3-24.

Figure 3-24 Create a clone from protected snapshot

Clones are essentially new images, so they can be snapshotted, resized, or renamed. They
can also be created on a separate pool for performance or cost reasons. Finally, clones are
storage-efficient because only the modifications made to them are stored on the cluster.

3.3.3 RBD write modes


The typical write mode uses two types of objects:
򐂰 A header, which a small object containing RBD image metadata:
– Image Name - a character string that supports Unicode.
– Image Order - Also known as Image Object Size (defaults to 22 which correspond to 4
Mib objects size).
– Image Size - The size of the block device.
– Stripe Unit - To configure librbd striping (defaults to 4 MiB).
– Stripe Count - To configure librbd striping (defaults to 1).
– Format - To control the image format (defaults to 2).
– Image features - documented in the section ““Features of RBD images” on page 65”.
򐂰 The content of an RBD block image is an array of blocks that are striped into RADOS
objects stored in 4 MB chunks by default, distributed across the Ceph cluster. These
objects are only created when data is written, so no space is used until the block image is
written to. Either replicated or erasure-coded pools can be used to store these objects.

Journaling mode can also be enabled to store data. In this mode, all writes are sent to a
journal before being stored as an object. The journal contains all recent writes and metadata

Chapter 3. IBM Storage Ceph main features and capabilities 63


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

changes (device resizing, snapshots, clones, and so on). Journaling mode is intended to be
used for RBD mirroring.

3.3.4 RBD mirroring


RBD mirroring is an asynchronous process of replicating Ceph Block Device images between
two or more Ceph storage clusters. Journal-based mirroring leverages the journaling write
mode. It is managed by a RBD mirroring daemon that runs on each Ceph cluster. The
daemon on the mirrored Ceph cluster monitors the changes in the journal and sends them to
the daemon on the mirror Ceph cluster, which replicates the changes to the copy image
stored on that cluster.

This copy is an asynchronous, point-in-time, crash-consistent copy, meaning that if the


mirrored cluster crashes, the copy will only miss the last few operations that were not yet
replicated. The copy will behave like a clone of the original RBD image, including not only the
data, but also all snapshots and clones that were created.

RBD mirroring also supports full lifecycle, meaning that after a failure, the synchronization
can be reversed and the original site can be made the primary site again once it is back
online. Mirroring can be set at the image level, so not all cluster images have to be replicated.

RBD mirroring can be one-way or two-way. With one-way replication, multiple secondary sites
can be configured. Figure 3-25 shows one-way RDB mirroring.

Figure 3-25 One-way RBD mirroring

Two-way mirroring limits replication to two storage clusters. Figure 3-26 shows two-way RBD
mirroring.

64 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Figure 3-26 Two-way RBD mirroring

Another way of mirroring RBD images is to use snapshot-based mirroring. In this method, the
remote cluster site monitors the data and metadata differences between two snapshots
before copying the changes locally. Unlike the journal-based method, the snapshot-based
method must be scheduled or launched manually, so it is less accurate, but might also be
faster.

Figure 3-27 snapshot RBD mirroring.

Figure 3-27 Snapshot RBD mirroring

Note: For more details about mirroring, refer to the IBM Storage Ceph documentation,

3.3.5 Other RBD features


RBD offers many other notable and useful features.

Features of RBD images


The following lists the features of RBD images:
򐂰 Layering refers to the copy-on-write clones of block devices.

Chapter 3. IBM Storage Ceph main features and capabilities 65


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

򐂰 Striping spreads data across multiple objects. Striping helps with parallelism for sequential
read and write workloads.
򐂰 Exclusive locks prevent multiple processes from accessing RBD images at the same time
in an uncoordinated fashion. This helps to address the write conflict situation that can
occur when multiple clients try to write to the same object, which can be the case in
virtualization environments or when using RBD mirroring to avoid simultaneous access to
the journal.
This feature is enabled by default on new RBD devices. It can be disabled, but other
features that rely on it may be affected.
This feature is mostly transparent to the user. When a client attempts to access an RBD
device, it requests an exclusive lock on the device. If another client already has a lock on
the device, the lock request is denied. The client holding the lock is requested to release it
when its write is done, allowing the other client to access the device.
򐂰 Object map support depends on exclusive lock support. Block devices are thin
provisioned, meaning they only store data that actually exists. Object map support tracks
which objects actually exist (have data stored on a drive). Enabling object map support
speeds up I/O operations for cloning, importing and exporting sparsely populated images,
and deleting.
򐂰 Fast-diff support depends on object map support and exclusive lock support. It adds
another property to the object map, making it much faster to generate diffs between
snapshots of an image and determine the actual data usage of a snapshot.
򐂰 Deep-flatten enables RBD flatten to work on all snapshots of an image, in addition to the
image itself. Without deep-flatten, snapshots of an image rely on the parent, so the parent
cannot be deleted until the snapshots are deleted. Deep-flatten makes a parent
independent of its clones, even if they have snapshots.
򐂰 Journaling support depends on exclusive lock support. Journaling records all
modifications to an image in the order they occur. RBD mirroring utilizes the journal to
replicate a crash consistent image to a remote cluster.

Encryption
Using Linux Unified Key Setup (LUKS) 1 or 2 format, a RBD images can be encrypted if used
with librbd (krbd is not supported yet). The format operation persists the encryption metadata
to the RBD image. The encryption key is secured by a passphrase provided by the user to
create and to access the device once it is encrypted.

Note: For more information about encryption, refer to the IBM Storage Ceph
documentation.

Quality of service
Using librbd it is possible to limit per-image IO using parameters, disabled by default, that
operates independently of each other, meaning that write IOPS can be limited while read
IOPS is not. These parameters are:
򐂰 IOPS: number of I/Os per second (any type of I/O)
򐂰 read IOPS: number of read I/Os per second
򐂰 write IOPS: number of write I/Os per second
򐂰 bps: bytes per second (any type of I/O)
򐂰 read bps: bytes per second read
򐂰 write bps: bytes per second written

66 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

These settings can be configured at the image creation time or any time after, using either the
rbd command line tool or the Dashboard.

Namespace isolation
Namespace isolation allows you to restrict client access to different private namespaces
using their authentication keys. In a private namespace, clients can only see their own RBD
images and cannot access the images of other clients in the same RADOS pool but located in
a different namespace.

You can create namespaces and configure images at image creation using the rbd
command-line tool or the Dashboard.

Live migration of images


You can live-migrate RBD images between different pools, or even within the same pool, on
the same storage cluster. You can also migrate between different image formats and layouts,
and even from external data sources.

When live migration is initiated, the source image is deep-copied to the destination image,
preserving all snapshot history and sparse allocation of data where possible. The live
migration requires creating a target image that can maintain read and write access to the data
while the linked source image is marked read-only. Once the background migration is
complete, you can commit the migration, removing the cross-link and deleting the source
(unless import-only mode is used). You can also cancel the migration, removing the cross-link
and deleting the target image.

Import and export images


Using rbd export and rbd import commands it is possible to copy images from a cluster to
another through ssh connection. Import and export operations can also leverage snapshots
using the export-diff, import-diff, and merge-diff commands. The export-diff command
generates a new snapshot file containing the changes between two snapshots. This snapshot
file can then be used with the import-diff command to apply the changes to an image on a
remote Ceph cluster, or with the merge-diff command to merge two continuous snapshots
into a single one.

Trash
Moving a RBD image to trash allows to keep it for a specified time before it is really deleted. If
the retention time is not expired or the trash has been purged, the trashed images can be
restored.

Trash management is available from both the command line tool and the Dashboard.

Rbdmap service
The systemd unit file, rbdmap.service, is included with the ceph-common package. The
rbdmap.service unit executes the rbdmap shell script.

This script is very useful to automatically mount at boot time and umount at shutdown RBD
images from a Ceph client, by simply adding one path per line to rbd devices (/dev/rbdX) and
associated credentials to be managed.

RBD-NBD
RBD Network Block Device (RBD-NBD) is an alternative to Kernel RBD (KRBD) for mapping
Ceph RBD images. Unlike KRBD, which works in kernel space, RBD-NBD relies on the NBD
kernel module and works in user space. This allows access to librbd features such as

Chapter 3. IBM Storage Ceph main features and capabilities 67


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

exclusive lock and fast-diff. NBD exposes mapped devices as local devices in paths like
/dev/nbdX.

In summary, RBD image features make them well-suited for a variety of workloads, including
both virtualization (virtual machines and containers) and high-performance applications (such
as databases and applications that require high IOPS and use small I/O size). RBD images
provide functionality like snapshots and layering, and stripe data across multiple servers in
the cluster to improve performance.

3.4 IBM Storage Ceph file storage


IBM Storage Ceph provides a file system compatible with POSIX standards that uses a Ceph
storage cluster to store and retrieve data. As the data within Ceph is internally stored as
objects in RADOS, a specific component is required to maintain ACL, ownership and
permissions for files and directories while allowing the mapping of POSIX inode numbers to
RADOS object names. This specific CephFS component is the Metadata Server or MDS.

Although the shared file system feature was the original use case for IBM Storage Ceph, the
demand for block and object storage by Cloud providers and companies implementing
OpenStack, this feature was put on the backburner and was the last current Ceph core
feature to become generally available, preceded by virtual block devices (RADOS Block
Device) and the OpenStack Swift and Amazon S3 gateway (RADOS Gateway or RGW).

The first line of code of a prototype file system was written in 2004 during Sage Weil's
internship at Lawrence Livermore National Laboratory and Sage continued working on the
project during another summer project at the University of California Santa Cruz in 2005 to
create a fully functional file system, baptized Ceph, and presented at Super Compute and
USENIX in 2006.

3.4.1 Metadata Server (MDS)


CephFS stores both the data and the metadata (inodes and dentries) for the file system within
RADOS pools. Using RADOS as a storage layer allows Ceph to leverage built-in RADOS
object features such as watch and notify to easily implement an active-active or
active-passive mechanism and a highly efficient online file system check mechanism.

The Metadata Server also maintains a cache to improve metadata access performance while
managing the cache of CephFS clients to ensure the clients have the proper cache data and
to prevent deadlocks on metadata access.

A Metadata Server can have two roles:


򐂰 Active
򐂰 Standby

Active Metadata Server


An active Metadata Server will respond to CephFS client requests to provide the client with
the exact RADOS object name and its metadata for a given i-node number while maintaining
a cache of all accesses to the metadata.

When a new inode or dentry is created, updated, or deleted, the cache is updated and the
inode or dentry is recorded, updated, or removed in a RADOS pool dedicated to CephFS
metadata.

68 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Standby Metadata Server


A standby Metadata Server will be assigned a rank if it becomes Failed. Once the standby
MDS becomes active, it replays the journal to reach a consistent state.

If the active Metadata Server becomes inactive or upon an administrative command, a


standby Metadata Server will become the active one for a given Ceph File System.

A Metadata Server can be marked as standby-replay for a given rank to apply journal
changes to its cache continuously. It allows for the failover between two Metadata Servers to
occur faster. If using this feature, each rank must be assigned to a standby-replay Metadata
Server.

Metadata Server journaling


Every action on the Ceph File System metadata is streamed into a journal shared between
Metadata Servers and located in the CephFS metadata pool before the actual file system
operation is committed.

This journal is used to maintain the consistency of the file system during a Metadata Server
failover operation as events can be replayed by the standby Metadata Server to reach a file
system state consistent with the state last reached by the now defunct previously active
Metadata Server.

The benefit of using a journal to record the changes is that most journal updates are
sequential. They are handled faster by all type of physical disk drives, and sequential
consecutive writes can be merged for even better performance.

The performance of the metadata pool where the journal is maintained is of vital importance
to the level of performance delivered by a CephFS subsystem. This is why it is a best practice
to use flash device based OSDs to host the placement groups of the metadata pool.

The journal comprises multiple RADOS objects in the metadata pool, and journals are striped
across the multiple objects for better performance. Each active Metadata Server maintains its
journal for performance and resiliency reasons. Old journal entries are automatically trimmed
by the active Metadata Server.

3.4.2 File System configuration


Ceph supports one or more file systems in a single storage cluster. This allows the
administrator to adapt the shared file system capabilities to the actual use cases that are to
be served using a single Ceph storage cluster. Creating multiple file systems is aligned with
the CRUSH customization mapping specific hardware capabilities or layout:
򐂰 Lightning-fast file system.
򐂰 Fast file system.
򐂰 Home directory file system.
򐂰 Archival file system.

The file system configuration can then be paired with specific Metadata Server deployment
and configuration to serve each file system with the appropriate level of performance when it
comes to file system metadata access requirements:
򐂰 A very hot directory requires a dedicated set of Metadata Servers.
򐂰 Archival file system is essentially read-only, and metadata update occurs in large batches.
򐂰 Increase the number of Metadata Servers to remediate a metadata access bottleneck.

Chapter 3. IBM Storage Ceph main features and capabilities 69


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Each file system is assigned a parameter named max_mds to set the maximum number of
Metadata Servers that can be active for a given file system. By default, this parameter is set to
1 for each file system created in the Ceph storage cluster.

The best practice when creating file systems is to deploy max_mds+1 Metadata Servers for a
single file system to keep one Metadata Server with a standby role to preserve the
accessibility and the resilience of the access to the metadata for the file system.

The number of active Metadata Servers can be dynamically increased or decreased for a
given file system without traffic disruption.

Figure 3-28 provides a visual representation of a multi MDS configuration within a Ceph
cluster. The shaded nodes represent directories, and the unshaded nodes represent files.
The MDSs at the bottom of the picture illustrate which specific MDS rank is in charge of a
specific subtree.

Figure 3-28 CephFS MFS subtree partitioning

Metadata Server rank


Each active Metadata Server is assigned a rank for a file system. The rank will be used when
creating subtree partitioning policies, covered below, or when pinning the directory to a
specific Metadata Server. The first rank assigned to an active Metadata Server for a file
system is 0.

Each rank has a state:


򐂰 Up - the rank is assigned to an MDS,
򐂰 Failed - the rank is not assigned to any MDS,
򐂰 Damaged - metadata for the rank is damaged/corrupted.

Note: A Damaged rank will not be assigned to any MDS until it is fixed and after the Ceph
administrator issues a ceph mds repaired command.

To make the life of the Ceph administrator easier when it comes to the number of Metadata
Servers assigned to a file system, the Ceph Manager provides the MDS Autoscaler module
that will monitor the max_mds and the standby_count_wanted parameters for a file system.

70 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Metadata Server caching


To control the performance of the Metadata Server, the Ceph administrator can modify the
amount of memory used by an MDS for caching or the number of inodes maintained in cache
by an MDS:
򐂰 mds_cache_memory_limit - to control the amount of memory used (8-64GiB)
򐂰 mds_cache_size - to control inode count in cache (disabled by default)

Note: IBM recommends using the mds_cache_memory_limit parameter.

As the client may be misbehaving and causing some metadata to be held in the cache more
than expected, the Ceph cluster has a built-in warning mechanism, set through the
mds_health_cache_threshold parameter, when the actual cache usage is 150% of the
mds_cache_memory_limit parameter.

Metadata Server file system affinity


If your Ceph cluster has heterogeneous hardware that includes fast and slow hardware, you
can assign a file system affinity to favor specific Metadata Servers running on specific
hardware assigned to a specific file system.

This is achieved by assigning the mds_join_fs parameter for a specific MDs instance. If a
rank becomes Failed, the Monitors in the cluster will assign to the Failed rank a standby
MDS for which the mds_join_fs is set to the name of the file system to which the Failed rank
is assigned.

If no standby Metadata Server has the parameter set, an existing standby Metadata Server is
assigned to the Failed rank.

Dynamic tree partitioning


When multiple active Metadata Servers are assigned to a file system, the distribution of the
metadata serving will be performed automatically so that each Metadata Server serves an
equivalent number of metadata to guarantee the performance of the file system.

Ephemeral pinning
The dynamic tree partitioning of an existing directory subtree can be configured via policies
assigned to directories within a file system to influence the distribution of the Metadata Server
workload. The following extended attributes of a file system directory can be used to control
the balancing method across the active Metadata Servers:
򐂰 ceph.dir.pin.distributed - All children are to be pinned to a rank,
򐂰 ceph.dir.pin.random - Percentage of children to be pinned to a rank.

The ephemeral pinning does not persist once the inode is dropped from the Metadata Server
cache.

Manual pinning
If needed when the Dynamic Tree Partitioning is not satisfying, with or without subtree
partitioning policies (for example, one directory is hot), the Ceph administrator can pin a
directory to a particular Metadata Server. This is achieved by setting an extended attribute
(ceph.dir.pin) of a specific directory to indicate which Metadata Server rank will oversee
metadata requests for it.

Pinning a directory to a specific Metadata Server rank does not dedicate that rank to this
directory, as Dynamic Tree Partitioning never stops between active Metadata Servers.

Chapter 3. IBM Storage Ceph main features and capabilities 71


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

3.4.3 File System layout


In this section we will discuss Ceph File System layout.

RADOS pools
Each file system requires one (1) pool to store the metadata managed by the Metadata
Servers and the file system journal and at least one (1) data pool to store the data itself.

The overall architecture for the Ceph File System can be represented in Figure 3-29.

Figure 3-29 Ceph File System pools and components

Data layout
In this section we discuss metadata and data pools.

Metadata pool
The journaling mechanism uses dedicated objects and the journal is striped across multiple
objects for performance reasons.

Each inode in the file system is stored using a separate set of objects that will be named
{inode_number}.{inode_extent} starting with {inode_extent} as 00000000.

Data pool
The organization of the pool that contains the data is driven by a set of extended attributes
assigned to files and directories in the Ceph File System. By default, a Ceph File System is
created with one metadata pool and one data pool. An additional data pool can be attached to
the existing file system so it can be leveraged via the set of extended attributes managed by
the Metadata Server:

72 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

$ ceph fs add_data_pool {filesysten_name} {data_pool_name}

The placement and the organization of the data follow the following rules:
򐂰 A file, when created, inherits the attributes of the parent directory.
򐂰 A file attributes can only be changed if it contains no data.
򐂰 The files already created are affected by attribute changes on the parent directory.

Table 3-1 shows the directory attributes.

Table 3-1 Directory attributes


Name Controls

ceph.dir.layout.pool What pool the data is written to

ceph.dir.layout.stripe_unit What is the size of a stripe unit

ceph.dir.layout.stripe_count How many stripe units to build a full stripe

ceph.dir.layout.object_size What object size to use to support striping

ceph.dir.layout.pool_namespace What RADOS namespace to use

The default values for these parameters are:


򐂰 ceph.dir.layout.pool: Original file system data pool
򐂰 ceph.dir.layout.stripe.unit: MiB
򐂰 ceph.dir.layout.stripe.count: 1
򐂰 ceph.dir.layout.object.size: 4 MiB
򐂰 ceph.dir.layout.pool.namespace: default namespace name=""

All the attributes used at the file level have the same name but are prefixed with
ceph.file.layout.

The attributes can be visualized using the following commands with the file system mounted:
򐂰 getfattr -n ceph.dir.layout {directory_path}
򐂰 getfattr -n ceph.file.layout {file_path}

The attributes can be visualized using the following commands with the file system mounted:
򐂰 setfattr -n ceph.dir.layout.attribute -v {value} {dir_path}
򐂰 setfattr -n ceph.file.layout.attribute -v {value} {file_path}

Let us look at the RADOS physical layout of a file that would have the following attributes,
inherited or not from its parent directory (Figure 3-30 on page 74):
򐂰 File size is 8 MiB.
򐂰 Object size is 4 MiB.
򐂰 Stripe unit is 1 MiB.
򐂰 Stripe count is 4.

Chapter 3. IBM Storage Ceph main features and capabilities 73


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Figure 3-30 Ceph File System layout over RADOS objects

3.4.4 Data layout in action


As an example, imagine we create the following structure on an empty file system. See
Example 3-8.

Example 3-8 Create the following structure on an empty file system


/mnt/myfs/testdir
/mnt/myfs/emptyfile (100MB)
/mnt/myfs/testdir/emptyfileindir (100MB)

We can dump the RADOS objects that are created to support the actual data. See
Example 3-9.

Example 3-9 RADOS objects that are created to support the actual data
$ mount.ceph 10.0.1.100:/ /mnt/myfs -o name=admin
$ df | grep myfs
10.0.1.100:/ 181141504 204800 180936704 1% /mnt/myf
$ mkdir /mnt/myfs/testdir
$ dd if=/dev/zero of=/mnt/myfs/emptyfile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.194878 s, 538 MB/s
$ dd if=/dev/zero of=/mnt/myfs/testdir/emptyfileindir bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0782966 s, 1.3 GB/s
$ rados -p cephfs_data ls | cut -f1 -d. | sort -u

74 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

10000000001
100000001f6
$ for inode in $(rados -p cephfs_data ls | cut -f1 -d. | sort -u); do echo
"Processing INODE=${inode}";echo "----------------------------";rados -p
cephfs_data ls | grep $inode; done
Processing INODE=10000000001
----------------------------
10000000001.00000009
10000000001.0000000f
10000000001.00000012
10000000001.00000018
10000000001.00000015
10000000001.00000003
10000000001.0000000e
10000000001.00000002
10000000001.00000004
10000000001.0000000b
10000000001.00000010
10000000001.00000008
10000000001.0000000a
10000000001.00000005
10000000001.00000007
10000000001.00000014
10000000001.00000006
10000000001.00000000
10000000001.00000013
10000000001.0000000d
10000000001.00000017
10000000001.00000001
10000000001.00000011
10000000001.0000000c
10000000001.00000016
Processing INODE=100000001f6
----------------------------
100000001f6.00000017
100000001f6.00000018
100000001f6.00000016
100000001f6.0000000e
100000001f6.0000000d
100000001f6.0000000b
100000001f6.0000000c
100000001f6.00000004
100000001f6.00000015
100000001f6.00000005
100000001f6.00000001
100000001f6.00000007
100000001f6.00000012
100000001f6.00000000
100000001f6.00000003
100000001f6.00000008
100000001f6.00000006
100000001f6.00000013
100000001f6.00000011
100000001f6.00000009
100000001f6.0000000f

Chapter 3. IBM Storage Ceph main features and capabilities 75


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

100000001f6.00000014
100000001f6.0000000a
100000001f6.00000002
100000001f6.00000010

We can see that we have two inodes created, 10000000001 and 100000001f6, and each
actually counts 25 objects. As the file system is empty and has not been pre-configured or
customized, we use stripe_unit=4 MiB, stripe_count=1 and object_size=4 MiB, therefor,
25*4=100 MiB.

Example 3-10 Two inodes are created


$ for inode in $(rados -p cephfs_data ls | cut -f1 -d. | sort -u); do echo
"Processing INODE=${inode}";echo "----------------------------";echo "Found
$(rados -p cephfs_data ls | grep $inode | wc -l) RADOS objects"; done
Processing INODE=10000000001
----------------------------
Found 25 RADOS objects
Processing INODE=100000001f6
----------------------------
Found 25 RADOS objects

How can we now which file is which inode number? There is a simple way to do that. See
Example 3-11.

Example 3-11 printf command to find out which file is which inode number
$ printf '%x\n' $(stat -c %i mnt/myfs/emptyfile)
10000000001
$ printf '%x\n' $(stat -c %i /mnt/myfs/testdir/emptyfileindir)
100000001f6

Volumes and sub-volumes


Later during the Ceph File System project (Nautilus cycle 14.2.x), the concept of volumes and
sub-volumes was added.

The volumes are used to manage exports through a Ceph Manager module and is currently
used to provide shared file system capabilities for OpenStack Manila and Ceph CSI in the
Kubernetes and Red Hat OpenShift environments:
򐂰 Volumes represent an abstraction for Ceph file systems.
򐂰 Sub-volumes represent an abstraction for directory trees.
򐂰 Sub-volume-groups aggregate sub-volumes to apply specific common policies across
multiple sub-volumes.

The introduction of this feature enabled an easier creation of a Ceph file system through a
single command that automatically creates the underlying pools required by the file system
and then deploys the Metadata Servers to serve the file system: ceph fs volume create
{filesystem_name} [options].

Quotas
The Ceph File System supports quotas to restrict the number of bytes or the number of files
within a directory. The quota mechanism is enforced by both the Ceph kernel and the Ceph
FUSE clients.

76 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

Quotas are managed via extended attributes of a directory and can be set via the setfattr
command for a specific directory:
򐂰 ceph.quota.max_bytes
򐂰 ceph.quota.max_files

Note: The above parameters are managed directly by the OpenStack Ceph Manila driver
and the Ceph CSI driver at the sub-volume level.

3.4.5 File System clients


The Ceph File System kernel client was merged with Linux 2.6.34 in October 2010, it was the
first Ceph kernel client and remains to this day the most performant way to access a Ceph file
system.

Later on, the Ceph FUSE client is created to allow non-Linux based clients to access a Ceph
file system. See Figure 3-31.

Figure 3-31 Ceph File System client layering

3.4.6 File System NFS Gateway


IBM Storage Ceph 5 added the ability to export a Ceph file system directory through NFS via
a specific gateway leveraging the Ganesha open-source project. This feature is completely
based on FUSE.

Chapter 3. IBM Storage Ceph main features and capabilities 77


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Ganesha supports multiple protocols such as NFS v3 and NFS v4 and does so through a File
System Abstraction Layer also known as FSAL.

IBM Storage Ceph only supports NFS v4 with version 6.x but NFS v3 support is expected in a
future IBM Storage Ceph version.

The NFS implementation requires a specific Ceph Manager module to be enabled to leverage
this feature. The name of the module is nfs and can be enabled with the ceph mgr module
enable nfs command.

3.4.7 Ceph File System permissions


Ceph has a concept of capabilities to control what cluster components and clients are allowed
to perform within or against a cluster. This Ceph capability concept is designed to control the
permissions when accessing the Ceph file system. Chapter 2 of the document describes the
Ceph capability feature (client id, client name, cephx and so forth).

You can assign specific permissions to a directory for a specific Ceph client user name or
user id:
$ ceph fs authorize {filesystem_name} {client_id|client_name} {path} {permissions}

Where {client_name} is client.admin for example.

Where {permissions} can be:


򐂰 r for read permission
򐂰 w for write permission
򐂰 p to let the user manipulate layout and quota attributes
򐂰 s to let user manipulate snapshots

As an example, imagine a Ceph client cephx definition like this one (Example 3-12).

Example 3-12 Ceph client cephx definition


# ceph auth get client.0
[client.0]
key = AQA+KrxjWUovChAACWGj0YUbUEZHSKmNtYxriw==
caps mds = "allow rw fsname=myfs"
caps mon = "allow r fsname=myfs"
caps osd = "allow rw tag cephfs data=myfs"

If we try to modify an attribute of the root directory (/) of the Ceph file system mounted on the
/mnt mountpoint, it is denied as the set of capabilities do not include 'p'.
# setfattr -n ceph.dir.layout.stripe_count -v 2 /mnt
setfattr: /mnt: Permission denied

If we create another user with the correct capabilities and then remount the Ceph file system
on the same /mnt mountpoint, the denial disappears. See Example 3-13.

Example 3-13 Create another user with the correct capabilities and then remount the Ceph file system
# mkdir /mnt/dir4
# ceph fs authorize myfs client.4 / rw /dir4 rwp
[client.4]
key = AQBmK71j0FcKERAAJqwhXOHoucR+iY0nzGV9BQ==

78 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch03.fm

# umount /mnt
# mount -t ceph ceph-node01.example.com,ceph-node02.example.com:/ /mnt -o
name=4,secret="AQBmK71j0FcKERAAJqwhXOHoucR+iY0nzGV9BQ=="
# touch /mnt/dir4/file1
# setfattr -n ceph.file.layout.stripe_count -v 2 /mnt/dir4/file1
# getfattr -n ceph.file.layout /mnt/dir4/file1
# file: mnt/dir4/file1
ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304
pool=cephfs.fs_name.data"

3.4.8 IBM Storage Ceph File System snapshots


The Ceph file system provides an asynchronous snapshot capability. The snapshots are
accessible via a virtual directory named .snap.

Snapshots can be created at any level in the directory tree structure include the root level of a
given file system.

Snapshots capabilities for specific users as we have seen earlier can be granted at the
individual client level via the 's' flag.

Snapshot capabilities can also be enabled or disabled as a whole feature for an entire file
system.
# ceph fs set {filesystem_name} allow_new_snaps true|false

To create a snapshot, the end user that has the Ceph file system mounted will simply create a
subdirectory within the '.snap' directory with the name of his or her choice.
# mkdir /mnt/.snap/mynewsnapshot

The end user with sufficient privileges can create regular snapshot of a specific directory
name via the snap-schedule command. See Example 3-14.

Example 3-14 snap-schedule command


# ceph fs snap-schedule add / 1h
Schedule set for path /
# ceph fs snap-schedule list /
/ 1h
# ceph fs snap-schedule status / | jq .
{
"fs": "fs_name",
"subvol": null,
"path": "/",
"rel_path": "/",
"schedule": "1h",
"retention": {},
"start": "2023-01-11T00:00:00",
"created": "2023-01-11T09:38:07",
"first": null,
"last": null,
"last_pruned": null,
"created_count": 0,
"pruned_count": 0,
"active": true

Chapter 3. IBM Storage Ceph main features and capabilities 79


5721ch03.fm Draft Document for Review November 28, 2023 12:23 am

Tip: It is recommended to keep each snapshot at least one hour apart.

To control the retention of snapshots the snap-schedule command as a retention argument.


# ceph fs snap-schedule retention add / h 24
Retention added to path /

3.4.9 IBM Storage Ceph File System asynchronous replication


Ceph file system asynchronous replication between two Ceph clusters has been added to the
set of Ceph capabilities to satisfy geo-replication needs for disaster recovery purposes.

This feature requires both clusters to be identical version for support reasons and at least
running version 5.3 of IBM Storage Ceph.

The feature is based on CephFS snapshot, taken at regular intervals. The first snapshot will
require a complete transfer of the data while subsequent snapshot will only require
transferring the data that has been updated since the last snapshot was applied on the
remote cluster.

Ceph file system mirroring is disabled by default and must be enabled separately to make the
following changes to the Ceph clusters:
򐂰 Enable the Manager mirroring module.
򐂰 Deploy a cephfs-mirror component via cephadm.
򐂰 Authorize the mirroring daemons in both clusters (source and target).
򐂰 Peer the source and the target clusters.
򐂰 Configure the path to mirror.

The sequence, shown in Example 3-15, highlights the sequence of changes required.

Example 3-15 Sequence of changes required


# ceph mgr module enable mirroring
# ceph orch apply cephfs-mirror [node-name]
# ceph fs authorize fs-name client_ / rwps
# ceph fs snapshot mirror enable fs-name
# ceph fs snapshot mirror peer_bootstrap create fs-name peer-name site-name
# ceph fs snapshot mirror peer_bootstrap import fs-name bootstrap-token
# ceph fs snapshot mirror add fs-name path

80 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

Chapter 4. Sizing IBM Storage Ceph


An IBM Storage Ceph cluster can be designed to serve different types of capacity and
workload requirements. There is no one size fits all Ceph system but each cluster needs to be
carefully architected. This chapter provides guidance on designing and sizing an IBM Storage
Ceph cluster and has the following sections:
򐂰 “Workload considerations” on page 82
򐂰 “Performance domains and storage pools” on page 84
򐂰 “Network considerations” on page 85
򐂰 “Collocation versus non-collocation” on page 85
򐂰 “Minimum hardware requirements for daemons” on page 87
򐂰 “OSD node CPU and RAM requirements” on page 88
򐂰 “Scaling RGWs” on page 89
򐂰 “Recovery calculator” on page 91
򐂰 “IBM Storage Ready Nodes for Ceph” on page 92
򐂰 “Performance guidelines” on page 94
򐂰 “Sizing examples” on page 97

Note: For more information on sizing an IBM Storage Ceph environment, you can refer to
the Planning section of IBM Storage Ceph documentation.

© Copyright IBM Corp. 2023. 81


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

4.1 Workload considerations


One of the key benefits of a Ceph storage cluster is the ability to support different types of
workloads within the same storage cluster using performance domains. Different hardware
configurations can be associated with each performance domain.
򐂰 For example, these performance domains coexisting in the same IBM Storage Ceph
cluster:
򐂰 IOPS-intensive workloads, such as MySQL and MariaDB, often use SSDs.
򐂰 Throughput-sensitive workloads typically use HDDs with Ceph metadata on solid-state
drives (SSDs.
򐂰 Hard disk drives (HDDs) are typically appropriate for cost and capacity-focused
workloads.

See an example of performance domains in Figure 4-1.

Figure 4-1 A cluster with multiple performance domains

4.1.1 IOPS-optimized
Input, output per second (IOPS) optimization deployments are suitable for cloud computing
operations, such as running MYSQL or MariaDB instances as virtual machines on OpenStack
or as containers on OpenShift. IOPS optimized deployments require higher performance
storage, such as flash storage, to improve IOPS and total throughput.

An IOPS-optimized storage cluster has the following properties:


– Lowest cost per IOPS.

82 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

– Highest IOPS per GB.


– 99th percentile latency consistency.

Use cases for an IOPS-optimized storage cluster are:


򐂰 Typically block storage.
򐂰 2x replication or 3x replication using solid-state drives (SSDs).
򐂰 MySQL on OpenStack clouds.

4.1.2 Throughput-optimized
Throughput-optimized deployments are ideal for serving large amounts of data, such as
graphic, audio, and video content. They require high-bandwidth networking hardware,
controllers, and hard disk drives with fast sequential read and write performance. If fast data
access is required, use a throughput-optimized storage strategy. If fast write performance is
required, consider using SSDs for metadata. A throughput-optimized storage cluster has the
following properties:
򐂰 Lowest cost per MBps (throughput).
򐂰 Highest MBps per TB.
򐂰 97th percentile latency consistency.

Uses for a throughput-optimized storage cluster are:


򐂰 Block or object storage.
򐂰 3x replication for block data or erasure coding (EC) for object data
򐂰 Storage for backups
򐂰 Active performance storage for video, audio, and images.

4.1.3 Capacity-optimized
Capacity-optimized deployments are ideal for storing large amounts of data at the lowest
possible cost. They typically trade performance for a more attractive price point. For example,
capacity-optimized deployments often use slower and less expensive large capacity SATA or
NL-SAS drives and can use a large number of OSD drives per server.

A cost and capacity-optimized storage cluster has the following properties:


򐂰 Lowest cost per TB.

A cost and capacity-optimized storage cluster has the following properties:


򐂰 Typically object storage.
򐂰 Erasure coding for maximizing usable capacity.
򐂰 Object archive.
򐂰 Video, audio, and image object repositories.

Chapter 4. Sizing IBM Storage Ceph 83


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

4.2 Performance domains and storage pools


Different performance domains are implemented by creating multiple storage pools. Ceph
clients store data in pools. When you create pools, you are creating an I/O interface for clients
to store data.

There are two types of pools:


򐂰 replicated
򐂰 erasure-coded

By default, Ceph uses replicated pools, which means that each object is copied from a
primary OSD node to one or more secondary OSDs. Erasure-coded pools reduce the disk
space required to ensure data durability, but are computationally more expensive than
replication.

Erasure coding is a method of storing an object in the Ceph storage cluster durably by
breaking it into data chunks (k) and coding chunks (m). These chunks are then stored in
different OSDs. In the event of an OSD failure, Ceph retrieves the remaining data (k) and
coding (m) chunks from the other OSDs and uses the erasure code algorithm to restore the
object from those chunks.

Erasure coding uses storage capacity more efficiently than replication. The n-replication
approach maintains n copies of an object (3x by default in Ceph), whereas erasure coding
maintains only k + m chunks. For example, 3 data and 2 coding chunks use 1.5x the storage
space of the original object.

Figure 4-2 shows the comparison between replication and erasure coding.

Figure 4-2 Comparison between replication and erasure coding

84 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

While erasure coding uses less storage overhead than replication, it requires more RAM and
CPU than replication to access or recover objects. Erasure coding is advantageous when
data storage must be durable and fault tolerant, but does not require fast read performance
(for example, cold storage, historical records, and so on).

4.3 Network considerations


Cloud storage solutions are susceptible to IOPS exhaustion due to network latency,
bandwidth constraints, and other factors, long before the storage clusters reach their capacity
limits. Therefore, network hardware configurations must be chosen carefully to support the
intended workloads and meet price/performance requirements.

Ceph supports two networks: a public network and a storage cluster network. The public
network handles client traffic and communication with Ceph monitors. The storage cluster
network handles Ceph OSD heartbeats, replication, backfilling, and recovery traffic. For
production clusters, it is recommended to use at least two 10 Gbps networks for each network
type.

Link Aggregation Control Protocol (LACP) mode 4 (xmit_hash_policy 3+4) can be used to
bond network interfaces. Use jumbo frames with a maximum transmission unit (MTU) of
9000, especially on the backend or cluster network.

4.4 Collocation versus non-collocation


Every Ceph cluster consists of Monitor daemons (ceph-mon), Manager daemons (ceph-mgr)
and Object Storage Daemons (ceph-osd).

Clusters that are used as object storage also deploy RADOS Gateway daemons (radosgw),
and if shared file services are required, additional Ceph Metadata Servers (ceph-mds) would
be configured.

Co-locating multiple services to a single cluster node has the following benefits:
򐂰 Significant improvement in total cost of ownership (TCO) in smaller clusters.
򐂰 Reduction from six hosts to four for the minimum configuration.
򐂰 Easier upgrade.
򐂰 Better resource isolation.

Modern x86 servers have 16 or more CPU cores and large amounts of RAM memory which
justifies the co-location of services. Running each service type in dedicated cluster nodes
separated from other types (non-collocated) would require a larger number of smaller servers
which may even be difficult to find on the market.

There are rules to be considered that restrict deploying every type of service on a single
node.

For any host that has ceph-mon/ceph-mgr and OSD, only one of the following can be added
to the same host:
򐂰 RADOS Gateway
򐂰 Ceph Metadata Servers
򐂰 Grafana

Chapter 4. Sizing IBM Storage Ceph 85


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

RADOS Gateway and Ceph Metadata services are always deployed on different nodes.

Grafana is typically deployed on a node, which does not host Monitor or Manager services, as
shown in Figure 4-3.

Figure 4-3 Grafana service co-location rules

Smallest supported cluster size for IBM Storage Ceph is four nodes. An example of services
co-location for a four-node cluster is shown in Figure 4-4.

86 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

Figure 4-4 Example of services co-location for a four-node cluster

4.5 Minimum hardware requirements for daemons


Each service daemon requires CPU and RAM resources, some daemons need disk capacity
as well. Recommended minimums for each daemon are shown in Figure 4-5.

Chapter 4. Sizing IBM Storage Ceph 87


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

Figure 4-5 Recommended minimums for each daemon

Daemon resource requirements may change between releases. Remember to check the
latest (or the release you are after) requirements from IBM Documentation.

Supported configurations: Check out the following link for supported configurations:
Supported configurations Note that you need to login with your IBM Internet ID to access
this link:

4.6 OSD node CPU and RAM requirements


Adding up individual daemon CPU and RAM requirements for each server can be time
consuming. Alternatively, you can use below table (Figure 6) as a guideline to server sizing.
This table calculates enough CPU and RAM per server to deploy common services, such as
ceph-mon, ceph-mgr, radosgw and Grafana in addition to OSD requirements.

Figure 4-6 shows aggregate server CPU and RAM resources required for an OSD node with
x number of HDD drives, SSD devices or NVMe devices.

88 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

Figure 4-6 Aggregate server CPU and RAM resources required for an OSD node

Make a note, that it is strongly recommended to add SSD or NVMe devices equaling 4% of
HDD capacity per server as RADOS Gateway metadata storage. This allows Ceph to perform
much better compared to not having such capacity.

4.7 Scaling RGWs


To reach high throughput for object data, multiple RGWs are required. Consider these general
principles:
򐂰 RGW daemons can provide 1-2GB/s throughput each with large object size.
򐂰 1-8 RGW daemons can be placed on a single node.
򐂰 Minimum configuration is 2 RGW daemons deployed on separate nodes.
򐂰 Up to 80 RGW daemons in a single cluster has been tested.
򐂰 For optimal performance, allocate 4 CPU cores and 16-32GB RAM per RGW daemon.
򐂰 1 RGW daemon per node in a 7-node cluster will perform better than 7 RGW daemons on
a single node. Spread RGW daemons wide.
򐂰 Multi-colocated RGW on OSD hosts can significantly improve read performance for
read-intensive workloads, enabling near-zero-latency reads.
򐂰 While not optimal for performance, running RGW daemons on standalone VMs may be
necessary for networking or security reasons.

Figure 4-7 shows performance scaling using large objects as more RGW instances are
deployed in a 7-node cluster.

Chapter 4. Sizing IBM Storage Ceph 89


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

Figure 4-7 Performance scaling using large objects as more RGW instances are deployed

Scaling from to 1 to 7 RGW daemons in a 7-node cluster by spreading daemons across all
nodes improves performance 5-6x. Adding RGW daemons to nodes which already have an
RGW daemon does not scale performance as much.

Figure 4-8 shows the performance impact of adding a second RGW daemon to each of the 7
cluster nodes.

Figure 4-8 Performance effect of adding a second RGW daemon in each of the 7 cluster nodes

Small objects also scale well in RGW, and performance is typically measured in OPS
(operations per second). See Figure 4-9 on page 91 shows the small object performance of
the same cluster.

90 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

Figure 4-9 Small object performance of the same cluster

IBM Storage Ceph includes an ingress service for RGW that deploys HAProxy and
keepalived daemons on hosts running RGW and allows for an easy creation of a single virtual
IP for the service.

4.8 Recovery calculator


IBM Storage Ceph does not restrict the OSD drive sizes to be used or the number of OSD
drives per node. However, it is required that the cluster can recover from a node failure in less
than 8 hours.

Official recovery calculator tool to be used to determine node recovery time can be found
here: https://access.redhat.com/labs/rhsrc/

You must be registered as a Red Hat customer to be able to log in to the tool. Here is a link to
the site where you can create a Red Hat login:

https://www.redhat.com/wapps/ugc/register.html?_flowId=register-flow&_flowExecutio
nKey=e1s1

It is important to choose Host as OSD Failure Domain. The number to observe is MTTR in
Hours (Host Failure) which needs to be less than 8 hours.

Figure 4-10 on page 92 shows the recovery calculator user interface.

Chapter 4. Sizing IBM Storage Ceph 91


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

Figure 4-10 Ceph Recovery Calculator GUI

The use of recovery calculator is especially important for small clusters where recovery times
with large drives are often higher than 8 hours.

4.9 IBM Storage Ready Nodes for Ceph


IBM has recently introduced IBM Storage Ready Nodes for Ceph, which are pre-configured
and pre-tested servers that can be used with IBM Storage Ceph software.

By using I IBM Storage Ready Nodes for Ceph customers get both hardware maintenance
and software support from IBM.

Currently there is one server model available with various drive configurations, as shown in
Figure 4-11 on page 93.

92 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

Figure 4-11 IBM Storage Ceph with Storage Ready Node specifications

Each node comes with two 3.84TB SSD acceleration drives, which are used as metadata
drives. Available data drives are 8TB, 12TB, 16TB and 20TB SATA drives. Every node must
be fully populated with 12 drives of the same size.

Possible capacity examples using IBM Storage Ready Nodes for Ceph are shown in
Figure 4-12.

Chapter 4. Sizing IBM Storage Ceph 93


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

Figure 4-12 Capacity examples using IBM Storage Ready Nodes for Ceph

IBM Storage Ready Nodes for Ceph are well suited for throughput optimized workloads, such
as backup storage.

Examples of an 8-node IBM Storage Ceph cluster's performance using Ready Nodes can be
seen in Figure 4-13.

Figure 4-13 Examples of an 8-node IBM Storage Ceph cluster's performance using Ready Nodes

4.10 Performance guidelines


Daemon deployment per each cluster node needs to be considered and recommendations
followed to use servers with enough CPU and RAM resources to support the daemons.

Once the servers are correctly sized then we can make estimates of actual cluster
performance.

94 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

4.10.1 IOPS-optimized
IOPS-optimized Ceph clusters use SSD or NVME drives for the data. Main objective of the
cluster is to provide a large number of OPS for RBD block devices or CephFS. Workload is
typically database application requiring high IOPS with small block size.

Example server models and estimated IOPS per host using NVMe drives are shown in
Figure 4-14.

Figure 4-14 Example server models and estimated IOPS per host using NVMe drives

It is a best practice to have at least 10 Gbps of network bandwidth for every 12 HDDs in OSD
node for both cluster network and client network. By doing this, we make sure that network
does not become a bottleneck.

4.10.2 Throughput-optimized
Throughput optimized Ceph cluster can be all flash or hybrid using both SSDs and HDDs.
Main objective of the cluster is to achieve a required throughput performance. Workload is
typically large files and most common use case is backup storage where Ceph capacity is
used via RADOS Gateway object access.

Example server models and estimated MB/s per OSD drive using HDDs as data drives are
shown in Figure 4-15.

Chapter 4. Sizing IBM Storage Ceph 95


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

Figure 4-15 Example server models and estimated MB/s per OSD drive using HDDs as data drives

4.10.3 Capacity-optimized
Capacity-optimized Ceph clusters use large-capacity HDDs and are often designed to be
narrower (fewer servers) but deeper (more HDDs per server) than throughput-optimized
clusters. The primary objective is to achieve the lowest cost per GB.

Capacity-optimized Ceph clusters are typically used for archive storage of large files, with
RADOS Gateway object access being the most common use case. While it is possible to
co-locate metadata on the same HDDs as data, it is still recommended to separate metadata
and place it on SSDs, especially when using erasure coding protection for the data. This can
provide a significant performance benefit.

Many of the typical server models connect to an external JBOD (Just a Bunch of Disks) drive
enclosure where data disks reside. External JBODs can range from 12-drive to more than a
hundred drive models.

Example server models for capacity-optimized clusters are shown in Figure 4-16.

Figure 4-16 Example server models to build a capacity optimized Ceph cluster

Make a note that too narrow clusters cannot be used with largest drives since calculated
recovery time would be greater than 8 hours. For cold archive type of use case it is possible to
have less than one CPU core per OSD drive.

96 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

4.11 Sizing examples


In this section we cover some scenarios for IBM Storage Ceph sizing.

4.11.1 IOPS-optimized scenario


An example use case scenario is described in Figure 4-17 on page 97.

Figure 4-17 Scenario: IOPS-optimized

The IOPS requirement is quite high for a small capacity requirement. Therefore, we should
plan to use NVMe drives which provide highest IOPS per drive. Refer to the performance
chart in Figure 4-14 on page 95.

Performance planning
Figure 4-14 on page 95 shows that a single server with six NVMe drives can do 50 K write
IOPS and 200 K read IOPS with 4 KB block size.

The following are customer requirements:


򐂰 400 K writes
򐂰 400 K / 50 K (single server performance) = 8 servers required
򐂰 1.8 M reads
򐂰 1.8 M / 200 K (single server performance) = 9 servers required

We should plan for 10 servers to have a little headroom and to be prepared for a server
failure.

Capacity planning
The following are considerations for capacity planning:
򐂰 It is recommended to use 3-way replication with block data.
򐂰 Total raw storage requirement is 3x 50 TB = 150 TB.
򐂰 Our performance calculation is based on having six NVMe drives per server.
򐂰 10 servers in the cluster means 60 OSD drives.
򐂰 Capacity requirement per NVMe drive is 150 TB (total raw capacity) / 60 (OSD
drives) = 2.5 TB.
򐂰 Closest larger capacity NVMe is 3.84 TB.

Chapter 4. Sizing IBM Storage Ceph 97


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

CPU, RAM and networking requirements


It is critical to have enough CPU and memory resources and networking bandwidth when
using fast NVME drives as data drives.

CPU requirement
Best practice is to have 4 to 10 CPU cores per NVMe drive. Our solution has 6 NVMe drives
per server, therefore CPU requirement for OSD daemons is 24 to 60 CPU cores.

RAM requirement
Best practice is to have 12 to 15 GB of RAM per NVMe drive. Our solution has 6 NVMe drives
per server, therefore RAM memory requirement for OSD daemons is 72 to 90 GB.

Networking requirement
We know that 6 NVMe drives in one host is about 200K read IOPS (write IOPS is less). Block
size is 4 KB. Bandwidth required per server is 200000* 4KB/s = 800MB/s = 6.4 Gbps for both
client and cluster networks.

Figure 4-18 A possible solution for an IOPS optimized Ceph cluster

4.11.2 Throughput-optimized scenario


Customer is looking for 1PB of usable S3 capacity for their long term backups. Backup
application is IBM Storage Defender Data Protect.

Sizing example
Throughput optimized systems often combine lower cost drives for data and higher
performance drives for RadosGW metadata. IBM Storage Ready Nodes for Ceph offer this
combination of drives.

Ceph cluster needs to be designed in a way that allows data to be rebuilt on remaining nodes
if one of the servers fail. This needs to be considered as extra capacity of top of the requested
1 PB.

Also, Ceph cluster goes into read only mode once it reaches 95% capacity utilization.

98 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

Ceph S3 backup targets typically use erasure coding to protect data. For a 1 PB usable
capacity, EC 4+2 is a better choice than EC 8+3 or EC 8+4.

Minimum number of nodes to support EC 4+2 is 7. We should investigate if a good match for
customer capacity requirement can be found with 7 nodes or if more are needed.

Largest available drive size in IBM Storage Ready Nodes for Ceph is 20 TB.
7 nodes * 12 drives per node * 20TB drives = 1680 TB raw.

This would seem like a low cost alternative for 1 PB usable. But can we use 20 TB drives in a
7-node cluster, as shown in Figure 4-19 on page 99? What does the recovery calculator
show?

Figure 4-19 7 nodes 20 TB drives

Since calculated recovery time for host failure is longer than 8 hours this cluster would not be
supported.

First, we increase the number of hosts in recovery calculator from 7 to 8. See Figure 4-20.

Figure 4-20 8-nodes 20 TB drives

Recovery time is just under 8 hours.

Chapter 4. Sizing IBM Storage Ceph 99


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

Now let us do the capacity math for a 8-node cluster with 20 TB drives:
1. 8 nodes each with 12pcs of 20 TB drives = 8*12*20 TB = 1920 TB raw capacity.
2. EC 4+2 has 4 data chunks and 2 EC chunks. Each chunk is the same size.
3. One chunk of 1920 TB is therefore 1920 TB / 6 = 320 TB.
4. There are 4 data chunks in EC 4+2 and therefore usable capacity is 4*320 TB = 1280 TB.
5. This would satisfy customer's 1 PB capacity requirement and cluster has enough capacity
to rebuild data if one node fails and still stay below 95% capacity utilization.
a. Each server has 12x 20TB = 240 TB raw capacity.
b. Total cluster raw capacity is 1920 TB.
c. Cluster raw capacity if one node breaks is 1920 TB - 240 TB = 1680 TB.
d. 1680 TB raw equals 1120 TB usable.
e. Customer has 1 PB (1000 TB) of data.
f. Capacity utilization if one host is broken is 1000 TB / 1120 TB = 0.89 = 89%.
6. Conclusion is that 8-node cluster with 20 TB drives has enough capacity to stay below
95% capacity utilization even if one the servers breaks and its data is rebuilt on remaining
servers.
7. Now that the raw capacity in each OSD node is calculated, we add the recommended
SSD drives for metadata that equal 4% of the raw capacity on the host.
– 4% of 240 TB is 9.6 TB.
– Best practice is to have one metadata SSD for every four or six HDDs.
– Therefore, it would be recommended to add two or three 3.84 TB SSDs in each host.
– Two 3.84 TB SSDs (non-raided) amounts to 7.68 TB which is 3.2% of raw capacity in
the host.
– Three 3.84 TB SSDs (non-raided) amounts to 11.52 TB which is 4.8% of raw capacity
in the host.
8. Since the customer performance requirement (2 GB/s ingest) is not very high compared to
what an 8-node cluster can achieve (over 4 GB/s) we can settle for the lower SSD
metadata capacity.

CPU, RAM and networking requirements


The following lists the CPU, RAM and networking requirements.

CPU requirement
Best practice is to have 1 CPU core per HDD. Our solution has 12 HDDs per server, therefore
CPU requirement for OSD daemons is 12 CPU cores.

RAM requirement
Best practice is to have 10 GB of RAM per HDD for throughput optimized use case. Our
solution has 12 HDDs per server, therefore RAM memory requirement for OSD daemons is
120 GB.

Networking requirement
One HDD has 90MB/s read performance. Our solution has 12 HDDs per server, therefore
network bandwidth requirement per network is 12* 90MB/s = 1080MB/s = 8.64 Gbps.

100 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

See Figure 4-21 on page 101.

Figure 4-21 A possible solution for a 1 PB throughput optimized Ceph cluster

An example use case scenario for a capacity optimized Ceph cluster is described below.

4.11.3 Capacity-optimized scenario


The following are the requirements:
򐂰 Low cost storage for cold data archive
򐂰 S3 compatibility
򐂰 12 PB usable capacity

A possible solution
An object storage archive for cold data is a type of use case where we could design a cluster
that is narrow (few servers) and deep (large number of OSD drives per host). The end result
is a low-cost but not high performing cluster. Very large servers benefit greatly of SSDs for
metadata and it is recommended to include them even for a cold archive use case.

Since capacity requirement is quite high, we can consider EC 8+3 which has the lowest
overhead (raw capacity vs. usable capacity). An IBM Storage Ceph cluster using EC 8+3
requires a minimum of 12 nodes.

First, we want to check if a 12-node cluster with large drives and large JBODs can recover
from host failure in less than 8 hours.

We log in to Recover Calculator tool and enter the parameters of the cluster we want to
check:
򐂰 12 OSD hosts
򐂰 84pcs of 20 TB OSD drives per host
򐂰 2x 25 GbE network ports

Chapter 4. Sizing IBM Storage Ceph 101


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

We can see from Recovery Calculator (Figure 4-22) that calculate recovery time for such a
cluster is less than 8 hours meaning it is a supported design.

Figure 4-22 Checking the calculated recovery time of a 12-node cluster

Next, we want to calculate the capacity:


1. 12 nodes each with 84pcs of 20 TB drives = 12*84*20 TB = 20160 TB raw capacity.
2. EC 8+3 has 8 data chunks and 3 EC chunks. Each chunk is the same size.
3. One chunk of 20160 TB is therefore 20160 TB / 11 = 1832 TB.
4. There are 8 data chunks in EC 8+3 and therefore usable capacity is 8*1832 TB = 14656
TB.
5. This would satisfy customer's 12 PB capacity requirement and cluster has enough
capacity to rebuild data if one node fails and still stay below 95% capacity utilization.
– Each server has 84x 20 TB = 1680 TB raw capacity.
– Total cluster raw capacity is 20160 TB.
– Cluster raw capacity if one node breaks is 20160 TB - 1680 TB = 18480 TB.
– 18480 TB raw equals 13440 TB usable.
– Customer has 12 PB (12000 TB) of data.
– Capacity utilization if one host is broken = 12000 TB / 13440 TB = 0.89 = 89%.
6. Conclusion is that 12-node cluster with 20 TB drives has enough capacity to stay below
95% capacity utilization even if one the servers breaks and its data is rebuilt on remaining
servers.
Even for cold archive it is recommended to have SSD metadata devices totaling 4% of
HDD capacity per server. This results in 4% of 84*20 TB = 67 TB. 8 or 10 large SSDs
(7.68TB) would be required for metadata, more if they are to be protected with server
RAID.

CPU, RAM and networking requirements


The following lists the CPU, RAM and networking requirements.

102 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch04sizing.fm

CPU requirement
For a cold archive use case we can use 0.5 CPU cores per HDD. Our solution has 84 HDDs
per server, therefore CPU requirement for OSD daemons is 42 CPU cores.

RAM requirement
Best practice is to have 5 GB of RAM per HDD for capacity optimized use case. Our solution
has 84 HDDs per server, therefore RAM memory requirement for OSD daemons is 420 GB.

Networking requirement
One HDD has 90MB/s read performance. Our solution has 84 HDDs per server, therefore
network bandwidth requirement per network is 84* 90MB/s = 7560MB/s ~60 Gbps.

See Figure 4-23 on page 103.

Figure 4-23 A possible solution for a 12 PB capacity optimized Ceph cluster

Chapter 4. Sizing IBM Storage Ceph 103


5721ch04sizing.fm Draft Document for Review November 28, 2023 12:23 am

104 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

Chapter 5. Monitoring your IBM Storage


Ceph environment
The supporting hardware of an IBM Storage Ceph cluster is subject to failure over time. The
Ceph administrator is responsible for monitoring and, if necessary, troubleshooting the cluster
to keep it in a healthy state. This chapter provides an overview of the Ceph services, metrics,
and tools that are in scope for end-to-end monitoring of cluster health and performance.

This chapter has the following sections:


򐂰 “Monitoring overview” on page 106
򐂰 “IBM Storage Ceph monitoring examples” on page 109
򐂰 “Conclusion” on page 118
򐂰 “References on the World Wide Web” on page 119

© Copyright IBM Corp. 2023. 105


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

5.1 Monitoring overview


Managing and monitoring a Ceph storage network demands a diverse range of technical
skills. Organizations have devised various strategies for allocating administrative
responsibilities in accordance with their unique requirements. In large Ceph deployments,
organizations often assign technical specialists to specific technical domains. Conversely,
organizations with smaller Ceph clusters typically employ IT generalists who oversee multiple
disciplines.

The three technical skills needed to manage and monitor a Ceph cluster are:
򐂰 Operating system: The ability to install, configure, and operate the operating system that
hosts the Ceph nodes (Red Hat Enterprise Linux).
򐂰 Networking: The ability to configure TCP/IP networking in support of Ceph cluster
operations for both client access (public network) and Ceph node to node communications
(cluster network).
򐂰 Storage: The ability to design and implement a Ceph storage design to provide the
performance, data availability, and data durability that aligns with the requirements of the
client applications and the value of the data.

Regardless of the number of individuals who perform the work, whether it be one person or
several, these administrators will need to perform their roles in managing a Ceph cluster
using the right tool at the right time. This chapter provides an overview of Ceph cluster
monitoring and the tools that are available to perform monitoring tasks.

The supporting hardware of a Ceph cluster is subject to failure over time. The Ceph
administrator is responsible for monitoring and, if necessary, troubleshooting the cluster to
keep it in a healthy state. The remainder of this chapter provides an overview of the Ceph
services, metrics, and tools that are in scope for end-to-end monitoring of cluster health and
performance.

5.1.1 IBM Storage Ceph monitoring components


In this section we discuss the Ceph monitoring components.

Ceph Monitor
Ceph Monitors (MONs) are the daemons responsible for maintaining the cluster map, a
collection of five maps that provide comprehensive information about the cluster's state and
configuration. Ceph proactively handles each cluster event, updates the relevant map, and
replicates the updated map to all MON daemons. A typical Ceph cluster comprises three
MON instances, each running on a separate host. To ensure data integrity and consistency,
MONs adopt a consensus mechanism, requiring a voting majority of the configured monitors
to be available and agree on the map update before it is applied. This is why a Ceph cluster
must be configured with an odd number of monitors (for example, 3 or 5) to establish a
quorum and prevent potential conflicts.

Ceph Manager
The Ceph Manager (MGR) is a critical component of the Ceph cluster responsible for
collecting and aggregating cluster-wide statistics. The first MGR daemon that is started in a
cluster becomes the active MGR and all other MGRs are on standby. If the active MGR does
not send a beacon within the configured time interval, a standby MGR takes over. Client I/O
operations continue normally while MGR nodes are down, but queries for cluster statistics fail.
The best practice is to deploy at least two MGRs in the Ceph cluster to provide high

106 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

availability. Ceph MGRs are typically run on the same hosts as the MON daemons, but it is
not required.

The existence of the first Ceph Manager, and the first Ceph Monitor, defines the existence of
a Ceph cluster. This means that when cephadm bootstraps the first node, a Monitor and a
Manager are started on that node, and the Ceph cluster is considered operational. In the
Ceph single node experience, that first node will also have one or several OSD daemons
further extending the concept of an operational and fully functioning Ceph cluster capable of
storing data.

Ceph daemons
Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster itself
maintains a cluster log that records high-level events about the entire Ceph cluster. These
events are logged to disk on monitor servers (in the default location
/var/log/ceph/ceph.log), and they can be monitored via the command line.

5.1.2 IBM Storage Ceph monitoring categories


In this section we discuss the Ceph monitoring categories.

Monitoring services
The Ceph cluster is a collection of daemons and services that perform their respective roles
in the operation of a Ceph cluster. The first step in Ceph monitoring or troubleshooting is to
inspect these Ceph components and make observations as to whether they are up and
running, on the nodes they were designated to run, and are they reporting a healthy state. For
instance, if the cluster design specifies two Ceph Object Gateway (RGW) instances for
handling S3 API client requests from distinct nodes, a single cephadm command or a glance at
the Ceph Dashboard will provide insights into the operating status of the RGW daemons and
their respective nodes.

Monitoring resources
Ceph resources encompass the cluster entities and constructs that define its characteristics.
These resources include networking infrastructure, storage devices (for example, SSDs,
HDDs), storage pools and their capacity, and data protection mechanisms. As with monitoring
other resources, understanding the health of Ceph storage resources is crucial for effective
cluster management. At the core, administrators must ensure sufficient capacity with
adequate expansion capabilities to provide the appropriate data protection for the
applications they support.

Monitoring performance
Ensuring that both physical and software-defined resources operate within their defined
performance service levels is the responsibility of the Ceph administrator. System alerts and
notifications serve as valuable tools, alerting the administrator to anomalies without requiring
constant monitoring. Key resources and constructs for monitoring Ceph performance include
node utilization (CPU, memory, and disk), network utilization (interfaces, bandwidth, latency,
and routing), storage device performance, and daemon workload.

5.1.3 IBM Storage Ceph monitoring tools


In this section we discuss the Ceph monitoring tools.

Chapter 5. Monitoring your IBM Storage Ceph environment 107


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

Ceph Dashboard
The Ceph Dashboard UI exposes an HTTP web browser interface accessible on port 8443.
The various Dashboard navigation menus provide real-time health and basic statistics, which
are context sensitive, for the cluster resources. For example, the Dashboard can display the
number of OSD read bytes, write bytes, read operations, and write operations. If you
bootstrap your cluster with the cephadm bootstrap command, then the Dashboard is enabled
by default.

The Dashboard plug-in provides context sensitive metrics for the following services:
򐂰 Hosts
򐂰 OSDs
򐂰 Pools
򐂰 Block devices
򐂰 File systems
򐂰 Object storage gateways

The Dashboard also provides a convenient access point to observe the state and the health
of resources and services in the Ceph cluster. The following Dashboard UI navigation
provides at-a-glance views into the health of the cluster.
򐂰 Cluster → Services (View Ceph services and daemon node placement and instances.)
򐂰 Cluster → Logs (View and search within the Cluster logs, Audit logs, Daemon logs.)
򐂰 Cluster → Monitoring (Control for active alerts, alert history, and alert silences.)

Prometheus
The Prometheus plug-in to the Dashboard facilitates the collection and visualization of Ceph
performance metrics by enabling the export of performance counters directly from ceph-mgr.
Ceph-mgr gathers MMgrReport messages from all MgrClient processes, including monitors
(mons) and object storage devices (OSDs), containing performance counter schema and
data. These messages are stored in a circular buffer, maintaining a record of the last N
samples for analysis.

This plugin establishes an HTTP endpoint, akin to other Prometheus exporters, and retrieves
the most recent sample of each counter upon polling, or scraping in Prometheus parlance.
The HTTP path and query parameters are disregarded, and all available counters for all
reporting entities are returned in the Prometheus text exposition format (Refer to the
Prometheus documentation for further details.). By default the module will accept HTTP
requests on TCP port 9283 on all IPv4 and IPv6 addresses on the host.

The Prometheus metrics can be viewed in a Grafana from an HTTP web browser on TCP port
3000 (for example, https://ceph-mgr-node:3000).

Ceph command line tool


The Ceph command line tool ceph offers the deepest levels of inspection and granularity to
monitor all areas of the cluster. Besides viewing performance metrics, it further offers the
ability to control daemons and adjust tuning parameters to optimize Ceph performance for the
client workload.

108 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

5.2 IBM Storage Ceph monitoring examples


The following section offers specific examples of Ceph monitoring tools for the most common
operational tasks.

5.2.1 Ceph Dashboard health


The Dashboard offers a convenient at-a-glance view into the configuration and the health of
the cluster. In addition to the Dashboard home page, deeper inspection is available by
navigating to the relevant Cluster sub-categories (for example, Cluster → Monitoring).

Figure 5-1 Ceph Dashboard home page and health condition

5.2.2 Ceph command line health check


While the Ceph Dashboard provides a convenient, at-a-glance overview of the cluster's
configuration and health, the Ceph command-line interface (cephadm) empowers
administrators to delve into the intricate details of all Ceph services and resources. The ceph
status and ceph health commands will report the health of the cluster. If the cluster health
status is HEALTH_WARN or HEALTH_ERR, the operator can use the ceph health detail
command to view the health check message to begin troubleshooting the issue. See
Example 5-1.

Example 5-1 Ceph command line health and status examples


[root@node1 ~]# cephadm shell
[ceph: root@node1 /]# ceph health

Chapter 5. Monitoring your IBM Storage Ceph environment 109


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

HEALTH_OK

[ceph: root@node1 /]# ceph health detail


HEALTH_WARN failed to probe daemons or devices; 1/3 mons down
. . . output omitted . . .

[ceph: root@node1 /]# ceph status

cluster:
id: 899f61d6-5ae5-11ee-a228-005056b286b1
health: HEALTH_OK

services:
mon: 3 daemons, quorum
techzone-ceph6-node1,techzone-ceph6-node2,techzone-ceph6-node3 (age 16h)
mgr: techzone-ceph6-node1.jqdquv(active, since 2w), standbys:
techzone-ceph6-node2.akqefd
mds: 1/1 daemons up
osd: 56 osds: 56 up (since 2w), 56 in (since 4w)
rgw: 2 daemons active (2 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 10 pools, 273 pgs
objects: 283 objects, 1.7 MiB
usage: 8.4 GiB used, 823 GiB / 832 GiB avail
pgs: 273 active+clean

5.2.3 Ceph Dashboard alerts and logs


The Dashboard also offers views into the cluster alerts and logs for a point and click
inspection into the active alerts, alerts history, and alert silences. (For example, Cluster →
Monitoring → Alerts and Cluster → Logs → Audit Logs).

Figure 5-2 shows Ceph Dashboard alerts history.

110 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

Figure 5-2 Ceph Dashboard alerts history

Figure 5-3 on page 112 shows Ceph Dashboard audit logs.

Chapter 5. Monitoring your IBM Storage Ceph environment 111


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

Figure 5-3 Ceph Dashboard audit logs

5.2.4 Ceph Dashboard plug-in


The Dashboard offers a convenient at-a-glance view into context sensitive performance
metrics. If a Grafana view is available for a property page, the Dashboard will display an
“Overall Performance” tab on that selected page. Clicking on this tab displays a context
sensitive view into the current performance for the chosen service or resource. See
Figure 5-4 on page 113.

112 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

Figure 5-4 Ceph Dashboard plug-in for Grafana context sensitive performance charts

5.2.5 Grafana stand-alone dashboard


The Grafana stand-alone Dashboard offers a more in-depth view into Ceph cluster
performance metrics. The stand-alone Grafana browser offers more than a dozen pre-defined
dashboards that can be viewed in a dedicated full screen experience.

The Grafana stand-alone Dashboard can be accessed from a web browser that addresses
TCP port 3000 on the Ceph MGR node (for example, https://ceph-node1:3000). See
Figure 5-5 on page 114.

Chapter 5. Monitoring your IBM Storage Ceph environment 113


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

Figure 5-5 The Grafana stand-alone Dashboard cluster view

5.2.6 Ceph command line deeper dive


The following examples illustrate some of the most frequently used cephadm commands for
monitoring or querying the configuration and health of Ceph cluster constructs and resources.

Ceph service status


This command displays the services currently running in the Ceph cluster at a summary level.
To view details about a particular service, use the command ceph orch [service] ls.
Selected examples of cephadm commands follow. Example 5-2 shows the cephadm shell
command.

Example 5-2 Ceph service status


[root@node1 ~]# cephadm shell
[ceph: root@node1 /]#
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
alertmanager ?:9093,9094 1/1 8m ago 4w count:1
ceph-exporter 4/4 8m ago 4w *
crash 4/4 8m ago 4w *
grafana ?:3000 1/1 8m ago 4w count:1
mds.mycephfs 1/1 6m ago 2w
techzone-ceph6-node3;count:1
mgr 2/2 8m ago 4w count:2
mon 3/3 8m ago 3w
techzone-ceph6-node1;techzone-ceph6-node2;techzone-ceph6-node3
node-exporter ?:9100 4/4 8m ago 4w *

114 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

osd 1 8m ago - <unmanaged>


osd.all-available-devices 55 8m ago 4w *
prometheus ?:9095 1/1 8m ago 4w count:1
rgw.s3service ?:80 2/2 7m ago 2w
techzone-ceph6-node2;techzone-ceph6-node3;count:2

Ceph host status


This command displays all OSD devices in the system along with their organization within the
OSD nodes and their status. See Example 5-3.

Example 5-3 Ceph host status


[ceph: root@node1 /]# ceph orch host ls
HOST ADDR LABELS STATUS
techzone-ceph6-node1 192.168.65.161 _admin
techzone-ceph6-node2 192.168.65.162
techzone-ceph6-node3 192.168.65.163
techzone-ceph6-node4 192.168.65.164
4 hosts in cluster

Ceph device status


This command lists all OSD devices on each host in the system, along with their size and
availability status. In Example 5-4, the available status of the devices is "No," which indicates
that these are normal devices that have already been integrated into the cluster.

Example 5-4 Ceph device status


[ceph: root@node1 /]# ceph orch device ls
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED
REJECT REASONS
techzone-ceph6-node1 /dev/sdb hdd 8589M No 29m ago
Insufficient space (<10 extents) on vgs, LVM detected, locked
techzone-ceph6-node1 /dev/sdc hdd 8589M No 29m ago
Insufficient space (<10 extents) on vgs, LVM detected, locked
locked

. . . output omitted . . .

techzone-ceph6-node4 /dev/sdn hdd 17.1G No 104s ago


Insufficient space (<10 extents) on vgs, LVM detected, locked
techzone-ceph6-node4 /dev/sdo hdd 17.1G No 104s ago
Insufficient space (<10 extents) on vgs, LVM detected, locked

Ceph OSD status


This command displays all OSD devices in the system with their organization in the cluster
OSD nodes and the device status. See Example 5-5 on page 115.

Example 5-5 Ceph OSD tree listing


[ceph: root@node1 /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.81091 root default
-3 0.20273 host techzone-ceph6-node1
9 hdd 0.01559 osd.9 up 1.00000 1.00000

Chapter 5. Monitoring your IBM Storage Ceph environment 115


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

14 hdd 0.01559 osd.14 up 1.00000 1.00000


20 hdd 0.01559 osd.20 up 1.00000 1.00000
25 hdd 0.01559 osd.25 up 1.00000 1.00000

. . . output omitted . . .

-9 0.20273 host techzone-ceph6-node2


1 ssd 0.00780 osd.1 up 1.00000 1.00000
5 ssd 0.00780 osd.5 up 1.00000 1.00000
2 ssd 0.00780 osd.2 up 1.00000 1.00000
7 ssd 0.00780 osd.7 up 1.00000 1.00000

5.2.7 Ceph monitoring Day 2 operations


In this final section, we explore a few additional commands that will be useful to start learning
about Ceph Day 2 operations.

Ceph versions and status


Earlier versions of Ceph clients might not benefit from features provided by the installed
version of the Ceph cluster. When upgrading a Ceph cluster, it is recommended to also
update the clients. The RADOS Gateway, the FUSE client for CephFS, the librbd, and the
command-line tools are examples of Ceph clients.

Check the version of the Ceph client on a Ceph client machine. See Example 5-6.

Example 5-6 Ceph client version report


[ceph: root@node1 /]# ceph -v
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy
(stable)

[ceph: root@node1 /]# ceph version


ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy
(stable)

Check the software versions on each of the Ceph cluster nodes. See Example 5-7.

Example 5-7 Ceph cluster versions report


[ceph: root@node1 /]# ceph versions
{
"mon": {
"ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343)
quincy (stable)": 3
},
"mgr": {
"ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343)
quincy (stable)": 2
},
"osd": {
"ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343)
quincy (stable)": 56
},
"mds": {

116 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

"ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343)


quincy (stable)": 1
},
"overall": {
"ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343)
quincy (stable)": 64
}
}

Ceph cluster safety check


Check the current OSD capacity ratio for your cluster. See Example 5-8.

Example 5-8 Check the current OSD capacity ratio for your cluster
[ceph: root@ceph-node1 /]# ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

Change one of the ratios currently set for your cluster. See Example 5-9.

Example 5-9 Change one of the ratios currently set for your cluster
[ceph: root@ceph-node1 /]# ceph osd set-full-ratio 0.9
osd set-full-ratio 0.9

[ceph: root@ceph-node1 /]# ceph osd dump | grep ratio


full_ratio 0.9
backfillfull_ratio 0.9
nearfull_ratio 0.85

[ceph: root@ceph-node1 /]# ceph osd set-full-ratio 0.95


osd set-full-ratio 0.9

[ceph: root@ceph-node1 /]# ceph osd dump | grep ratio


full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

Check the current cluster space usage. See Example 5-10.

Example 5-10 Check the current cluster space usage


[ceph: root@ceph-node1 /]# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 384 GiB 384 GiB 15 MiB 15 MiB 0
TOTAL 384 GiB 384 GiB 15 MiB 15 MiB 0

--- POOLS ---


POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 0 B 0 0 B 0 173 GiB

Check each OSD usage. See Example 5-11.

Chapter 5. Monitoring your IBM Storage Ceph environment 117


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

Example 5-11 Check each OSD usage


[ceph: root@ceph01 /]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL
%USE VAR PGS STATUS
0 hdd 0.12500 1.00000 128 GiB 5.1 MiB 184 KiB 0 B 4.9 MiB 128 GiB
0.00 1.00 1 up
1 hdd 0.12500 1.00000 128 GiB 5.1 MiB 184 KiB 0 B 4.9 MiB 128 GiB
0.00 1.00 1 up
2 hdd 0.12500 1.00000 128 GiB 5 MiB 184 KiB 0 B 4.8 MiB 128 GiB
0.00 0.99 0 up
TOTAL 384 GiB 15 MiB 552 KiB 0 B 15 MiB 384 GiB
0.00
MIN/MAX VAR: 0.99/1.00 STDDEV: 0

Tip: When a CephOSD reaches the full ratio, it indicates that the device is nearly full and
can no longer accommodate additional data. This condition can have several negative
consequences for the Ceph cluster, including:
򐂰 Reduced performance: As an OSD fills up, its performance degrades. This can lead to
slower read and write speeds, increased latency, and longer response times for client
applications.
򐂰 Increased risk of data loss: When an OSD is full, it becomes more susceptible to data
loss. If the OSD fails, it can cause data corruption or loss of access to stored data.
򐂰 Cluster rebalancing challenges: When an OSD reaches the full ratio, it can make it
more difficult to rebalance the cluster. Rebalancing is the process of evenly distributing
data across all OSDs to optimize performance and improve fault tolerance.
򐂰 Cluster outage: If a full OSD fails, it can cause the entire cluster to become
unavailable. This can lead to downtime for critical applications and data loss.

To avoid these consequences, it is important to monitor OSD usage and take proactive
measures to prevent OSDs from reaching the full ratio. This may involve adding new OSDs
to the cluster, increasing the capacity of existing OSDs, or deleting old or unused data.

5.3 Conclusion
The Ceph cluster maintains centralized cluster logging, capturing high-level events pertaining
to the entire cluster. These events are stored on disk on monitor servers and can be accessed
and monitored through various administrative tools.

The many optional Ceph Manager modules such as Zabbix, Influx, Insights, Telegraph,
Alerts, Disk Prediction, iostat can help you better integrate the monitoring of your IBM Storage
Ceph cluster in your existing monitoring and alerting solution while refining the granularity of
your monitoring. Additionally, you can utilize the SNMP Gateway service to further enhance
monitoring capabilities.

Administrators can leverage the Ceph Dashboard for a straightforward and intuitive view into
the health of the Ceph cluster. The built-in Grafana dashboard enables them to examine
detailed information, context-sensitive performance counters, and performance data for
specific resources and services.

118 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch05Monitoring.fm

Administrators can further use cephadm, the stand-alone Grafana dashboard, and other
supporting 3rd party tools to visualize and record detailed metrics on cluster utilization and
performance.

In conclusion, the overall performance of a software-defined storage solution like IBM Storage
Ceph is heavily influenced by the network. Therefore, it is crucial to integrate the monitoring of
your IBM Storage Ceph cluster with the existing network monitoring infrastructure. This
includes tracking packet drops and other network errors alongside the health of network
interfaces. The SNMP subsystem included in Red Hat Enterprise Linux is a built-in tool that
can facilitate this comprehensive monitoring.

5.4 References on the World Wide Web


You are invited to explore the wealth of Ceph documentation and resources that can be found
on the World Wide Web. A concise selection of recommended reading is provided below:

IBM Storage Ceph Documentation:

https://www.ibm.com/docs/en/storage-ceph/6?topic=dashboard-monitoring-cluster

Community Ceph Documentation

https://docs.ceph.com/en/latest/monitoring/

Chapter 5. Monitoring your IBM Storage Ceph environment 119


5721ch05Monitoring.fm Draft Document for Review November 28, 2023 12:23 am

120 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Chapter 6. Day 1 and Day 2 operations


In this chapter we will cover Day 1 and Day 2 operations of an IBM Storage Ceph
environment. This chapter has the following sections:
򐂰 “Day 1 operations” on page 122
򐂰 “Day 2 Operations” on page 134

© Copyright IBM Corp. 2023. 121


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

6.1 Day 1 operations


The initial operations, also known as Day-1 operations, include the following elements:
򐂰 Check deployment pre-requisites.
򐂰 Bootstrap the IBM Storage Ceph cluster.
򐂰 Deploy the IBM Storage Ceph cluster services.
򐂰 Configure the cluster (pools, placement groups, adjust parameters and so forth).

6.1.1 Prerequisites
This section we will discuss the IBM Ceph prerequisites.

Server node requirements


While basic proof-of-concept architectures can be implemented using a single node or virtual
machine, production environments require a more robust setup. For production deployments
using cephadm, IBM Storage Ceph mandates a minimum architecture comprising the
following components:
򐂰 At least three distinct hosts running monitor (MON).
򐂰 At least three OSD hosts using directly attached storage (SAN not supported).

In addition, you should configure:


򐂰 At least two distinct manager (MGR) nodes.
򐂰 At least as many OSDs as the number of replicas configured.
򐂰 At least two distinct identically configured MDS nodes, if you are using CephFS.
򐂰 At least two distinct RADOSGW nodes, if you are using Ceph Object Gateway.

Check the sizing section in chapter 2 of this book for details about the server sizing and
configuration.

The servers must be configured appropriately:


򐂰 Python 3
򐂰 Systemd
򐂰 Podman
򐂰 Time synchronization (Chronyd or NTP)
򐂰 Supported operating system
򐂰 cephadm requires password-less SSH capabilities
򐂰 DNS resolution must be configured for all nodes
򐂰 The OSD nodes must have the above tools plus LVM2

Firewall requirements
Table 6-1 lists the various TCP ports IBM Storage Ceph uses.

Table 6-1 TCP ports that IBM Storage Ceph uses


Service Ports Description

MON 6789/TCP & 3300/TCP

122 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Service Ports Description

MGR 8443/TCP SSL Default Dashboard port


8080/TCP SSL Disabled
8003/TCP Default RESTful port
9283/TCP Default Prometheus port

OSD/MDS 6800-7300/TCP OSD ports

RGW 7480/TCP Civetweb Default RGW ports


80/TCP Beast

Miscellaneous requirements
If configuring SSL for your Dashboard access and your object storage endpoint (RADOS
Gateway), ensure you obtain the correct certificate files from your security team and deploy
them where needed.

Make sure your server network cards and configuration are adequate for the workload served
by your IBM Storage Cluster:
򐂰 Size your server's network cards appropriately for the throughput you need to deliver.
Evaluate the network bandwidth generated by the data protection, as it will need to be
carried by the network for all write operations.
򐂰 Make sure your network configuration does not present any single point of failure.
򐂰 If your cluster spans multiple subnets, make sure each server can communicate with the
other servers.

6.1.2 Deployment
IBM Storage Ceph documentation details how to use cephadm to deploy your Ceph cluster. A
cephadm based deployment follows the following steps:
򐂰 For beginners:
• Bootstrap your Ceph cluster (create one initial Monitor and Manager).
• Add services to your cluster (OSDs, MDSs, RADOS Gateways and so forth).
򐂰 For advanced users:
• Bootstrap your cluster with a complete service file to deploy everything.

Cephadm
The only supported tool for IBM Storage Ceph, cephadm is available since Ceph upstream
Octopus and is the default deployment tool since Pacific.

Service files
An entire cluster can be deployed using a single service file. The service file follows the
following syntax. See Example 6-1.

Example 6-1 Service file format


service_type: {type_value}
service_name: {name_value}
addr: {address_value}
hostname: {hostname}
{options}

Chapter 6. Day 1 and Day 2 operations 123


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

The service_type accepts the following values:


򐂰 host to declare hosts.
򐂰 crash to place the Ceph crash collection daemon.
򐂰 grafana to place the Grafana component.
򐂰 node-exporter to place the exporter component.
򐂰 prometheus to place the Prometheus component.
򐂰 mgr to place the Manager daemon.
򐂰 mon to place the Monitor daemons.
򐂰 osd to place the Object Storage daemons.
򐂰 rgw to place the RADOS Gateway daemons.
򐂰 mds to place the MDS daemons.
򐂰 ingress to place the ingress loadbalancer (haproxy).

You can assign labels to hosts using the labels: field of a host service file. See
Example 6-2.

Example 6-2 Service file with label


service_type: host
addr: {host_address}
hostname: {host_name}
labels:
- xxx
- yyy
...

You can assign a specific placement for any service using the placement: field in a service
file. You can refer to “Placement” on page 126 for details on placement.

Example 6-3 Service file with placement


service_type: {service_type}
service_name: {service_name}
placement:
host_pattern: '*'

The OSD service file offers many specific options to specify how the OSDs are to be deployed
on the nodes.
򐂰 block_db_size to specify the RocksDB size on separate devices.
򐂰 block_wal_size to specify the RocksDB size on separate devices.
򐂰 data_devices to specify which devices will receive the data.
򐂰 db_devices to specify which devices will receive RocksDB DB portion.
򐂰 wal_devices to specify which devices will receive the RocksDB WAL portion.
򐂰 db_slots to specify how many RocksDB DB partition per db_device.
򐂰 wal_slots to specify how many RocksDB WAL partition per wal_device.
򐂰 data_directories to specify a list of device paths to be used.
򐂰 filter_logic to specify OR or AND between filters. Default is AND.

124 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

򐂰 objectstore to specify the OSD backend type (bluestore or filestore).


򐂰 crush_device_class to specify the CRUSH device class.
򐂰 data_allocate_fraction to specify portion of drive for data devices.
򐂰 osds_per_device to specify how many OSDs to deploy per device (default is 1).
򐂰 osd_id_claims to specify OSD ids should be preserved per node (true or false).

The data_devices, db_devices and wal_devices parameters accept the following arguments:
򐂰 all to specify all devices are to be consumed (true or false).
򐂰 limit to specify how many OSD are to be deployed per node.
򐂰 rotational to specify the type of devices to select (0 or 1).
򐂰 size to specify the size of the devices to select:
• xTB to select a specific device size.
• xTB:yTB to select devices between the two capacities.
• :xTB to select any device up to this size.
• xTB: to select any device at least this size.
򐂰 path to specify the device path to use.
򐂰 model to specify the disk model name.
򐂰 vendor to specify the vendor model name.
򐂰 encrypted to specify if the data is to be encrypted at rest (data_devices only).

The RADOS Gateway service file accepts the following specific arguments:
򐂰 networks to specify which CIDR the gateway will bind to.
򐂰 spec:
• rgw_frontend_port to specify which TCP port the gateway will bind.
• rgw_realm to specify the realm for this gateway.
• rgw_zone to specify the zone for this gateway.
• ssl to specify if this gateway uses SSL (true or false).
• rgw_frontend_ssl_certificate to specify the certificate to use.
• rgw_frontend_ssl_key to specify the key to use.
• rgw_frontend_type to specify the frontend to use (default is beast).
򐂰 placement.count_per_host to specify how many RADOS Gateways per node.

Container parameters
You can customize the parameters used by the Ceph containers using a special section of
your service file known as extra_container_args. To add extra parameters, use template as
shown in Example 6-4, in the appropriate service files.

Example 6-4 Passing container extra parameters


service_type: {service_type}
service_name: {service_name}
placement:
host_pattern: '*'
extra_container_args:

Chapter 6. Day 1 and Day 2 operations 125


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

- "--cpus=2"

Placement
Placement can be a simple count to indicate the number of daemons to deploy. In such a
configuration, cephadm will choose where to deploy the daemons.

Placement can use explicit naming: --placement="host1 host2 …". In such a configuration,
the daemons will be deployed on the nodes listed.

Placement can use labels: --placement="label:mylabel". In such configuration, the


daemons will be deployed on the nodes that match the provided label.

Placement can use expressions: --placement="host[1-5]". In such configuration, the


daemons will be deployed on the nodes that match the provided expression.

Using a service file, you would encode the following for count. See Example 6-5.

Example 6-5 Service file with placement count


service_type: rgw
placement:
count: 3

Using a service file, you would encode the following for label. See Example 6-6.

Example 6-6 Service file with placement label


service_type: rgw
placement:
label: "mylabel"

Using a service file, you would encode the following for the host list. See Example 6-7.

Example 6-7 Service file with placement host list


service_type: rgw
placement:
hosts:
- host1
- host2

Using a service file, you would encode the following for pattern. See Example 6-8.

Example 6-8 Service file with placement pattern


service_type: rgw
placement:
host_pattern: "ceph-server-[0-5]"

Minimal cluster (bootstrapping)


The initial cephadm deployment always starts with what is known as a cluster bootstrapping.
To do so, install the cephadm binary on a node and run the following command:
$ cephadm bootstrap --mon-ip {monitor_ip_address}

This command performs the following actions:

126 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

򐂰 Create an initial Monitor daemon.


򐂰 Create an initial Manager daemon.
򐂰 Generate a cephadm SSH key.
򐂰 Adds the cephadm SSH key to ~/.ssh/authorized_keys.
򐂰 Writes a copy of the public key to /etc/ceph.
򐂰 Generate a minimal /etc/ceph/ceph.conf file.
򐂰 Creates the /etc/ceph/ceph.client.admin.keyring file.

You can pass an initial Ceph configuration file to the bootstrap command through the
--config {path_to_config_file} command line option.

You can override the SSH user that will be used by cephadm through the --ssh-user
{user_name} command line option.

You can pass a specific set of registry parameters through a valid registry JSON file via the
--registry-json {path_to_registry_json} command line option.

You can choose the Ceph container image you want to deploy via the --image
{registry}[:{port}]/{imagename}:{imagetag} command line option.

You can specify the network configuration to be used by the cluster. A Ceph cluster uses two
networks:
򐂰 Public network,
• Used by clients, including RGWs, to connect to all Ceph daemons.
• Used by Monitors to converse with other daemons.
• Used by MGRs and MDSs to communicate with other daemons.
򐂰 Cluster network.
• Used by OSDs to perform OSD operations such as replication and recovery.

See Figure 6-1 on page 128.

Chapter 6. Day 1 and Day 2 operations 127


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

Figure 6-1 Ceph network layout

The public network will be extrapolated from the --mon-ip parameter provided by the
bootstrap command. The cluster network can be provided during the bootstrap operation via
the --cluster-network parameter. If the --cluster-network parameter is not specified, it will
be set to the same as the public network value.

Tip: Add the --single-host-defaults argument to your bootstrap command to bootstrap


a test cluster using a single node.

To connect to your initial cluster run the following command:

$ cephadm shell

To connect to the graphical user interface of your initial cluster, look for the following lines in
the cephadm bootstrap output and point your HTTP browser to the URL being displayed. See
Example 6-9 on page 129.

128 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Example 6-9 Lines in the cephadm bootstrap output


Ceph Dashboard is now available at:
URL: https://{initial_node_name}:8443/
User: admin
Password: {generated_password}

This document is not designed to provide extensive details about the deployment and
configuration your cluster. Go to the IBM Storage Ceph documentation for more details.

BM Storage Ceph Solutions Guide, REDP-5715 provides GUI based deployment methods for
those who feel more comfortable using a graphical user interface.

Once the cluster is bootstrapped, you can deploy the appropriate service of your clusters. A
production cluster will require the following elements to be deployed in appropriate numbers
for a reliable cluster that does not present any single point of failure.

Adding nodes
Once the cluster is bootstrapped, the Ceph administrator must add all the nodes where the
services will be deployed. You must copy the cluster public SSH key to each node that will be
part of the cluster. Once the nodes are prepared, add it to the cluster. See Example 6-10.

Example 6-10 Adding nodes


# ssh-copy-id -f -I /etc/ceph/ceph.pub root@ceph02
# ssh-copy-id -f -I /etc/ceph/ceph.pub root@ceph03
# ceph orch host add ceph02 10.10.0.102
# ceph orch host add ceph02 10.10.0.103 --labels=label1,label2

Tip: To add multiple hosts, use a service file describing all your nodes.

Removing nodes
If you need to remove a node that was added by mistake, use the following commands. If it
does not have OSD already deployed, skip the second command of the example. See
Example 6-11.

Example 6-11 Removing nodes


# ceph orch host drain {hostname}
# ceph orch osd rm status
# ceph orch ps {hostname}
# ceph orch rm {hostname}
# ssh root@{hostname} rm -R /etc/ceph

Tip: You can also clean up the SSH keys copied to the host.

Assigning labels
You can assign labels to hosts after they were added to the cluster. See Example 6-12.

Example 6-12 Assigning label post node addition


# ceph orch host label add {hostname} {label}

Chapter 6. Day 1 and Day 2 operations 129


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

Adding services
Once the cluster is bootstrapped the Ceph administrator must add the required services for
the cluster to become fully operational

After bootstrapping your cluster, you only have a single Monitor, a single Manager, no OSDs,
no MDSs, no RGWs. The services will be deployed in the following order:
7. Deploy another 2 Monitors at least.
8. Deploy at least 2 more Manager.
9. Deploy the OSDs.
10.If needed, deploy your Ceph file system.
11.If needed, deploy your RADOS Gateways.

Monitors
To deploy a total of 3 Monitors, simply deploy 2 additional Monitors. See Example 6-13.

Example 6-13 Deploy full Monitor quorum


# ceph orch daemon add mon --placement="{ceph02},{ceph03}"

Tip: This can also be achieved using a service file via the command ceph orch apply -i
{path_to_mon_service_file}.

Tip: If you want your Monitor to bind to a specific IP address or subnet use the
{hostname}:{ip_addr} or {hostname}:{cidr} to specify the host.

Managers
To deploy a total of 2 Managers, simply deploy 1 additional Manager. See Example 6-14.

Example 6-14 Deploy high available Managers


# ceph orch daemon add mgr --placement="{ceph02}"

Object Storage daemons


The only requirement for a fully operational cluster is to deploy OSDs on at least 3 separate
nodes. This requires that at least 3 nodes you have added to the cluster have at least one free
local device that can be used to deploy Object Storage daemons.

You can list the devices available on all nodes using the following command after you added
all the nodes to your cluster. See Example 6-15.

Example 6-15 List devices


# ceph orch device ls

cephadm will scan all nodes for available devices (free of partitions, free of formatting, free of
LVM configuration). See Example 6-16.

Example 6-16 Deploy as many OSDs as available devices


# ceph orch apply osd --all-available-devices

130 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Tip: For production cluster with specific configuration and strict deployment scenarios, it is
recommended to use a service ceph orch apply -i {path_to_osd_service_file}
command.

Tip: To visualize what devices will be consumed by the command above, use ceph orch
apply osd --all-available-devices --dry-run command.

A OSD service file, using the information provided in 6.1.2, “Deployment” on page 123, can
be tailored to each node need. As such, an OSD service file can contain multiple
specifications. See Example 6-17.

Example 6-17 Multiple specification OSD service file


service_type: osd
service_id: osd_spec_hdd
placement:
host_pattern: '*'
spec:
data_devices:
rotational: 1
db_devices:
model: MC-55-44-XZ # This model is identified as a flash device
limit: 2
---
service_type: osd
service_id: osd_spec_ssd
placement:
host_pattern: '*'
spec:
data_devices:
model: MC-55-44-XZ # This model is identified as a flash device

You can also add OSD targeting a specific device. See Example 6-18.

Example 6-18 Deploy OSD using specific device


# ceph orch daemon add osd {hostname}:data_devices={dev1}

Tip: You can pass many parameters through the command line, including data_devices,
db_devices, wal_devices. For example,
{hostname}:data_devices={dev1},{dev2},db_devices={db1}.

In some cases, it may be necessary to initialize or clean a local device so that it can be
consumed by the OSD. See Example 6-19.

Example 6-19 Initializing disk devices


# ceph orch device zap {hostname} {device_path}

6.1.3 Initial cluster configuration


In this section we cover initial cluster configuration.

Chapter 6. Day 1 and Day 2 operations 131


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

Ceph RADOS Gateway


The deployment of a RADOS Gateway service necessary to provide support for Swift and S3
protocol using the Ceph cluster follows the same model as for other components thanks to
the cephadm standardized processes. See Example 6-20.

Example 6-20 Deploying RADOS Gateway service


# ceph orch apply rgw {service_name}

Just like other services, arguments can be passed through the command line arguments or
be provided with a service file with a detailed configuration.

Tip: You can pass many parameters through the command line, including
--realm={realm_name}, --zone={zone_name} --placement={placement_specs},
--rgw_frontend_port={port} or count-per-host:{n}.

Ingress service
If your cluster has a RADOS Gateway service deployed, you will likely require a load balancer
in front of them to guarantee the distribution of the traffic between multiple RADOS Gateways
but also to provide a highly available object service with no single point of failure.

The ingress service provides the following parameters:


򐂰 backend_service to specify the RGW service name to frontend.
򐂰 virtual_ip to specify the CIDR the ingress service binds to.
򐂰 frontend_port to specify the port the ingress service binds to.
򐂰 monitor_port to specify the port where the LB status is maintained.
򐂰 virtual_interface_networks to specify a list of CIDRs.
򐂰 ssl_cert to specify the SSL certificate and key.

These parameters can be used in an ingress service file. See Example 6-21.

Example 6-21 Create ingress service file for RGW


service_type: ingress
service_id: rgw.default
placement:
hosts:
- ceph01
- ceph02
- ceph03
spec:
backend_service: rgw.default
virtual_ip: 10.0.1.0/24
frontend_port: 8080
monitor_port: 1900
ssl_cert: |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----

132 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Tip: For a production cluster with specific configuration and strict deployment scenarios, it
is recommended to run the service ceph orch apply -i {path_to_mds_service_file}.
command. Once the MDSs are active, you must manually create the file system with the
following command ceph fs new {fs_name} {meta_pool} {data_pool}.

Once your cluster is fully deployed, the best practice is to export the cluster configuration and
backup the configuration. Using git to manage the cluster configuration is recommended.
See Example 6-22.

Example 6-22 . Exporting cluster configuration


# ceph orch ls --export
service_type: alertmanager
service_name: alertmanager
placement:
count: 1
---
service_type: crash
service_name: crash
placement:
host_pattern: '*'
---
service_type: grafana
service_name: grafana
placement:
count: 1
---
service_type: mds
service_id: myfs
service_name: mds.myfs
placement:
count: 1
hosts:
- ceph01
---
service_type: mgr
service_name: mgr
placement:
count: 2
---
service_type: mon
service_name: mon
placement:
count: 5
---
service_type: node-exporter
service_name: node-exporter
placement:
host_pattern: '*'
---
service_type: osd
service_id: all-available-devices
service_name: osd.all-available-devices
placement:
host_pattern: '*'

Chapter 6. Day 1 and Day 2 operations 133


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

spec:
data_devices:
all: true
filter_logic: AND
objectstore: bluestore
---
service_type: prometheus
service_name: prometheus
placement:
count: 1
---
service_type: rgw
service_id: myrgw
service_name: rgw.myrgw
placement:
count: 1
hosts:
- ceph01

Tip: Simply redirect the command output to a file. This is an easy way to create a full
cluster deployment service file once you have learned cephadm.

Pools and placement groups


Once the cluster is deployed, you may need to create additional pools or customize the pools
that were created automatically by some of the services deployed via cephadm.

In a production cluster, it is best for each OSD to serve between 100 and 200 placement
groups.

Best practice is never to have more than 300 placement groups managed by an OSD.
The Placement Group Auto-Scaler module automatically resizes the PGs assigned to each
pool based on the target size ratio assigned to each pool.

Controlled Replication Under Scalable Hashing (CRUSH)


After the cluster is deployed, you may need to customize the CRUSH configuration to align it
with your physical infrastructure. By default, the failure domain is the host, but you may prefer
to set it to rack.

6.2 Day 2 Operations


Following the deployment and configuration of your cluster, some operations will be required.
They are known as Day-2 operations and include the following elements:
򐂰 Configure logging to file.
򐂰 Upgrade your cluster.
򐂰 Expand your cluster for more performance or capacity.
򐂰 Replace a failed node.
򐂰 Replace a failed disk.
򐂰 Redeploy a failed daemon.

134 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

򐂰 Monitor the cluster.


򐂰 Monitor the network.

This chapter provides insights about each topic and highlights some of the best practices and
necessary steps when applicable.

6.2.1 General considerations


Ceph is designed to automatically protect the data stored in the cluster so a node or device
failure will be processed according to the data protection applied replicated or erasure coding.

Nonetheless, the failure of a node or the failure of multiple devices can endanger the
resiliency of the data and expose it to a double failure scenario that can alter the availability of
the data.

As such, Ceph clusters must be monitored and action taken swiftly to replace failed nodes
and failed devices when exposure to double failure scenarios gets closer.

As the data is being rebalanced or recovered in a Ceph cluster, it may impact the
performance of the client traffic entering the cluster. As Ceph matured, the default recovery
and backfill parameters have been adapted to minimize this negative impact.

6.2.2 Maintenance mode


cephadm allows you to stop all daemons on a node easily. This feature will be used for node
maintenance, such as an operating system upgrade. See Example 6-23.

Example 6-23 Node Maintenance mode


# ceph orch host maintenance enter {hostname}
# ceph orch host maintenance exit {hostname}

General Ceph commands


This section of the document is intended to provide some basic Ceph commands that will
prove helpful for Day-2 operations. Table 6-2 shows the general Ceph commands.

Table 6-2 General Ceph commands


Command Description

ceph status|-s Display cluster status

ceph heatlh [detail] Display cluster health information

ceph df Display cluster usage information

ceph osd df Display OSD usage information

ceph osd tree Display CRUSH topology

ceph osd pool ls [detail] Display Ceph pool information

ceph mon dump Dump MONMap

ceph mgr dump Dump MGRMap

ceph osd dump Dump OSDMap

ceph fs dump Dump MDSMap

Chapter 6. Day 1 and Day 2 operations 135


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

Command Description

ceph osd pool set {pool} {parm} Change a pool attribute


{value}

ceph osd pool get {pool} {parm} Display a pool attribute

ceph mgr module ls Display Manager module information

ceph mgr module enable {module} Enable a Manager module

ceph mgr module disable {module} Disable a Manager module

ceph version|-v Display Ceph binary version used

ceph versions Display daemon versions across the cluster

ceph auth list Dump cephx entries

ceph auth get-or-create Create a cephx entry

ceph auth get {username} Display a specific cephx entry

ceph auth del {username} Delete a specific cephx entry

rados -p {poolname} ls List objects in a pool

rados list-inconsistent-pg List inconsistent PGs for a pool


{poolname}

rados list-inconsistent-obj {pgid} List inconsistent objects for a PG

ceph pg dump_stuck {pgstate} List PGs stuck in stale, inactive, unclean)

6.2.3 Configure logging


In this section we discuss configuring logging.

Ceph cluster logging


Ceph daemons log to journald by default, and Ceph logs are captured by the container
runtime environment. They are accessible via journalctl.

For example, to view the daemon mon.foo logs for a cluster with ID
5c5a50ae-272a-455d-99e9-32c6a013e694, the command would be like the following, as shown
in Example 6-24.

Example 6-24 Check Ceph cluster logs


# journalctl -u [email protected]

Ceph logging to file


You can also configure Ceph daemons to log to files instead of to journald if you prefer logs to
appear in files. When Ceph logs to files, the logs appear in /var/log/ceph/<cluster-fsid>.
See Example 6-25 on page 136.

Example 6-25 Configure logs to file


# ceph config set global log_to_file true
# ceph config set global mon_cluster_log_to_file true

136 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Tip: You can set a single daemon to log to a file by using osd.0 instead of global.

By default, cephadm sets up log rotation on each host. You can configure the logging
retention schedule by modifying /etc/logrotate.d/ceph.<CLUSTER FSID>.

All cluster logs are stored in /var/log/ceph/CLUSTER_FSID.

Because a few Ceph daemons (notably, the monitors and Prometheus) store a large amount
of data in /var/lib/ceph, we recommend moving this directory to its own disk, partition, or
logical volume so that it does not fill up the root file system.

For more information, refer to Ceph daemon logs section in IBM Documentation.

Disable logging to journald


If you choose to log to files, we recommend disabling logging to journald or everything will be
logged twice. Run the following commands in Example 6-26 to disable logging to stderr.

Example 6-26 Disable logging to stderr


# ceph config set global log_to_stderr false
# ceph config set global mon_cluster_log_to_stderr false
# ceph config set global log_to_journald false
# ceph config set global mon_cluster_log_to_syslog false

Cephadm logs
Cephadm writes logs to the cephadm cluster log channel. You can monitor Ceph's activity in
real-time by reading the logs as they fill up. Run the following command in Example 6-27 to
see the logs in real-time.

Example 6-27 cephadm watch logs


# ceph -W cephadm

By default, this command shows info-level events and above. Run the following commands in
Example 6-28 to see debug-level messages and info-level events.

Example 6-28 Enable debug logs on cephadm


# ceph config set mgr mgr/cephadm/log_to_cluster_level debug
# ceph -W cephadm --watch-debug

You can see recent events by running the following command in Example 6-29.

Example 6-29 List recent cephadm events


# ceph log last cephadm

If your Ceph cluster has been configured to log events to files, there will exist a cephadm log
file called ceph.cephadm.log on all monitor hosts (see Ceph daemon control for a more
complete explanation of this).

Chapter 6. Day 1 and Day 2 operations 137


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

Cephadm per-service event logs


To simplify debugging failed daemon deployments, cephadm stores events on a per-service
and per-daemon basis. To view the list of events for a specific service, run the following
command. See Example 6-30.

Example 6-30 cephadm per-service logs


# ceph orch ls --service_name=alertmanager --format yaml
service_type: alertmanager
service_name: alertmanager
placement:
hosts:
- unknown_host
[...]
events:
- 2022-02-01T12:09:25.264584 service:alertmanager [ERROR] "Failed to apply: \
Cannot place <AlertManagerSpec for service_name=alertmanager> on \
unknown_host: Unknown hosts"'

For more information, refer to Monitor cephadm log messages section in IBM Documentation.

Ceph subsystem logs


Each subsystem has a logging level for its output logs, and for its logs in-memory. You may
set different values for each subsystem by setting a log file and a memory level for debug
logging. Ceph's logging levels operate on a scale of 1 to 20, where 1 is terse and 20 is
verbose 1. The memory logs are generally not sent to the output log unless a fatal signal is
raised or an assertion in the source code is triggered upon request.

A debug logging setting can take a single value for the log and memory levels, which sets
them as the same value. For example, if you specify debug ms = 5, Ceph will treat it as a log
and memory level of 5. You may also specify them separately. The first setting is the log level,
and the second is the memory level. You must separate them with a forward slash (/). For
example, if you want to set the ms subsystem's debug logging level to 1 and its memory level
to 5, you will specify it as debug ms = 1/5. For example, you can increase log verbosity during
runtime in different ways.

Example 6-31 shows how to configure the logging level for a specific subsystem.

Example 6-31 Configure the logging level for a specific subsystem


# ceph tell osd.0 config set debug_osd 20

Example 6-32 shows how to use the admin socket from inside the affected service container.

Example 6-32 Configure the logging level using the admin socket
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.<name>.asok config set
debug_rgw 20

Example 6-33 shows how to make the verbose/debug change permanent, so that it persists
after a restart.

Example 6-33 Make the change log persistent


# ceph config set client.rgw debug_rgw 20

138 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

For more information, refer to Ceph subsystems section in IBM Documentation.

Getting logs from container startup


You might need to investigate why a cephadm command failed or why a certain service no
longer runs properly.

We can use the cephadm logs to get logs from the containers running ceph services. See
Example 6-34.

Example 6-34 Check the startup logs from a Ceph service container
# cephadm ls | grep mgr
"name": "mgr.ceph-mon01.ndicbs",
"systemd_unit":
"[email protected]",
"service_name": "mgr",
# cephadm logs --name mgr.ceph-mon01.ndicbs
Inferring fsid 3c6182ba-9b1d-11ed-87b3-2cc260754989
-- Logs begin at Tue 2023-01-24 04:05:12 EST, end at Tue 2023-01-24 05:34:07 EST.
--
Jan 24 04:05:21 ceph-mon01 systemd[1]: Starting Ceph mgr.ceph-mon01.ndicbs for
3c6182ba-9b1d-11ed-87b3-2cc260754989...
Jan 24 04:05:25 ceph-mon01 podman[1637]:
Jan 24 04:05:26 ceph-mon01 bash[1637]:
36f6ae35866d0001688643b6332ba0c986645c7fba90d60062e6a4abcd6c8123
Jan 24 04:05:26 ceph-mon01 systemd[1]: Started Ceph mgr.ceph-mon01.ndicbs for
3c6182ba-9b1d-11ed-87b3-2cc260754989.
Jan 24 04:05:27 ceph-mon01
ceph-3c6182ba-9b1d-11ed-87b3-2cc260754989-mgr-ceph-mon01-ndicbs[1686]: debug
2023-01-24T09:05:27.272+0000 7fe90710d>
Jan 24 04:05:27 ceph-mon01
ceph-3c6182ba-9b1d-11ed-87b3-2cc260754989-mgr-ceph-mon01-ndicbs[1686]: debug
2023-01-24T09:05:27.272+0000 7fe90710d>

6.2.4 Cluster upgrade


As part of the lifecycle of an IBM Storage Ceph cluster, we need to do periodic updates to
keep our cluster up to date so we can take advantage of the new features and security
patches.

There are two types of upgrades: minor and major upgrades.

Major upgrade releases include more disruptive changes, such as Dashboard or mgr API
deprecation, than minor releases. Major IBM Storage Ceph releases typically use a new
major upstream version, such as moving from Pacific to Quincy or from Quincy to Reef. This
would be represented as moving from 5.X to 6.X or from 6.X to 7.X on the IBM Storage Ceph
side. Major upgrades may also require upgrading the operating system. Here is a link with the
matrix of OS-supported versions depending on the IBM Storage Ceph release.

Minor upgrades generally use the same upstream release as the Major it belongs to and try to
avoid disruptive changes. In IBM Storage Ceph, minor releases would be IBM Storage Ceph
7.1, 7.2, and so forth.

Chapter 6. Day 1 and Day 2 operations 139


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

Inside a Minor upgrade, there are periodic maintenance releases. Maintenance releases
bring security and bug fixes. Very rarely new features are introduced in maintenance
releases. Maintenance releases are represented like 6.1z1, 6.1z2, and so forth.

Ceph upgrade orchestration tools


For all versions of IBM Storage Ceph 5 and higher, the only supported orchestration tool that
will take care of the updates is cephadm.

The first version of IBM Storage Ceph was 5.X, so cephadm is the only orchestrator tool you
would need to work with. If by any chance you are upgrading from Red Hat Ceph Storage
Cluster (RHCS) to IBM Storage Ceph and the RHCS cluster is in versions 3.X or 4.X, the
upgrade from 3.X to 4.X to 5.X would be done with Ceph-ansible, before cephadm, the
Ceph-ansible repo was used for upgrading minor and major versions of Red Hat Ceph
Storage.

Upgrading Ceph with cephadm: Prerequisites


For IBM Storage Ceph, you must use the cephadm orchestrator for minor and major
upgrades. An example of an IBM Storage Ceph Major upgrade would be upgrading from
version 5.3 to 6.1. Cephadm fully automates the process, making ceph upgrades very
straightforward and safe for the storage administrators. the automated upgrade process
follows all the Ceph upgrade best practices. For example:
򐂰 The upgrade order starts with Ceph Managers, Ceph Monitors, and then other daemons.
򐂰 Each daemon is restarted only after Ceph indicates that the cluster will remain available.

Before starting the actual Ceph upgrade, there are some prerequisites that we need to
double-check:
1. Read the release notes of the version you are upgrading to; you may be affected by a
known issue, like, for example, an API deprecation that you may be using.
2. Open a proactive case with the IBM support team; you can open a support ticket informing
the IBM Ceph support team.
3. Upgrade to the latest maintenance release of the latest minor version before doing a major
upgrade; for example, before upgrading from 5.3 to 6.1, ensure you are running the latest
5.3 maintenance release, in this case, 5.3z5.
4. Label a second node as the admin in the cluster to manage the cluster when the admin
node is down during the OS upgrade.
5. Check your podman version. podman and IBM Storage Ceph have different end-of-life
strategies that might make it challenging to find compatible versions. There is a matrix of
supported podman versions for each release: The following link provides an example.
6. Check if the current RHEL version is supported by the IBM Storage Ceph version you are
upgrading to. If the current version of RHEL is not supported on the new IBM Storage
Ceph release, you need to upgrade the OS before you upgrade to the new IBM Storage
Ceph version. There are two recommended ways to upgrade the OS:
– RHEL Leapp upgrade. You can find the instructions here. The Leapp upgrade is an
in-place upgrade of the OS. All your OSD data will be available once the upgrade of the
OS has finished, so the recovery time for the OSDs on these nodes will be shorter than
a full reinstall.
– RHEL full OS reinstall. With this approach, we do a clean OS install with the new
version. The data on the OSDs will be lost, so when we add the updated OS node to
the Ceph cluster, it will need to recover all the data in the OSDs fully.

140 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Both approaches are valid and have pros and cons depending on the use case and
infrastructure resources.

Upgrading Ceph with Cephadm: Upgrade strategy


Once we have checked and fulfilled the prerequisites listed in the previous section, we are
ready to start the actual ceph upgrade with the help of cephadm.

We have two possible ways of updating Ceph with Cephadm.


• One in which the full update of all components is done in one go so that Cephadm
will upgrade all of our services individually. See the link for the step-by-step guide in
the documentation.
• Staggered upgrade: This approach allows you to upgrade IBM Storage Ceph
components in phases, rather than all at once. The Ceph Orch upgrade command
enables you to specify options to limit which daemons are upgraded by a single
upgrade command. This option offers fine-grained control of what gets updated, as
you can limit per service, daemon, or host. Cephadm strictly enforces an order for
upgrading daemons, even in staggered upgrade scenarios. The current upgrade
order is: Manager (mgr) → Monitors (mons) → Crash daemons → OSDs →
MDSs → RGWs → RBD mirrors → CephFS mirrors → NFS. If you specify
parameters that would upgrade daemons out of the recommended order, the
upgrade command will block and notify you which daemons will be missed if you
proceed.

Upgrades can also be done in disconnected environments with limited internet connectivity.
Here is the documentation link with the step-by-step guide for disconnected upgrades.

Data rebalancing during the upgrade


During an upgrade, we will need to restart the OSD services running on our Ceph nodes. The
time it takes to restart the OSD or OSD node will determine when the cluster declares the
OSDs out of the cluster and starts a recovery process to fulfill the replica schema selected for
the pool/PGs on the missing OSDs. For example, if we have a replica count of 3 and lose one
replica, the cluster will declare the OSD out of the cluster after 10 minutes and start a
recovery process to find a new available OSD to create a new copy of the data, bringing the
replica count back to 3.

There are two main approaches. The most common one is setting the following parameters to
avoid any data movement, even if it takes longer than 10 minutes for an OS upgrade of an
OSD node.

Example 6-35 Prevent OSDs from getting marked out during an upgrade and avoid unnecessary load
on the cluster
# ceph osd set noout
# ceph osd set noscrub
# ceph osd set nodeep-scrub

The conservative approach is to recover all OSDs when upgrading the OS, especially if doing
a clean OS installation to upgrade RHEL. This can take hours, as the running Ceph cluster
will be in a degraded state with only two valid copies of the data (assuming replica 3 is used).

To avoid this scenario, you can take a node out of the cluster by enabling maintenance mode
in Cephadm. Once the monitor OSD timeout has expired (default: 10 minutes), recovery will
start. Once recovery is finished, the cluster will be in a fully protected state with 3 replicas.
Once the Red Hat OS has been updated, zap the OSD drives and add the node back to the

Chapter 6. Day 1 and Day 2 operations 141


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

cluster. This will trigger a data rebalance, which will finish once the cluster is HEALTH_OK.
You can then upgrade the next node and repeat the process.

For more information, refer to Dashboard section in IBM Documentation.

Cluster upgrade monitoring


After running the ceph orch upgrade start command to upgrade the IBM Storage Ceph
cluster. You can check the status, pause, resume, or stop the upgrade process. The cluster's
health changes to HEALTH_WARNING during an upgrade.

Example 6-36 shows how to determine whether an upgrade is in process and the version to
which the cluster is upgrading.

Example 6-36 Get upgrade status


# ceph orch upgrade status

If you want to get detailed information on all the steps the upgrade is taking, you can query
the cephadm logs. See Example 6-37.

Example 6-37 Get detail upgrade logs


# ceph -W cephadm

You can pause, resume or stop an upgrade while it is running.

Example 6-38 Pause, stop or resume a running upgrade


# ceph orch upgrade [pause|resume|stop]

Cluster expansion
One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at
run time. It allows you to resize the storage cluster capacity without taking it down, so you can
add, remove, or replace hardware during regular business hours. However, adding and
removing OSD nodes can significantly impact performance.

Before adding Ceph OSD nodes, consider the effects on storage cluster performance. Adding
or removing Ceph OSD nodes causes backfilling as the storage cluster rebalances,
regardless of whether you are expanding or reducing capacity. During this rebalancing
period, Ceph uses additional resources, which can impact performance.

Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or
removing a node typically affects the performance of pools that use the CRUSH ruleset.

Adding a new IBM Storage Ceph node example


Before deploying any Ceph service into the cluster, we must fulfill the cephadm requirements.
The following is an example of the steps we need to follow to add a new host to our cluster:
1. Install the RHEL operating system
2. Register the node and subscribe to a valid pool. See Example 6-39.

Example 6-39 Register the RHEL node to the RH CDN


$ subscription-manager register
$ subscription-manager subscribe --pool=8a8XXXXXXXX9ff

3. Ensure the correct RHEL repositories are enabled. See Example 6-40 on page 143.

142 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

Example 6-40 Enable RHEL repos


$ subscription-manager repos --disable="*"
--enable="rhel-9-for-x86_64-appstream-rpms"
--enable="rhel-9-for-x86_64-baseos-rpms"

4. Add the IBM Storage Ceph repository. See Example 6-41.

Example 6-41 Add the IBM Storage Ceph repository


$ curl
https://public.dhe.ibm.com/ibmdl/export/pub/storage/ceph/ibm-storage-ceph-6-rhel-9
.repo | sudo tee /etc/yum.repos.d/ibm-storage-ceph-6-rhel-9.repo

5. From the IBM Storage Ceph Cluster bootstrap node, enter the cephadm shell. See
Example 6-42.

Example 6-42 Enter the cephadm shell


$ cephadm shell

6. Extract the cluster public ssh key to a folder. See Example 6-43.

Example 6-43 Extract clusters public ssh key


$ ceph cephadm get-pub-key > ~/PATH

7. Copy Ceph cluster's public SSH keys to the root user's authorized_keys file on the new
host. See Example 6-44.

Example 6-44 Copy ssh key to new node you are adding
$ ssh-copy-id -f -i ~/PATH root@HOSTNAME

8. Add the new host to the Ansible inventory file. The default location for the file is
/usr/share/cephadm-ansible/hosts. Example 6-45 shows the structure of a typical
inventory file.

Example 6-45 Add new node to the ansible inventory


$ cat /usr/share/cephadm-ansible/hosts
host01
host02
host03

[admin]
host00

9. Run the preflight playbook with the --limit option. See Example 6-46.

Example 6-46 Copy the ssh key to the new node you are adding
$ ansible-playbook -i INVENTORY_FILE cephadm-preflight.yml --extra-vars
"ceph_origin=ibm" --limit NEWHOST

10.From the Ceph administration node, log into the Cephadm shell. See Example 6-47.

Example 6-47 Enter the cephadm shell


$ cephadm shell

Chapter 6. Day 1 and Day 2 operations 143


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

11.Use the cephadm orchestrator to add hosts to the storage cluster. See Example 6-48.

Example 6-48 Add the new host to the cluster


$ ceph orch host add HOST_NAME IP_ADDRESS_OF_HOST

Once the node is added to the cluster, you should see the node listed in the output of the
ceph orch host ls command or ceph orch device list

If the disks on the newly added host pass the filter you configured in your cephadm OSD
service spec, new OSDs will be created using the drives on the new host. The cluster will
then rebalance the data, moving PGs from other OSDs to the new OSDs to distribute the data
evenly across the cluster.

Limit backfill and recovery


Tuning the storage cluster for the fastest possible recovery time can significantly impact Ceph
client I/O performance. To maintain the highest Ceph client I/O performance, limit backfill and
recovery operations and allow them to take longer by modifying the following Ceph
configuration parameters. See Example 6-49.

Example 6-49 Ceph parameters to favor client IO during backfill


osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

Increase the number of placement groups


If you are expanding the size of your storage cluster, you may need to increase the number of
placement groups. IBM recommends making incremental increases, as increasing the
number of placement groups significantly can cause a noticeable degradation in
performance.

6.2.5 Node replacement


Eventually, a cluster will experience a whole node failure. Handling a node failure is similar to
handling a disk failure, but instead of recovering placement groups for only one disk, Ceph
must recover all placement groups on the disks within that node. Ceph will automatically
detect that the OSDs are down and start the recovery process, known as self-healing.

All the recommendations we suggested in “Cluster expansion” on page 142 also apply to this
section, so read them carefully before starting a node replacement.

Node replacement strategies


There are three node failure scenarios. Here is the high-level workflow for each scenario
when replacing a node.
򐂰 Replacing the node using the failed node's root and Ceph OSD disks.
a. Disable backfilling.
b. Replace the node, taking the disks from the old node and adding them to the new
node.
c. Enable backfilling.
򐂰 Replacing the node, reinstalling the operating system, and using the Ceph OSD disks
from the failed node.

144 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

a. Disable backfilling.
b. Create a backup of the Ceph configuration.
c. Replace the node and add the Ceph OSD disks from the failed node.
d. Configuring disks as JBOD.
e. Install the operating system.
f. Restore the Ceph configuration.
򐂰 Add the new node to the storage cluster commands, and Ceph daemons are placed
automatically on the respective node.
a. Enable backfilling.
b. Replacing the node, reinstalling the operating system, and using all new Ceph OSDs
disks.
c. Disable backfilling.
d. Remove all OSDs on the failed node from the storage cluster.
e. Create a backup of the Ceph configuration.
f. Replace the node and add the Ceph OSD disks from the failed node.
g. Configuring disks as JBOD.
h. Install the operating system.
i. Add the new node to the storage cluster commands, and Ceph daemons are placed
automatically on the respective node.
j. Enable backfilling.

Node replacement steps


The official documentation provides detailed steps for removing and adding nodes.

6.2.6 Disk replacement


In case of disk failures, you can replace the faulty storage device and reuse the same OSD
ID. This way, you won't have to reconfigure the CRUSH map. You can remove the faulty OSD
from the cluster by preserving the OSD ID with the ceph orch rm command. The OSD will not
be permanently removed from the CRUSH hierarchy, but it will be assigned a "destroyed" flag.
This flag is used to determine which OSD IDs can be reused in the next OSD deployment. By
using the destroyed flag, you can easily identify which OSD ID can be assigned to the next
OSD deployment. If you use OSD specification for deployment, your newly added disk will get
the OSD ID of the replaced disk.

In the official documentation, you can find detailed steps that take you through a disk
replacement. Here is the link.

At a smaller scale because, in this case, it is a single OSD, not a full node, but all the
recommendations we suggested in the chapter on cluster expansion also apply to disk
replacement, so please read them carefully before starting a disk replacement.

6.2.7 Cluster monitoring


We have three complementary tools to monitor an IBM Storage Ceph cluster: the Dashboard
RESTful API, the Ceph CLI commands, and the IBM Storage Ceph Dashboard Observability
Stack. Each tool provides different insights into cluster health.

Chapter 6. Day 1 and Day 2 operations 145


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

The IBM Storage Ceph Dashboard Observability Stack provides management and monitoring
capabilities, allowing you to administer and configure the cluster and visualize related
information and performance statistics. The Dashboard uses a web server hosted by the
ceph-mgr daemon.

Dashboard Observability Stack components


Multiple components provide the Dashboard's functionality.
– The Cephadm application for deployment.
– The embedded Dashboard ceph-mgr module.
– The embedded Prometheus ceph-mgr module.
– The Prometheus time-series database.
– The Prometheus node-exporter daemon runs on each storage cluster host.
– The Grafana platform to provide monitoring user interface and alerting.
– The Alert Manager daemon for alerting.

See Figure 6-2.

Figure 6-2 Dashboard Observability Stack architecture

Dashboard Observability Stack features


The Ceph Dashboard provides the following features:
– Multi-user and role management: The Dashboard supports multiple user accounts with
different permissions and roles. User accounts and roles can be managed using both,
the command line and the web user interface. The Dashboard supports various
methods to enhance password security. Password complexity rules may be configured,
requiring users to change their password after the first login or after a configurable time
period.
– Single Sign-On (SSO): The Dashboard supports authentication with an external
identity provider using the SAML 2.0 protocol.

146 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

– Auditing: The Dashboard backend can be configured to log all PUT, POST and
DELETE API requests in the Ceph manager log.

Management features
– View cluster hierarchy: You can view the CRUSH map, for example, to determine which
host a specific OSD ID is running on. This is helpful if there is an issue with an OSD.
– Configure manager modules: You can view and change parameters for Ceph manager
modules.
– Embedded Grafana Dashboards: Ceph Dashboard Grafana Dashboards might be
embedded in external applications and web pages to surface information and
performance metrics gathered by the Prometheus module.
– View and filter logs: You can view event and audit cluster logs and filter them based on
priority, keyword, date, or time range.
– Toggle Dashboard components: You can enable and disable Dashboard components
so only the features you need are available.
– Manage OSD settings: You can set cluster-wide OSD flags using the Dashboard. You
can also Mark OSDs up, down or out, purge and reweight OSDs, perform scrub
operations, modify various scrub-related configuration options, select profiles to adjust
the level of backfilling activity. You can set and change the device class of an OSD,
display and sort OSDs by device class. You can deploy OSDs on new drives and hosts.
– Viewing Alerts: The alerts page allows you to see details of current alerts.
– Quality of Service for images: You can set performance limits on images, for example
limiting IOPS or read BPS burst rates.

Monitoring features
– Username and password protection: You can access the Dashboard only by providing
a configurable user name and password.
– Overall cluster health: Displays performance and capacity metrics. This also displays
the overall cluster status, storage utilization, for example, number of objects, raw
capacity, usage per pool, a list of pools and their status and usage statistics.
– Hosts: Provides a list of all hosts associated with the cluster along with the running
services and the installed Ceph version.
– Performance counters: Displays detailed statistics for each running service.
– Monitors: Lists all Monitors, their quorum status and open sessions.
– Configuration editor: Displays all the available configuration options, their descriptions,
types, default, and currently set values. These values are editable.
– Cluster logs: Displays and filters the latest updates to the cluster's event and audit log
files by priority, date, or keyword.
– Device management: Lists all hosts known by the Orchestrator. Lists all drives
attached to a host and their properties. Displays drive health predictions, SMART data,
and blink enclosure LEDs.
– View storage cluster capacity: You can view raw storage capacity of the IBM Storage
Ceph cluster in the Capacity panels of the Ceph Dashboard.
– Pools: Lists and manages all Ceph pools and their details. For example: applications,
placement groups, replication size, EC profile, quotas, CRUSH ruleset, etc.
– OSDs: Lists and manages all OSDs, their status and usage statistics as well as
detailed information like attributes, like OSD map, metadata, and performance
counters for read and write operations. Lists all drives associated with an OSD.

Chapter 6. Day 1 and Day 2 operations 147


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

– Images: Lists all RBD images and their properties such as size, objects, and features.
Create, copy, modify and delete RBD images. Create, delete, and rollback snapshots
of selected images, protect or unprotect these snapshots against modification. Copy or
clone snapshots, flatten cloned images.
– RBD Mirroring: Enables and configures RBD mirroring to a remote Ceph server. Lists
all active sync daemons and their status, pools and RBD images including their
synchronization state.
– Ceph File Systems: Lists all active Ceph file system (CephFS) clients and associated
pools, including their usage statistics. Evict active CephFS clients, manage CephFS
quotas and snapshots, and browse a CephFS directory structure.
– Object Gateway (RGW): Lists all active Object Gateways and their performance
counters. Displays and manages, including add, edit, delete, Object Gateway users
and their details, for example quotas, as well as the users' buckets and their details, for
example, owner or quotas.

Security features
– SSL and TLS support: All HTTP communication between the web browser and the
Dashboard is secured via SSL. A self-signed certificate can be created with a built-in
command, but it is also possible to import custom certificates signed and issued by a
Certificate Authority (CA).

Dashboard access
You can access the Dashboard with the credentials provided on bootstrapping the cluster.

Cephadm installs the Dashboard by default. Example 6-50 is an example of the Dashboard
URL:

Example 6-50 Dashboard credentials example during bootstrap of the Ceph cluster
URL: https://ceph-mon01:8443/
User: admin
Password: XXXXXXXXX

To find the Ceph Dashboard credentials, search the var/log/ceph/cephadm.log file for the
string "Ceph Dashboard is now available at".

You have to change the password the first time you log into the Dashboard with the
credentials provided on bootstrapping only if --dashboard-password-noupdate option is not
used while bootstrapping.

6.2.8 Network monitoring


Network configuration and health are critical for building a high-performance IBM Storage
Ceph cluster. As a software-defined storage solution, Ceph is very sensitive to network
fluctuations, flapping, and other health issues. Because OSDs replicate client writes to other
OSDs in the cluster over the network, any increase in network latency can significantly hinder
performance. Given these considerations, it is critical to have a network monitoring stack that
allows you to take immediate action when an issue is identified.

IBM Storage Ceph has built in some network warnings at the CLI level and also in the
observability stack in Alert Manager.

Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon
availability and network performance. If a single delayed response is detected, this might

148 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721ch06-Day1 and Day2 operations.fm

indicate nothing more than a busy OSD. But if multiple delays between distinct pairs of OSDs
are detected, this might indicate a failed network switch, a NIC failure, or a layer 1 failure.

In the output of the ceph health detail command, you can see which OSDs are experiencing
delays and how long the delays are. The output of ceph health detail is limited to ten lines.
Example 6-51shows an example of the output you can expect from the ceph health detail
command.

Example 6-51 Output example of the ceph health detail command


[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
Slow OSD heartbeats on back from osd.0 [rack2,host1] to osd.1 [rack2,host2]
1118.001 msec possibly improving
Slow OSD heartbeats on back from osd.0 [rack2,host1] to osd.2 [rack2,host3]
1030.123 msec

To see more detail and to collect a complete dump of network performance information, use
the dump_osd_network command.

From the Alert Manager that is part of the out of the box Observability stack, there are
different preconfigured alarms related to networking issues. See Example 6-52 as an
example.

Example 6-52 Alert Manager pre-configured network alarm example


- alert: "CephNodeNetworkPacketDrops"
annotations:
description: "Node {{ $labels.instance }} experiences packet drop > 0.5%
or > 10 packets/s on interface {{ $labels.device }}."
summary: "One or more NICs reports packet drops"
expr: |
(
rate(node_network_receive_drop_total{device!="lo"}[1m]) +
rate(node_network_transmit_drop_total{device!="lo"}[1m])
) / (
rate(node_network_receive_packets_total{device!="lo"}[1m]) +
rate(node_network_transmit_packets_total{device!="lo"}[1m])
) >= 0.0050000000000000001 and (
rate(node_network_receive_drop_total{device!="lo"}[1m]) +
rate(node_network_transmit_drop_total{device!="lo"}[1m])
) >= 10
labels:
oid: "1.3.6.1.4.1.50495.1.2.1.8.2"
severity: "warning"
type: "ceph_default"
- alert: "CephNodeNetworkPacketErrors"
annotations:
description: "Node {{ $labels.instance }} experiences packet errors >
0.01% or > 10 packets/s on interface {{ $labels.device }}."
summary: "One or more NICs reports packet errors"
expr: |
(
rate(node_network_receive_errs_total{device!="lo"}[1m]) +
rate(node_network_transmit_errs_total{device!="lo"}[1m])
) / (
rate(node_network_receive_packets_total{device!="lo"}[1m]) +
rate(node_network_transmit_packets_total{device!="lo"}[1m])

Chapter 6. Day 1 and Day 2 operations 149


5721ch06-Day1 and Day2 operations.fm Draft Document for Review November 28, 2023 12:23 am

) >= 0.0001 or (
rate(node_network_receive_errs_total{device!="lo"}[1m]) +
rate(node_network_transmit_errs_total{device!="lo"}[1m])
) >= 10
labels:
oid: "1.3.6.1.4.1.50495.1.2.1.8.3"
severity: "warning"
type: "ceph_default"

You can check all of the preconfigured alarms at this link.

150 IBM Storage Ceph Concepts and Architecture Guide


Draft Document for Review November 28, 2023 12:23 am 5721bibl.fm

Related publications

The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this paper.

IBM Redbooks
The following IBM Redbooks publications provide additional information about the topic in this
document. Note that some publications referenced in this list might be available in softcopy
only.
򐂰 IBM Storage Ceph Solutions Guide, REDP-5715

You can search for, view, download or order these documents and other Redbooks,
Redpapers, Web Docs, draft and additional materials, at the following website:
ibm.com/redbooks

Online resources
These websites are also relevant as further information sources:
򐂰 IBM Storage Ceph Documentation:
https://www.ibm.com/docs/en/storage-ceph/6?topic=dashboard-monitoring-cluster
򐂰 Community Ceph Documentation
https://docs.ceph.com/en/latest/monitoring/
򐂰 AWS CLI documentation:
https://docs.aws.amazon.com/cli/index.html
򐂰 IP load balancer documentation:
https://github.ibm.com/dparkes/ceph-top-gun-enablement/blob/main/training/modul
es/ROOT/pages/radosgw_ha.adoc

Help from IBM


IBM Support and downloads
ibm.com/support

IBM Global Services


ibm.com/services

© Copyright IBM Corp. 2023. 151


5721bibl.fm Draft Document for Review November 28, 2023 12:23 am

152 IBM Storage Ceph Concepts and Architecture Guide


Back cover
Draft Document for Review November 28, 2023 8:05 am

REDP-5721-00

ISBN DocISBN

Printed in U.S.A.

®
ibm.com/redbooks

You might also like