IBM Storage Ceph For Beginner's
IBM Storage Ceph For Beginner's
BEGINNER’S
Abstract
This document in intended as an introduction to Ceph
and IBM Storage Ceph in particular from installation to
deployment of all unified services including block, file
and object. It is ideally suited to help customers
evaluate the benefits of IBM Storage Ceph in a test or
POC environment.
NISHAAN DOCRAT
IBM Systems Hardware
Trademarks
IBM, IBM Storage Ceph, IBM Storage Scale are trademarks or registered trademarks of the
International Business Machines Corporation in the United States and other countries.
Ceph is a trademark or registered trademark of Red Hat, Inc. or its subsidiaries in the United States
and other countries.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora,
the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other
countries.
OpenStack, OpenStack Swift are trademarks or registered trademarks of the OpenStack Foundation.
Amazon Web Services, AWS, Amazon EC2, EC2, Amazon S3 are trademarks or registered trademarks
of Amazon.com, Inc. or its affiliates in the United States and other countries.
KUBERNETES is a registered trademark of the Linux Foundation in the United States and other
countries,
Linux is the registered trademark of Linus Torvalds in the United States and other countries.
"SUSE" and the SUSE logo are trademarks of SUSE LLC or its subsidiaries or affiliates.
All other trademarks, trade names, or company names referenced herein are used for identification
only and are the property of their respective owners.
1
IBM Storage Ceph for Beginner’s
Table of Contents
Executive Summary ......................................................................................... 5
Ceph Introduction ........................................................................................... 5
IBM Storage Ceph Architecture and Key Components ......................................... 6
Client/Cluster versus Traditional Client/Server Architectures ................................. 7
Ceph Calculated Data Placement........................................................................... 8
Putting it all together – how does Ceph store or retrieve data .................................. 9
Who Uses Ceph and What for? ........................................................................ 12
Sizing an IBM Storage Ceph Cluster ................................................................ 14
Using IBM Storage Modeller (StorM) to Size an IBM Storage Ceph Cluster .............. 16
Obtaining a 60-day Trial License for IBM Storage Ceph Pro Edition ................... 18
Deploying an IBM Storage Ceph 4-node Cluster ............................................... 18
Firewall Rules Required for IBM Storage Ceph ..................................................... 19
Storage Configuration ......................................................................................... 23
Initial Installation of your IBM Storage Ceph cluster ............................................. 24
Register your IBM Storage Ceph Cluster Nodes with Red Hat ................................ 24
Configure the Ansible Inventory Location ............................................................. 29
Enabling SSH login as root user on Red Hat Enterprise Linux ................................. 29
Run cephadm pre-flight playbook to install all pre-requisites on all cluster nodes .. 30
Bootstrapping a new storage cluster.................................................................... 31
Distributing Ceph Cluster SSH keys to all nodes ................................................... 35
Verifying the cluster installation .......................................................................... 36
Logging into the Ceph Dashboard and Adding Cluster Nodes ................................. 37
IBM Storage Ceph RADOS Gateway (RGW) Deployment .................................... 47
Using Virtual-hosted Style Bucket Addressing with IBM Storage Ceph RGW ........... 64
Setting up IBM Storage Ceph RGW static web hosting .......................................... 71
Setting up IBM Storage Ceph RGW Presigned URL ................................................ 73
IBM Storage Ceph RADOS Block Device (RBD) Deployment ............................... 73
Accessing a Ceph RBD Image on a Windows Host ................................................. 77
Accessing a Ceph RBD Image on a Linux Host ....................................................... 81
Ceph RBD Images and Thin Provisioning .............................................................. 84
Testing RBD client access during a failure ............................................................ 85
IBM Storage Ceph Grafana Dashboards .......................................................... 89
2
IBM Storage Ceph for Beginner’s
3
IBM Storage Ceph for Beginner’s
4
IBM Storage Ceph for Beginner’s
Executive Summary
IBM Storage Ceph is the latest addition to IBM’s software defined Storage Portfolio. IBM has always
had market leading storage software offerings including IBM Storage Scale (formerly GPFS) and IBM
Cloud Object Storage (COS). Whilst the introduction of IBM Storage Ceph does to an extent overlap
with the existing IBM offerings, IBM Storage Ceph still offers a big value proposition to our clients with
very strong S3 API compatibility, the ability to serve out block storage and also strong integration with
Kubernetes. Open source Ceph is widely used worldwide across a multitude of industries and for
varied applications. IBM Storage Ceph is built on open source Ceph and gives customers the ability to
purchase and implement Ceph into mission critical environments with the full backing of IBM support
and development.
This document is primarily meant to help IBM Technical Sellers, Business Partners and IBM customers
deploy and evaluate all of IBM Storage Ceph’s features. Whilst the publicly available documentation
is comprehensive, it is primarily targeted at using the command line. This is the same issue IBM has
had with IBM Storage Scale where our customers perception of the product’s ease of use is
determined by the effort and knowledge required to install, configure and administer the product. This
document serves to make use of the IBM Storage Ceph Dashboard wherever possible to specifically
address this concern. The full implementation of IBM Storage Ceph including all protocols (file, block
and object) are covered in detail along with advanced features like replication, compression and
encryption. Integration with native Kubernetes is also demonstrated.
AUDIENCE: This document is intended for anyone involved in evaluating, acquiring, managing,
operating, or designing a software defined storage solution based on IBM Storage Ceph.
Ceph Introduction
Ceph is a popular open source software defined distributed storage solution that is highly reliable and
extremely scalable. It is built on commodity hardware and provides file, block and object storage from
a single unified storage cluster. Ceph is designed to have no single point of failure and includes self-
healing and self-managing capabilities to reduce administrative overhead and costs. It favors
consistency and correctness above performance. It was initially developed by Sage Weil in 2004 with
the primary goal being to resolve issues with existing storage solutions at the time that struggled with
scalability and were prone to performance degradation at scale. These legacy storage solutions
centralised their metadata service and this became a bottleneck as the solution scaled. Ceph on the
other hand is based on a distributed storage architecture with no centralised metadata server so can
easily scale to exabytes of data without any noticeable performance degradation. It achieves this
through the CRUSH (Controlled Replication Under Scalable Hashing) algorithm which calculates
where data needs to be stored and retrieved from without requiring any central lookup or access to
dedicated metadata servers. The beauty of this approach is that the calculation is done on the client
side so that there is no single point of failure or bottleneck and enables infinite scalability.
The first prototype of Ceph was released in 2006 and in 2007 Ceph was released under the LGPL.
Ceph was later incorporated into the Linux kernel in 2010 by Linus Torvalds and Sage Weil formed a
company called Inktank Storage to commercialise and promote Ceph in 2011. The first stable release
of Ceph (Argonaut) was released in 2012. In 2014 Red Hat purchased Inktank Storage which was a
major milestone for Ceph as it brought significant investment into Ceph’s development with enterprise
focus and exposure to a wider audience. Red Hat Ceph storage was paired with their OpenStack
5
IBM Storage Ceph for Beginner’s
offering and would later form the basis for their OpenShift Data Foundation product. In 2015, the Ceph
Community Advisory Board was formed (which was later replaced in 2018 by the Ceph Foundation).
In the years following, Ceph had a string of new releases that further improved performance and
introduced new features with contributions from users, developers and companies across a broad
spectrum of industries. In January 2023, IBM acquired the entire storage development team from
Red Hat and rebranded Red Hat Ceph Storage to IBM Ceph Storage. Ceph remains open source and
IBM is committed to ensure that it stays that way. IBM is a diamond member of the Ceph Foundation
and all new Ceph code changes are published upstream and open source Ceph still forms the basis
for IBM Storage Ceph. What IBM brings to Ceph is to enterprise harden it as a fully tested and
supported product ready for large scale deployments in mission critical environments.
https://docs.ceph.com/en/latest/architecture/
https://www.redbooks.ibm.com/abstracts/redp5721.html
Ceph’s architecture1 is designed primarily for reliability, scalability and performance. It makes use of
distributed computing to handle exabytes of data efficiently and each of its key components can be
individually scaled depending on its intended use case. Figure 1 illustrates the key components of a
Ceph cluster starting with the data access services, through the librados client API library and
underpinned by RADOS.
RADOS (Reliable Autonomic Distributed Object Store) is the foundation of the Ceph architecture. It is
the underlying storage layer that supports its block, file and object services. RADOS provides a low-
level object storage service that is both reliable and scalable and it provides strong consistency.
RADOS manages data placement, data protection (either using replication or erasure coding),
rebalancing, repair and recovery. A RADOS cluster comprises of the following components:
1
Source: https://docs.ceph.com/en/latest/architecture/
6
IBM Storage Ceph for Beginner’s
MON (Monitors) – The Ceph Monitor’s primary function is to maintain a master copy
MON of the cluster map. Monitors also keep a detailed record of the cluster state, including
all OSDs, their status, and other critical metadata. Monitors ensure the cluster
achieves consensus on the state of the system using the Paxos algorithm, providing a
reliable and consistent view of the cluster to all clients and OSDs. A Ceph cluster will
typically have anywhere from 3 to 7 monitors.
MGR (Manager) – The Ceph Manager aggregates real-time metrics (throughput, disk
MGR usage, etc.) and tracks the current state of the cluster. It provides essential
management and monitoring capabilities for the cluster. A Ceph cluster typically has
one active and one or more standby Managers (3 is the recommended).
OSD (Object Storage Daemon) – The Ceph Object Storage Dameon is responsible for
OSD storing data on disk and services client I/O requests. The OSDs are also responsible
for data replication, recovery and rebalancing. OSDs communicate with each other to
ensure data is replicated and distributed across the cluster. Each OSD is mapped to a
single disk. A Ceph cluster would typically have 10s-1000s OSD per cluster.
LIBRADOS - The Ceph storage cluster provides the basic storage services that allows Ceph to uniquely
deliver object, block and file storage in one unified system. However, you are not limited to using the
RESTful, block or POSIX interfaces. Based on RADOS, the librados API enables you to create your own
interface to the Ceph storage cluster. Librados Is a C language library, commonly referred to as Ceph
basic library. The functions of RADOS are abstracted and encapsulated in this library. It essentially
provides low-level access to the RADOS service.
RADOS BLOCK DEVICE (RBD) – A reliable and fully distributed block device with a Linux kernel client
and a QEMU/KVM driver.
RADOS GATEWAY (RGW) – This interface provides object storage. Ceph RGW is a bucket-based REST
API gateway compatible with S3 and Swift.
CEPH FILESYSTEM (CEPHFS) – A POSIX compliant distributed filesystem with a Linux kernel client
and support for FUSE (Filesystems in Userspace).
One of the primary goals of the Ceph architecture is to avoid the pitfalls of traditional client/server
models2. Whilst traditional client/server architectures work well, as services scale it becomes
increasingly difficult to maintain the illusion of a single server when there could be hundreds or
thousands of servers that make up the storage cluster. Traditional architectures used technologies
like virtual IP addresses or failover pairs or gateway nodes to hide the layout of the data from the client
though these technologies in themselves have limitations that affect the overall design of the system
and affect its performance and ability to maintain consistency and overall cluster behavior.
Ceph is designed around a client/cluster architecture. This basically means that there is an intelligent
client library that sits on the application clients. This library understands that it is not talking to a single
server but to a cluster of co-operating servers. This client library enables intelligent access to the
storage cluster by enabling smart routing where I/O requests can be routed to the server that has the
2
Source: Sage Weil, Ceph Tech Talk – Intro to Ceph 27/06/19 - https://www.youtube.com/watch?v=PmLPbrf-x9g&list=WL&index=48
7
IBM Storage Ceph for Beginner’s
actual data in question. It also allows for flexible addressing where it can address all the nodes in the
storage cluster and manage the fact that data can be moving around in the background whilst
providing a seamless experience for the client. Lastly, the application is abstracted from the
underlying complexity of the storage cluster by using this client library that handles the intricacies of
where exactly data is being written to or retrieved from.
Considering a Ceph cluster can store hundreds of millions of objects, the overhead of keeping a
centralised record of where each object is supposed to be stored or retrieved from would be
computationally expensive and quickly become a bottleneck. In fact, Ceph’s primary goal is to
eliminate having to “work out” where to store or retrieve data from a centralised metadata service but
to “calculate” it. The best part of this approach is that the calculation is done on the client side. Figure
3 depicts this process. Initially, the client library requests a cluster map from one of the Ceph monitor
daemons (1). As mentioned previously, the cluster map reflects the current state of the cluster
including the structure of the cluster and how to layout data across the servers that comprise it. When
the client application wants to read or write data, the client library will then do a calculation based on
the state of the cluster and the name of the object (2). The result of this calculation provides the
location in the cluster of where that data needs to be stored. The client library can then contact the
appropriate node in the cluster to read or write the object too (3).
8
IBM Storage Ceph for Beginner’s
If the state of the cluster changes (e.g. a server node is added or removed or a device fails), an updated
cluster map is provided to the application so that when it needs to retrieve an object it had previously
stored, it will redo the calculation which might produce a different result. The application can then go
and retrieve the object from the new location.
An object is the fundamental unit of storage in RADOS. A RADOS object consists of a unique name,
associated attributes or metadata and the actual byte data of the object which can range from a few
bytes to tens of megabytes. Most objects in Ceph are defaulted to 4MB in size. RADOS also supports
a special type of object called OMAP objects. These objects store key map data instead of byte data.
All objects in Ceph reside in storage pools. Ceph storage pools represent a high-level grouping of
similar objects. They are typically created based on their use case. For example, you might have pools
to store RBD images or pools to store S3 Objects. Ceph storage pools are logical constructs and are
thin provisioned. Ceph storage pools usually share devices (unless Ceph’s CRUSH algorithm has a
placement policy that specifies a specific class of device e.g. hdd vs sdd).
How does Ceph know where to store these RADOS objects? Figure 4 illustrates exactly how this is
achieved.
The way Ceph stores objects can be categorized into the following 3 functions:
Let’s assume we need to store a file (e.g. a large MP4 video). The first thing Ceph does it to break up
the video file into multiple 4MB objects. Each object has an ino (filename and metadata) and an ono
(sequence number determined by the 4MB fragmentation algorithm). Ceph then calculates and
assigns this object an object ID (oid) by using the ino and ono. All of these objects are then mapped to
a pool. Remember, a pool is a logical construct and could contain petabytes of data and billions of
objects. It would be computationally expensive to manage placement of objects individually. To cope
9
IBM Storage Ceph for Beginner’s
with this number of objects, Ceph breaks down the pool into fragments or shards and groups them
into placement groups. Objects are mapped to placement groups within a pool by using a static hash
function to calculate the value of oid, maps oid to an approximately uniformly distributed pseudo-
random value, and then performs bitwise AND with mask to obtain placement group id or pgid. The
mask is typically the number of placement groups in the pool less 1. Once the placement group id is
calculated, Ceph then uses the CRUSH algorithm, substituting pgid into it to get a set of OSDs on which
to store the object. CRUSH is essentially pseudo random data distribution and replication algorithm.
It always produces a consistent and repeatable calculation. Each Ceph cluster has a CRUSH map. The
CRUSH map is basically a hierarchy describing the physical topology of the cluster and a set of rules
defining policy about how to place data on those devices that make up the topology. The hierarchy
has devices (OSDs) at the leaves and internal nodes corresponding to other physical features or
groupings (e.g. hosts, racks, rows, datacenters, etc.). The rules describe how replicas are placed in
terms of that hierarchy (e.g. three replicas in different racks). Figure 5 depicts a simple CRUSH map.
In the above CRUSH map, having a 3-way replica policy will cause Ceph to distribute each replica of
an object across 3 separate rack buckets or failure domains. When you deploy a Ceph cluster, a default
CRUSH map is generated. This might be fine for a POC or test environment, but for a large cluster you
should give careful consideration to creating a custom CRUSH map that will ensure optimal
performance and maximum availability. You can also create your own bucket types to suit your
environment. The default bucket types are:
0 OSD
1 HOST
2 CHASSIS
3 RACK
4 ROW
5 PDU
6 POD
7 ROOM
8 DATCENTER
9 REGION
10 ROOT
In terms of data durability, Ceph supports both replication and erasure coding. The default replication
factor is 3 though this can be dynamically changed. Ceph also supports erasure coding using the Reed-
10
IBM Storage Ceph for Beginner’s
Solomon algorithm. In erasure coding, data is broken into fragments of two kinds: data chunks (k) and
parity or coding chunks (m) and stores those chunks on different OSDs. If a drive fails or becomes
corrupted, Ceph retrieves the remaining data (k) and coding (m) chunks from the other OSDs and the
erasure code algorithm restores the object from those chunks. Erasure coding uses storage capacity
more efficiently than replication. The n-replication approach maintains n copies of an object (3x by
default in Ceph), whereas erasure coding maintains only k + m chunks. For example, 3 data and 2
coding chunks use 1.5x the storage space of the original object.
While erasure coding uses less storage overhead than replication, the erasure code algorithm uses
more RAM and CPU than replication when it accesses or recovers objects. Erasure coding is
advantageous when data storage must be durable and fault tolerant, but do not require fast read
performance (for example, cold storage, historical records, and so on).
Ceph defines an erasure-coded pool with a profile3. Ceph uses a profile when creating an erasure-
coded pool and the associated CRUSH rule. Ceph creates a default erasure code profile when
initializing a cluster with k=2 and m=2. This mean that Ceph will spread the object data over four OSDs
(k+m = 4) and Ceph can lose one of those OSDs without losing data. You can create a new profile to
improve redundancy without increasing raw storage requirements. For instance, a profile with k=8
and m=4 can sustain the loss of four (m=4) OSDs by distributing an object on 12 (k+m=12) OSDs. Ceph
divides the object into 8 chunks and computes 4 coding chunks for recovery. For example, if the object
size is 8 MB, each data chunk is 1 MB and each coding chunk has the same size as the data chunk,
that is also 1 MB. The object is not lost even if four OSDs fail simultaneously.
For instance, if the desired architecture must sustain the loss of two racks with a 40% storage
overhead, the following profile can be defined:
The primary OSD will divide the NYAN object into four (k=4) data chunks and create two additional
chunks (m=2). The value of m defines how many OSDs can be lost simultaneously without losing any
data. The crush-failure-domain=rack will create a CRUSH rule that ensures no two chunks are stored
in the same rack.
3
Source: https://docs.redhat.com/en/documentation/red_hat_ceph_storage/7/html/storage_strategies_guide/erasure-code-pools-overview_strategy#erasure-code-
profiles_strategy
11
IBM Storage Ceph for Beginner’s
Source: https://ceph.io/en/news/blog/2022/ceph-user-survey-results-2022/
More importantly, now that we understand some of the basics of the Ceph architecture, we can see
that it is used exactly because it is open source and because of its availability, reliability and scalability
characteristics.
The primary uses cases for Ceph are Virtualization, Containers and Backup. This implies that Ceph
RBD and Ceph RGW are popular and is also where IBM is advocating for its use; for VMWare
environments (using Ceph block storage over NVMe/TCP) and as an S3 compliant object store for a
wide range of applications including backup.
12
IBM Storage Ceph for Beginner’s
Perhaps the biggest reason why paying for Ceph (e.g. IBM Storage Ceph) can be understood is by
analyzing the graph below.
Getting help from these sources is probably acceptable when deploying Ceph into a test or
development environment. For mission critical environments however, being reliant on
documentation and the wider Ceph community for help is not practical. When you have a critical issue,
you want the knowledge that you are backed by a vendor who has the support capability of IBM with
over 200 dedicated Ceph developers and the support personnel to assist you at a moment’s notice.
With the release of IBM Storage Ready Nodes for IBM Storage Ceph, IBM is able to provide both
hardware and software support for your mission critical environments.
Figure 11: IBM Storage Ready Nodes for IBM Storage Ceph – Value Proposition
Apart from 3rd party organizations that provide Ceph consulting and support services, SUSE was the
only other major vendor to offer a commercial Ceph product (SUSE Enterprise Storage or SES).
However, they discontinued this offering in 2020 leaving only IBM and Red Hat as the major vendors
that offer a commercial Ceph product along with its associated support, services and continued
software maintenance.
13
IBM Storage Ceph for Beginner’s
https://www.redbooks.ibm.com/abstracts/redp5721.html
IBM Storage Ceph is deployed as containers. Containerization makes IBM Storage Ceph easy to
deploy, manage and scale its services. The only caveat is that troubleshooting and tuning become
more complex than if it was deployed natively on bare metal servers.
The minimum hardware requirements for IBM Storage Ceph are listed here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=hardware-minimum-recommendations-
containerized-ceph
The above link also discusses the rules around Ceph daemon colocation. Whilst it is possible to deploy
all services on a single node (which is described later in this paper), for a POC or test environment the
minimum supported configuration is a 4-node cluster. Technically, a 3-node cluster would also work
though a failure of a single node would result in no rebuild space and affect cluster performance and
data integrity. A 4-node cluster allows the ability to tolerate a single node failure without affecting
data redundancy and little impact on performance. The minimum recommended cluster configuration
is depicted below:
Figure 12: Minimum supported IBM Storage Ceph Cluster configuration with service collocation
(Source: https://www.redbooks.ibm.com/abstracts/redp5721.html)
The best practice minimum hardware requirements for each of the IBM Storage Ceph daemons is
listed below:
14
IBM Storage Ceph for Beginner’s
Figure 13: Best Practice Minimum hardware requirements for each Ceph daemon
(Source: https://www.redbooks.ibm.com/abstracts/redp5721.html)
The full software requirements for IBM Storage Ceph including supported Operating Systems and ISV
applications are documented here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=compatibility-matrix
There are many other factors to take into account when sizing a storage Ceph cluster. The ability to
separate IBM Storage Ceph cluster internal OSD traffic to a dedicated private network is also a good
practice. The amount and size of your disk devices per node also plays a big role. Ideally, IBM Storage
Ceph cluster should be able to recover from a complete node failure in under 8 hours. Red Hat offers
an official Recovery Calculator that can be found here:
https://access.redhat.com/labs/rhsrc/
It is also important to keep the same size disks per server node. Whilst it is possible to calculate the
optimal amount of placement groups per storage pool, using similar sized drives will ensure data is
evenly distributed across them and performance is uniform. Having mixed sized disks would cause a
performance imbalance and IBM Storage Ceph will also complain if there is a placement group
imbalance across OSDs. Red Hat placement group calculator can be found here:
https://access.redhat.com/labs/cephpgc/manual/
The placement group autoscaler is an excellent way to automatically manage placement groups in
your Ceph cluster. Based on expected pool usage, the autoscaler can make recommendations and
adjust the number of placement groups in a cluster based on pool usage and tunings set by the user.
For a POC or test environment, with a limited amount of OSDs per storage node, you will most likely
hit the limit of the optimal number of PGs per OSD. The default maximum is set to 300 (though this
can be adjusted). When using multiple data pools for storing objects, you need to ensure that you
balance the number of placement groups per pool with the number of placement groups per OSD so
that you arrive at a reasonable total number of placement groups. The aim is to achieve reasonably
low variance per OSD without taxing system resources or making the peering process too slow. It is
recommended to use the PG calculator to work out the optimal number of PGs per pool (as opposed
to using the autoscaler) so that you don’t over-burden your Ceph cluster with too many PGs per OSD.
15
IBM Storage Ceph for Beginner’s
Using IBM Storage Modeller (StorM) to Size an IBM Storage Ceph Cluster
IBMers and Business Partners should be familiar with IBM StorM. This is the primary tool that is used
to size most of our storage solutions. StorM has support for IBM Storage Ceph and can be used to size
a Ceph solution based on IBM Storage Ready nodes.
A detailed description of the IBM Storage Ready Nodes for IBM Storage Ceph can be found here:
https://www.ibm.com/downloads/documents/us-en/107a02e95bc8f6bd
https://www.ibm.com/tools/storage-modeller
As a quick example, let’s size an Entry configuration with a required usable capacity of 125TB on
StorM. First, you need to add IBM Storage Ceph to your project.
Next, we choose the Solution Group and Site (refer to IBM StorM Help for more information on these
constructs).
Figure 15: IBM StorM Product Selection – Solution Group and Site
16
IBM Storage Ceph for Beginner’s
IBM StorM allows you to choose the type of Storage Ready Node and also specify some details about
the expected use case. You can see from the drop-down list below, there are a few pre-defined use
cases (each with recommended options for data redundancy).
In this example, we want an Entry configuration with a usable capacity of 125TB using a 2+2 erasure
coding data protection scheme (refer to the IBM StorM help for more information on the different use
cases). As you can see, using a 2+2 EC data protection scheme, StorM recommends 5 nodes (1 node
is added to ensure node redundancy and optimal performance even with a single node failure). IBM
StorM calculates the cluster usable capacity and depending on the use case will also calculate the
expected performance throughout. It is also useful to get the RAW capacity values for the proposed
solution so that you can determine the cluster’s licensing requirement. Note, IBM Storage Ceph is
licensed based on RAW capacity.
If you don’t have access to IBM StorM, you can also refer to these public sources to help you calculate
the usable capacity for your Ceph cluster.
17
IBM Storage Ceph for Beginner’s
https://access.redhat.com/solutions/6980916
https://www.virtualizationhowto.com/2024/09/ceph-storage-calculator-to-find-capacity-and-cost/
https://bennetgallein.de/tools/ceph-calculator
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=installing-pro-edition-free
It helps a great deal to have command output to compare too when trying to troubleshoot issues or
validate proper execution of commands. This document therefore includes, where possible, the
output of executed commands from the lab environment.
For the initial cluster deployment, we will use 4 storage nodes deployed as virtual machines with 2
OSDs per node. For some of the advanced functions like replication and multi-cluster management,
single node IBM Storage Ceph clusters were deployed to save on lab resources.
All the IBM Storage Ceph cluster nodes were installed using RHEL 9.4. Whilst not strictly required in
a test or POC environment, it is good practice to separate internal OSD traffic to a dedicated network
for performance and security reasons (see https://www.ibm.com/docs/en/storage-
ceph/7.1?topic=configuration-network-ceph). It is also important to ensure that time synchronization
is enabled on all CEPH monitor hosts at the very least.
18
IBM Storage Ceph for Beginner’s
Try to size your nodes symmetrically. From resources to the number of drives to the size of the drives
(ideally 3-5 nodes and 12 OSDs per node). Whilst Ceph can cope with unbalanced nodes, optimal
performance is easier to achieve with a symmetrical configuration. (e.g. Ceph weights each disk so
larger disks get a higher weighting and therefore more I/O will be directed to these disks than smaller
ones). Also, whilst a 3-node cluster is also viable for a POC, consider performance and capacity when
a single node fails. With 4 nodes, Ceph can rebuild and recover onto the free space of the remaining
3 nodes (using 3x replica) and be ready to deal with another failure. With 3 nodes, there is no where
to recover to and another failure will cause an outage. Also consider cluster quorum. If you want to
survive the failure of 2 nodes then you need at least a 5-node cluster and so forth.
IBM Storage Ceph container images are obtained via the IBM Container Registry (ICR). For the initial
deployment, you need internet access from your cluster nodes to pull the required container images.
Most organizations would require firewall rules to be in place to icr.io using port 443. If this is not
possible, you can create a podman private container registry with the required IBM Storage Ceph
container images to perform a disconnected install. You would need at least one node however to be
able to access the IBM Container Registry to pull the required images. Instructions on how to setup a
private container registry can be found at the link below:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=installation-configuring-private-registry-
disconnected
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=installation-performing-disconnected
Since we need to run similar commands on all the IBM Storage Ceph cluster nodes, it is a good idea
to use tools like ParallelSSH to save time during the installation. A good reference on how to install
and setup ParallelSSH can be found here https://www.cyberciti.biz/cloud-computing/how-to-use-
pssh-parallel-ssh-program-on-linux-unix/. In our lab environment, I have created an alias called
cephdsh which basically issues the pssh command with the required options.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=hardening-threat-vulnerability-
management).
The official recommendation to harden your IBM Storage Ceph cluster is to use 4 distinct security
zones as described in the link above. Since this would be overkill for a test or POC environment, the
following commands are just to demonstrate the use of two zones (public and private).
19
IBM Storage Ceph for Beginner’s
block dmz drop external home internal nm-shared public trusted work
[4] 23:32:20 [SUCCESS] root@cephnode2
block dmz drop external home internal nm-shared public trusted work
[root@cephnode1 ~]#
Add the interface we designated for internal traffic to the private zone with persistence:
[root@cephnode1 ~]# cephdsh firewall-cmd --zone=internal --add-interface=ens19 --permanent
[1] 23:36:03 [SUCCESS] root@cephnode1
The interface is under control of NetworkManager, setting zone to 'internal'.
success
[2] 23:36:03 [SUCCESS] root@cephnode3
The interface is under control of NetworkManager, setting zone to 'internal'.
success
[3] 23:36:03 [SUCCESS] root@cephnode4
The interface is under control of NetworkManager, setting zone to 'internal'.
success
[4] 23:36:03 [SUCCESS] root@cephnode2
The interface is under control of NetworkManager, setting zone to 'internal'.
success
[root@cephnode1 ~]#
Confirm our interfaces to be used for internal traffic are assigned to the internal zone:
[root@cephnode1 ~]# cephdsh firewall-cmd --list-interfaces --zone internal
[1] 23:37:17 [SUCCESS] root@cephnode1
20
IBM Storage Ceph for Beginner’s
ens19
[2] 23:37:17 [SUCCESS] root@cephnode2
ens19
[3] 23:37:17 [SUCCESS] root@cephnode4
ens19
[4] 23:37:17 [SUCCESS] root@cephnode3
ens19
[root@cephnode1 ~]#
Check that Network Manager has these interfaces set to the correct security zone (if they are not, use
nmcli con modify to change them):
Now that we have setup two security zones, let us check what services are already enabled on each
cluster node:
IBM Storage Ceph requires ports 3300 (Ceph Monitor) and the range 6800-7300 for OSDs. If you want
to make use of the Ceph iSCSI gateway (this functionality is deprecated), then ports 3260 and 5000
are also required).
The iSCSI gateway is in maintenance as of November 2022. This means that it is no longer in active
development and will not be updated to add new features.
https://docs.ceph.com/en/reef/rbd/iscsi-overview/
IBM Storage Ceph is already part of the pre-defined list of available services.
21
IBM Storage Ceph for Beginner’s
includes:
helpers:
[root@cephnode1 ~]# firewall-cmd --info-service ceph-mon
ceph-mon
ports: 3300/tcp 6789/tcp
protocols:
source-ports:
modules:
destination:
includes:
helpers:
[root@cephnode1 ~]# firewall-cmd --info-service ceph-exporter
ceph-exporter
ports: 9283/tcp
protocols:
source-ports:
modules:
destination:
includes:
helpers:
[root@cephnode1 ~]#
In our setup, Monitors and MDS operate on the public network, OSDs operate on the public and cluster
network. We will now add the required services to the appropriate zones.
22
IBM Storage Ceph for Beginner’s
Storage Configuration
IBM Storage Ceph is meant to run on commodity-based hardware with internal disks (JBOD). Using
external storage (e.g. SAN attached storage) increases the cost of a Ceph deployment and defeats the
purpose of Ceph as it most likely will already provide data protection via RAID (which means that we
are effectively protecting data using Ceph replication or erasure coding then again at the block level
with RAID). A failure at the array level which leads to a degraded RAID array will adversely affect
performance. Another consideration is that capacity will further be reduced (e.g. Ceph 3x replication
and RAID6 data protection). Lastly, the external SAN array is a single point of failure and would negate
Ceph’s effort to ensure data durability and maintain separate fault domains. IBM does not officially
support the use of SAN as backend storage for OSDs (see https://www.ibm.com/docs/en/storage-
ceph/7.1?topic=hardware-avoid-using-raid-san-solutions).
In our example lab setup and most likely for a test or POC environment you will deploy your IBM
Storage Ceph cluster on virtualized infrastructure (e.g. VMWare or KVM or Proxmox). The backing store
for your OSDs will most likely be sourced from a single SAN array. In our lab setup, each of the nodes
is initially configured with 2 x 25GB disks for use by Ceph (more will be added later in this document).
23
IBM Storage Ceph for Beginner’s
Make sure the disks are free (you can use dd to overwrite the partition table if they already contain
one).
We will make use of the cephadm utility to perform the initial cluster installation. The cephadm utility
deploys and manages a Ceph storage cluster. It is tightly integrated with both the command-line
interface (CLI) and the IBM Storage Ceph Dashboard web interface so that you can manage storage
clusters from either environment. Cephadm uses SSH to connect to hosts from the manager daemon
to add, remove, or update Ceph daemon containers. It does not rely on external configuration or
orchestration tools such as Ansible or Rook. The following is a high-level summary of the installation
steps:
Register your IBM Storage Ceph Cluster Nodes with Red Hat
Register your cluster nodes with subscription-manager register (enter your Red Hat customer portal
credentials when prompted).
24
IBM Storage Ceph for Beginner’s
[root@cephnode1 ~]#
Enable the Red Hat Enterprise Linux BaseOS and AppStream repositories on all cluster nodes.
25
IBM Storage Ceph for Beginner’s
26
IBM Storage Ceph for Beginner’s
Check that IBM ceph-tools repository has been successfully added to the list of repositories on all
cluster nodes.
Add and accept the IBM Storage Ceph license on all cluster nodes.
Transaction Summary
================================================================================
Install 1 Package
27
IBM Storage Ceph for Beginner’s
Preparing : 1/1
Installing : ibm-storage-ceph-license-7-2.el9cp.noarch 1/1
Running scriptlet: ibm-storage-ceph-license-7-2.el9cp.noarch 1/1
Your licenses have been installed in /usr/share/ibm-storage-ceph-license/L-XSHK-LPQLHG/UTF8/
System locale: en
NOTICE
This document includes License Information documents below for multiple Programs. Each
License Information document identifies the Program(s) to which it applies. Only those
License Information documents for the Program(s) for which Licensee has acquired entitlements
apply.
.
.
.
You can read this license in another language at /usr/share/ibm-storage-ceph-license/L-XSHK-
LPQLHG/UTF8/
Install ceph-ansible on the ansible Admin node (in our case, this is cephnode1).
28
IBM Storage Ceph for Beginner’s
python3-cryptography-36.0.1-4.el9.x86_64 python3-packaging-20.9-5.el9.noarch
python3-ply-3.11-14.el9.noarch
python3-pycparser-2.20-6.el9.noarch python3-pyparsing-2.4.7-9.el9.noarch
python3-resolvelib-0.5.4-5.el9.noarch
sshpass-1.09-4.el9.x86_64
Complete!
[root@cephnode1 ~]#
[admin]
cephnode1
[root@cephnode1 cephadm-ansible]#
Now we need to edit the ansible.cfg file to add the location of our inventory hosts file.
forks = 20
host_key_checking = False
gathering = smart
fact_caching = jsonfile
fact_caching_connection = $HOME/ansible/facts
fact_caching_timeout = 7200
nocows = 1
callback_whitelist = profile_tasks
stdout_callback = yaml
force_valid_group_names = ignore
inject_facts_as_vars = False
retry_files_enabled = False
timeout = 60
[ssh_connection]
control_path = %(directory)s/%%h-%%r-%%p
ssh_args = -o ControlMaster=auto -o ControlPersist=600s
pipelining = True
retries = 10
[root@cephnode1 cephadm-ansible]#
29
IBM Storage Ceph for Beginner’s
Test to see if you can ssh to the other cluster nodes without being prompted for a password.
If you don’t want to use root for the installation then you need to create an ansible user with sudo root
access on all the cluster nodes and setup password-less access for this user (see
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=installation-creating-ansible-user-sudo-
access and https://www.ibm.com/docs/en/storage-ceph/7.1?topic=installation-enabling-password-
less-ssh-ansible).
PLAY [insecure_registries]
*********************************************************************************************
**************************************
PLAY [preflight]
*********************************************************************************************
************************************************
.
.
30
IBM Storage Ceph for Beginner’s
.
PLAY RECAP
*********************************************************************************************
******************************************************
cephnode1 : ok=8 changed=3 unreachable=0 failed=0 skipped=25
rescued=0 ignored=0
cephnode2 : ok=8 changed=3 unreachable=0 failed=0 skipped=29
rescued=0 ignored=0
cephnode3 : ok=8 changed=3 unreachable=0 failed=0 skipped=25
rescued=0 ignored=0
cephnode4 : ok=8 changed=3 unreachable=0 failed=0 skipped=25
rescued=0 ignored=0
Make sure in the Play Recap none of the tasks failed. Resolve any issues and re-run the playbook as
required.
We are now ready to bootstrap our IBM Storage Ceph cluster using the cephadm utility. There are
some important pre-requisites that you need to be aware of that are documented here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=installation-bootstrapping-new-storage-
cluster.
31
IBM Storage Ceph for Beginner’s
Access to container images is necessary for the successful deployment of IBM Storage Ceph.
Container images are hosted on the IBM Cloud Container Registry, ICR. To do this, navigate to the
following URL and login with your IBM ID and password https://myibm.ibm.com/products-
services/containerlibrary.
Once you login, generate a new entitlement key by clicking on “Add new key”:
Once you have obtained the entitlement key, you need to create a JSON file with this key (see
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=cluster-using-json-file-protect-login-
information). Note, the login username is always “cp” not your IBM ID. The format of the JSON file is
as follows:
You can test access to the IBM Cloud Container Registry by issuing the podman login cp.icr.io. Login
using cp as the username and your entitlement key as the password. You can also test access to the
IBM Storage Ceph container images using the following command skopeo list-tags
docker://cp.icr.io/cp/ibm-ceph/ceph-7-rhel9.
For experienced Ceph users, you can bootstrap your cluster using a service configuration file. The
service configuration file is a YAML file that contains the service type, placement, and designated
32
IBM Storage Ceph for Beginner’s
nodes for services that you want to deploy in your cluster. Since we want to take the easy route and
deploy a single node cluster and use the Ceph Dashboard wizard to add additional cluster nodes and
deploy services we won’t use a service configuration file.
Remember, in our lab setup we want to separate OSD traffic to a private network. The public network
is extrapolated from the --mon-ip parameter that is provided by the bootstrap command. The cluster
network can be provided during the bootstrap operation by using the --cluster-network parameter. If
the --cluster-network parameter is not specified, it is set to the same value as the public network
value.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=cluster-bootstrap-command-options.
In our example, we want to use FQDNs and also, we saved our ICR login details in a JSON file so need
to specify it. If you don’t have a private network for OSD traffic then you don’t need to specify the --
cluster-network.
33
IBM Storage Ceph for Beginner’s
URL: https://cephnode1.local:8443/
User: admin
Password: 3cb37m2xt6
ceph telemetry on
https://docs.ceph.com/en/latest/mgr/telemetry/
Bootstrap complete.
[root@cephnode1 ~]#
Note the URL, port and username/password that is output from the bootstrap command.
Consider contributing to the Ceph project by enabling telemetry. The telemetry sends anonymous
data about the cluster back to the Ceph developers to help understand how Ceph is used and what
problems users may be experiencing.
This data is visualized on public dashboards that allow the community to quickly see summary
statistics on how many clusters are reporting, their total capacity and OSD count, and version
distribution trends (see https://telemetry-public.ceph.com/).
34
IBM Storage Ceph for Beginner’s
We need to distribute the Ceph cluster public SSH key to all nodes in the cluster. We will use the
cephadm-distribute-ssh-key.yml playbook to distribute the SSH keys instead of creating and
distributing the keys manually.
PLAY [all]
*********************************************************************************************
****************************************************************************
35
IBM Storage Ceph for Beginner’s
services:
mon: 1 daemons, quorum cephnode1 (age 7m)
mgr: cephnode1.tbqyke(active, since 4m)
osd: 0 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
We only have our admin node added to the cluster for now.
[ceph: root@cephnode1 /]# ceph orch host ls
HOST ADDR LABELS STATUS
cephnode1.local 10.0.0.240 _admin
1 hosts in cluster
[ceph: root@cephnode1 /]#
If you want to add node labels, you can specify them from the command line as follows:
[ceph: root@cephnode1 /]# ceph orch host label add cephnode1.local mon
Added label mon to host cephnode1.local
[ceph: root@cephnode1 /]# ceph orch host label add cephnode1.local mgr
Added label mgr to host cephnode1.local
[ceph: root@cephnode1 /]# ceph orch host label add cephnode1.local osd
Added label osd to host cephnode1.local
[ceph: root@cephnode1 /]# ceph orch host ls
HOST ADDR LABELS STATUS
cephnode1.local 10.0.0.240 _admin,mon,mgr,osd
1 hosts in cluster
[ceph: root@cephnode1 /]#
We can also label nodes in the Ceph Dashboard GUI. Labelling nodes makes it easier to deploy Ceph
services (e.g. deploy RGW to all nodes labelled with RGW).
We can also configure all available disks as OSDs from the command line (or we can do this via the
Ceph Dashboard GUI when adding additional cluster nodes). If you prefer to do this from the command
line, you can do so as follows:
[ceph: root@cephnode1 /]# ceph orch apply osd --all-available-devices
Scheduled osd.all-available-devices update...
[ceph: root@cephnode1 /]#
36
IBM Storage Ceph for Beginner’s
You can see that the two available disks were automatically configured for us.
[ceph: root@cephnode1 /]# ceph -s
cluster:
id: e7fcc1ac-42ec-11ef-a58f-bc241172f341
health: HEALTH_WARN
OSD count 2 < osd_pool_default_size 3
services:
mon: 1 daemons, quorum cephnode1 (age 37m)
mgr: cephnode1.tbqyke(active, since 35m)
osd: 2 osds: 2 up (since 18s), 2 in (since 32s)
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 453 MiB used, 50 GiB / 50 GiB avail
pgs:
A Ceph cluster typically has at least three manager (MGR) daemons that are responsible amongst
other things to provide the Ceph dashboard web-based GUI (they also collect and distribute statistics
and perform rebalancing and other cluster tasks). You would ideally deploy MGR and MON daemons
on the same nodes to achieve the same level of availability. A Ceph cluster will typically have one
active and two standby managers. Unlike the MON daemons, there is no requirement for MGR
daemons to maintain quorum and the cluster can tolerate the loss of all 3 managers (you can just
startup a MGR daemon on one of the remaining nodes for example). When you bootstrap your cluster,
the dashboard URL (the admin node port 8443) and login username and password are provided. The
first time you connect to the Ceph dashboard, you need to accept the security risk and continue (due
to Ceph using a self-signed certificate). Login and admin and password you previously recorded from
the bootstrap process.
You can reset the dashboard password from the command line.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=ia-changing-ceph-dashboard-password-
using-command-line-interface.
Before we add additional cluster nodes, remember for the minimum recommended 4-node cluster we
want to deploy the following Ceph daemons keeping in mind the service collocation guidelines (Note:
Ceph services are logical groups of Ceph daemons that are configured to run together in a Ceph
cluster. Ceph daemons are individual processes that run in the background when a Ceph service
starts).
37
IBM Storage Ceph for Beginner’s
You will initially be prompted to change the current password. Once you have changed the password
you need to add additional nodes to your Ceph cluster via the “Expand Cluster” wizard. Note, you have
alerts to enable the public telemetry and to enable IBM Call Home (again, this is one of the reasons
why you would choose IBM Storage Ceph over the publicly available version).
Figure 25: Ceph expand your cluster wizard after initial login
As you can see, all the Ceph core services (blue) are deployed on our admin/install node. Also, you
can see the node labels (black). As mentioned earlier, when you deploy Ceph services, you can specify
deployment based on labels instead of specify the exact nodes. Lastly, you can see a summary of the
node resources for each node.
38
IBM Storage Ceph for Beginner’s
We need to add the 3 other cluster nodes to the cluster. Click on “+ Add” to add your nodes. You need
to insert your node hostname (since we using FQDN as specified during bootstrap). You can see, some
default labels are automatically added to the node.
Figure 27: Ceph expand your cluster wizard – Adding your cluster nodes
Add the rest of the nodes and before you click “Next”, check that the labels are correct to match the
desired service colocation for a 4-node cluster. You can edit the labels by clicking on the node and
editing the labels if they are not (“Edit”).
39
IBM Storage Ceph for Beginner’s
Figure 28: Ceph expand your cluster wizard – Checking Node Labels
Once you have completed the node labels, you can click “Next”.
Figure 29: Ceph expand your cluster wizard – Recommended service collocation for a 4-node cluster
On the next screen you can create OSDs and enable data-at-rest (DAE) encryption (discussed later in
this document). For now, since we already set OSDs to all available devices on the cli (equivalent to
Cost/Capacity Optimized), we don’t have to choose anything. For a detailed explanation of all the
available options for OSD creation (including the Advanced Mode options) see here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=osds-managing.
40
IBM Storage Ceph for Beginner’s
Clicking on “Next” takes us to the “Create Services” step. We will deploy protocols separately so for
now just accept the default mandatory cluster services and their respective counts. As an example,
the Grafana service will be deployed on a cluster node that is tagged with the correct label (or if no
label is specified it will be deployed on any of the manager nodes).
Select “Next” and you will have a chance to review your configuration and accept it.
41
IBM Storage Ceph for Beginner’s
Depending on the size of your cluster, your GUI will spit out some warnings as the additional nodes
and services are added. Once completed, you should have a stable cluster.
Check you cluster inventory to make sure you have the correct number of deployed services. In our
example, the Expand Cluster wizard deployed 4 monitors.
42
IBM Storage Ceph for Beginner’s
We only want 3 monitors. To correct this, navigate to Administration -> Services and select and edit
the mon service. Change the count to 3.
If you expand the service, you will see which cluster nodes the service is deployed on.
43
IBM Storage Ceph for Beginner’s
As a reference, our example 4-node cluster has the following services and node labels deployed.
If you navigate to Cluster -> Pools, you will see that a default .mgr pool is created with a 3x replica
protection scheme.
44
IBM Storage Ceph for Beginner’s
If you navigate to Cluster -> Hosts, you can check the status and get information on all your cluster
nodes.
If you navigate to Cluster -> OSDs, you can check the status and get information on all your cluster
OSDs.
45
IBM Storage Ceph for Beginner’s
If you navigate to Cluster -> Physical Disks, you can get information on all your cluster physical disks.
If you navigate to Cluster -> CRUSH map, you can check the current CRUSH map hierarchy. As
explained earlier, our CRUSH map failure domains in our example are hosts (each replica will reside
on a separate host).
46
IBM Storage Ceph for Beginner’s
Lastly, we can view the status of the MONITOR daemons. Click on Cluster -> Monitors. The Ceph
cluster quorum is displayed here. We need at least 2 MONs for the cluster to remain active.
Now that you have a working IBM Storage Ceph cluster, we will now start to deploy the different
protocols.
47
IBM Storage Ceph for Beginner’s
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=gateway-basic-configuration
Before we start, there are a few concepts that we need to understand. The Ceph Object Gateway
typically supports a single site or multi-site deployment. In order to support a multi-site configuration,
Ceph makes use of realms, zonegroups (formerly called regions) and zones. A realm represents a
globally unique namespace consisting of one or more zonegroups containing one or more zones with
each zone supported by one or more rgw instances and backed by a single Ceph storage cluster. A
single zone contains buckets, which in turn contain objects. A realm enables the Ceph Object Gateway
to support multiple namespaces and their configuration on the same hardware.
For our purposes, we want to first deploy a single site configuration. To do this, we need a single
zonegroup which contains one zone with one or more Ceph RGW instances. You can specify the realm,
zonegroup and zone when creating a RGW service or accept the default. Navigate to Administration -
> Services -> Create and select RGW.
48
IBM Storage Ceph for Beginner’s
You have to specify the service id (in our case it’s rgw_default) and also the placement of the service
(we will use label since we already added the required rgw labels to each Ceph cluster node). To
support a muti-site configuration, you can create your own realm, zonegroup and zone names (or
accept the default). If you are not planning to test a multi-site configuration then you can just accept
the defaults.
Figure 46: Ceph RGW service – Create Realm, Zonegroup and Zone
Finally, you need to specify the port for the RGW service to run which in our case is port 8080.
49
IBM Storage Ceph for Beginner’s
You can verify if the RGW instances are deployed after clicking on “Create Service”.
You will notice that a set of default pools are created when you deploy the RGW service. The .rgw.root
pool is where the configuration for the Ceph Object Gateway (RGW) is stored. This includes
information such as realms, zone groups, and zones. Then 3 pools are created per zone as illustrated
below.
50
IBM Storage Ceph for Beginner’s
One of the advantages of containerized Ceph is that firewall rules are automatically updated on the
relevant Ceph cluster nodes for each service that is deployed. As an example, if we enabled port 8080
we see that it is already enabled on our RGW nodes.
You can also check the status of the RGWs from the command line as follows:
Now that we have deployed the RGWs, we can manage this service from the Ceph dashboard.
51
IBM Storage Ceph for Beginner’s
You can use the radosgw-admin command to query the current RGW service configuration from the
command line. As an example, we will query the zonegroup and zone configuration for our
configuration as depicted below:
52
IBM Storage Ceph for Beginner’s
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "a9c5b73e-66bd-4e10-9a2f-5df2f6fc515a",
"sync_policy": {
"groups": []
},
"enabled_features": [
"resharding"
]
}
[ceph: root@cephnode1 /]# radosgw-admin zone get
{
"id": "7109acc2-883b-4502-b863-4e7097d7d83c",
"name": "DC1_ZONE",
"domain_root": "DC1_ZONE.rgw.meta:root",
"control_pool": "DC1_ZONE.rgw.control",
"gc_pool": "DC1_ZONE.rgw.log:gc",
"lc_pool": "DC1_ZONE.rgw.log:lc",
"log_pool": "DC1_ZONE.rgw.log",
"intent_log_pool": "DC1_ZONE.rgw.log:intent",
"usage_log_pool": "DC1_ZONE.rgw.log:usage",
"roles_pool": "DC1_ZONE.rgw.meta:roles",
"reshard_pool": "DC1_ZONE.rgw.log:reshard",
"user_keys_pool": "DC1_ZONE.rgw.meta:users.keys",
"user_email_pool": "DC1_ZONE.rgw.meta:users.email",
"user_swift_pool": "DC1_ZONE.rgw.meta:users.swift",
"user_uid_pool": "DC1_ZONE.rgw.meta:users.uid",
"otp_pool": "DC1_ZONE.rgw.otp",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "DC1_ZONE.rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "DC1_ZONE.rgw.buckets.data"
}
},
"data_extra_pool": "DC1_ZONE.rgw.buckets.non-ec",
"index_type": 0,
"inline_data": true
}
}
],
"realm_id": "a9c5b73e-66bd-4e10-9a2f-5df2f6fc515a",
"notif_pool": "DC1_ZONE.rgw.log:notif"
}
[ceph: root@cephnode1 /]#
Note the placement target for our zonegroup is set to default_placement. Placement targets control
which pools are associated with a bucket and cannot be modified once a bucket is created. Storage
classes specify the placement of object data. S3 Bucket Lifecycle (LC) rules can automate the
transition of objects between storage classes. Storage classes are defined in terms of placement
targets. Each zonegroup placement target lists its available storage classes with an initial class named
STANDARD. The zone configuration is responsible for providing a data_pool pool name for each of the
zone group’s storage classes. Demonstrating Bucket Lifecycle policy is outside the scope of this
document but it’s important to understand how this works.
Before we start using the object service, let us configure high availability for the Ceph object gateways
we deployed. Even though we have two RGWs running, a failure of one instance or cluster node will
result in all clients using that gateway failing to connect. Also, we have to statically configure the
53
IBM Storage Ceph for Beginner’s
clients across the two RGWs in order to distribute the workload which is not ideal. Luckily, Ceph
includes a built-in load balancer which is referred to as an ingress service. The ingress service allows
you to create a high availability endpoint for RGW with a minimum set of configuration options. The
orchestrator will deploy and manage a combination of haproxy and keepalived to provide load
balancing on a floating virtual IP. We also want to use SSL for our object gateway service as most
commercial applications require a secure connection to an object store and won’t work over standard
http. This requires SSL termination by the ingress service (and not on the object gateways
themselves).
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=gateway-high-availability-service
Firstly, let us create a self-signed certificate for use by the ingress service. We will use cephs3.local
(10.0.0.244) as the virtual IP address for our ingress service. All object clients will connect to the
virtual IP and will be load balanced across our two RGW instances automatically. If you not planning
on using SSL, then you can skip this step.
[req_distinguished_name]
C = ZA
ST = Gauteng
L = Johannesburg
O = Acme Ltd
OU = Storage Management
CN = cephs3.local
[v3_req]
keyUsage = digitalSignature, keyEncipherment, nonRepudiation, keyCertSign
extendedKeyUsage = serverAuth
54
IBM Storage Ceph for Beginner’s
subjectAltName = @alt_names
[alt_names]
DNS.1 = *.local
[root@cephnode2 cert]#
[root@cephnode2 cert]# openssl req -new -nodes -x509 -days 365 -keyout cephs3.key -out
cephs3.crt -config ./cert.txt -addext 'basicConstraints = critical,CA:TRUE'
....+...+...........+......+......+....+.....+......+...+....+...++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++*.........+..+............+...+.......+.....+.+.....+...
....+...+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++*...+.....+...+....+
.....+....+............+........+.........+...+...+......+.+...+..+.........+.+.........+...+
.....................+.....+..........+......+.....+.+...+..+....+......+++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
....+...+..+.+........+..........+..+.........+.............+........++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++*......+.+...+.........+++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++*.+......+....+.....+..........+......+.....+......+...+.
......+...........+....+...........+.............+.....+..........+............+............+
..+.+...+...........+.+.....+....+..................+.....+.+.....+.+...+..+...+.......+.....
.+..+......+.......+.....+....+.....+.+...+.....+....+..+......+............+.+..+...........
.+.......+.....+..........+.........+..+.......+..+.+.........+...+......+.....+....+...+...+
.........+..+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----
[root@cephnode2 cert]# [root@cephnode2 cert]# openssl x509 -noout -text -in cephs3.pem
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
09:ac:c8:39:8e:5d:25:15:4c:e4:bc:2d:ca:0b:e2:0e:a5:f4:4d:83
Signature Algorithm: sha256WithRSAEncryption
Issuer: C = ZA, ST = Gauteng, L = Johannesburg, O = Acme Ltd, OU = Storage
Management, CN = cephs3.local
Validity
Not Before: Jul 19 19:25:43 2024 GMT
Not After : Jul 19 19:25:43 2025 GMT
Subject: C = ZA, ST = Gauteng, L = Johannesburg, O = Acme Ltd, OU = Storage
Management, CN = cephs3.local
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
.
.
.
91:f7:75:ea:cd:1f:68:2a:a6:fa:37:34:3a:b1:34:4a:28:46:
e0:cb:f2:1a:03:b0:4b:a0:43:14:c3:2b:6b:43:1e:33:8a:80:
f7:33:2c:61:21:70:7d:96:ea:ef:02:74:f8:1f:25:38:15:47:
f3:bf:5b:32:69:a5:84:20:0e:6d:5b:cb:52:eb:42:21:e9:34:
6f:98:95:1a:71:29:33:d7:8f:6e:ab:3d:1c:ce:8d:7b:42:cd:
a9:1a:d0:e6
[root@cephnode2 cert]#
[root@cephnode2 cert]# openssl verify cephs3.crt
C = ZA, ST = Gauteng, L = Johannesburg, O = Acme Ltd, OU = Storage Management, CN =
cephs3.local
error 18 at 0 depth lookup: self-signed certificate
error cephs3.crt: verification failed
[root@cephnode2 cert]# cp cephs3.crt /etc/pki/ca-trust/source/anchors/
[root@cephnode2 cert]# update-ca-trust enable; update-ca-trust; update-ca-trust extract
pkcs11:id=%42%3D%2B%24%A6%C1%45%CE;type=cert
type: certificate
label: A-Trust-Qual-02
trust: anchor
category: authority
.
.
.
[root@cephnode2 cert]# openssl verify cephs3.crt
cephs3.crt: OK
[root@cephnode2 cert]# openssl verify cephs3.pem
cephs3.pem: OK
55
IBM Storage Ceph for Beginner’s
[root@cephnode2 cert]#
You can deploy the ingress service using the ceph orchestrator or via the dashboard. To use the ceph
orchestrator, we need to create in input ingress.yaml file.
You can use a host of web-based tools to validate your input yaml file syntax. As an example,
https://www.yamllint.com/.
56
IBM Storage Ceph for Beginner’s
kdzSpfZ2xKXoy5g2sBtRRhALTkgPktpx3CbnNFGM8yq50GvTj0leuy8G3NgA/8gb
lyURRXAx1QKBgQCjWYgU/nt+05NsQry5hzFsBueHiPOCxyJM9MqL9DXjYqu6wn4g
KAFQj4OgYu1B6/We8dii1vHmUdVCiKYeZ6/pRzRyM2FXMR/BfA3dE/YuZqkuGoRx
U5goEGOrrtVBPseYrLzQDOuxMbW07ETdpHcHMxkbqLV7A+PFwCs+Mjx7lQKBgAtC
nFjkseghcuKmxEUvUklikxYvObQy7la1dCDPTgzujH1VBXIA6f86+T7TAkC4hHFC
8zH+oGmnIAlly2l8u4N6ZoIj6TQWY010JI+nfb+3v+79ATIgDaQ9yYgTThRb6/9A
XDBB7nnyVzDlgGmx/jr61sX5txUNXTpid25It7XlAoGBAJwBTX86b4HMzHUDghVA
oTbH4kjK0msYr+9Hbsc+iaaoAfrY8pUkP1krcGhj3S5LjHEAYUTWZU+xW54wUFWe
o6CWB+cQ7RQww7vITE0B0F24iS24IoTA1OXmN8bit/KW4+PkXNFv0uP3d0b9R7Qr
pJaeAgy19eTaPnUSdCWrywhw
-----END PRIVATE KEY-----
[root@cephnode1 ~]#
Note, we want to deploy the ingress service on the same nodes as our RGWs using the placement set
to label. Before we deploy the ingress service, we need to make sure we have the latest ingress and
haproxy container images.
We can check the status of the deployment. We should have two haproxy and two keepalived daemons
running.
57
IBM Storage Ceph for Beginner’s
To validate the HA configuration for the Ceph Object Gateway is working by using wget or curl. Both
should return an index.html.
index.html [ <=>
] 214 --.-KB/s in 0s
[root@cephnode1 ~]#
[root@cephnode1 ~]# curl -k https://cephs3.local
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult
xmlns="http://s3.amazonaws.com/doc/2006-03-
01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner> [root@cephnode1 ~]#
As mentioned earlier, you can deploy the ingress service via the Ceph dashboard as follows:
Figure 52: Defining the Ceph Ingress Service using the Ceph Dashboard
Now that we have a highly available object gateway, we can create an object user and test access.
Navigate to Object -> Users on the Ceph Dashboard and create a new user. Make sure to check the
box to generate a S3 access key ID and secret access key.
58
IBM Storage Ceph for Beginner’s
If you get an error “The Object Gateway Service is not configured” or “Error Connecting to Object
Gateway” when navigating to the Object service in the dashboard then check that the server port is
set correctly (in our case 443 after deploying the ingress service). You can query the current port by
issuing “ceph config dump | grep -i -e rgw -e dash”. You can also get this error when configuring
multi-site. See here https://bugzilla.redhat.com/show_bug.cgi?id=2231072 to resolve it.
You can also display the S3 access key ID and secret access key by clicking on the user and choosing
“Show” under Keys. Note, a default dashboard user is also created by default.
Figure 54: Displaying an object user access key ID and secret access key
59
IBM Storage Ceph for Beginner’s
We will use AWSCLI for testing access to our object service (https://aws.amazon.com/cli/). We need
the PEM file in order to use SSL via HTTPS with AWSCLI.
60
IBM Storage Ceph for Beginner’s
To test high availability, you can check which node has the virtual IP (10.0.0.244 or cephs3.local). In
our case, cephnode2 currently has virtual IP. We will abruptly shutdown this node and test access to
our virtual IP with AWSCLI as follows:
With cephnode2 down, we need to test access to the virtual IP which has now moved to cephnode3
as per the above output. Firstly, we verify the node is down and it’s RGW is not available.
61
IBM Storage Ceph for Beginner’s
id: e7fcc1ac-42ec-11ef-a58f-bc241172f341
health: HEALTH_WARN
1/3 mons down, quorum cephnode1,cephnode3
2 osds down
1 host (2 osds) down
Degraded data redundancy: 192/663 objects degraded (28.959%), 56 pgs degraded,
327 pgs undersized
services:
mon: 3 daemons, quorum cephnode1,cephnode3 (age 108s), out of quorum: cephnode2
mgr: cephnode1.tbqyke(active, since 77s)
osd: 8 osds: 6 up (since 2m), 8 in (since 27h)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 7 pools, 417 pgs
objects: 221 objects, 586 KiB
usage: 363 MiB used, 150 GiB / 150 GiB avail
pgs: 192/663 objects degraded (28.959%)
271 active+undersized
90 active+clean
56 active+undersized+degraded
We have just demonstrated the high availability of the Ceph RGW service. We need to reboot
cephnode2 and wait for it to rejoin the cluster.
62
IBM Storage Ceph for Beginner’s
services:
mon: 3 daemons, quorum cephnode1,cephnode3,cephnode2 (age 2m)
mgr: cephnode1.tbqyke(active, since 11m), standbys: cephnode2.iaecpr
osd: 8 osds: 8 up (since 2m), 8 in (since 27h)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 7 pools, 417 pgs
objects: 260 objects, 586 KiB
usage: 431 MiB used, 200 GiB / 200 GiB avail
pgs: 44/780 objects degraded (5.641%)
354 active+clean
51 active+undersized
10 active+undersized+degraded
2 active+clean+scrubbing
services:
mon: 3 daemons, quorum cephnode1,cephnode3,cephnode2 (age 16m)
mgr: cephnode1.tbqyke(active, since 25m), standbys: cephnode2.iaecpr
osd: 8 osds: 8 up (since 16m), 8 in (since 27h)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 7 pools, 417 pgs
objects: 259 objects, 586 KiB
usage: 427 MiB used, 200 GiB / 200 GiB avail
pgs: 417 active+clean
When performing failure testing during a POC like we just did where we simulated a node failure, you
might want to prevent CRUSH from automatically rebalancing the cluster. To avert this rebalancing
behaviour, set the cluster to noout by running the following command “ceph osd set noout”.
The virtual IP will return to cephnode2 (original node) by default. You can verify this by using ip a. As
you can see from the below output, the virtual IP has moved back to cephode2.
[root@cephnode2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:e9:1f:f5 brd ff:ff:ff:ff:ff:ff
altname enp0s18
inet 10.0.0.241/8 brd 10.255.255.255 scope global noprefixroute ens18
valid_lft forever preferred_lft forever
inet 10.0.0.244/24 scope global ens18
valid_lft forever preferred_lft forever
inet6 fe80::be24:11ff:fee9:1ff5/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:2b:c4:ee brd ff:ff:ff:ff:ff:ff
63
IBM Storage Ceph for Beginner’s
altname enp0s19
inet 192.168.1.11/24 brd 192.168.1.255 scope global noprefixroute ens19
valid_lft forever preferred_lft forever
inet6 fe80::f3b8:12a1:3e2c:7db/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@cephnode2 ~]#
[root@cephnode3 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:5e:07:89 brd ff:ff:ff:ff:ff:ff
altname enp0s18
inet 10.0.0.242/8 brd 10.255.255.255 scope global noprefixroute ens18
valid_lft forever preferred_lft forever
inet6 fe80::be24:11ff:fe5e:789/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:bb:d5:f6 brd ff:ff:ff:ff:ff:ff
altname enp0s19
inet 192.168.1.12/24 brd 192.168.1.255 scope global noprefixroute ens19
valid_lft forever preferred_lft forever
inet6 fe80::4860:35fe:9978:e3a6/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@cephnode3 ~]#
https://bucket.s3.amazonaws.com/object
https://bucket.s3-aws-region.amazonaws.com/object
For Path-style addressing, the bucket name is not part of the DNS name in the URL. For example:
https://s3.amazonaws.com/bucket/object
https://s3-aws-region.amazonaws.com/bucket/object
Most commercial applications that make use of object storage will require virtual-hosted style bucket
addressing. To support Virtual-hosted style addressing requires the use of a Wildcard DNS. Amazon
intends to deprecate Path-style API requests.
“Amazon S3 currently supports two request URI styles in all regions: path-style (also known as V1) that
includes bucket name in the path of the URI (example: //s3.amazonaws.com/<bucketname>/key), and
virtual-hosted style (also known as V2) which uses the bucket name as part of the domain name
(example: //<bucketname>.s3.amazonaws.com/key). In our effort to continuously improve customer
experience, the path-style naming convention is being retired in favor of virtual-hosted style request
format. Customers should update their applications to use the virtual-hosted style request format when
making S3 API requests before September 30th, 2020 to avoid any service disruptions. Customers using
the AWS SDK can upgrade to the most recent version of the SDK to ensure their applications are using
the virtual-hosted style request format.
64
IBM Storage Ceph for Beginner’s
Virtual-hosted style requests are supported for all S3 endpoints in all AWS regions. S3 will stop
accepting requests made using the path-style request format in all regions starting September 30th,
2020. Any requests using the path-style request format made after this time will fail.” Source:
https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/ and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html.”
The deprecation of Path-style requests was originally targeted for September 2020. As of the time of
writing, this deadline has been extended.
“Update (September 23, 2020) – Over the last year, we’ve heard feedback from many customers who
have asked us to extend the deprecation date. Based on this feedback we have decided to delay the
deprecation of path-style URLs to ensure that customers have the time that they need to transition to
virtual hosted-style URLs.
We have also heard feedback from customers that virtual hosted-style URLs should support buckets
that have dots in their names for compatibility reasons, so we’re working on developing that support.
Once we do, we will provide at least one full year prior to deprecating support for path-style URLs for
new buckets.” Source: https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-
rest-of-the-story/.
Without the ability to implement Virtual-hosted style bucket addressing in IBM Storage Ceph, we risk
losing the ability to support applications built to make use of this addressing scheme exclusively. The
procedure to add a wildcard such as a hostname to the DNS record of the DNS server is documented
here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=configuration-add-wildcard-dns
For a POC environment, you can setup a DNS server using dnsmasq which is an open source,
lightweight, easy to configure DNS forwarder and DHCP server. Our lab dnsmasq configuration file
looks as follows:
.
.
.
#mods
server=8.8.8.8
address=/cephs3.local/10.0.0.244
#address=/local/127.0.0.1
domain=local
.
.
.
Check to see if your name resolution is working. Anything that doesn’t have an entry in the /etc/hosts
file of the dnsmasq server should resolve to *.cephs3.local.
Name: cephnode1.local
Address: 10.0.0.240
Name: cephnode1
Address: 10.0.0.240
65
IBM Storage Ceph for Beginner’s
Name: cephs3.local
Address: 10.0.0.244
Name: cephs3
Address: 10.0.0.244
Name: bucket1.cephs3.local
Address: 10.0.0.244
root@labserver:/etc#
From one of our ceph cluster nodes we can also verify wildcard DNS is working.
Name: bucket.cephs3.local
Address: 10.0.0.244
[root@cephnode1 ~]#
Before following the procedure from the documentation, we can test to make sure virtual-host style
bucket addressing is working or not. For this, we will use s3cmd (https://s3tools.org/s3cmd). When
issuing the s3cmd --configure to add your S3 credentials and endpoint URL, the configuration should
look something like this.
New settings:
Access Key: Z3VKKAHG9W9WLRN30FQF
Secret Key: zBDMygZBnEizrLrY3ioP7wpePeSy69AJBRGhMG1z
Default Region: US
S3 Endpoint: cephs3.local
DNS-style bucket+hostname:port template for accessing a bucket: %(bucket)s.cephs3.local
Encryption password: Nlp345gp
Path to GPG program: /usr/bin/gpg
Use HTTPS protocol: True
HTTP Proxy server name:
HTTP Proxy server port: 0
Since we have SSL enabled with a self-signed certificate, we will just disable validating the SSL
certificate in the .s3cfg file with “check_ssl_certificate = False”. If we test the creation of a DNS style
bucket with s3cmd we get:
Now we will follow the procedure as per the documentation. Firstly, we need to get the current
zonegroup information and output it to a json file.
Next, we need to modify the file to include cephs3.local in the hostnames field.
66
IBM Storage Ceph for Beginner’s
.
.
"api_name": "LAB_ZONE_GROUP1",
"is_master": true,
"endpoints": [
"10.0.0.241",
"10.0.0.242"
],
"hostnames": ["cephs3.local","cephnode2.local","cephnode3.local"],
"hostnames_s3website": [],
"master_zone": "7109acc2-883b-4502-b863-4e7097d7d83c",
"zones": [
{
"id": "7109acc2-883b-4502-b863-4e7097d7d83c",
"name": "DC1_ZONE",
"endpoints": [
"10.0.0.241",
"10.0.0.242"
],
"log_meta": false,
"log_data": false,
.
.
We need to now upload the new zonegroup information back to the Ceph RGWs and update the period.
Each realm is associated with a “period”. A period represents the state of the zonegroup and zone
configuration in time. Each time you make a change to a zonegroup or zone, you should update and
commit the period.
67
IBM Storage Ceph for Beginner’s
"10.0.0.241",
"10.0.0.242"
],
"hostnames": [
"cephs3.local",
"cephnode2.local",
"cephnode3.local"
],
.
.
.
"realm_id": "a9c5b73e-66bd-4e10-9a2f-5df2f6fc515a",
"realm_epoch": 2
}
[root@cephnode1 ~]#
Lastly, we need to recycle the Ceph Object Gateways (do this for all RGWs).
68
IBM Storage Ceph for Beginner’s
DATEVALUE=`date -R`
SIGNATURE_STRING="PUT\n\n${CONTENTTYPE}\n${DATEVALUE}\n${FILEPATH}"
# Create signature hash to be sent in Authorization header
SIGNATURE_HASH=`echo -en ${SIGNATURE_STRING} | openssl sha1 -hmac ${S3_SECRET_KEY} -binary |
base64`
# Clean out awsbucket
s3cmd rb --recursive --force s3://awsbucket >/dev/null 2>&1
s3cmd mb s3://awsbucket >/dev/null 2>&1
# Output to screen
echo
echo "Script to demonstrate Virtual-hosted style bucket addressing"
echo
echo "Listing $BUCKET bucket prior to PUT"
echo
echo " -> Running s3cmd ls s3://awsbucket.."
s3cmd ls s3://awsbucket
echo
echo " -> List bucket completed"
echo
echo "Performing a PUT to the following URL https://${BUCKET}.cephs3.local/${FILE}"
echo
echo " -> curl -k -X PUT -T ${FILE} -H Host: ${BUCKET}.cephs3.local -H Date: ${DATEVALUE} -H
Content-Type: ${CONTENTTYPE} -H Authorization: AWS ${S3_ACCESS_KEY}:${SIGNATURE_HASH}
https://${BUCKET}.cephs3.local/${FILE}"
# curl command to do PUT operation to our S3 endpoint
curl -k -X PUT -T "${FILE}" \
-H "Host: ${BUCKET}.cephs3.local" \
-H "Date: ${DATEVALUE}" \
-H "Content-Type: ${CONTENTTYPE}" \
-H "Authorization: AWS ${S3_ACCESS_KEY}:${SIGNATURE_HASH}" \
https://${BUCKET}.cephs3.local/${FILE}
echo
echo "Listing bucket after PUT"
echo
echo " -> Running s3cmd ls s3://awsbucket.."
echo
s3cmd ls s3://awsbucket
echo
echo " -> List bucket completed"
echo
echo "End of script.. Exiting"
root@labserver:~#
root@labserver:~# ./virtual.sh
69
IBM Storage Ceph for Beginner’s
For a POC or test environment, you can use s3bench to test the performance of the Ceph Object
Gateway (https://github.com/igneous-systems/s3bench ). You can monitor the RGW performance
using the Grafana dashboards.
Test parameters
endpoint(s): [http://cephnode2.local:8080]
bucket: s3bench
objectNamePrefix: s3bench
objectSize: 0.0096 MB
numClients: 2
numSamples: 100
verbose: %!d(bool=false)
70
IBM Storage Ceph for Beginner’s
Note, some applications require you to specify a REGION. With Ceph, the REGION equates to the zone
name.
Another useful tool for a POC environment is s3tests https://github.com/ceph/s3-tests. This tool tests
S3 API compatibility. If you are comparing different object stores (e.g. Ceph, MinIO, Dell EMC Isilon,
Storage Scale etc.) you can use this to see which solution offers the highest level of S3 API
compatibility.
s3tests_boto3/functional/test_headers.py .........F...FFFF........FF.FFF....F...F....F...
[ 6%]
s3tests_boto3/functional/test_iam.py
FFFFFFFFFFFFFFFFFFFFFFFFFFFssssssssssssssssssssssssssssssssssssssssssssssssssssssss
[ 17%]
s3tests_boto3/functional/test_s3.py
.............................................................................................
...............F.......... [ 32%]
...........................F..F.....................................................F........
..........................F................................... [ 53%]
........
.
.
.
.
.
======================================== 138 failed, 552 passed, 69 skipped, 14776 warnings,
1 error in 1071.85s (0:17:51) ========================================
You can configure the Ceph Object Gateway to host static websites in S3 buckets. Traditional website
hosting involves configuring a web server for each website, which can use resources inefficiently when
content does not change dynamically. For example, sites that do not use server-side services like PHP,
servlets, databases, NodeJS, and the like. This approach is substantially more economical than setting
up virtual machines with web servers for each site. The procedure to do so is documented here:
71
IBM Storage Ceph for Beginner’s
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=configuration-static-web-hosting
We will use s3cmd to create out static website. First we need to create a bucket to host our static
website.
Next, we will upload the index.html, error.html (required) and any other files referenced in the
index.html and grant all of these objects public access.
Figure 56: IBM Storage Ceph RGW Static Web hosting Example
72
IBM Storage Ceph for Beginner’s
Since this document assumes you are trying to evaluate IBM Storage Ceph, we will configure and
provision RBDs to Linux and Windows hosts and also to a native Kubernetes cluster for persistent
volume storage. Ceph RBDs require just monitor, manager and OSD daemons and can be hosted in
the same Ceph storage cluster as RGWs and CephFS. More information on RADOS Block Devices can
be found here:
73
IBM Storage Ceph for Beginner’s
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=ceph-block-devices
Deleting pools on the Ceph dashboard is disabled by default. You can delete the
pools on Ceph Storage Dashboard by ensuring that value of
mon_allow_pool_delete is set to True in Manager modules. You can also enable
this in a running Ceph cluster from the command line by issuing "ceph tell mon.*
injectargs --mon_allow_pool_delete true".
If you navigate to the Block storage protocol in the Ceph dashboard you will notice that you need to
first create a RBD pool.
Navigate to Cluster -> Pools -> Create and create a new pool called RBD. You need to specify the
application as RBD. Notice the default is a replicated pool with 3 replicas.
74
IBM Storage Ceph for Beginner’s
Navigate back to Block Storage and we are can now create RBD images (similar to Luns). Notice, there
is an option to create namespaces. A namespace allows you to segregate RBD images (it is similar to
Lun masking that we usually do on a block storage arrays). Users granted access to one namespace
won’t be able to see other RBD images that reside in a different namespace to which they don’t have
access. See https://access.redhat.com/solutions/4872331.
For a POC environment, we want to create images for a Windows and Linux hosts. We don’t want them
to be able to access each other’s images so we will create two namespaces called windows and linux.
Both will reside in the same storage pool we created earlier.
75
IBM Storage Ceph for Beginner’s
We can now create an image in each namespace. We will call them winlun and linuxlun and set them
to 3GB in size (we will accept the default options for now).
Unless otherwise specified, the client rbd command uses the Ceph user ID admin to access the Ceph
cluster. The admin Ceph user ID allows full administrative access to the cluster. It is recommended
that you acess the Ceph cluster with a Ceph user ID that has fewer permissions than the admin Ceph
user ID does. We call this non-admin Ceph user ID a “block device user” or “Ceph user”. We will create
two users called client.windows and client.linux (note, all ceph users should have the prefix client).
You can use the CLI command ceph auth get-or-create to create the required users and assign them
the correct MONITOR and OSD capabilities. Since we are using namespaces, we also have to specify
76
IBM Storage Ceph for Beginner’s
the namespaces for which they have access. We use the -o flag to create a keyring file that we will
provide to each of the clients with their corresponding keys.
[root@cephnode1 ~]# ceph auth get-or-create client.windows mon 'profile rbd' osd 'profile
rbd pool=rbd namespace=windows' -o /etc/ceph/ceph.client.windows.keyring
[root@cephnode1 ~]# ceph auth get-or-create client.linux mon 'profile rbd' osd 'profile rbd
pool=rbd namespace=linux' -o /etc/ceph/ceph.client.linux.keyring
[root@cephnode1 ~]# ceph auth get client.windows
[client.windows]
key = AQApVZxmHkGOFRAAvpJODJPdmeU1pzuN7OAx5g==
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd namespace=windows"
[root@cephnode1 ~]# ceph auth get client.linux
[client.linux]
key = AQA/VZxmTnFvEhAA+IOSfUHZig/HrW/RkGSKsw==
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd namespace=linux"
[root@cephnode1 ~]#
We can test access for each user to ensure that they are only able to see images in their own
namespace. Since we have both keyring files already in /etc/ceph we can issue the following
commands:
As you can see, the linux user can’t access the windows namespace and vice-versa.
https://cloudbase.it/ceph-for-windows/
77
IBM Storage Ceph for Beginner’s
Ceph for Windows is released under the GNU LGPL (the same as Ceph). Be sure to choose the defaults
as we want both the CLI tools and RBD driver to be installed.
Once completed, you will be prompted to reboot your Windows server. After reboot, we need to modify
the ceph.conf file. The default location for the ceph.conf file on Windows is
%ProgramData%\ceph\ceph.conf, You can just take the MON host information from one of your Ceph
cluster nodes.
78
IBM Storage Ceph for Beginner’s
Figure 67: Ceph for Windows Configuration file and client keyring
You should be able to list RBD images for which you have access to using the rbd command. Note, if
we try the wrong namespace we get an error.
If you navigate to Windows Disk Management you should see the RBD image. Right click to mark it as
online.
Right Click on the RBD image and initialize and format it.
79
IBM Storage Ceph for Beginner’s
You can use the rbd command to list the configuration as well. You can find a list of rbd command
options here:
https://docs.ceph.com/en/reef/rbd/rados-rbd-cmds/
Figure 72: Ceph for Windows rbd command to show mapped images
80
IBM Storage Ceph for Beginner’s
Copy the client keyring file from the cluster node where we created the rbd users (or you can re-export
the keyring file again from the CLI or use the GUI to copy and paste it). Also copy the ceph.conf from
one of your cluster nodes.
Like we did on Windows, we need to map the RBD image. As you can see, a new device /dev/rbd0 is
created.
81
IBM Storage Ceph for Beginner’s
A handy tool for POCs is to use the rbd bench option to test the performance of the RBD image. We
will assign a new RBD image and then run the rbd bench with a read/write I/O pattern against it. Note,
we are using the raw device but you could also create a filesystem and run the benchmark this way as
well. Also, you can perform different types of tests and use different block sizes (see the command
syntax for the rbd command or from the syntax output below).
You can monitor performance of the benchmark from the Ceph Grafana dashboard.
82
IBM Storage Ceph for Beginner’s
You can also make use of the rbd iotop and rbd iostat commands. You need to first make sure that the
rbd_support Ceph manager module is enabled (you can check with “ceph mgr module ls”) and ensure
that the RBD stats are enabled in the Prometheus manager module (they are not by default).
83
IBM Storage Ceph for Beginner’s
Next, we can check on the Ceph dashboard for the current capacity utilization. In our case, the
Windows RBD image has consumed ~42% of its provisioned size of 3GB.
84
IBM Storage Ceph for Beginner’s
If we check on the Ceph dashboard, we should immediately see the freed-up space. This is one of the
advantages of using Ceph. On traditional block storage arrays, thin provisioned space is not freed-up
by default and is only possible if the array and host operating system support the SCSI unmap
function.
85
IBM Storage Ceph for Beginner’s
For the Windows RBD client, we will use AJA System Test Utility (https://www.aja.com/products/aja-
system-test).
Figure 79: Using AJA System Test to generate load on Windows RBD client
On Linux, you can just use a simple dd command to generate load as follows:
86
IBM Storage Ceph for Beginner’s
On the IBM Storage Ceph Dashboard we can check the load on our four cluster nodes.
We can verify from the Linux client which nodes the client is connecting too. Sure enough, the client is
load-balancing across the OSD nodes (.240-.243).
87
IBM Storage Ceph for Beginner’s
Let us now shutdown one of the cluster nodes and monitor the behavior.
services:
mon: 3 daemons, quorum cephnode1,cephnode3 (age 17s), out of quorum: cephnode2
mgr: cephnode2.iaecpr(active, since 78m), standbys: cephnode1.tbqyke
mds: 1/1 daemons up, 1 standby
osd: 8 osds: 6 up (since 16s), 8 in (since 3d)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 11 pools, 625 pgs
objects: 1.40k objects, 3.2 GiB
usage: 17 GiB used, 182 GiB / 200 GiB avail
pgs: 625 active+clean
io:
client: 1.2 KiB/s rd, 15 MiB/s wr, 1 op/s rd, 8 op/s wr
[root@cephnode1 ~]#
From the Linux hosts we can see a slight delay in writing our test file. However, access was not lost
and continued as normal even with the failure albeit with some performance degradation (this is to be
expected as we shutdown an OSD node with 2 active OSDs). The same behavior is observed on the
Windows client.
.
.
.
1048576000 bytes (1.0 GB, 1000 MiB) copied, 28.8649 s, 36.3 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 36.7548 s, 28.5 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 38.0272 s, 27.6 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 30.8565 s, 34.0 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 25.0541 s, 41.9 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 25.4571 s, 41.2 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.8836 s, 43.9 MB/s
.
.
.
We can see the drop in performance when cephnode2 was shutdown and the subsequent resumption
of I/O load after a brief pause.
88
IBM Storage Ceph for Beginner’s
The Ceph Grafana dashboard RBD Details also show the failure and continuation of I/O after it.
Reboot the failed node and wait for the cluster to recover.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=access-setting-admin-user-password-
grafana
89
IBM Storage Ceph for Beginner’s
Essentially, we need to create a Grafana YAML file with the desired admin password and then use the
Ceph orchestrator to apply the new specification as per below:
You can then connect to the IP of the node running the Grafana service (check under services in the
Ceph Dashboard) and login as admin with the new password.
If you navigate to Dashboard, you will see the available ones to be displayed.
90
IBM Storage Ceph for Beginner’s
91
IBM Storage Ceph for Beginner’s
92
IBM Storage Ceph for Beginner’s
If you get NO DATA on any of the dashboard, check that the data source definitions are correctly
defined. Also, if you get errors when accessing the embedded Grafana pages on the Ceph
dashboard due to using a self-signed certificate, be sure to add an exception on your browser (e.g.
On Firefox, under Settings -> Privacy and Security, navigate to Certificate Manager and add an
exception manually).
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=upgrading
To check for a software upgrade on the Ceph dashboard, navigate to “Administration -> Upgrade”.
93
IBM Storage Ceph for Beginner’s
As you can see, we can’t do an upgrade check from the GUI. This issue is documented here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=access-upgrading-cluster
You can check the current version of IBM Storage Ceph from the command line as follows:
94
IBM Storage Ceph for Beginner’s
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=cephadm-upgrading-storage-ceph-cluster
You will also want to check what the options are to perform a staggered upgrade which is documented
here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=upgrading-staggered-upgrade
Since we can’t demonstrate the upgrade from the Ceph Dashboard, we will go through the steps for a
command line upgrade using the Ceph orchestrator. The automated upgrade process follows Ceph
best practices. It will start first with MGRs, MONs then other daemons. Each daemon is restarted only
after Ceph determines that the cluster will remain available.
As per the link provided above, the procedure is demonstrated below. First, we will ensure we have a
valid subscription and apply the latest OS updates (similar to when we bootstrapped a new cluster).
95
IBM Storage Ceph for Beginner’s
Nothing to do.
Complete!
[3] 21:38:09 [SUCCESS] root@cephnode1
Updating Subscription Management repositories.
Last metadata expiration check: 1:26:15 ago on Sun 21 Jul 2024 20:11:53.
Dependencies resolved.
Nothing to do.
Complete!
[4] 21:38:09 [SUCCESS] root@cephnode2
Updating Subscription Management repositories.
Last metadata expiration check: 2:09:19 ago on Sun 21 Jul 2024 19:28:49.
Dependencies resolved.
Nothing to do.
Complete!
[root@cephnode1 ~]#
We used PSSH as explained earlier. If you chose to not use it, then you can use the Ceph preflight
playbook to upgrade the Ceph packages on the other cluster nodes by specifying
upgrade_ceph_packages=true as follows:
[root@cephnode1 ~]# cd /usr/share/cephadm-ansible
[root@cephnode1 cephadm-ansible]# ansible-playbook -i ./inventory/production/hosts cephadm-
preflight.yml --extra-vars "ceph_origin=ibm upgrade_ceph_packages=true"
[DEPRECATION WARNING]: [defaults]callback_whitelist option, normalizing names to new
standard, use callbacks_enabled instead. This feature will be removed from ansible-core in
version
2.15. Deprecation warnings can be disabled by setting deprecation_warnings=False in
ansible.cfg.
PLAY [insecure_registries]
*********************************************************************************************
*********************************************************************
96
IBM Storage Ceph for Beginner’s
PLAY [preflight]
*********************************************************************************************
*******************************************************************************
Before we start the upgrade, we need to check that all cluster nodes are online and that our cluster is
healthy.
services:
mon: 3 daemons, quorum cephnode1,cephnode3,cephnode2 (age 57m)
mgr: cephnode2.iaecpr(active, since 57m), standbys: cephnode1.tbqyke
osd: 8 osds: 8 up (since 57m), 8 in (since 2d)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 9 pools, 481 pgs
objects: 632 objects, 716 MiB
usage: 2.6 GiB used, 197 GiB / 200 GiB avail
pgs: 481 active+clean
97
IBM Storage Ceph for Beginner’s
lvcreate is present
Unit chronyd.service is enabled and running
Hostname "cephnode4.local" matches what is expected.
Host looks OK
[root@cephnode1 ~]#
In order to ensure that no recovery actions are performed during the upgrade, we want to set the
noout, noscrub and nodeep-scrub flags. This will also prevent any unnecessary load on the cluster
during the upgrade.
Now we need to login to the IBM Container Registry to check service versions and available target
versions.
98
IBM Storage Ceph for Beginner’s
"osd.4",
"osd.7"
]
}
[root@cephnode1 ~]#
You can see that we have some services that need an update and others that do not. We can now start
the upgrade for our Ceph cluster.
We can monitor the progress using the “ceph orch upgrade status” command. When the in_progress
status changes to false the upgrade is completed.
You can also check the status using the “ceph status” command (sample output provided).
After the upgrade is completed, you can check the versions using “ceph ps” or “ceph versions” and
check the cluster version using the “ceph --version” commands.
99
IBM Storage Ceph for Beginner’s
We need to unset the noout, noscrub, and nodeep-scrub flags we set before we started the upgrade.
services:
mon: 3 daemons, quorum cephnode1,cephnode3,cephnode2 (age 5m)
mgr: cephnode2.iaecpr(active, since 75m), standbys: cephnode1.tbqyke
osd: 8 osds: 8 up (since 75m), 8 in (since 2d)
flags noout,noscrub,nodeep-scrub
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 9 pools, 481 pgs
objects: 632 objects, 716 MiB
usage: 2.6 GiB used, 197 GiB / 200 GiB avail
pgs: 481 active+clean
[root@cephnode1 ~]#
services:
mon: 3 daemons, quorum cephnode1,cephnode3,cephnode2 (age 12m)
mgr: cephnode2.iaecpr(active, since 83m), standbys: cephnode1.tbqyke
osd: 8 osds: 8 up (since 83m), 8 in (since 2d)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 9 pools, 481 pgs
objects: 632 objects, 716 MiB
usage: 2.6 GiB used, 197 GiB / 200 GiB avail
pgs: 481 active+clean
[root@cephnode1 ~]#
100
IBM Storage Ceph for Beginner’s
The final step is to upgrade ceph-tools on all client nodes that connect to the cluster and to check that
they are on the latest version. You can run the “dnf update ceph-common” on RHEL clients and “ceph
--version” commands to do this.
As part of a POC, you would want to demonstrate the use of CephFS as a clustered filesystem similar
to IBM Storage Scale, GlusterFS or Lustre. The same use cases would typically apply to CephFS as
with any of the others mentioned.
Let’s start by navigating to File Systems on the Ceph Dashboard and clicking on “Create”. Note, we
are choosing to deploy our MDS servers on any cluster nodes labelled mds. Ceph will create two pools
for each filesystem, a metadata and a data pool. We do not specify a size of the filesystem at creation
time as the pool is thin provisioned.
By default, a Ceph File System uses only one active MDS daemon. However, systems with many clients
benefit from multiple active MDS daemons. For a POC, we just need one active and one standby
daemon. Navigate to Administration -> Services and Edit the service call mds.labserver (our filesystem
name) and change the count to 2.
101
IBM Storage Ceph for Beginner’s
We should have at least two MDS daemons running, one active and one standby.
You can also query this from the command line as follows:
On the Ceph Dashboard, navigate back to File System and get details for the newly created filesystem.
We can see that we have one active and one standby daemon for our filesystem.
102
IBM Storage Ceph for Beginner’s
We need to now add client access to the filesystem. You can click on “Authorize” to add a client. We
want to grant access for the client lab to the entire labserver filesystem and also give this client read
and write access. We can untick “Root Squash”.
We need the client keyring so now must navigate to Administration -> Ceph Users -> client.lab and
click on “Edit” to see the client key. You can also choose to export it.
103
IBM Storage Ceph for Beginner’s
Figure 96: Ceph Dashboard Ceph Users – Edit User to display key
You can also generate the client keyring file from the command line as follows:
Always make sure your keyring file have the correct permissions set. They should be set to 600.
We need to copy this keyring file to our CephFS client server and mount the Ceph filesystem we just
created. We want to mount the filesystem using the Ceph Linux kernel driver so that it mounts as a
regular filesystem and we get native kernel performance. You can also mount it using fuse. The syntax
for the mount command is as follows:
The fsid is a unique identifier for the Ceph cluster, and stands for File System ID from the days when
the Ceph Storage Cluster was principally for the Ceph File System. Ceph now supports block devices
and object storage gateway interfaces too, so fsid is a bit of a misnomer. You can obtain this from
running “ceph -s” on one of the Ceph cluster nodes.
104
IBM Storage Ceph for Beginner’s
You can also get the CephFS mount command syntax from the Ceph Dashboard by clicking on the
filesystem and then selecting “Attach” as the action.
Let us copy the client keyring file we exported and mount the filesystem.
105
IBM Storage Ceph for Beginner’s
If you want to monitor performance from the command line, you can install cephfs-top utility. You
need to enable the stats plugin on your MGR module as it is disabled by default. You also need to
create a client.fstop user for the cephfs-top utility to function.
[root@cephnode1 ~]# ceph auth get-or-create client.fstop mon 'allow r' mds 'allow r' osd
'allow r' mgr 'allow r'
[client.fstop]
key = AQC+dZ5mhqpkAhAAohC9SDLOSj8f10QxzLiDBg==
[root@cephnode1 ~]#
Complete!
[root@cephnode1 ~]#
Now you can get detailed metrics from the command line using cephfs-top.
106
IBM Storage Ceph for Beginner’s
You can also view the performance through the Ceph Dashboard or Grafana Dashboard.
A quick test of client access during a node failure can be done as follows. First, we generate some
client load from our CephFS client.
107
IBM Storage Ceph for Beginner’s
Check the cluster status to make sure the node is down and we should only have 1 MDS daemon
running.
services:
mon: 3 daemons, quorum cephnode3,cephnode2 (age 2m), out of quorum: cephnode1
mgr: cephnode2.iaecpr(active, since 42m)
mds: 1/1 daemons up
osd: 8 osds: 6 up (since 2m), 8 in (since 3d)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 11 pools, 625 pgs
objects: 1.31k objects, 2.5 GiB
usage: 15 GiB used, 185 GiB / 200 GiB avail
pgs: 902/3924 objects degraded (22.987%)
261 active+undersized
194 active+undersized+degraded
170 active+clean
io:
client: 4.3 KiB/s rd, 121 MiB/s wr, 8 op/s rd, 136 op/s wr
[root@cephnode4 ~]#
On the CephFS client, our simple script to create a file with dd is still running fine.
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.81071 s, 579 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.62853 s, 644 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.58135 s, 663 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 2.55161 s, 411 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 8.83065 s, 119 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.6657 s, 185 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.7842 s, 588 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.72276 s, 609 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.78127 s, 589 MB/s
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.70911 s, 614 MB/s
You can reboot the failed node and make sure the cluster goes back to a healthy state.
108
IBM Storage Ceph for Beginner’s
As with pools, deleting a CephFS filesystem is disabled in the Dashboard by default. You can follow
the procedure below to delete a File System.
109
IBM Storage Ceph for Beginner’s
We will demonstrate deploying a single instance NFS service. As we did with the RGW service, we will
also demonstrate the deployment of a highly-available NFS server using the Ceph ingress service
which deploys a virtual IP along with HAProxy and keepalived. If you don’t need to test NFS high-
availability then you can just deploy one or more NFS gateways with no failover.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=operations-managing-nfs-ganesha-
gateway-using-ceph-orchestrator
We can create the NFS cluster using the Dashboard. Navigate to Administration -> Services.
110
IBM Storage Ceph for Beginner’s
We can query our NFS cluster information from the command line.
Next, we will create a separate Ceph File system to use for our NFS export.
111
IBM Storage Ceph for Beginner’s
Now we will create an NFS export. Navigate to File -> NFS -> Create. IBM Storage Ceph support both
NFSv3 and NFSv4. Open-source Ceph (Reef) only support NFSv4. Pseudo path is the export position
within the NFSv4 pseudo filesystem where the export will be available on the server. It must be an
absolute path and be unique.
We can specify the clients that we want to export the CephFS filesystem too as well as specify NFS
options.
112
IBM Storage Ceph for Beginner’s
You can now mount your NFS export on your intended NFS client. Since we just have a single NFS
server running on cephnode4, you would specify the NFS server IP as cephnode4 and the port as
12049.
113
IBM Storage Ceph for Beginner’s
You can deploy more than one NFS daemon (we only used 1 when we created our NFS service via
Administration -> Services). If you deployed more than one NFS daemon, you can use any cluster node
that the NFS service is running on. However, if that node fails, the NFS client will lose connectivity to
the NFS mount.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=orchestrator-implementing-ha-cephfsnfs-
service
The above link also explains how to convert an existing NFS cluster for high availability. First, we need
to label our nodes that we want to use as NFS servers. We will use cephnode1 and cephnode4.
Next, we will use the Ceph Orchestrator to deploy our NFS server with the ingress service. Our virtual
IP for the NFS server will be 10.0.0.245 (cephnfs.local).
[root@cephnode1 ganesha]# ceph nfs cluster create mynfs "2 label:nfs" --ingress --ingress-
mode haproxy-protocol --virtual-ip 10.0.0.245/24
114
IBM Storage Ceph for Beginner’s
[root@cephnode1 ganesha]#
You can query your NFS cluster deployment and check where our NFS server virtual IP is active (in our
case our NFS server virtual IP is aliased onto the public interface on cephnode1).
[root@cephnode4 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:04:de:52 brd ff:ff:ff:ff:ff:ff
115
IBM Storage Ceph for Beginner’s
altname enp0s18
inet 10.0.0.243/8 brd 10.255.255.255 scope global noprefixroute ens18
valid_lft forever preferred_lft forever
inet6 fe80::be24:11ff:fe04:de52/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:d8:68:7b brd ff:ff:ff:ff:ff:ff
altname enp0s19
inet 192.168.1.13/24 brd 192.168.1.255 scope global noprefixroute ens19
valid_lft forever preferred_lft forever
inet6 fe80::bd5d:7ac4:eaf0:adb9/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@cephnode4 ~]#
[root@cephnode1 e7fcc1ac-42ec-11ef-a58f-bc241172f341]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:72:f3:41 brd ff:ff:ff:ff:ff:ff
altname enp0s18
inet 10.0.0.240/8 brd 10.255.255.255 scope global noprefixroute ens18
valid_lft forever preferred_lft forever
inet 10.0.0.245/24 scope global ens18
valid_lft forever preferred_lft forever
inet6 fe80::be24:11ff:fe72:f341/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: ens19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default
qlen 1000
link/ether bc:24:11:05:3f:27 brd ff:ff:ff:ff:ff:ff
altname enp0s19
inet 192.168.1.10/24 brd 192.168.1.255 scope global noprefixroute ens19
valid_lft forever preferred_lft forever
inet6 fe80::3524:3a85:d8dd:2a79/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@cephnode1 e7fcc1ac-42ec-11ef-a58f-bc241172f341]#
As demonstrated previously, we need to now create an NFS export on the Ceph Dashboard.
116
IBM Storage Ceph for Beginner’s
You can now mount your NFS export on the NFS client as follows:
You can now perform failover testing if required (similar to how we did for CephFS). Run a dd command
on the client to generate some load and shutdown the cluster node with the NFS server virtual IP. You
can check the I/O load via the Ceph Dashboard or using cephfs-top.
117
IBM Storage Ceph for Beginner’s
In the current IBM Storage Ceph version, NFS failover testing didn’t work as expected with the client
being disconnected when the NFS server IP fails over. After doing some additional testing and
research, it was discovered that the ingress service deploys haproxy without the health check
option. Haproxy health checks automatically detect when a server becomes unresponsive or begins
to return errors; HAProxy can then temporarily remove that server from the pool until it begins to act
normally again. Without health checks, HAProxy has no way of knowing when a server has become
dysfunctional. Credit to this Reddit post that eventually help resolve this issue
https://www.reddit.com/r/ceph/comments/10bcwra/nfs_cluster_ha_not_working_what_am_i_mis
sing/
Since IBM Storage Ceph is containerized, the configuration files for HAProxy are deployed inside
containers. The only way to make changes to any of the configuration files (e.g. NFS-Ganesha Exports
file or HAProxy configuration file) is to use the method described below. For NFS-Ganesha exports,
you can download a sample file from here:
https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_samples/export.txt
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=orchestrator-setting-custom-nfs-ganesha-
configuration
%url {{ url }}
[root@cephnode1 ~]#
For HAProxy, to add the health check option perform the following steps:
haproxy.cfg.j2
100%[================================================================>] 2.51K --.-KB/s
in 0.008s
118
IBM Storage Ceph for Beginner’s
You can verify that the new file has taken effect after redeploying the haproxy and keepalived services
as follows:
defaults
mode tcp
119
IBM Storage Ceph for Beginner’s
log global
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s
maxconn 8000
frontend stats
mode http
bind 10.0.0.245:9000
bind 10.0.0.240:9000
stats enable
stats uri /stats
stats refresh 10s
stats auth admin:wokqvaej
http-request use-service prometheus-exporter if { path /metrics }
monitor-uri /health
frontend frontend
bind 10.0.0.245:2049
default_backend backend
backend backend
mode tcp
balance source
hash-type consistent
server nfs.mynfs.0 10.0.0.243:12049 check
server nfs.mynfs.1 10.0.0.240:12049 check
[root@cephnode1 haproxy]#
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=ganesha-nfs-ceph-object-storage
Let’s enable this via the Ceph Dashboard. Navigate to Object -> NFS -> Create Export. It is
recommended to enable RO access for compatibility reasons (see limitations in the above link). We
will however enable RW access. We will use a bucket called testbucket. Note, the Storage Backend is
set to Object Gateway (unlike a normal NFS export which uses CephFS as the backend).
120
IBM Storage Ceph for Beginner’s
After you create the Export, you should see it in the list of cluster NFS exports.
121
IBM Storage Ceph for Beginner’s
Let’s navigate to the newly mounted filesystem and list the contents of testbucket. As we can see, we
already have 2 files (these are actually objects).
root@labserver:~# cd /mnt/testbucket/
root@labserver:/mnt/testbucket# ls -al
total 4026
drwxrwxrwx 1 root root 0 Jul 24 20:53 .
drwxr-xr-x 5 root root 4096 Jul 24 20:54 ..
-rw-rw-rw- 1 root root 4118141 Jul 24 20:49 1gfile
-rw-rw-rw- 1 root root 318 Jul 20 22:22 hosts
root@labserver:/mnt/testbucket#
Let’s now use s3cmd to write an object to our testbucket called can_u_see_me_from_nfs. Thereafter,
we should be able to immediately see this new object via NFS with a simple ls command.
Sure enough, we can see the file we wrote to the RGW via s3cmd. Now let’s test it the other way
around. We will create a file via NFS called can_u_see_me_from_s3 and then query the RGW via
object protocol and list the contents of testbucket to see if we can see it.
We have just demonstrated that we can expose our object storage via NFS easily. You can query the
contents of the testbucket in Ceph using the radosgw-admin command as well.
122
IBM Storage Ceph for Beginner’s
"testwebsite",
"awsbucket",
"testbucket"
]
[root@cephnode1 ~]# radosgw-admin bucket list --bucket=testbucket
[
{
"name": "1gfile",
"instance": "",
"ver": {
"pool": 7,
"epoch": 12
},
"locator": "",
"exists": true,
"meta": {
"category": 1,
"size": 4118141,
"mtime": "2024-07-24T18:49:40.747228Z",
"etag": "94b829092828287fd4d714e5adfd0481-67",
"storage_class": "",
"owner": "ndocrat",
"owner_display_name": "Nishaan Docrat",
"content_type": "application/octet-stream",
"accounted_size": 1048576000,
"user_data": "",
"appendable": false
},
"tag": "7109acc2-883b-4502-b863-4e7097d7d83c.574704.15486274694923450889",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
},
{
"name": "can_u_see_me_from_nfs",
"instance": "",
"ver": {
"pool": 7,
"epoch": 17
},
"locator": "",
"exists": true,
"meta": {
"category": 1,
"size": 332,
"mtime": "2024-07-24T18:58:15.274165Z",
"etag": "87d7315472cc2b6e89fa6a7ab85236c5",
"storage_class": "STANDARD",
"owner": "ndocrat",
"owner_display_name": "Nishaan Docrat",
"content_type": "text/plain",
"accounted_size": 552,
"user_data": "",
"appendable": false
},
"tag": "7109acc2-883b-4502-b863-4e7097d7d83c.584431.17680027879669500773",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
},
{
"name": "can_you_see_me_from_s3",
"instance": "",
"ver": {
"pool": 7,
"epoch": 28
},
"locator": "",
"exists": true,
"meta": {
"category": 1,
"size": 552,
"mtime": "2024-07-25T05:57:20.986184Z",
"etag": "87d7315472cc2b6e89fa6a7ab85236c5",
"storage_class": "",
"owner": "ndocrat",
"owner_display_name": "",
123
IBM Storage Ceph for Beginner’s
"content_type": "",
"accounted_size": 552,
"user_data": "",
"appendable": false
},
"tag": "7109acc2-883b-4502-b863-4e7097d7d83c.594593.5626747190077097157",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
},
{
"name": "hosts",
"instance": "",
"ver": {
"pool": 7,
"epoch": 6
},
"locator": "",
"exists": true,
"meta": {
"category": 1,
"size": 318,
"mtime": "2024-07-20T20:22:18.288298Z",
"etag": "b04bc021f3faea3a141ee55e39e5bfbf",
"storage_class": "STANDARD",
"owner": "ndocrat",
"owner_display_name": "Nishaan Docrat",
"content_type": "text/plain",
"accounted_size": 518,
"user_data": "",
"appendable": false
},
"tag": "7109acc2-883b-4502-b863-4e7097d7d83c.112302.4596205271780943401",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}
]
[root@cephnode1 ~]#
Open-source Ceph still offers the use of the iSCSI Gateway whilst IBM Storage Ceph has this
functionality disabled by default in the Ceph Dashboard. It is possible to enable it, though it was not
possible to get this to work on IBM Storage Ceph 7.1. Trying to deploy the iSCSI service worked as
expected, though the service failed to start (this was tested on 3 separate IBM Storage Ceph clusters).
124
IBM Storage Ceph for Beginner’s
125
IBM Storage Ceph for Beginner’s
It is important to note that whilst IBM Storage Ceph does not support the use of the iSCSI Gateway,
open source Ceph still allows use of this feature. For the sake of completeness, below find details of
the iSCSI Gateway deployment using open source Ceph (Reef) 18.2.1.
126
IBM Storage Ceph for Beginner’s
127
IBM Storage Ceph for Beginner’s
Figure 118: Open-source Ceph iSCSI – Windows iSCSI Software Initiator Configuration
IBM is targeting the use of Ceph NVMe-oF for high performance workloads like VMWare. The NVMe-
oF Gateway presents an NVMe-oF target that exports RADOS Block Device (RBD) images as NVMe
namespaces. The NVMe-oF protocol allows clients (initiators) to send NVMe commands to storage
devices (targets) over a TCP/IP network, enabling clients without native Ceph client support to access
Ceph block storage.
128
IBM Storage Ceph for Beginner’s
As you can see, the CRUSH rule is set to replicate data or placement groups across OSDs instead of
the default hosts. Also, the default pool replica is set to 2x instead of the default 3x. Lastly, we only
have a single host so no need for a standby MGR.
First, we will make sure we have a valid Red Hat subscription, ensure we only have the BaseOS and
AppStream repositories enabled apply the latest OS updates.
We need to add the IBM Storage Ceph repository and install the license and accept it.
129
IBM Storage Ceph for Beginner’s
This document includes License Information documents below for multiple Programs. Each
License Information document identifies the Program(s) to which it applies. Only those
License Information documents for the Program(s) for which Licensee has acquired entitlements
apply.
==============================================
IMPORTANT: READ CAREFULLY
.
.
.
.
.
.
You can read this license in another language at /usr/share/ibm-storage-ceph-license/L-XSHK-
LPQLHG/UTF8/
Installed:
ibm-storage-ceph-license-7-2.el9cp.noarch
Complete!
[root@rceph ~]#
[root@rceph ~]# touch /usr/share/ibm-storage-ceph-license/accept
[root@rceph ~]# ls -al /usr/share/ibm-storage-ceph-license/accept
-rw-r--r--. 1 root root 0 Aug 1 20:47 /usr/share/ibm-storage-ceph-license/accept
[root@rceph ~]#
130
IBM Storage Ceph for Beginner’s
Complete!
[root@rceph ~]#
We want to use ansible as the root user so don’t need to create a separate ansible user with sudo
access. We need to ensure that root ssh is allowed and working.
[root@rceph .ssh]#
/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote
system.
(if you think this is a mistake, you may want to use -f option)
[root@rceph .ssh]#
We need to create an ansible inventory file and run the preflight playbook.
131
IBM Storage Ceph for Beginner’s
[WARNING]: log file at /root/ansible/ansible.log is not writeable and we cannot create it,
aborting
PLAY [insecure_registries]
*********************************************************************************************
************
.
.
.
fail if baseurl is not defined for ceph_custom_repositories ---------------------------------
-------------------------------- 0.02s
install prerequisites packages on clients ---------------------------------------------------
-------------------------------- 0.02s
set_fact ceph_custom_repositories -----------------------------------------------------------
-------------------------------- 0.02s
configure Ceph community repository ---------------------------------------------------------
-------------------------------- 0.02s
[root@rceph cephadm-ansible]#
Make sure we have login details for the IBM Container Registry (ICR) stored to a file. Your generated
entitlement key should be the password.
Lastly, we need to bootstrap our single node cluster with cephadm. We have to specify --single-node-
defaults as explained earlier.
132
IBM Storage Ceph for Beginner’s
URL: https://rceph.local:8443/
User: admin
Password: rhmj2s8egg
ceph telemetry on
https://docs.ceph.com/en/latest/mgr/telemetry/
Bootstrap complete.
[root@rceph cephadm-ansible]#
You can check the status of your cluster from the command line.
133
IBM Storage Ceph for Beginner’s
services:
mon: 1 daemons, quorum rceph (age 106s)
mgr: rceph.zigotn(active, since 67s)
osd: 0 osds: 0 up, 0 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
progress:
Updating grafana deployment (+1 -> 1) (0s)
[............................]
[root@rceph cephadm-ansible]#
[root@rceph ~]#
We can also just check what daemons and services are deployed and what disks are available for OSDs
before connecting to the Ceph Dashboard to complete the installation.
134
IBM Storage Ceph for Beginner’s
We can now connect to the IBM Storage Ceph Dashboard (credentials were provided during the
bootstrap process).
Click on “Expand Cluster” to complete the installation. On the Add Hosts page we can edit or add node
labels as we did when we created a 4-node cluster.
135
IBM Storage Ceph for Beginner’s
We do not need to modify anything so can click “Next”. On the “Create OSDs” page the default
“Cost/Capacity-optimized” is selected by default. We can select “Next”.
On the Create Services page you can deploy additional services as you require. For now, though, we
will just select the defaults and click “Next”.
136
IBM Storage Ceph for Beginner’s
Click on “Cluster Review” page we get a summary of our configuration. We can click on “Expand
Cluster” to complete the installation.
After a few minutes we should see a healthy cluster with a raw storage capacity of 90GB.
137
IBM Storage Ceph for Beginner’s
138
IBM Storage Ceph for Beginner’s
139
IBM Storage Ceph for Beginner’s
140
IBM Storage Ceph for Beginner’s
We can also check we have a healthy single node cluster from the command line.
services:
mon: 1 daemons, quorum rceph (age 17m)
mgr: rceph.zigotn(active, since 14m), standbys: rceph.kumbrt
osd: 3 osds: 3 up (since 4m), 3 in (since 4m)
data:
pools: 2 pools, 33 pgs
objects: 2 objects, 449 KiB
usage: 80 MiB used, 90 GiB / 90 GiB avail
pgs: 33 active+clean
Most OpenShift customers would be using ODF/FDF or a CSI driver based on their existing storage
deployment. Alternatively, they have the option of deploying the rook-ceph operator via OpenShift
OperatorHub with the option of configuring an external Ceph cluster. So, for OpenShift the use of Ceph
141
IBM Storage Ceph for Beginner’s
is largely automated (unless you deploying the rook-ceph operator to use with an external Ceph
cluster).
For our purposes, we will demonstrate the use of the IBM Storage Ceph CSI driver for a native
Kubernetes cluster. Like OpenShift, you can deploy the rook-ceph operator to deploy a storage cluster
inside your Kubernetes cluster or make use of an external Ceph cluster which is what we will
demonstrate. This deployment type supports providing storage to multiple Kubernetes clusters as
depicted below.
Figure 133: Centralised Ceph Storage Cluster serving multiple k8s Clusters
For the test setup, we have a 4-node Kubernetes cluster (1 master and 3 workers). We will present
RBD, CephFS and also configure Object Bucket Claims using the Ceph RGW.
142
IBM Storage Ceph for Beginner’s
The instructions to configure an external Ceph cluster for the Rook operator can be found here:
https://rook.io/docs/rook/latest-release/CRDs/Cluster/external-cluster/external-cluster/
Rook is an open-source cloud-native storage orchestrator for Kubernetes, providing the platform,
framework, and support for Ceph storage to natively integrate with Kubernetes. Rook automates
deployment and management of Ceph to provide self-managing, self-scaling, and self-healing storage
services. The Rook operator does this by building on Kubernetes resources to deploy, configure,
provision, scale, upgrade, and monitor Ceph (https://github.com/rook/rook).
We need to export the configuration from the provider Ceph cluster and import it into the Rook
consumer cluster. To do this we need to run the python script create-external-cluster-resources.py in
the provider Ceph cluster cephadm shell, to create the necessary users and keys.
https://rook.io/docs/rook/latest-release/CRDs/Cluster/external-cluster/provider-export/
Let’s execute it on our Ceph storage cluster with the --dry-run option to see what commands it is going
to run. We need to populate the options to match our cluster configuration.
143
IBM Storage Ceph for Beginner’s
[root@cephnode1 k8s]#
Next, we can run the command with the --format bash option as we need to copy the environment
variables to our consumer cluster.
[root@cephnode1 k8s]#
On our consumer cluster we will clone the same GitHub repository as we will need the example YAML
files in the rook/deploy/example directory to define our Kubernetes storage resources.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
144
IBM Storage Ceph for Beginner’s
git switch -
root@k8smaster:~/ceph#
Copy the environment variables from the provider cluster to a file on the consumer cluster to make it
easy to export all the required variables. We need to source the file before we run the import.
We need to install the rook operator either using Helm or using manifests. We will use manifests. Note
the following:
# If Rook is not managing any existing cluster in the 'rook-ceph' namespace do:
# kubectl create -f ../../examples/crds.yaml -f ../../examples/common.yaml -f
../../examples/operator.yaml
# kubectl create -f common-external.yaml -f cluster-external.yaml
#
# If there is already a cluster managed by Rook in 'rook-ceph' then do:
# kubectl create -f common-external.yaml
Since we do not already have any existing cluster, we have to do the following:
145
IBM Storage Ceph for Beginner’s
customresourcedefinition.apiextensions.k8s.io/cephobjectstores.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephobjectstoreusers.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephobjectzonegroups.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephobjectzones.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/cephrbdmirrors.ceph.rook.io created
customresourcedefinition.apiextensions.k8s.io/objectbucketclaims.objectbucket.io created
customresourcedefinition.apiextensions.k8s.io/objectbuckets.objectbucket.io created
clusterrole.rbac.authorization.k8s.io/cephfs-csi-nodeplugin created
clusterrole.rbac.authorization.k8s.io/cephfs-external-provisioner-runner created
clusterrole.rbac.authorization.k8s.io/objectstorage-provisioner-role created
clusterrole.rbac.authorization.k8s.io/rbd-csi-nodeplugin created
clusterrole.rbac.authorization.k8s.io/rbd-external-provisioner-runner created
clusterrole.rbac.authorization.k8s.io/rook-ceph-cluster-mgmt created
clusterrole.rbac.authorization.k8s.io/rook-ceph-global created
clusterrole.rbac.authorization.k8s.io/rook-ceph-mgr-cluster created
clusterrole.rbac.authorization.k8s.io/rook-ceph-mgr-system created
clusterrole.rbac.authorization.k8s.io/rook-ceph-object-bucket created
clusterrole.rbac.authorization.k8s.io/rook-ceph-osd created
clusterrole.rbac.authorization.k8s.io/rook-ceph-system created
clusterrolebinding.rbac.authorization.k8s.io/cephfs-csi-nodeplugin-role created
clusterrolebinding.rbac.authorization.k8s.io/cephfs-csi-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/objectstorage-provisioner-role-binding created
clusterrolebinding.rbac.authorization.k8s.io/rbd-csi-nodeplugin created
clusterrolebinding.rbac.authorization.k8s.io/rbd-csi-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/rook-ceph-global created
clusterrolebinding.rbac.authorization.k8s.io/rook-ceph-mgr-cluster created
clusterrolebinding.rbac.authorization.k8s.io/rook-ceph-object-bucket created
clusterrolebinding.rbac.authorization.k8s.io/rook-ceph-osd created
clusterrolebinding.rbac.authorization.k8s.io/rook-ceph-system created
role.rbac.authorization.k8s.io/cephfs-external-provisioner-cfg created
role.rbac.authorization.k8s.io/rbd-csi-nodeplugin created
role.rbac.authorization.k8s.io/rbd-external-provisioner-cfg created
role.rbac.authorization.k8s.io/rook-ceph-cmd-reporter created
role.rbac.authorization.k8s.io/rook-ceph-mgr created
role.rbac.authorization.k8s.io/rook-ceph-osd created
role.rbac.authorization.k8s.io/rook-ceph-purge-osd created
role.rbac.authorization.k8s.io/rook-ceph-system created
rolebinding.rbac.authorization.k8s.io/cephfs-csi-provisioner-role-cfg created
rolebinding.rbac.authorization.k8s.io/rbd-csi-nodeplugin-role-cfg created
rolebinding.rbac.authorization.k8s.io/rbd-csi-provisioner-role-cfg created
rolebinding.rbac.authorization.k8s.io/rook-ceph-cluster-mgmt created
rolebinding.rbac.authorization.k8s.io/rook-ceph-cmd-reporter created
rolebinding.rbac.authorization.k8s.io/rook-ceph-mgr created
rolebinding.rbac.authorization.k8s.io/rook-ceph-mgr-system created
rolebinding.rbac.authorization.k8s.io/rook-ceph-osd created
rolebinding.rbac.authorization.k8s.io/rook-ceph-purge-osd created
rolebinding.rbac.authorization.k8s.io/rook-ceph-system created
serviceaccount/objectstorage-provisioner created
serviceaccount/rook-ceph-cmd-reporter created
serviceaccount/rook-ceph-default created
serviceaccount/rook-ceph-mgr created
serviceaccount/rook-ceph-osd created
serviceaccount/rook-ceph-purge-osd created
serviceaccount/rook-ceph-rgw created
serviceaccount/rook-ceph-system created
serviceaccount/rook-csi-cephfs-plugin-sa created
serviceaccount/rook-csi-cephfs-provisioner-sa created
serviceaccount/rook-csi-rbd-plugin-sa created
serviceaccount/rook-csi-rbd-provisioner-sa created
configmap/rook-ceph-operator-config created
deployment.apps/rook-ceph-operator created
Error from server (AlreadyExists): error when creating "../../examples/common.yaml":
namespaces "rook-ceph" already exists
root@k8smaster:~/ceph/rook/deploy/examples/external#
I mistakenly ran the import-external-cluster.sh script before this command hence why the resource
“already exists” errors. It didn’t cause any issues though you should follow the steps in this order.
root@k8smaster:~/ceph/rook/deploy/examples/external# kubectl create -f common-external.yaml -
f cluster-external.yaml
cephcluster.ceph.rook.io/rook-ceph-external created
Error from server (AlreadyExists): error when creating "common-external.yaml": namespaces
"rook-ceph" already exists
Error from server (AlreadyExists): error when creating "common-external.yaml":
rolebindings.rbac.authorization.k8s.io "rook-ceph-cluster-mgmt" already exists
146
IBM Storage Ceph for Beginner’s
We can now run the import-external-cluster.sh script to define the external cluster.
After issuing the above commands we can query the rook operator deployment. Note, k is aliased to
kubectl.
After a few minutes, we should see all pods running and check if our external cluster is connected.
147
IBM Storage Ceph for Beginner’s
We can also query our storage classes and set the default storage class.
root@k8smaster:~/ceph/rook/deploy/examples# k get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
ALLOWVOLUMEEXPANSION AGE
ceph-rbd rook-ceph.rbd.csi.ceph.com Delete Immediate true
15m
cephfs rook-ceph.cephfs.csi.ceph.com Delete Immediate true
15m
root@k8smaster:~/ceph/rook/deploy/examples#
root@k8smaster:~/ceph/rook/deploy/examples# k get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
ALLOWVOLUMEEXPANSION AGE
ceph-rbd rook-ceph.rbd.csi.ceph.com Delete Immediate true
16m
cephfs (default) rook-ceph.cephfs.csi.ceph.com Delete Immediate true
16m
root@k8smaster:~/ceph/rook/deploy/examples#
We can also check the storage class definitions from the Kubernetes dashboard and make sure they
correspond to the correct RBD pool and CephFS filesystem.
148
IBM Storage Ceph for Beginner’s
We can use the example YAML files from the rook Git repository we cloned to test PVC creation. First
let’s try using CephFS storage class.
root@k8smaster:~/ceph/rook/deploy/examples/csi/cephfs# k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS
CLAIM STORAGECLASS REASON AGE
149
IBM Storage Ceph for Beginner’s
We can also check the status of our PVCs on the Kubernetes Dashboard.
PVCs only support block or file. However, through the use of Object Bucket Claims (OBC) you can
define storage classes from where buckets can be provisioned. The goal of the Object Bucket Claim
(OBC) is to “provide a generic, dynamic bucket provision API similar to Persistent Volumes and
Claims”. This will look a lot like the PVC model used for block and file: An administrator defines a
150
IBM Storage Ceph for Beginner’s
storage class that points to underlying object storage, and when users create an object bucket claim
the bucket is automatically provisioned and is guaranteed to be available by the time the pod is
started. The procedure to configure an OBC is detailed here.
https://rook.io/docs/rook/latest/Storage-Configuration/Object-Storage-RGW/object-
storage/#connect-to-an-external-object-store
A good reference on object storage for Kubernetes concepts and the new COSI standard can be found
here:
https://archive.fosdem.org/2021/schedule/event/sds_object_storage_for_k8s/attachments/slides/
4507/export/events/attachments/sds_object_storage_for_k8s/slides/4507/cosi_slides.pdf
Let’s first define an external object store to the rook operator. We will use our highly-available RGW
endpoint.
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
name: external-store
namespace: rook-ceph # namespace:cluster
spec:
gateway:
# The port on which **ALL** the gateway(s) are listening on.
# Passing a single IP from a load-balancer is also valid.
port: 8080
externalRgwEndpoints:
- ip: cephs3.local
# hostname: example.com
root@k8smaster:~/ceph/rook/deploy/examples/external#
151
IBM Storage Ceph for Beginner’s
Now we can use the example YAML files to define our Object Storage Class using the rook external
object store we created above.
Provisioner: rook-ceph.ceph.rook.io/bucket
Parameters: objectStoreName=external-store,objectStoreNamespace=rook-ceph
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
root@k8smaster:~/ceph/rook/deploy/examples/external#
152
IBM Storage Ceph for Beginner’s
First, we need to create a deployment and apply it. We will use CephFS storage class for the PVC.
153
IBM Storage Ceph for Beginner’s
metadata:
name: nginx-pvc-claim
spec:
storageClassName: cephfs
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: clear-nginx-deployment
spec:
selector:
matchLabels:
app: clear-nginx
template:
metadata:
labels:
app: clear-nginx
spec:
containers:
- name: clear-nginx
image: nginx:1.14.2
volumeMounts:
- mountPath: /var/www/html
name: site-data
ports:
- containerPort: 80
volumes:
- name: site-data
persistentVolumeClaim:
claimName: nginx-pvc-claim
---
apiVersion: v1
kind: Service
metadata:
name: clear-nginx-service
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 80
selector:
app: clear-nginx
type: NodePort
root@k8smaster:~/ceph#
root@k8smaster:~/ceph# k apply -f nginx.yaml
persistentvolume/nginx-pvc created
persistentvolumeclaim/nginx-pvc-claim created
deployment.apps/clear-nginx-deployment created
service/clear-nginx-service created
root@k8smaster:~/ceph#
154
IBM Storage Ceph for Beginner’s
https://github.com/kubernetes-csi/external-snapshotter
155
IBM Storage Ceph for Beginner’s
To support snapshots, we need to clone the Git repository above. Then follow the instructions
documented. First, we need to install the snapshot and volume group snapshot CRDs. Note, the
location of the files from the cloned repository.
Then we need to deploy the common snapshot controller. Make sure to change the namespace to
match where your rook operator is deployed too.
root@k8smaster:~/ceph/snap/external-snapshotter/deploy/kubernetes/snapshot-controller# vi
rbac-snapshot-controller.yaml
root@k8smaster:~/ceph/snap/external-snapshotter/deploy/kubernetes/snapshot-controller# grep
namespace: *
rbac-snapshot-controller.yaml: namespace: rook-ceph
rbac-snapshot-controller.yaml: namespace: rook-ceph
rbac-snapshot-controller.yaml: namespace: rook-ceph
rbac-snapshot-controller.yaml: namespace: rook-ceph
setup-snapshot-controller.yaml: namespace: kube-system
root@k8smaster:~/ceph/snap/external-snapshotter/deploy/kubernetes/snapshot-controller# vi
setup-snapshot-controller.yaml
root@k8smaster:~/ceph/snap/external-snapshotter/deploy/kubernetes/snapshot-controller# grep
namespace: *
rbac-snapshot-controller.yaml: namespace: rook-ceph
rbac-snapshot-controller.yaml: namespace: rook-ceph
rbac-snapshot-controller.yaml: namespace: rook-ceph
rbac-snapshot-controller.yaml: namespace: rook-ceph
setup-snapshot-controller.yaml: namespace: rook-ceph
root@k8smaster:~/ceph/snap/external-snapshotter/deploy/kubernetes/snapshot-controller#
Once you have modified the namespaces to match your Kubernetes cluster, you can apply the
common snapshot controller CRD.
You should see new snapshot controller pods deployed. You can check that they are running.
156
IBM Storage Ceph for Beginner’s
Let’s create a RBD snapshot class. Use the example YAML files from the rook Git repository we cloned
earlier.
We can now take a snapshot of a RBD PVC. Modify the snapshot.yaml file to match your snapshot
name and source volume.
You can double check the snapshot creation from the Ceph Dashboard as well.
157
IBM Storage Ceph for Beginner’s
We can also test restoring the snapshot to a new PVC. Modify the storageClassName to match your
RBD storage class and the name of the snapshot you want to restore.
root@k8smaster:~/ceph/rook/deploy/examples/csi/rbd# k get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
ALLOWVOLUMEEXPANSION AGE
ceph-rbd rook-ceph.rbd.csi.ceph.com Delete Immediate
true 2d8h
cephfs (default) rook-ceph.cephfs.csi.ceph.com Delete Immediate
true 2d8h
rook-ceph-delete-bucket rook-ceph.ceph.rook.io/bucket Delete Immediate
false 2d5h
root@k8smaster:~/ceph/rook/deploy/examples/csi/rbd# cat pvc-restore.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: rbd-pvc-restore
spec:
storageClassName: ceph-rbd
dataSource:
name: rbd-pvc-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
root@k8smaster:~/ceph/rook/deploy/examples/csi/rbd# k apply -f pvc-restore.yaml
persistentvolumeclaim/rbd-pvc-restore created
root@k8smaster:~/ceph/rook/deploy/examples/csi/rbd# k get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES
STORAGECLASS AGE
cephfs-pvc Bound pvc-a441b483-0921-4735-8cc8-202f82d96cc7 2Gi RWO
cephfs 2d7h
rbd-pvc Bound pvc-a20cda16-332f-4f01-af1e-dc319d61c60c 2Gi RWO
ceph-rbd 2d7h
rbd-pvc-restore Bound pvc-2715d0fe-1b88-40a3-958d-6cd2b4b99bb0 2Gi RWO
ceph-rbd 5s
root@k8smaster:~/ceph/rook/deploy/examples/csi/rbd#
Now that we create a snapshot class for RBD, we can do the same for CephFS.
root@k8smaster:~/ceph/rook/deploy/examples/csi/cephfs# cat snapshotclass.yaml
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-cephfsplugin-snapclass
driver: rook-ceph.cephfs.csi.ceph.com # csi-provisioner-name
parameters:
# Specify a string that identifies your cluster. Ceph CSI supports any
# unique string. When Ceph CSI is deployed by Rook use the Rook namespace,
158
IBM Storage Ceph for Beginner’s
And take a snapshot of an existing CephFS PVC. Modify the snapshot.yaml file to specify the name of
the snapshot and source CephFS PVC.
You can check the creation of the CephFS snapshot on the Ceph Dashboard.
Finally, we can restore the CephFS snapshot to a new PVC. Modify the storageClassName to match
your CephFS storage class and the name of the snapshot you want to restore.
root@k8smaster:~/ceph/rook/deploy/examples/csi/cephfs# k get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
ALLOWVOLUMEEXPANSION AGE
159
IBM Storage Ceph for Beginner’s
https://github.com/vmware-tanzu/velero/releases/tag/v1.15.1
https://velero.io
160
IBM Storage Ceph for Beginner’s
Download Velero and unpack it on a node in your Kubernetes cluster. Create a text file with the Velero
user S3 credentials.
Next, we can deploy Velero. Note how we specify the Ceph RGW endpoint in the --backup-location-
config option. We also have to specify the location of our SSL certificate since we configured our RGW
to use SSL termination.
root@k8smaster:~/velero# velero install --provider aws --bucket velero --plugins
velero/velero-plugin-for-aws --secret-file ./s3.txt --use-volume-snapshots=false --cacert
./cephs3.pem --backup-location-config
region=default,s3Url=http://cephs3.local:8080,insecureSkipTLSVerify=true
161
IBM Storage Ceph for Beginner’s
We can query the Velero deployment on our Kubernetes cluster and also check if the backup location
is available.
We can now use Velero to backup our Kubernetes applications. We will use the nginx webserver we
deployed earlier.
Phase: Completed
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
TTL: 720h0m0s
CSISnapshotTimeout: 10m0s
ItemOperationTimeout: 4h0m0s
Hooks: <none>
162
IBM Storage Ceph for Beginner’s
Backup Volumes:
Velero-Native Snapshots: <none included>
HooksAttempted: 0
HooksFailed: 0
root@k8smaster:~/velero#
You can check the backup objects store on our Ceph cluster.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=dashboard-managing-multi-clusters-
technology-preview
To setup multi-cluster management, navigate to Multi-Cluster and click on “Connect” to add another
Ceph Storage Cluster. You need to specify the cluster API URL (this is the same as the dashboard URL)
and also the admin username and password. You can also set the login expiration in days.
163
IBM Storage Ceph for Beginner’s
You can see a list of all connected clusters on the Ceph Dashboard. You can launch the Ceph
dashboard of any connected cluster by clicking on the URL.
If you navigate to Multi-Cluster on one of the connected clusters, you will see a message stating that
the cluster is already managed by another cluster with the cluster ID of the other cluster.
164
IBM Storage Ceph for Beginner’s
This feature is useful to demonstrate during a POC if you are deploying multiple clusters (e.g. for
replication). You can see rolled up information from the management cluster by navigating to Multi-
Cluster -> Overview (scroll down to see information).
165
IBM Storage Ceph for Beginner’s
Open source Ceph Reef (18.2.4) did not include this feature at the time of creating this document.
166
IBM Storage Ceph for Beginner’s
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=images-creating-snapshots
Below is the procedure to showcase RBD image snapshots. We will first map this volume on our RBD
client and write a file to it so that we can later demonstrate how to rollback the source image from a
snapshot.
Create an RBD image and map it to a client (you would need to create a Ceph User for the RBD client
as we did before (e.g. ceph auth get-or-create client.ndcephrbd mon 'profile rbd' osd 'profile rbd
pool=rbd' -o /etc/ceph/ceph.client.ndcephrbd.keyring). Copy the client keyring and the ceph.conf file
from one of your cluster nodes to the RBD client then map the volume, create a filesystem if necessary
and mount it. Once mounted, we will create a file to demonstrate how we can use snapshots to
rollback the RBD image.
167
IBM Storage Ceph for Beginner’s
root@labserver:/mnt/rbd# ls -alt
total 8
drwxr-xr-x 2 root root 30 Dec 31 12:53 .
-rw-r--r-- 1 root root 398 Dec 31 12:53 file_before_snap
drwxr-xr-x 13 root root 4096 Aug 4 21:49 ..
root@labserver:/mnt/rbd#
Next, on the Ceph Dashboard, navigate to Block, Images and on the selected RBD image click on the
Snapshots tab and Create a new snapshot.
Once the snapshot is created, note the options we have available. We will demonstrate the rollback
and clone functionality.
168
IBM Storage Ceph for Beginner’s
Next, we will create a new file after we took the snapshot. Then we can unmount the filesystem and
unmap the RBD image prior to us restoring the source image from the snapshot we just created. This
is done of course to ensure consistency of the filesystem and is the typical procedure employed on all
of our IBM storage arrays.
root@labserver:/mnt/rbd# ls -alt
total 12
drwxr-xr-x 2 root root 53 Dec 31 12:57 .
-rw-r--r-- 1 root root 398 Dec 31 12:57 file_after_snap
-rw-r--r-- 1 root root 398 Dec 31 12:53 file_before_snap
drwxr-xr-x 13 root root 4096 Aug 4 21:49 ..
root@labserver:/mnt/rbd#
Now we can rollback the source image from the snapshot from the Ceph Dashboard.
Depending on the size of the RBD image, it might take some time to complete.
169
IBM Storage Ceph for Beginner’s
To confirm the rollback was successful, we can remap the RBD image and remount the filesystem and
we should only see the file(s) we had prior to taking the snapshot.
The above procedure would suffice to test the creation and recovery from a block snapshot. Let us
now clone the snapshot to a new RBD image. First, we need to protect the source snapshot image.
170
IBM Storage Ceph for Beginner’s
After the clone completes, we should have a new RBD image that would be visible to RBD clients who
have the correct permissions on the RBD pool it resides in.
To demonstrate that the clone is available for use, we can map it to our desired RBD client, mount the
filesystem and list the contents as follows:
The above two use cases should be sufficient to demonstrate block snapshots for your POC.
The use of subvolumes and subvolume groups are outside the scope of this document (they are
typically used with OpenStack for example). Snapshots are currently supported on volumes and
subvolumes (as of IBM Storage Ceph 7.1). As with RBD snapshots, you can clone a subvolume
snapshot to a new subvolume.
171
IBM Storage Ceph for Beginner’s
For our purposes we will demonstrate creating a snapshot of the CephFS filesystem (CephFS volume).
This process is documented here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=snapshots-creating-snapshot-ceph-file-
system
On the Ceph Dashboard, if you navigate to File Systems and select a filesystem, under the Snapshots
tab you will notice that it requires the use of subvolumes.
One useful feature of CephFS snapshots is the ability to use the snapshot scheduler to automate
snapshot creation and retention. Unfortunately, this feature is supported on CephFS directories or
subvolumes only.
In order to demonstrate an entire CephFS volume (filesystem) snapshot, we will use an existing
filesystem called cephfs and create a filesystem snapshot on the command line and the Ceph
Dashboard. First, let us authorize a new Ceph user called client.ndceph who has access to create
172
IBM Storage Ceph for Beginner’s
snapshots (for the command line use case). We need to allow snapshots so need to tick this box. We
should also untick Root Squash.
As we did in the section describing how to deploy CephFS, we will mount the filesystem on a client.
We need the client keyring obtained as below after authorizing a new filesystem user on the Ceph
Dashboard.
We need to copy the client keyring and the /etc/ceph.conf from the cluster node we ran the above
command. Then we can mount the filesystem as follows:
On one of our Ceph cluster nodes, we need to enable new snapshots on all existing filesystem.
173
IBM Storage Ceph for Beginner’s
Next, we can copy an arbitrary file to the new filesystem prior to creating a command line snapshot.
root@labserver:/mnt/cephfs# cp /etc/hosts .
root@labserver:/mnt/cephfs# ls -alt
total 5
-rw-r--r-- 1 root root 398 Dec 31 13:53 hosts
drwxr-xr-x 2 root root 1 Dec 31 13:53 .
drwxr-xr-x 15 root root 4096 Dec 31 13:49 ..
root@labserver:/mnt/cephfs#
To create a CephFS snapshot from the command line, we need to navigate to the hidden .snap
directory in the root of our Ceph filesystem.
root@labserver:/mnt/cephfs# cd .snap
root@labserver:/mnt/cephfs/.snap#
root@labserver:/mnt/cephfs/.snap# pwd
/mnt/cephfs/.snap
root@labserver:/mnt/cephfs/.snap#
root@labserver:/mnt/cephfs/.snap/mysnap# ls -alt
total 1
-rw-r--r-- 1 root root 398 Dec 31 13:53 hosts
drwxr-xr-x 2 root root 1 Dec 31 13:53 .
drwxr-xr-x 2 root root 1 Dec 31 13:49 ..
root@labserver:/mnt/cephfs/.snap/mysnap#
If you don’t want to use the command line, you can use the Ceph Dashboard to create a filesystem
snapshot. Navigate to File systems -> cephfs -> Directories and click on Create Snapshot.
174
IBM Storage Ceph for Beginner’s
We can verify this from the active filesystem as well on a CephFS client.
root@labserver:/mnt/cephfs/.snap# ls -alt
total 0
drwxr-xr-x 2 root root 1 Dec 31 13:53 ..
drwxr-xr-x 2 root root 1 Dec 31 13:53 2024-12-31T14:00:02.978+02:00
drwxr-xr-x 2 root root 1 Dec 31 13:53 mysnap
drwxr-xr-x 2 root root 2 Dec 31 13:49 .
root@labserver:/mnt/cephfs/.snap#
root@labserver:/mnt/cephfs/.snap# ls -alt 2024-12-31T14:00:02.978+02:00
total 1
-rw-r--r-- 1 root root 398 Dec 31 13:53 hosts
drwxr-xr-x 2 root root 1 Dec 31 13:53 .
drwxr-xr-x 2 root root 2 Dec 31 13:49 ..
root@labserver:/mnt/cephfs/.snap#
175
IBM Storage Ceph for Beginner’s
Two-way replication supports failover and failback (demote primary image and promote the non-
primary image). Both the primary and secondary images are enabled for writes. Changes to the RBD
images on the secondary cluster will be replicated back to the primary. For two-way mirroring, the
rbd-mirror daemon runs on both clusters. Two-way mirroring also only supports two sites.
RBD mirroring is configured at a pool level and Ceph supports two mirroring modes. Either at the pool
level (all RBD images are mirrored) or image level (only a subset of images are replicated). A full
explanation of RBD mirroring can be found here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=devices-mirroring-ceph-block
Important: The CRUSH hierarchies supporting primary and secondary pools that mirror block
device images must have the same capacity and performance characteristics, and must have
adequate bandwidth to ensure mirroring without excess latency. For example, if you have X MB/s
average write throughput to images in the primary storage cluster, the network must support N * X
throughput in the network connection to the secondary site plus a safety factor of Y% to mirror N
images. (https://www.ibm.com/docs/en/storage-ceph/7.1?topic=devices-ceph-block-device-
mirroring)
For a POC environment, you most likely would need to demonstrate this functionality. We will setup
two-way replication to showcase how we can tolerate a site failure for Ceph RBD clients. Our test
environment is making use of two single node IBM Storage Ceph clusters (for a POC, you could use a
176
IBM Storage Ceph for Beginner’s
single node cluster for the secondary cluster). As per the diagram below, we will mirror an RBD image
from RCEPH to LCEPH. We have a Ceph RBD client called labserver to test client access to the RBD
image.
The first thing we need to do for two-way replication is to deploy the rbd-mirror service on each of our
clusters.
Figure 169: Ceph Dashboard – Administration – Services – Create RBD Mirror Service on RCEPH
177
IBM Storage Ceph for Beginner’s
Figure 170: Ceph Dashboard – Administration – Services – Create RBD Mirror Service on LCEPH
If we navigate to Block -> Mirroring, we should see our rbd-mirror daemon running and the current
mirroring mode that is enabled for our rbd pool. Note, you can edit the Site Name to make it easier to
understand the configuration.
178
IBM Storage Ceph for Beginner’s
We want to enable RBD mirroring for all newly created images. We can enable this as follows:
If you already have RBD images created in your RBD pool, you can enable mirroring as follows:
rbd feature enable <POOL_NAME>/<IMAGE_NAME> exclusive-lock, journaling
If we query the mirror info for our RBD pool called rbd, we will see that we don’t have any peer sites
defined yet.
You might see a warning for the rbd-mirror daemon health status. This will disappear as soon as we
define the peer clusters at both sites.
179
IBM Storage Ceph for Beginner’s
DAEMONS
service 124170:
instance_id:
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: false
health: WARNING
callouts: not reporting status
service 134099:
instance_id:
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: false
health: OK
IMAGES[root@rceph ~]#
We now need to bootstrap the peer clusters. You can navigate to Block -> Mirroring and click on the
top right option “Create Bootstrap Token” on the primary site which is RCEPH.
Figure 173: Ceph Dashboard – Block – Mirroring – Create Bootstrap Token on RCEPH
We now need to import the bootstrap token on secondary site cluster. Navigate to Block -> Mirroring
and click on the down icon next to the Create Bootstrap Token to Import a Bootstrap token. On the
Import Bootstrap Token page make sure to select bi-directional for two-way replication, insert the
local site’s name (LCEPH), choose the pool we want mirroring enabled on (we only have one RBD pool)
and paste the peer’s bootstrap token.
180
IBM Storage Ceph for Beginner’s
Figure 174: Ceph Dashboard – Block – Mirroring – Import Bootstrap Token on LCEPH
You can do the bootstrap process from the command line as well as follows:
[root@rceph ~]# rbd mirror pool peer bootstrap create --site-name RCEPH rbd > rceph.rbd
[root@rceph ~]#
[root@lceph ~]# rbd mirror pool peer bootstrap import --site-name LCEPH --direction rx-tx
rbd /tmp/rceph.rbd
[root@lceph ~]#
On the Ceph dashboard for each cluster, we should see both set to leaders with no images being
mirrored yet.
181
IBM Storage Ceph for Beginner’s
On RCEPH we can create an RBD image to mirror. We will name it linuxlun. Mirroring should be auto-
selected as we enabled mirroring for all new images.
In the current version of Ceph, you cannot mirror a RBD image that belongs to a namespace. See
https://bugzilla.redhat.com/show_bug.cgi?id=2024444.
[root@cephnode1 ~]# rbd mirror image enable rbd/winlun --namespace=windows
2024-08-03T21:18:28.068+0200 7f060e20dc00 -1 librbd::api::Mirror: image_enable: cannot
enable mirroring: mirroring is not enabled on a namespace
[root@cephnode1 ~]#
This should be fixed in the next release of IBM Storage Ceph.
182
IBM Storage Ceph for Beginner’s
Make sure to create a corresponding client RBD user with the ceph auth get-or-create command as
discussed in the RBD deployment section (with MON and OSD capabilities) on both Ceph clusters. For
our test we will use a Ceph user called drlinux. Once the image is created, we should see it set as
primary on RCEPH and secondary on LCEPH.
A maximum of 5 snapshots are retained by default. If required, the limit can be overridden through
the rbd_mirroring_max_mirroring_snapshots configuration option. All snapshots are automatically
removed when the RBD image is deleted or when mirroring is disabled.
183
IBM Storage Ceph for Beginner’s
On RCEPH, the mirroring status should be STOPPED (since we are the primary for this image).
To get more details of the mirroring status (including the transfer speed and last update) we can use
the command line.
184
IBM Storage Ceph for Beginner’s
1 replaying
DAEMONS
service 134099:
instance_id: 134159
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
IMAGES
linuxlun:
global_id: b5bc5fff-824e-42e9-9afb-b9a1a553bf4d
state: up+stopped
description: local image is primary
service: rceph.qxvdwv on rceph.local
last_update: 2024-08-04 10:12:18
peer_sites:
name: LCEPH
state: up+replaying
description: replaying,
{"bytes_per_second":0.0,"entries_behind_primary":0,"entries_per_second":0.0,"non_primary_posi
tion":{"entry_tid":3,"object_number":3,"tag_tid":15},"primary_position":{"entry_tid":3,"objec
t_number":3,"tag_tid":15}}
last_update: 2024-08-04 10:12:19
[root@rceph ~]#
The next step in a POC would be to simulate a DR scenario. To do this we will mount our RBD image
on our RBD client and write some data to it and then failover to the secondary cluster and access the
same RBD image.
Trying to map the RBD image though fails on the client (Ubuntu Server 22.04). See below:
The Ubuntu 22.04 kernel does not support one of the RBD image features (0x40). After some research,
this feature corresponds to the journalling flag that was set on the RBD image at creation. Later Linux
kernels have support for all the advanced RBD features as explained here
https://access.redhat.com/solutions/4270092. For the purposes of this document, instead of trying
a later Linux kernel on our RBD client it is easier to just change from using journal mode to use
snapshots since our goal is to demonstrate the failover/failback process. If you are using a later Linux
kernel version than our RBD client (Linux kernel version 6.5) that supports RBD journalling then you
don’t have to change to using snapshots.
185
IBM Storage Ceph for Beginner’s
If you have existing RBD images you can convert from journal based mirroring to snapshot based
mirroring without having to delete any images.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=devices-converting-journal-based-
mirroring-snapshot-based-mirroring
On both RCEPH and LCEPH, delete the RBD image and then change the mirroring mode to image (from
pool).
Figure 182: Ceph Dashboard – Block – Mirroring – Pool Mirroring Mode on RCEPH
Do the same on LCEPH. Next, recreate the RBD image specifying Snapshot mirroring on RCEPH and
choose a suitable schedule interval (we have chosen every 3 minutes).
A maximum of 5 snapshots are retained by default. If required, the limit can be overridden through
the rbd_mirroring_max_mirroring_snapshots configuration option. All snapshots are automatically
removed when the RBD image is deleted or when mirroring is disabled.
186
IBM Storage Ceph for Beginner’s
Figure 183: Ceph Dashboard – Block – Images – Pool Mirroring Mode on RCEPH
Soon after creating the new RBD image, you should see automatic snapshots being taken for the
snapshot mirroring process.
If we query the pool mirroring status now, we can see the details for the last snapshot etc.
DAEMONS
187
IBM Storage Ceph for Beginner’s
service 134099:
instance_id: 134159
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
IMAGES
linuxlun:
global_id: 0ee454a8-cf71-48e2-95ea-96f6a737f5ac
state: up+stopped
description: local image is primary
service: rceph.qxvdwv on rceph.local
last_update: 2024-08-04 10:35:18
peer_sites:
name: LCEPH
state: up+replaying
description: replaying,
{"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"last_snapshot_bytes":0,"last_snapshot_sync_
seconds":0,"local_snapshot_timestamp":1722760410,"remote_snapshot_timestamp":1722760410,"repl
ay_state":"idle"}
last_update: 2024-08-04 10:35:19
[root@rceph ~]#
Next, we can mount this RBD image on our client, create a filesystem and write some data to it.
We can query the pool mirroring status again to see if data is being transferred.
188
IBM Storage Ceph for Beginner’s
And we can query the mirroring status for the actual RBD image on the secondary cluster to check the
last snapshot sync time.
Let us now simulate DR (with a planned failover). First we will write a new file prior to stopping client
RBD access to the primary image.
We can check the mirror status to make sure the last snapshot update is after we created the test file
and that the health status is good.
DAEMONS
service 134099:
instance_id: 134159
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
189
IBM Storage Ceph for Beginner’s
IMAGES
linuxlun:
global_id: 0ee454a8-cf71-48e2-95ea-96f6a737f5ac
state: up+stopped
description: local image is primary
service: rceph.qxvdwv on rceph.local
last_update: 2024-08-04 11:07:18
peer_sites:
name: LCEPH
state: up+replaying
description: replaying,
{"bytes_per_second":0.0,"bytes_per_snapshot":1114368.0,"last_snapshot_bytes":2097152,"last_sn
apshot_sync_seconds":0,"local_snapshot_timestamp":1722762360,"remote_snapshot_timestamp":1722
762360,"replay_state":"idle"}
last_update: 2024-08-04 11:07:19
[root@rceph ~]#
And on LCEPH, we can PROMOTE the RBD image (if this was an unplanned failover, you can use the -
-force option).
190
IBM Storage Ceph for Beginner’s
We can query the mirroring status to make sure RCEPH is now set to replaying.
DAEMONS
service 104106:
instance_id: 104159
client_id: lceph.quprif
hostname: lceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
IMAGES
linuxlun:
global_id: 0ee454a8-cf71-48e2-95ea-96f6a737f5ac
state: up+stopped
description: local image is primary
service: lceph.quprif on lceph.local
last_update: 2024-08-04 11:09:49
peer_sites:
name: RCEPH
state: up+replaying
description: replaying,
{"bytes_per_second":0.0,"bytes_per_snapshot":370176.0,"last_snapshot_bytes":370176,"last_snap
shot_sync_seconds":0,"local_snapshot_timestamp":1722762520,"remote_snapshot_timestamp":172276
2520,"replay_state":"idle"}
last_update: 2024-08-04 11:09:48
[root@lceph ~]#
Let us map the RBD image from LCEPH on our RBD client, mount the filesystem and check to see if we
have the latest data (the file we created prior to the planned failover in our example).
191
IBM Storage Ceph for Beginner’s
To simulate running workload at the secondary site (which is now the primary), we will create a file.
Now let’s failback to the old primary (RCEPH). First, we will unmap the RBD image on our client.
Next, we need to check the current mirroring status and make sure it is healthy on LCEPH.
DAEMONS
service 104106:
instance_id: 104159
client_id: lceph.quprif
hostname: lceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
IMAGES
linuxlun:
global_id: 0ee454a8-cf71-48e2-95ea-96f6a737f5ac
state: up+stopped
description: local image is primary
service: lceph.quprif on lceph.local
last_update: 2024-08-04 11:23:49
peer_sites:
name: RCEPH
state: up+replaying
description: replaying,
{"bytes_per_second":0.0,"bytes_per_snapshot":370176.0,"last_snapshot_bytes":370176,"last_snap
shot_sync_seconds":0,"local_snapshot_timestamp":1722762520,"remote_snapshot_timestamp":172276
2520,"replay_state":"idle"}
last_update: 2024-08-04 11:23:48
[root@lceph ~]#
Before we failback, we need to double check that we are not set to the primary (in the case of an
unplanned outage we might still be set as the primary). Make sure “mirroring primary” is set to false.
192
IBM Storage Ceph for Beginner’s
For our example (planned failover/failback) we do not have to manually resynchronize the image on
the old primary as we were replicating changes whilst we failed over to LCEPH back to RCEPH.
However, if this was an unplanned outage, before failing back to the old primary (RCEPH), we would
need to manually resynchronize the image (after making sure RCEPH is demoted as per above) from
LCEPH to ensure we have the latest updates. The syntax of the resync command at site-A (old primary)
is as follows:
For the sake of completeness, we will run this even though in our example it is not needed.
We need to wait for the mirroring status at RCEPH (old primary) to transition from down+unknown to
up+replaying.
DAEMONS
service 134099:
instance_id: 134159
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
IMAGES
linuxlun:
global_id: 0ee454a8-cf71-48e2-95ea-96f6a737f5ac
state: down+unknown
description: status not found
last_update:
peer_sites:
name: LCEPH
state: up+stopped
description: local image is primary
last_update: 2024-08-04 11:27:21
[root@rceph ~]# rbd mirror pool status rbd --verbose
health: OK
daemon health: OK
image health: OK
193
IBM Storage Ceph for Beginner’s
images: 1 total
1 stopped
DAEMONS
service 134099:
instance_id: 134159
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
IMAGES
linuxlun:
global_id: 0ee454a8-cf71-48e2-95ea-96f6a737f5ac
state: down+unknown
description: status not found
last_update:
peer_sites:
name: LCEPH
state: up+stopped
description: local image is primary
last_update: 2024-08-04 11:27:21
[root@rceph ~]#
DAEMONS
service 134099:
instance_id: 134159
client_id: rceph.qxvdwv
hostname: rceph.local
version: 18.2.1-194.el9cp
leader: true
health: OK
IMAGES
linuxlun:
global_id: 0ee454a8-cf71-48e2-95ea-96f6a737f5ac
state: up+replaying
description: replaying,
{"bytes_per_second":36165939.2,"bytes_per_snapshot":723318784.0,"last_snapshot_bytes":7233187
84,"last_snapshot_sync_seconds":7,"local_snapshot_timestamp":1722762520,"remote_snapshot_time
stamp":1722762520,"replay_state":"idle"}
service: rceph.qxvdwv on rceph.local
last_update: 2024-08-04 11:27:48
peer_sites:
name: LCEPH
state: up+stopped
description: local image is primary
last_update: 2024-08-04 11:27:21
[root@rceph ~]#
Now that we are back in sync (remember, this manual resync was only required if we had an unplanned
outage), we can demote the RBD image on LCEPH and promote the RBD image at RCEPH back to
primary to complete the failback.
194
IBM Storage Ceph for Beginner’s
We can check the mirroring status on RCEPH to confirm the failover is successful. The status should
be up+stopped on RCEPH and up+replaying on LCEPH.
As a final step, we can map our RBD image on our RBD client, mount the filesystem and check to make
sure we can see the file we created on the secondary after the initial failover.
195
IBM Storage Ceph for Beginner’s
That concludes the test case for demonstrating RBD mirroring with failover/failback that you can use
for your POC.
Both source and target Ceph clusters must be running IBM Storage Ceph 7.0 or later
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=systems-ceph-file-system-snapshot-
mirroring
For our lab environment, we will replicate our source filesystem called rcephfs on Ceph cluster RCEPH
to a target filesystem on Ceph cluster LCEPH called lcephfs as depicted below. We have a NFS client
called labserver that we will use to test access to the primary and replicated CephFS filesystems
accessed via NFS as opposed to using the native CephFS kernel driver.
Let us start by creating a Ceph filesystem called rcephfs on our source Ceph cluster call RCEPH and
deploy at least one cephfs-mirror daemon.
196
IBM Storage Ceph for Beginner’s
Remember, for high availability you need to deploy more than one cephfs-mirror daemon. We just
need a single daemon for our test as we are not concerned about replication performance or high
availability.
197
IBM Storage Ceph for Beginner’s
We need to create a Ceph user for each CephFS peer on our target cluster LCEPH. We just have a single
peer so just one user is required. You can create it from the command line and verify it from the Ceph
Dashboard.
On both source and target clusters, we need to make sure the CephFS mirroring module mirroring is
enabled (it is disabled by default).
198
IBM Storage Ceph for Beginner’s
On the target cluster LCEPH we need to create a cluster bootstrap token. The format of the command
is as follows:
199
IBM Storage Ceph for Beginner’s
We need to specify the target Ceph filesystem, the Ceph client we created earlier and specify a site
name which will be LCEPH. We need to copy the bootstrap token which is output inside the double
quotes.
On our source cluster RCEPH, we need to import the target bootstrap token we just generated. The
format of the command is as follows:
On our source cluster RCEPH, we can now list the mirror peers for our filesystem rcephfs.
On the source cluster RCEPH, we need to configure the directory path for CephFS mirroring. We want
to mirror the entire rcephfs filesystem so will specify / as the directory path.
If we query our cephfs-mirror daemon we should have 1 directory configured. Also note the peer
cluster UUID.
To get the actual mirror status of the snapshot mirroring, we have to run a few obscure commands.
First, we need to get the FSID of rcephfs on the source cluster RCEPH. ASOK refers to administrative
socket (refer to https://www.ibm.com/docs/en/storage-ceph/7?topic=monitoring-using-ceph-
administration-socket for more information on the use of ASOKs).
200
IBM Storage Ceph for Beginner’s
The ASOK files are located in the /var/run/ceph. On the node where the cephfs-mirror daemon is
running navigate to that directory and list the files. You have to be within the cephadm-shell.
We have two ASOK files for our mirroring daemon. To get the FSID, we have to run the following
command:
From within the cephadmin-shell, issue the above command against the two files we found. We are
looking for fs mirror peer status, The first file doesn’t have this information.
201
IBM Storage Ceph for Beginner’s
The second file does have it. Our FSID is therefore rcephfs@1. Also, the PEER UUID is c1014a51-
4f65-4e64-a2fd-5d980648d963.
Once we have the FSID, we can run the following command from within the cephadm-shell to get the
mirroring status.
If we run the command against the second file we identified with the correct filesystem name and
FSID we get:
To view the detailed peer status, we need to run the following command:
202
IBM Storage Ceph for Beginner’s
Issuing the above command with our FSID and PEER UUID we get:
Our state is idle as we haven’t created any snapshots yet. If you want to query detailed metrics for the
snapshot mirroring, we need to issue the following command from within the cephadm-shell:
If we create a snapshot on RCEPH, after some time (depending on the size of the snapshot and
bandwidth of the network between clusters), we should see the snapshot propagated to the target
cluster LCEPH.
203
IBM Storage Ceph for Beginner’s
Figure 195: Ceph Dashboard – File Systems – Directories – Create Snapshot on RCEPH
204
IBM Storage Ceph for Beginner’s
Figure 197: Ceph Dashboard – File Systems – Directories – Create Snapshot on RCEPH
If we query the peer mirroring status again, we should see 2 snapshots being transferred.
And on LCEPH we can check to see if we can see the snapshot we created on RCEPH.
205
IBM Storage Ceph for Beginner’s
As a final test, let’s mount both rcephfs and lcephfs on our NFS client labserver. First, we need to
create an NFS export on both clusters.
206
IBM Storage Ceph for Beginner’s
Now we will create a file in rcephfs and then take a snapshot and see if it propagates to lcephfs while
the target filesystem is mounted.
root@labserver:~# ls -l /mnt/rceph
total 0
root@labserver:~# ls -l /mnt/lceph
total 0
root@labserver:~#
root@labserver:~# touch /mnt/rceph/NEW_FILE_TO_REPLICATE
root@labserver:~# ls /mnt/rceph
NEW_FILE_TO_REPLICATE
root@labserver:~# ls /mnt/lceph
root@labserver:~#
207
IBM Storage Ceph for Beginner’s
Figure 201: Ceph Dashboard – File Systems – Directories – Snapshots – Create on RCEPH
208
IBM Storage Ceph for Beginner’s
And if we do a directory listing on our NFS client we can see that the file is replicated to LCEPH and
accessible on the client with the filesystem mounted.
As a final test, let’s see what happens if we create a file on the target filesystem and then a new
snapshot on the source to check whether the target file gets overwritten after the latest source
snapshot is replicated.
So, we have a new file to replicate to the target and the target has a new file created (that of course is
not replicated to the source). We now create a new rcephfs snapshot on the source cluster RCEPH.
Figure 203: Ceph Dashboard – File Systems – Directories – Snapshots – Create on RCEPH
209
IBM Storage Ceph for Beginner’s
And now we can check if we picked up the latest file created on our source rcephfs and if our file we
created prior to the latest snapshot has been overwritten or not.
root@labserver:/mnt/lcephfs# ls -alt
total 1228805
drwxr-xr-x 2 root root 1258291200 Aug 4 19:57 .
-rw-r--r-- 1 root root 0 Aug 4 19:56 LCEPH_AFTER_LATEST_SNAP
-rw-r--r-- 1 root root 0 Aug 4 19:55 TEST_TO_SEE_TARGET_OVERWRITE_RCEPH
-rw-r--r-- 1 root root 0 Aug 4 19:50 NEW_FILE_TO_REPLICATE
drwxr-xr-x 8 root root 4096 Aug 4 19:47 ..
root@labserver:/mnt/lcephfs#
So, we didn’t lose the file we created prior to the snapshot and can see the last file we generated on
the source. This functionality is similar to IBM Storage Scale AFM-DR.
210
IBM Storage Ceph for Beginner’s
Since we can’t use the CephFS scheduler (we not making use of subvolumes as explained earlier), you
can schedule client snapshots via cron for example to meet a specific RPO (e.g. every 15 minutes or
every hour etc.).
If the source site experiences a failure, you can point clients to the target site and continue processing.
When the source site is recovered, you would need to disable snapshot mirroring and then setup
replication in the opposite direction and once it’s in sync, you can cutover clients back to the original
primary site.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=gateway-multi-site-configuration-
administration
Ceph supports several multi-site configuration options for the Ceph Object Gateway:
• Multi-zone: A more advanced configuration consists of one zone group and multiple zones,
each zone with one or more ceph-radosgw instances. Each zone is backed by its own Ceph
Storage Cluster. Multiple zones in a zone group provides disaster recovery for the zone group
should one of the zones experience a significant failure. Each zone is active and may receive
write operations. In addition to disaster recovery, multiple active zones may also serve as a
foundation for content delivery networks.
• Multi-zone-group: Formerly called 'regions', the Ceph Object Gateway can also support
multiple zone groups, each zone group with one or more zones. Objects stored to zone groups
within the same realm share a global namespace, ensuring unique object IDs across zone
groups and zones.
• Multiple Realms: The Ceph Object Gateway supports the notion of realms, which can be a
single zone group or multiple zone groups and a globally unique namespace for the realm.
Multiple realms provide the ability to support numerous configurations and namespaces.
211
IBM Storage Ceph for Beginner’s
A multi-site configuration requires a master zone group and a master zone. Each zone group requires
a master zone. Zone groups may have one or more secondary or non-master zones. A single site
deployment would typically consist of a single zone group with a single zone and one or more RGW
instances (like we covered for the section earlier on RGW Deployment). For a multi-site configuration,
we require at least two Ceph storage clusters and at least two RGW instances, one per cluster. To
demonstrate RGW replication, we will create a multi-site configuration based on the following:
Please refer to the following URL for a more detailed explanation of how to setup a Ceph Object
Gateway Multi-Site configuration:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=gateway-configuring-multi-site
212
IBM Storage Ceph for Beginner’s
You can also migrate a single site deployment with a default zone configuration to a multi-site
deployment. This process is document here https://www.ibm.com/docs/en/storage-
ceph/7.1?topic=administration-migrating-single-site-system-multi-site.
Let us start on RCEPH and deploy a RGW service with the following configuration to match our multi-
site deployment:
Next, we need to create a replication user on the primary cluster RCEPH as follows. Note the S3 access
and secret keys for this new replication user.
213
IBM Storage Ceph for Beginner’s
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}
[root@rceph ~]#
On the primary cluster RCEPH, navigate to Object -> Multi-site and edit the Zone called PRIMARY.
Enter the replication user access and secret keys. Ensure the endpoint is set to the RGW we just
deployed (where we chose to place the RGW daemon which in our case is on cluster node rceph.local
or 10.0.0.239).
Now edit the Zone Group called REPLICATED_ZONE on RCEPH and ensure our endpoint is correctly
set to rceph.local or 10.0.0.239.
214
IBM Storage Ceph for Beginner’s
Figure 209: Ceph Dashboard – Object – Multi-site - Edit Zone Group on RCEPH
On our primary cluster RCEPH, edit the Realm called MYLAB and copy the multi-site token.
Make sure the new Realm, Zone Group and Zone are set to be the Default.
When you next go to Object -> Gateways, you get an error that The Object Gateway Service is not
configured. This bug is document here https://bugzilla.redhat.com/show_bug.cgi?id=2231072. As
a workaround, set the Ceph Object Gateway credentials on the command-line interface with the
ceph dashboard set-rgw-credentials command.
215
IBM Storage Ceph for Beginner’s
We can query the RGW sync status for now to confirm out configuration.
On the secondary cluster LCEPH, deploy the RGW service as follows. Note, we will import the Realm
Token from our primary cluster RCEPH later so will accept the default Zone Group and Zone Name.
This step is required for us to get access on the Dashboard to the Object protocol in order to configure
Multi-site (otherwise navigating to Object will give us a message that the Object Gateway service is
not configured).
We need to now configure the connection between our primary cluster RCEPH to our secondary cluster
LCEPH for object replication. On the secondary cluster LCEPH, you should now be able to navigate to
Object -> Multi-site -> Import. Enter the information as follows. Note, we have to specify the
secondary Zone Name to match the configuration we want which is SECONDARY_ZONE. We can also
choose how many RGW daemons to deploy (one will do for our test).
216
IBM Storage Ceph for Beginner’s
Once you click on Import to complete this step, your configuration should look similar to this. Note.
there is a warning for us to restart all RGW instances on our secondary cluster to ensure a consistent
multi-site configuration.
On our secondary cluster, navigate to Administration -> Services and restart the newly created RGW
server (which was deployed when we did the Import). Note, we also have the initial RGW service we
deployed which is not needed anymore.
217
IBM Storage Ceph for Beginner’s
If you get the following error when you navigate to Object on the Ceph Dashboard, then you need to
issue the ceph dashboard set-rgw-credentials as detailed earlier.
We can now query the RGW replication status on our secondary cluster LCEPH. We can check that the
data sync source is set to our primary cluster zone called PRIMARY and if the data is all caught up or
not (in our case both bucket metadata and data are already caught up).
218
IBM Storage Ceph for Beginner’s
In Ceph RGW, a shard is a part of a bucket index that is split into multiple rados objects. Sharding is
a process that breaks down a dataset into multiple parts to increase parallelism and distribute the
load.
Secondary zones accept bucket operations; however, secondary zones redirect bucket operations to
the master zone and then synchronize with the master zone to receive the result of the bucket
operations. If the master zone is down, bucket operations executed on the secondary zone will fail,
but object operations should succeed. The master zone is only important when talking about accounts
and new buckets. Writing object data will always use the latest write, regardless of where it is
ingested.
On our secondary cluster, if you navigate to Object -> Overview you can see the status of the
replication as well.
On the secondary cluster LCEPH, in the Ceph Dashboard under Object -> Multi-site you might see a
red exclamation mark next to the secondary zone. This is due to the following bug
https://bugzilla.redhat.com/show_bug.cgi?id=2242994.
219
IBM Storage Ceph for Beginner’s
To fix this, edit the secondary zone called SECONDARY on LCEPH and change the endpoint to the IP
address of LCEPH which is 10.0.0.249.
Figure 218: Ceph Dashboard – Object – Multi-site – Edit SECONDARY Zone on LCEPH
On our primary cluster, if we navigate to Object -> Multi-site, we should see the following which clearly
shows that the PRIMARY ZONE on RCEPH is the master zone for the Zone Group REPLICATED_ZONE.
Depending on how many buckets and objects need to be replicated, it might take the Ceph Dashboard
a while to report the correct status.
220
IBM Storage Ceph for Beginner’s
On our primary cluster RCEPH, the sync status should also show that we are the master zone.
You can also check the sync status on our primary cluster RCEPH from the Ceph Dashboard. If you
hover over the status you will see how many shards need to still be replicated.
221
IBM Storage Ceph for Beginner’s
Any existing Object Users on RCEPH will be propagated to LCEPH. You can verify this by navigating to
Object -> Users on each cluster.
222
IBM Storage Ceph for Beginner’s
Let us now do a simple test by creating a bucket on RCEPH and uploading a file to it and then querying
the same bucket on LCEPH to see if the data is replicated. First, we create and upload a file to our
bucket on RCEPH using AWSCLI or any S3 client of your choice.
223
IBM Storage Ceph for Beginner’s
"ID": "s3test"
}
}
[root@rceph ~]# aws s3api --endpoint-url http://lceph.local:8080 list-objects --bucket
mynewbucket
{
"Contents": [
{
"Key": "new",
"LastModified": "2024-08-03T07:39:25.488000+00:00",
"ETag": "\"7bb7cfda37a4ed26666902219c73cce2\"",
"Size": 60864458,
"StorageClass": "STANDARD",
"Owner": {
"DisplayName": "S3 Test User",
"ID": "s3test"
}
}
],
"RequestCharged": null
}
[root@rceph ~]#
Even though LCEPH does not contain the master zone, it will still accept bucket operations by
redirecting the request to the master and synchronizing afterwards. Let us test this by creating a
bucket on LCEPH and then listing all buckets on RCEPH and then copying a file to our bucket on LCEPH
and then querying the contents of the bucket on RCEPH.
224
IBM Storage Ceph for Beginner’s
We should have the same Object statistics on both primary RCEPH and secondary LCEPH. We can
check this via the Ceph Dashboard.
You can also view the replication performance from the Ceph Dashboard. Navigate to Object ->
Gateways -> Sync Performance.
225
IBM Storage Ceph for Beginner’s
Let us simulate a primary gateway failure by stopping the RGW service on our primary cluster RCEPH.
As explained earlier, the loss of a master zone should only affect new bucket operations but not affect
access to existing buckets and objects or our ability to write new objects to existing buckets. This is
because the default behavior of the Ceph RGW is to run in an active-active configuration. Now that
we simulated a primary zone failure by stopping the RGW service on RCEPH, let us test the creation of
a new bucket on LCEPH and access to an existing bucket on LCEPH. You could also setup the cluster
to run as active/passive, in which case you would need to remove the read-only status on the
secondary zone before running this test.
226
IBM Storage Ceph for Beginner’s
As expected, we can’t create new buckets (or Object Users for that matter) but we can access existing
buckets and write new objects to them even with the master zone being unavailable. This proves that
there are no failover commands to run in the event of a site failure due to the default active/active
behavior of the RGW. If you implemented an external load balancer (e.g. DNS round-robin) that
included RGWs from both zones, then Object Clients would not be interrupted as the load-balancer
would just direct new requests to RGWs on the site that is available.
227
IBM Storage Ceph for Beginner’s
Let us now assume that we have an extended outage and can’t get our primary site backup up. We
want to be able to create new buckets and new object users so can’t afford to wait for the master zone
to be recovered. We will now convert the secondary zone to master so that we can create new buckets
and new Object users even with the master zone down.
On the secondary cluster LCEPH, we need to make the secondary zone called SECONDARY the master
and default zone for our zone group. We can do this as follows:
A Ceph RGW period is a time period with multiple epochs that tracks changes to a Ceph Object
Gateway (RGW) configuration. The RGW period is stored in a realm, which also contains zones and
zone groups. Updating the period changes the epoch and ensures that other zones receive the
updated configuration.
228
IBM Storage Ceph for Beginner’s
.
.
.
"period_map": {
"id": "a43d12da-2ba9-4458-a6d9-ed04a3635cb3",
"zonegroups": [
{
"id": "b0269fde-61cc-4b7e-9e34-389cf0bc0eed",
"name": "REPLICATED_ZONE",
"api_name": "REPLICATED_ZONE",
"is_master": true,
"endpoints": [
"http://10.0.0.239:8080"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "30de453d-b997-4435-9a48-feebe5245952",
"zones": [
{
"id": "30de453d-b997-4435-9a48-feebe5245952",
"name": "SECONDARY",
"endpoints": [
"http://10.0.0.249:8080"
],
"log_meta": false,
"log_data": true,
"bucket_index_max_shards": 11,
"read_only": false,
"tier_type": "",
"sync_from_all": true,
"sync_from": [],
"redirect_zone": "",
"supported_features": [
"compress-encrypted",
"resharding"
]
},
{
"id": "44d5f41c-a04f-4d1b-8e14-4622bcc7b2af",
"name": "PRIMARY",
"endpoints": [
"http://10.0.0.239:8080"
],
"log_meta": false,
"log_data": true,
"bucket_index_max_shards": 11,
"read_only": false,
"tier_type": "",
"sync_from_all": true,
"sync_from": [],
"redirect_zone": "",
"supported_features": [
"compress-encrypted",
"resharding"
]
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "bee2e14c-b965-4f6d-a010-3a0625dee923",
"sync_policy": {
"groups": []
},
"enabled_features": [
"resharding"
]
}
],
"short_zone_ids": [
229
IBM Storage Ceph for Beginner’s
{
"key": "30de453d-b997-4435-9a48-feebe5245952",
"val": 4270858657
},
{
"key": "44d5f41c-a04f-4d1b-8e14-4622bcc7b2af",
"val": 583748692
}
]
},
"master_zonegroup": "b0269fde-61cc-4b7e-9e34-389cf0bc0eed",
"master_zone": "30de453d-b997-4435-9a48-feebe5245952",
"period_config": {
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_ratelimit": {
"max_read_ops": 0,
"max_write_ops": 0,
"max_read_bytes": 0,
"max_write_bytes": 0,
"enabled": false
},
"bucket_ratelimit": {
"max_read_ops": 0,
"max_write_ops": 0,
"max_read_bytes": 0,
"max_write_bytes": 0,
"enabled": false
},
"anonymous_ratelimit": {
"max_read_ops": 0,
"max_write_ops": 0,
"max_read_bytes": 0,
"max_write_bytes": 0,
"enabled": false
}
},
"realm_id": "bee2e14c-b965-4f6d-a010-3a0625dee923",
"realm_epoch": 3
}
[root@lceph ~]#
Finally, we have to restart the RGWs on our secondary cluster to pick up the changes.
If we check the Multi-site configuration on the secondary cluster LCEPH Ceph Dashboard, we can see
that our zone SECONDARY is now the master zone for our zone group REPLICATED_ZONE.
230
IBM Storage Ceph for Beginner’s
We can also verify this from the command line on LCEPH. SECONDARY zone on LCEPH is now the
master (no sync).
We can now test bucket creation (which failed earlier as we had lost the master zone). This now works
as expected.
231
IBM Storage Ceph for Beginner’s
When the former master zone recovers you can revert the failover configuration back to its original
state. To simulate this, let us first restart the RGW on RCEPH (which contained former master zone).
If we query the replication sync status on RCEPH, we can see that it still thinks that it is the master
zone (it doesn’t have the latest RGW period).
232
IBM Storage Ceph for Beginner’s
We can confirm this as well from the Ceph Dashboard. Our PRIMARY zone on RCEPH still thinks it’s
the master zone for our zone group REPLICATED_ZONE.
To correct this issue, we need to pull the latest realm configuration from the current master zone
(which is on LCEPH).
Next, we need to make the recovered zone (former master zone) the master and default for our
REPLICATED_ZONE zone group. On RCEPH, issue the following command:
233
IBM Storage Ceph for Beginner’s
{
"key": "default-placement",
"val": {
"index_pool": "PRIMARY.rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "PRIMARY.rgw.buckets.data",
"compression_type": "lz4"
}
},
"data_extra_pool": "PRIMARY.rgw.buckets.non-ec",
"index_type": 0,
"inline_data": true
}
}
],
"realm_id": "bee2e14c-b965-4f6d-a010-3a0625dee923",
"notif_pool": "PRIMARY.rgw.log:notif"
}
[root@rceph ~]#
We now have the latest realm configuration on RCEPH and have issued the command to convert our
zone called PRIMARY to make it the master and default zone for our zone group REPLICATED_ZONE.
We need to update the RGW period to reflect the changes. On RCEPH, issue:
[root@rceph ~]#
As we did for the failover, after we update the period we need to restart the RGW in the recovered
zone. On RCEPH, issue:
If you had previously configured the secondary zone to be read-only, then you would set that now.
Since we chose the active/active default configuration we don’t have to. We do however need to
update the period in the secondary zone and restart the RGWs for it to pick up the latest changes to
the realm. On LCEPH, do the following:
234
IBM Storage Ceph for Beginner’s
},
"realm_id": "bee2e14c-b965-4f6d-a010-3a0625dee923",
"realm_epoch": 4
}
[root@lceph ~]#
We can query the current sync status from either the command line or via the Ceph Dashboard. As
expected, our PRIMARY zone on RCEPH is now the master for our zone group REPLICATED_ZONE and
LCEPH reflects this as well.
235
IBM Storage Ceph for Beginner’s
236
IBM Storage Ceph for Beginner’s
To complete our testing of the failback, let us create a new bucket on RCEPH.
And check that it is available on LCEPH. Depending on the number of changes you made whilst your
SECONDARY zone was the master after failover, it might take some time for the changes to propagate.
In our lab environment, the sync appeared to be stuck for a few minutes:
237
IBM Storage Ceph for Beginner’s
"CreationDate": "2024-08-03T07:41:30.284000+00:00"
}
],
"Owner": {
"DisplayName": "S3 Test User",
"ID": "s3test"
}
}
[root@rceph ~]#
Since there are no errors, and we don’t see progress when issuing the sync status (out of sync shard
count is not reducing) it is not uncommon for failover/failback issues to cause replication to stop. See
here for an explanation https://www.ibm.com/docs/en/storage-ceph/7.1?topic=tmscog-
synchronizing-data-in-multi-site-ceph-object-gateway-configuration). We can manually issue a
resync to resolve this and restart the RGW gateway on LCEPH.
We can check the sync status again to see if its restarted and progressing.
238
IBM Storage Ceph for Beginner’s
And list the buckets on LCEPH to see if we can see the bucket we created on RCEPH after we
performed the failback.
Lastly, we can check if new data is syncing by writing an object on RCEPH and checking to see if it is
replicated to LCEPH.
239
IBM Storage Ceph for Beginner’s
• Protecting data - Encryption protects data from unauthorized access, theft, and tampering
• Meeting compliance requirements - Encryption may be required or encouraged by laws and
regulations. For example, the Payment Card Industry Data Security Standard (PCI DSS)
requires merchants to encrypt customer payment card data.
• Reducing the risk of costly penalties - Encryption can help organizations avoid costly
penalties, lengthy lawsuits, reduced revenue, and tarnished reputations.
• Reducing the attack surface - Encryption can reduce the surface of attack by cutting out the
lower layers of the hardware and software stack.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=hardening-encryption-key-management
Encryption for all Ceph traffic over the network is enabled by default, with the introduction of the
messenger version 2 protocol. The secure mode setting for messenger v2 encrypts communication
between Ceph daemons and Ceph clients, providing end-to-end encryption.
https://docs.ceph.com/en/reef/rados/configuration/msgr2/
The messenger v2 protocol, or msgr2, is the second major revision on Ceph’s on-wire protocol. It
brings with it several key features:
• A secure mode that encrypts all data passing over the network
• Improved encapsulation of authentication payloads, enabling future integration of new
authentication modes like Kerberos
• Improved earlier feature advertisement and negotiation, enabling future protocol revisions
Ceph daemons can now bind to multiple ports, allowing both legacy Ceph clients and new v2-capable
clients to connect to the same cluster. By default, monitors now bind to the new IANA-assigned port
3300 (ce4h or 0xce4) for the new v2 protocol, while also binding to the old default port 6789 for the
legacy v1 protocol.
We can verify that our MON are using the v2 protocol as follows:
240
IBM Storage Ceph for Beginner’s
And a netstat should also confirm we talking over the v2 protocol (port 3300).
We can also verify that the v1 protocol is not in use (i.e. we do not have any legacy clients).
Ceph Data-in-flight does have a slight performance overhead. An excellent article that tests the impact
of using over-the-wire encryption (on Ceph Reef or IBM Storage Ceph V7) can be found here:
https://ceph.io/en/news/blog/2023/ceph-encryption-performance/
The conclusion is that the performance impact is negligible when using the messenger v2 protocol to
perform end-to-end encryption.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=management-enabling-key-rotation
Ceph Block Storage Encryption is a feature in Ceph that enables users to encrypt data at the block
level. It encrypts data before writing it to the storage cluster and decrypts it when retrieving it. Block
241
IBM Storage Ceph for Beginner’s
storage encryption adds an extra degree of protection to sensitive data stored on Ceph. The encryption
is done per-volume, so the user may select which volumes to encrypt and which to leave unencrypted.
Block Storage Encryption does incur a performance overhead though, especially on workloads with
large writes. More information on IBM Storage Ceph’s Data-at-rest encryption (DAE) is documented
here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=management-encryption-rest
Ceph OSD encryption-at-rest relies on the Linux kernel's dm-crypt subsystem and the Linux Unified
Key Setup ("LUKS"). See here for more information:
https://docs.ceph.com/en/latest/ceph-volume/lvm/encryption/
When creating an encrypted OSD, ceph-volume creates an encrypted logical volume and saves the
corresponding dm-crypt secret key in the Ceph Monitor data store. When the OSD is to be started,
ceph-volume ensures the device is mounted, retrieves the dm-crypt secret key from the Ceph
Monitors, and decrypts the underlying device. This creates a new device, containing the unencrypted
data, and this is the device the Ceph OSD daemon is started on.
Since decrypting the data on an encrypted OSD disk requires knowledge of the corresponding dm-
crypt secret key, OSD encryption provides protection for cases when a disk drive that was used as an
OSD is decommissioned, lost, or stolen. The OSD itself does not know whether the underlying logical
volume is encrypted or not, so there is no ceph OSD command that will return this information.
To check if your cluster is using block encryption you can issue the following command on your OSD
nodes. Check for “encrypted”. A value of 0 means it is not encrypted.
[block] /dev/ceph-2a2f9223-0d70-41cb-aa07-7754b885728a/osd-block-187560c1-07a4-444d-
b55f-48a093c324f1
242
IBM Storage Ceph for Beginner’s
[block] /dev/ceph-ae4c3ffc-5d5e-473f-b324-8d5c195cc841/osd-block-34372239-a478-4bbd-
8bff-9e6704cbc35a
[block] /dev/ceph-fbe588c6-d736-4f75-be72-43ce94f21b7d/osd-block-5bd7dc13-5e2b-4340-
a2e9-8a940c0bfdca
Or you can query LUKS as follows using the device name of the logical volume used for the OSD
(obtained from the output above). As an example, for osd.10:
Encryption is done at an OSD level so you can have a mix of encrypted and non-encrypted OSDs in a
cluster as it is transparent to the upper layers. This of course is not recommended.
Since OSDs can only be encrypted at creation, it is practical to do this at cluster creation. Because you
can have a mix of encrypted and unencrypted OSDs in a cluster, for an existing cluster you can remove
and re-add OSDs one at a time back into the cluster and encrypting them. This would be too tedious
243
IBM Storage Ceph for Beginner’s
though so we will just demonstrate how to do this at cluster creation. Our single node Ceph cluster
has the following storage layout.
If you recall, after bootstrapping you Ceph cluster you connect to the Ceph Dashboard and go through
the Expand Cluster wizard. After adding your cluster nodes, a screen is displayed to create OSDs. Click
on Advanced and select Encryption to make use of data-at-rest or block encryption.
After you complete the Expand Cluster wizard, OSDs are created on our lab cluster as follows:
We can check if our OSDs are encrypted from the command line. If encrypted is set to 1 then the OSD
is encrypted.
244
IBM Storage Ceph for Beginner’s
[block] /dev/ceph-7513fae5-a462-4357-9bc4-7d9664815902/osd-block-10ecc366-f829-4b3f-
bc4e-f3932d7aeb83
[block] /dev/ceph-bbf9c4c4-6f4d-4c10-a34b-8704de19abd8/osd-block-a05fcdf3-3f95-485a-
b915-5b7781417fbe
[block] /dev/ceph-af9e6646-715b-44de-8a82-06fa4978ff04/osd-block-e49b668e-2398-4d4e-
85e3-f050cb7989d3
Or we can query LUKS using the device name for the OSD obtained from the above command. For
example, for osd.2:
245
IBM Storage Ceph for Beginner’s
Data segments:
0: crypt
offset: 16777216 [bytes]
length: (whole device)
cipher: aes-xts-plain64
sector: 512 [bytes]
Keyslots:
0: luks2
Key: 512 bits
Priority: normal
Cipher: aes-xts-plain64
Cipher key: 512 bits
PBKDF: argon2id
Time cost: 5
Memory: 1048576
Threads: 2
Salt: e2 73 80 30 82 a5 28 f9 f5 58 d2 46 c8 d6 68 9b
34 34 3c 4b 4e 17 66 b7 e6 b3 ee 53 10 b0 0f ee
AF stripes: 4000
AF hash: sha256
Area offset:32768 [bytes]
Area length:258048 [bytes]
Digest ID: 0
Tokens:
Digests:
0: pbkdf2
Hash: sha256
Iterations: 117028
Salt: 9a 16 a0 4a b2 15 ca 10 08 94 45 70 5e 34 b7 ff
dc 7f 36 0b 2d 6f dd 98 49 bd 81 10 40 a8 d0 ed
Digest: ce a5 f2 8b 70 27 a6 6d 16 b5 2e 78 66 af 58 78
2f ca bc 31 d5 4b dc d7 24 dc f9 91 c8 de 27 b4
[ceph: root@ndceph /]#
Note, Ceph uses LUKS version 2. IBM Storage Ceph also supports key rotation. See here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=management-enabling-key-rotation
When using HAProxy and keepalived to terminate SSL connections, the HAProxy and keepalived
components use encryption keys. Please refer to the section describing Ceph RGW Deployment where
we used the ingress service that makes use of HAProxy and keepalived. We enabled SSL termination
and generated a self-signed certificate for SSL termination.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=management-ssl-termination
If you implemented Ceph Data-at-rest encryption, then all data stored including object data will
automatically be encrypted. However, Ceph RGW also supports both server-side and client-side
encryption for S3 object data. The Ceph Object Gateway supports server-side encryption of uploaded
objects for the S3 application programming interface (API). Server-side encryption means that the S3
client sends data over HTTP in its unencrypted form, and the Ceph Object Gateway stores that data in
the IBM Storage Ceph cluster in encrypted form. For more information see here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=security-server-side-encryption
246
IBM Storage Ceph for Beginner’s
The three server-side encryption key options supported by the Ceph RDG are described below
(source: https://www.ibm.com/docs/en/storage-ceph/7.1?topic=security-server-side-encryption)
Customer-provided keys - When using customer-provided keys, the S3 client passes an encryption
key along with each request to read or write encrypted data. It is the customer’s responsibility to
manage those keys. Customers must remember which key the Ceph Object Gateway used to encrypt
each object. Ceph Object Gateway implements the customer-provided key behavior in the S3 API
according to the Amazon SSE-C specification. Since the customer handles the key management and
the S3 client passes keys to the Ceph Object Gateway, the Ceph Object Gateway requires no special
configuration to support this encryption mode.
Key management service - When using a key management service, the secure key management
service stores the keys and the Ceph Object Gateway retrieves them on demand to serve requests to
encrypt or decrypt data. Ceph Object Gateway implements the key management service behavior in
the S3 API according to the Amazon SSE-KMS specification. Important: Currently, the only tested key
management implementations are HashiCorp Vault, and OpenStack Barbican. However, OpenStack
Barbican is a Technology Preview and is not supported for use in production systems.
SSE-S3 - When using SSE-S3, the keys are stored in a vault, but they are automatically created and
deleted by Ceph and retrieved as required to serve requests to encrypt or decrypt data. Ceph Object
Gateway implements the SSE-S3 behavior in the S3 API according to the Amazon SSE-S3
specification.
247
IBM Storage Ceph for Beginner’s
Figure 238: Ceph Dashboard – Object – Bucket – Create (Specify server-side encryption method)
Using server-side S3 encryption is outside the scope of this document so we will concentrate on
demonstrating the use of client-side encryption. However, for the sake of completeness, you can set
the default server-side encryption for an existing bucket by using the S3 API. This is documented in
the link below:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=encryption-setting-default-existing-s3-
bucket
By setting the desired server-side encryption for an existing bucket, we can ensure that all new files
written to this bucket will be encrypted. First, we will query an existing bucket to see if it is encrypted
using AWSCLI.
248
IBM Storage Ceph for Beginner’s
}
}
[root@cephnode1 ~]#
Since there is no encryption on the existing bucket, we will create JSON file with the preferred
encryption settings.
Client-side encryption is the act of encrypting your data locally (at the client) to help ensure its security
in transit and at rest. To demonstrate the use of client-side encryption we will use s3cmd. To use
s3cmd, we need to define the encryption passphrase in the .s3cfg configuration file.
Next, we will create new bucket called encrypted and upload a text file to it with the -e flag which
encrypts the object.
249
IBM Storage Ceph for Beginner’s
Ownership: none
Versioning:none
Expiration rule: none
Block Public Access: none
Policy: none
CORS: none
ACL: Nishaan Docrat: FULL_CONTROL
root@labserver:~# s3cmd put /etc/hosts s3://encrypted -e
upload: '/tmp/tmpfile-7r0ikiQiALALIhkEQpqV' -> 's3://encrypted/hosts' [1 of 1]
357 of 357 100% in 0s 14.31 KB/s done
root@labserver:~#
s3cmd will automatically decrypt a file when it is retrieved. If we use another S3 API client like
AWSCLI, we should be able to retrieve the object but it should be encrypted. We can test this as
follows:
Using AWSCLI we are able to retrieve the object but it is encrypted. We can use s3cmd with our
passphrase to download the same file and check that it is decrypted.
250
IBM Storage Ceph for Beginner’s
Ceph supports four different compression algorithms and different compression modes for backend
BlueStore (pool level) encryption. It is almost guaranteed that you would need to test Ceph’s data
reduction capability during a POC. To get an accurate representation of the expected performance
impact and compression ratios to expect, it would be best to test with the customer’s data. Granted,
251
IBM Storage Ceph for Beginner’s
any performance testing on POC hardware is definitely not representative of a production deployment
so you would need to document this clearly. You would also need to get a reasonable variety of data
that you expect to write to the Ceph cluster of a representative size (testing with synthetically
generated files would give you inaccurate compression results for example). More details on back-
end compression can be found here:
https://www.ibm.com/docs/en/storage-ceph/7?topic=clusters-back-end-compression
For the purposes of this document, a process to test the compression is demonstrated. Due to
limitations in the lab environment storage capacity, the actual compression results should not be
taken as being representative of Ceph’s capability. For the BlueStore compression, a wide variety of
different file types were downloaded from https://filesamples.com/ to test the different Ceph
compression algorithms. The total sample size was only 2GB so definitely not representative of any
real-world workload. Five different RBD storage pools were created with 4 using one of Ceph’s
compression algorithms and one being uncompressed. It’s important to note that this test was
performed on RBD images and you could potentially get better results using CephFS.
252
IBM Storage Ceph for Beginner’s
All the pools were set to use force. Another important factor to consider is that for HDD OSDs we have
64KB allocation units and compression block sizes within [128KB,512KB] range. If an input block is
lower than 128KB - it's not compressed. If it's above 512KB, it's split into multiple chunks and each
one is compressed independently (noting that any chunk smaller than 128KB is not compressed).
On the RBD client, images from each of these pools were mounted.
root@labserver:~# df -hT
Filesystem Type Size Used Avail Use% Mounted on
tmpfs tmpfs 392M 1.1M 391M 1% /run
/dev/sda2 ext4 20G 17G 2.5G 87% /
tmpfs tmpfs 2.0G 84K 2.0G 1% /dev/shm
tmpfs tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs tmpfs 392M 16K 392M 1% /run/user/0
/dev/rbd0 xfs 5.0G 130M 4.9G 3% /mnt/snappy
/dev/rbd1 xfs 5.0G 130M 4.9G 3% /mnt/zlib
/dev/rbd2 xfs 5.0G 130M 4.9G 3% /mnt/zstd
/dev/rbd3 xfs 5.0G 130M 4.9G 3% /mnt/lz4
/dev/rbd4 xfs 5.0G 130M 4.9G 3% /mnt/uncompressed
root@labserver:~#
And our sample data was copied to each filesystem. The overall data stored on each filesystem
equated to 2GB and was the same across all filesystems.
root@labserver:/mnt# du -b .
2088985345 ./zlib/testfiles
2088985345 ./zlib
0 ./rcephfs
2088985345 ./uncompressed/testfiles
253
IBM Storage Ceph for Beginner’s
2088985345 ./uncompressed
0 ./testbucket
2088985345 ./snappy/testfiles
2088985345 ./snappy
0 ./cephnfs
0 ./rbd
2088985345 ./lz4/testfiles
2088985345 ./lz4
0 ./opennfs
2088985345 ./zstd/testfiles
2088985345 ./zstd
0 ./lcephfs
10444926725 .
root@labserver:/mnt#
Now if we check the pool capacity usage on our Ceph cluster we can see the following:
Based on the above output and taking into account the pool 3-way replication, to store 2GB of data
would require 6GB of physical capacity. We can see that the snappy and lz4 algorithms offered a 35%
capacity savings and the others about 4%. Of course using a larger data sample you would see better
results.
Ceph does not delete the object when deleting the file, same as the traditional file system, and the
object still remains on the RBD device. Also, a new write will either over-write these objects or
create new ones, as required. Therefore, the objects are still present in the pool, a 'ceph df' will
show the pool being occupied with the objects, even though those are not used. See here for more
information https://access.redhat.com/solutions/3075321
For RGW compression, compression is enabled on a storage class in the Zone’s placement target. It
is enabled by default when creating a zone and set to lz4.
254
IBM Storage Ceph for Beginner’s
You can query this for any zone from the command line as well.
As a simple test, we can create a new bucket and write our sample data to it and then query the
compression savings.
255
IBM Storage Ceph for Beginner’s
.
upload: 'sample_5184×3456.pbm' -> 's3://compression-test/sample_5184×3456.pbm' [52 of 54]
2239501 of 2239501 100% in 0s 20.56 MB/s done
upload: 'sample_5184×3456.tiff' -> 's3://compression-test/sample_5184×3456.tiff' [part 1 of
4, 15MB] [53 of 54]
15728640 of 15728640 100% in 0s 16.60 MB/s done
upload: 'sample_5184×3456.tiff' -> 's3://compression-test/sample_5184×3456.tiff' [part 2 of
4, 15MB] [53 of 54]
15728640 of 15728640 100% in 0s 17.17 MB/s done
upload: 'sample_5184×3456.tiff' -> 's3://compression-test/sample_5184×3456.tiff' [part 3 of
4, 15MB] [53 of 54]
15728640 of 15728640 100% in 0s 17.38 MB/s done
upload: 'sample_5184×3456.tiff' -> 's3://compression-test/sample_5184×3456.tiff' [part 4 of
4, 6MB] [53 of 54]
6562056 of 6562056 100% in 0s 15.86 MB/s done
upload: 'sample_960x400_ocean_with_audio.webm' -> 's3://compression-
test/sample_960x400_ocean_with_audio.webm' [part 1 of 2, 15MB] [54 of 54]
15728640 of 15728640 100% in 0s 17.73 MB/s done
upload: 'sample_960x400_ocean_with_audio.webm' -> 's3://compression-
test/sample_960x400_ocean_with_audio.webm' [part 2 of 2, 1485KB] [54 of 54]
1520959 of 1520959 100% in 0s 10.25 MB/s done
root@labserver:~/local_disk_testfiles#
Next, we can issue the radosgw-admin bucket stats command to get the bucket capacity utilisation.
We are interested in size_kb_actual and size_kb_utilized.
256
IBM Storage Ceph for Beginner’s
.
.
},
{
"bucket": "compression-test",
"num_shards": 11,
"tenant": "",
"versioning": "off",
"zonegroup": "ace3190f-c02e-40e2-abdd-344efc6ab06c",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "7109acc2-883b-4502-b863-4e7097d7d83c.1215475.1",
"marker": "7109acc2-883b-4502-b863-4e7097d7d83c.1215475.1",
"index_type": "Normal",
"versioned": false,
"versioning_enabled": false,
"object_lock_enabled": false,
"mfa_enabled": false,
"owner": "velero",
"ver": "0#4,1#13,2#10,3#17,4#14,5#11,6#22,7#17,8#96,9#10,10#6",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
"mtime": "2024-08-04T21:04:08.685795Z",
"creation_time": "2024-08-04T21:04:08.677361Z",
"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#",
"usage": {
"rgw.main": {
"size": 2088985345,
"size_actual": 2089123840,
"size_utilized": 2073674044,
"size_kb": 2040025,
"size_kb_actual": 2040160,
"size_kb_utilized": 2025073,
"num_objects": 54
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
]
[root@lceph ~]#
Notice that for our compression-test bucket, using the same data sample we used for the RBD test,
we hardly got any compression savings. Other buckets though do have significant savings (e.g.
testwebsite bucket has a 50% capacity saving).
257
IBM Storage Ceph for Beginner’s
"size_kb_actual": 104,
"size_kb_utilized": 21,
"bucket": "awsbucket",
"size_kb_actual": 4,
"size_kb_utilized": 1,
"bucket": "testbucket",
"size_kb_actual": 20,
"size_kb_utilized": 2,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"bucket": "compression-test",
"size_kb_actual": 2040160,
"size_kb_utilized": 2025073,
"size_kb_actual": 0,
"size_kb_utilized": 0,
[root@cephnode1 ~]#
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=benchmark-benchmarking-ceph-
performance
258
IBM Storage Ceph for Beginner’s
To test RBD performance from a client we can run the rbd command with the rbdbench option.
Test parameters
endpoint(s): [http://cephnode2.local:8080]
bucket: s3bench
objectNamePrefix: s3bench
objectSize: 0.0096 MB
numClients: 2
numSamples: 100
verbose: %!d(bool=false)
259
IBM Storage Ceph for Beginner’s
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=developer-ceph-restful-api
The Ceph Dashboard has a direct link to the REST API which can be launched by clicking on the Cluster
API URL.
260
IBM Storage Ceph for Beginner’s
On the Ceph RESTful API webpage you will see a list of all the supported API calls along with their
required parameters and expected responses.
You can test any of the API calls by clicking on “Try it out”.
261
IBM Storage Ceph for Beginner’s
If you need to simulate application access, you can use the curl command. Ceph uses Java Web Tools
(JWT) tokens for authentication. You first need to authenticate with the Ceph cluster and obtain a valid
token. Then you can use that token to perform REST API calls. Let us demonstrate this process. First,
we need to obtain a valid JWT token. The syntax for the auth API call is as follows (provided so that
you can easily cut and paste it).
curl -k -X 'POST' \
'https://10.0.0.235:8443/api/auth' \
-H 'accept: application/vnd.ceph.api.v1.0+json' \
-H 'Content-Type: application/json' \
-d '{
"username": "admin",
"password": "MyPassw0rd",
"ttl": null
}'
In order to get formatted results, we will pipe the output to jq. The permissions for the user are
displayed. You can create a user with only specific permission via the Ceph Dashboard (e.g. read-only
access).
262
IBM Storage Ceph for Beginner’s
],
"config-opt": [
"create",
"delete",
"read",
"update"
],
"dashboard-settings": [
"create",
"delete",
"read",
"update"
],
"grafana": [
"create",
"delete",
"read",
"update"
],
"hosts": [
"create",
"delete",
"read",
"update"
],
.
.
.
],
"prometheus": [
"create",
"delete",
"read",
"update"
],
"rbd-image": [
"create",
"delete",
"read",
"update"
],
"rbd-mirroring": [
"create",
"delete",
"read",
"update"
],
"rgw": [
"create",
"delete",
"read",
"update"
],
"user": [
"create",
"delete",
"read",
"update"
]
},
"pwdExpirationDate": null,
"sso": false,
"pwdUpdateRequired": false
}
root@labserver:~#
The authentication token we just generated needs to be passed with all future API requests. The
syntax to make requests is provided below for the /api/health/get_cluster_capacity API call.
curl -k -X 'GET' \
'https://10.0.0.235:8443/api/health/get_cluster_capacity' \
-H 'Authorization: Bearer
eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJjZXBoLWRhc2hib2FyZCIsImp0aSI6ImNiNzgwZTM0LWM1
NDItNDAzYi1hY2UwLTU2OTcxZDc3NTNiZSIsImV4cCI6MTczNTgyOTY2MiwiaWF0IjoxNzM1ODAwODYyLCJ1c2VybmFtZ
SI6ImFkbWluIn0.JHGmC3418z7IsqqX8vFKZBBmn8Ed2_6aLJFZKUCO9sQ' \
263
IBM Storage Ceph for Beginner’s
-H 'accept: application/vnd.ceph.api.v1.0+json' \
-H 'Content-Type:application/json'
You can also use curl to obtain the Ceph cluster’s metrics which are exported via Prometheus if you
need to.
264
IBM Storage Ceph for Beginner’s
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=traps-configuring-snmptrapd
If you need to demonstrate this as part of a POC, the easiest way to do it is to use Net-SNMP. First, we
need to install Net-SNMP on a SNMP management host and make sure port 162 is open.
Next, we need to download the Ceph SNMP MIB and copy it to the default Net-SNMP MIB directory.
Make sure permissions on the MIB file are set to 644.
curl -o CEPH_MIB.txt -L
https://raw.githubusercontent.com/ceph/ceph/master/monitoring/snmp/CEPH-MIB.txt
cp CEPH_MIB.txt /usr/share/snmp/mibs
You can browse the CEPH SNMP MIB using snmptranslate if you need to see what traps are supported.
265
IBM Storage Ceph for Beginner’s
Since we want to use SNMPV3 we only need to create the snmptrapd_auth.conf file. We need to
specify the ENGINE_ID. IBM recommends using 8000C53F_CLUSTER_FSID_WITHOUT_DASHES for
this parameter. To obtain the Ceph FSID you can run ceph -s.
What did work is 0x800007DB03 in the snmptrapd.conf file. Our snmptrapd.conf file should also
contain an SNMP_V3_AUTH_USER_NAME and SNMP_V3_AUTH_PASSWORD. We will use myceph as
the user and mycephpassword as the password.
Next, we can run the Net-SNMP daemon in the foreground on the SNMP management host. Note. the
IBM documentation states specify CEPH-MIB.txt which is incorrect. You need to drop the .txt
extension.
On your Ceph cluster, we can now create the SNMP gateway service.
266
IBM Storage Ceph for Beginner’s
Specify the ENGINE ID that matches our snmptrapd.conf file (exclude 0x) and the SNMPV3 username
and password we specified in the file. Also, make sure you select SNMP Version 3. Verify the service
is up and running.
Now you can monitor the SNMP management host for any SNMP traps. A simple way to generate a
trap is just to reboot one of the Ceph cluster nodes.
267
IBM Storage Ceph for Beginner’s
.iso.org.dod.internet.snmpV2.snmpModules.1.1.4.1.0 = OID:
.iso.org.dod.internet.private.enterprises.ceph.cephCluster.cephNotifications.prometheus.promO
sd.promOsdDownHigh
.iso.org.dod.internet.private.enterprises.ceph.cephCluster.cephNotifications.prometheus.promO
sd.promOsdDownHigh.1 = STRING: "1.3.6.1.4.1.50495.1.2.1.4.1[alertname=CephOSDDownHigh]"
.iso.org.dod.internet.private.enterprises.ceph.cephCluster.cephNotifications.prometheus.promO
sd.promOsdDownHigh.2 = STRING: "info"
.iso.org.dod.internet.private.enterprises.ceph.cephCluster.cephNotifications.prometheus.promO
sd.promOsdDownHigh.3 = STRING: "Status: OK"
--------------
https://www.ibm.com/docs/en/storage-ceph/7?topic=daemons-using-ceph-manager-alerts-
module
If you don’t have an SMTP relay to use, you can setup postfix on a server to forward email to public
email providers like Gmail and Yahoo. See here:
https://www.linode.com/docs/guides/configure-postfix-to-send-mail-using-gmail-and-google-
workspace-on-debian-or-ubuntu/
A simple way to enable email alerts is via the Ceph Dashboard. Navigate to Administration -> Manager
Modules -> select Alerts and click Edit. Enter your SMTP relay information.
Now when events are triggered, you should receive email alerts. Some examples are shown below:
268
IBM Storage Ceph for Beginner’s
269
IBM Storage Ceph for Beginner’s
The latest version of IBM Storage Insights supports the logs upload feature for existing IBM Storage
Ceph support tickets. This can dramatically reduce the time to resolve issues. It also includes an AI
Chatbot which allows users to interact and chat with IBM Storage Insights in natural language form to
help in observability and monitoring. You can view the latest Insights enhancements here:
https://www.ibm.com/docs/en/storage-insights?topic=new-change-history
IBM Storage Ceph is only supported with IBM Storage Insights Pro. If you want to demonstrate this
integration as part of you POC you can use the IBM Storage Insights 60-day trial. More information on
how to access the trial is available here:
https://www.ibm.com/docs/en/storage-insights?topic=pro-want-try-buy-storage-insights
To configure Call home and Storage Insights in Ceph, you can follow this procedure. Make sure you
have a Storage Insights Pro license (or trial) entitlement. Click on the top right of the Ceph Dashboard
and select the drop down on the user icon to setup IBM Call Home and/or Storage Insights. The full
procedure to do this is also documented here.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=insights-enabling-call-home-storage
https://www.ibm.com/docs/en/storage-insights?topic=pro-planning-storage-ceph-systems
270
IBM Storage Ceph for Beginner’s
IBM Storage Ceph clusters do not require the use of a Storage Insights data collector. To understand
what information is uploaded to IBM and also any firewall rules required for Insights to work see here:
https://www.ibm.com/docs/en/SSQRB8/pdf/IBM_Storage_Insights_Security_Guide.pdf
You have to configure Call Home first or you will get an error.
271
IBM Storage Ceph for Beginner’s
Next you can configure IBM Storage Insights. Select your company name as the tenant ID and Insights
will give you a list of choices to select and confirm if it finds a match.
If you click on Call Home again you should have the option of downloading your inventory or checking
when the last contact with IBM was.
272
IBM Storage Ceph for Beginner’s
https://www.ibm.com/mysupport/s/?language=en_US
273
IBM Storage Ceph for Beginner’s
You will need to generate an sos report and upload it to your support ticket. This process is
documented here:
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=troubleshooting-generating-sos-report
On one your storage Ceph cluster nodes, you can issue the following command to generate the report.
The report will be located in /var/tmp.
The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.
Optionally, please enter the case id that you are generating this report for []: DUMMY_CASE
Size 206.69MiB
Owner root
sha256 6abd358eb6387dd0166134a0279fb65daee32d3eb8b9ab1adc48e200fa5b4eae
274
IBM Storage Ceph for Beginner’s
[root@cephnode1 ~]#
Once you have the sos report, you can upload it to your support ticket via IBM ECuRep.
https://www.ibm.com/support/pages/enhanced-customer-data-repository-ecurep-send-data-
https#secure
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=osds-replacing
In our lab environment, we have a single node cluster with 3 OSDs. We will simulate a failure by
destroying the OSD and deleting the virtual disk. Destroy removes an OSD permanently. We will then
assign a new virtual disk to replace the failed one.
First, we navigate to Cluster -> OSDs and get the details for an OSD we want to replace (since this is a
virtual environment we need to get the OS device name that corresponds to the OSD).
275
IBM Storage Ceph for Beginner’s
For a single node cluster with 3 OSDs, we set the pool replica to 2. So, choosing to delete the OSD
should not impact data durability. We also choose not to preserve the OSDs ID in the CRUSH map.
276
IBM Storage Ceph for Beginner’s
277
IBM Storage Ceph for Beginner’s
0 hdd 0.01949 1.00000 20 GiB 48 MiB 2.5 MiB 6 KiB 46 MiB 20 GiB 0.24 0.89
289 up
1 hdd 0.01949 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0
0 destroyed
2 hdd 0.01949 1.00000 20 GiB 60 MiB 2.5 MiB 6 KiB 58 MiB 20 GiB 0.29 1.11
289 up
TOTAL 40 GiB 109 MiB 5.0 MiB 13 KiB 104 MiB 40 GiB 0.27
MIN/MAX VAR: 0.89/1.11 STDDEV: 0.03
[root@ndceph ~]#
Next, we can unmap the OSD from the cluster node, create a new virtual disk and assign the new
virtual disk to the cluster node.
On the Ceph cluster node, we can verify that the new disk is available for use (/dev/vdc).
Because we set Ceph to use all devices when we created our cluster, Ceph will automatically add the
new disk back as an OSD for us.
The effect of ceph orch apply is persistent which means that the Orchestrator automatically finds the
device, adds it to the cluster, and creates new OSDs. This occurs under the following conditions:
278
IBM Storage Ceph for Beginner’s
You can disable automatic creation of OSDs on all the available devices by using the --unmanaged
parameter.
After a few minutes the new OSD will appear on the Ceph Dashboard.
279
IBM Storage Ceph for Beginner’s
For an unplanned OSD failure, we won’t delete the OSD from the Ceph Dashboard. We will just unmap
it from the cluster node, delete it, create a new one and map it back to the cluster node. The behaviour
is similar to a planned removal.
We will forcefully remove the virtual disk backing OSD.0 with device name /dev/vdb.
Without doing anything on the Ceph Dashboard, we will unmap the OSD from the cluster node and
delete the virtual disk on the Hypervisor.
280
IBM Storage Ceph for Beginner’s
vdd
252:48 0 20G 0 disk
└─ceph--af9e6646--715b--44de--8a82--06fa4978ff04-osd--block--e49b668e--2398--4d4e--85e3--
f050cb7989d3 253:3 0 20G 0 lvm
└─QO4sBS-Mft4-3F6H-3rHD-JEEd-sVQx-I5xAR0
253:5 0 20G 0 crypt
[root@ndceph ~]#
services:
mon: 1 daemons, quorum ndceph (age 10m)
mgr: ndceph.azavyo(active, since 10m), standbys: ndceph.cmmpsd
osd: 3 osds: 2 up (since 95s), 3 in (since 8m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 8 pools, 321 pgs
objects: 257 objects, 455 KiB
usage: 163 MiB used, 60 GiB / 60 GiB avail
pgs: 161/514 objects degraded (31.323%)
176 active+undersized
90 active+clean
55 active+undersized+degraded
progress:
Global Recovery Event (90s)
[=======.....................] (remaining: 3m)
281
IBM Storage Ceph for Beginner’s
We have to wait for the rebalance and purge to occur. After a while it should show as removed on the
Ceph Dashboard.
282
IBM Storage Ceph for Beginner’s
We can now create a new virtual disk and map it to the Ceph node. As before, because we set the Ceph
orchestrator to use all available devices, Ceph will automatically create a new OSD and add it to the
cluster.
283
IBM Storage Ceph for Beginner’s
└─21TIgq-7Ych-a5ut-Oatv-wNQt-aWFd-RZVylo
253:5 0 20G 0 crypt
vdd
252:48 0 20G 0 disk
└─ceph--af9e6646--715b--44de--8a82--06fa4978ff04-osd--block--e49b668e--2398--4d4e--85e3--
f050cb7989d3 253:2 0 20G 0 lvm
└─QO4sBS-Mft4-3F6H-3rHD-JEEd-sVQx-I5xAR0
253:4 0 20G 0 crypt
[root@ndceph ~]#
The Ceph Dashboard confirms it is back and the cluster is back to a healthy state.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=failure-simulating-node
In our lab environment, we will simulate the worst-case scenario of completely losing a Ceph OSD
node with its associated disks and replacing it with a brand-new node with new disks.
First, we need to understand the impact of removing a node. We are only using 4% of the RAW capacity
so should be fine to remove a node.
284
IBM Storage Ceph for Beginner’s
The node we want to remove is cephnode4 which has the following services running. Note down the
labels as well as we will need them when we add a brand-new node.
285
IBM Storage Ceph for Beginner’s
Since this is a 4-node cluster and we want to limit the amount or recovery action during the node
replacement, we will disable rebalancing and backfill.
We want to change the node’s hostname so need to remove the host from the Ceph CRUSH map.
Before we do that however, we will delete its associated OSDs and remove them from the CRUSH
map.
286
IBM Storage Ceph for Beginner’s
Now we can remove the node from the CRUSH map and also the Ceph OSD users.
287
IBM Storage Ceph for Beginner’s
We removed the node from the CRUSH map but still need to remove it from the cluster. Specify the --
force as Ceph might prevent the removal if it can’t relocate daemons (or will lose daemons set to only
have a single instance).
As mentioned, the Grafana service is down as we don’t have any other node labelled to take the service
over.
services:
mon: 3 daemons, quorum cephnode1,cephnode3,cephnode2 (age 5d)
mgr: cephnode1.tbqyke(active, since 3d), standbys: cephnode2.iaecpr
mds: 3/3 daemons up, 1 standby
osd: 9 osds: 9 up (since 50m), 9 in (since 2w)
flags noout,noscrub,nodeep-scrub
rgw: 2 daemons active (2 hosts, 1 zones)
rgw-nfs: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 3/3 healthy
pools: 17 pools, 849 pgs
objects: 1.74k objects, 3.7 GiB
usage: 16 GiB used, 284 GiB / 300 GiB avail
pgs: 849 active+clean
288
IBM Storage Ceph for Beginner’s
io:
client: 115 B/s rd, 0 op/s rd, 0 op/s wr
[root@cephnode1 ~]#
We need to now create a new virtual machine and install the OS. We then need to do the same tasks
we did prior to bootstrapping our Ceph cluster.
On cephnode1, we can update the ansible inventory file to reflect the new node (newcephnode4.local)
and run the pre-flight playbook.
[admin]
cephnode1
[root@cephnode1 production]#
[root@cephnode1 cephadm-ansible]# ansible-playbook -i ./inventory/production/hosts cephadm-
distribute-ssh-key.yml -e cephadm_ssh_user=root -e admin_node=cephnode1.local
PLAY RECAP
*********************************************************************************************
*******************************
cephnode1 : ok=1 changed=0 unreachable=0 failed=0 skipped=0
rescued=0 ignored=0
cephnode2 : ok=3 changed=0 unreachable=0 failed=0 skipped=3
rescued=0 ignored=0
cephnode3 : ok=1 changed=0 unreachable=0 failed=0 skipped=0
rescued=0 ignored=0
newcephnode4 : ok=1 changed=1 unreachable=0 failed=0 skipped=0
rescued=0 ignored=0
.
.
.
[root@cephnode1 cephadm-ansible]#
Our new cluster node, newcephnode4.local has the following storage configuration (similar to the one
we removed).
On the Ceph Dashboard, we can add the new node back to the Ceph cluster. We need to add the
correct labels too.
289
IBM Storage Ceph for Beginner’s
290
IBM Storage Ceph for Beginner’s
We need to copy the Ceph admin keyring and ceph.conf file from one of the other cluster nodes in
order for the cephadm-shell to work.
Finally, we need to unset the noout, noscrub and nodeep-scrub settings and verify the cluster health.
services:
mon: 3 daemons, quorum cephnode1,cephnode3,cephnode2 (age 5d)
mgr: cephnode1.tbqyke(active, since 3d), standbys: cephnode2.iaecpr
mds: 3/3 daemons up, 1 standby
osd: 12 osds: 12 up (since 5m), 12 in (since 2w)
rgw: 2 daemons active (2 hosts, 1 zones)
rgw-nfs: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 3/3 healthy
pools: 17 pools, 849 pgs
objects: 1.74k objects, 3.7 GiB
usage: 16 GiB used, 383 GiB / 400 GiB avail
pgs: 846 active+clean
2 active+clean+scrubbing
291
IBM Storage Ceph for Beginner’s
1 active+clean+scrubbing+deep
io:
client: 204 B/s rd, 0 op/s rd, 0 op/s wr
[root@newcephnode4 ceph]#
We can double check our CRUSH map to validate the new node is added.
https://www.ibm.com/support/pages/ibm-storage-ceph-product-licensing-frequently-asked-
questions-faq
IBM Storage Ceph is offered in two editions: Premium Edition and Pro Edition. For each edition, there
is an Object part number that limits use to only object protocols while the other parts include file,
block, and object protocols. For each of the above, clients can purchase perpetual licenses, annual
licenses, or monthly licenses. After the initial year, clients can purchase renewal or reinstatement
parts
IBM Storage Ceph Premium Edition includes the required Red Hat Enterprise Linux subscriptions for
the IBM Storage Ceph nodes. IBM Storage Ceph Pro Edition requires the client to acquire Red Hat
Enterprise Linux subscriptions directly from Red Hat for the IBM Storage Ceph nodes. Both editions
include entitlement for IBM Storage Insights. The license for IBM Storage Insights (entitled via IBM
Spectrum Control) is limited to use with the IBM Storage Ceph environment ONLY.
292
IBM Storage Ceph for Beginner’s
IBM Storage Ceph is licensed per TB of raw capacity. IBM defines a TB as 2^40 bytes (TiB). Customers
must purchase enough TiB entitlements to equal the total aggregate raw TiB of all OSD data devices
independent of the number of nodes, clusters, or how the underlying hardware architecture is
implemented. Once installed, the client can confirm license compliance by summing up the
"ceph_cluster_capacity_bytes" metric for all their clusters.
Summary
This document has highlighted the steps required to plan, size and deploy a IBM Storage Ceph Cluster.
In addition, we have covered the use of advanced Ceph functions including replication, compression
and encryption. If you undertaking a POC or testing Ceph, this document should help you create a test
plan with the relevant test cases.
293
IBM Storage Ceph for Beginner’s
Ceph Dashboard does not allow you to delete storage pools by default. You can change this by
navigating to Administration -> Configuration and changing the mon_allow_pool_delete to true.
You might be required to showcase RBAC. Ceph has the following user roles.
You can create a new user with the specific role you need to test.
294
IBM Storage Ceph for Beginner’s
You can use this procedure to change the Ceph Dashboard admin password.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=ia-changing-ceph-dashboard-password-
using-command-line-interface
You can issue the following command to check the cluster health history.
You are bound to see this warning depending on the number and size of disks you use per node. It is
better to use a large number of smaller disks than a small number of large disks. The optimal number
of PGs per OSD is 200-300.
295
IBM Storage Ceph for Beginner’s
Change the limit with these commands. The PG Autoscaler is not really helpful for small POC or test
clusters.
Containerization makes it difficult to debug issues. Remember to check the node’s /var/log/messages
file and also use journalctl to see daemon errors. It is highly recommended to enable centralised
logging to a file on Ceph as well. See here.
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=cluster-viewing-centralized-logs-ceph
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=dashboard-enabling-single-sign-ceph
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=configuration-configuring-active-directory-
ceph-object-gateway
https://www.ibm.com/docs/en/storage-ceph/7.1?topic=management-powering-down-rebooting-
cluster
296