Cl260 5.0 Student Guide
Cl260 5.0 Student Guide
The contents of this course and all its modules and related materials, including handouts to audience members, are
Copyright © 2021 Red Hat, Inc.
No part of this publication may be stored in a retrieval system, transmitted or reproduced in any way, including, but
not limited to, photocopy, photograph, magnetic, electronic or other record, without the prior written permission of
Red Hat, Inc.
This instructional program, including all material provided herein, is supplied without any guarantees from Red Hat,
Inc. Red Hat, Inc. assumes no liability for damages or legal action arising from the use or misuse of contents or details
contained herein.
If you believe Red Hat training materials are being used, copied, or otherwise improperly distributed, please send
email to [email protected] or phone toll-free (USA) +1 (866) 626-2994 or +1 (919) 754-3700.
Red Hat, Red Hat Enterprise Linux, the Red Hat logo, JBoss, OpenShift, Fedora, Hibernate, Ansible, CloudForms,
RHCA, RHCE, RHCSA, Ceph, and Gluster are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries
in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS® is a registered trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or
other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Node.js® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent
Node.js open source or commercial project.
The OpenStack® Word Mark and OpenStack Logo are either registered trademarks/service marks or trademarks/
service marks of the OpenStack Foundation, in the United States and other countries and are used with the
OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack
Foundation or the OpenStack community.
Introduction xiii
Cloud Storage with Red Hat Ceph Storage ................................................................. xiii
Orientation to the Classroom Environment ................................................................. xiv
Performing Lab Exercises ....................................................................................... xviii
CL260-RHCS5.0-en-1-20211117 vii
Importing and Exporting RBD Images ........................................................................ 211
Guided Exercise: Importing and Exporting RBD Images ............................................... 214
Lab: Providing Block Storage Using RADOS Block Devices ......................................... 222
Summary ............................................................................................................. 229
13. Managing Cloud Platforms with Red Hat Ceph Storage 455
Introducing OpenStack Storage Architecture ............................................................ 456
Quiz: Introducing OpenStack Storage Architecture .................................................... 464
viii CL260-RHCS5.0-en-1-20211117
Implementing Storage in OpenStack Components .................................................... 466
Quiz: Implementing Storage in OpenStack Components ............................................ 470
Introducing OpenShift Storage Architecture ............................................................. 472
Quiz: Introducing OpenShift Storage Architecture ..................................................... 478
Implementing Storage in OpenShift Components ..................................................... 480
Quiz: Implementing Storage in OpenShift Components ............................................. 490
Summary ............................................................................................................. 492
14. Comprehensive Review 493
Comprehensive Review ......................................................................................... 494
Lab: Deploying Red Hat Ceph Storage ..................................................................... 497
Lab: Configuring Red Hat Ceph Storage .................................................................. 506
Lab: Deploying CephFS .......................................................................................... 515
Lab: Deploying and Configuring Block Storage with RBD ............................................ 523
Lab: Deploying and Configuring RADOS Gateway ..................................................... 535
CL260-RHCS5.0-en-1-20211117 ix
x CL260-RHCS5.0-en-1-20211117
Document Conventions
This section describes various conventions and practices used throughout all
Red Hat Training courses.
Admonitions
Red Hat Training courses use the following admonitions:
References
These describe where to find external documentation relevant to a
subject.
Note
These are tips, shortcuts, or alternative approaches to the task at hand.
Ignoring a note should have no negative consequences, but you might
miss out on something that makes your life easier.
Important
These provide details of information that is easily missed: configuration
changes that only apply to the current session, or services that need
restarting before an update will apply. Ignoring these admonitions will
not cause data loss, but may cause irritation and frustration.
Warning
These should not be ignored. Ignoring these admonitions will most likely
cause data loss.
Inclusive Language
Red Hat Training is currently reviewing its use of language in various areas
to help remove any potentially offensive terms. This is an ongoing process
and requires alignment with the products and services covered in Red Hat
Training courses. Red Hat appreciates your patience during this process.
CL260-RHCS5.0-en-1-20211117 xi
xii CL260-RHCS5.0-en-1-20211117
Introduction
CL260-RHCS5.0-en-1-20211117 xiii
Introduction
In this classroom environment, your primary system for hands-on activities is workstation. The
workstation virtual machine (VM) is the only system with a graphical desktop, which is required
for using a browser to access web-based tools. You should always log in directly to workstation
first. From workstation, use ssh for command-line access to all other VMs. Use a web browser
from workstation to access the Red Hat Ceph Storage web-based dashboard and other
graphical tools.
As seen in Figure 0.1, all VMs share an external network, 172.25.250.0/24, with a gateway of
172.25.250.254 (bastion). External network DNS and container registry services are provided by
utility.
Additional student VMs used for hands-on exercises include clienta, clientb,
serverc, serverd, servere, serverf, and serverg. All ten of these systems are in the
lab.example.com DNS domain.
All student computer systems have a standard user account, student, which has the password
student. The root password on all student systems is redhat.
Classroom Machines
xiv CL260-RHCS5.0-en-1-20211117
Introduction
The environment uses the classroom server as a NAT router to the outside network, and as a
file server using the URLs content.example.com and materials.example.com, serving
course content for certain exercises. Information on how to use these servers is provided in the
instructions for those activities.
You can reset your classroom environment to set all of your classroom nodes back to their
beginning state when the classroom was first created. Resetting allows you to clean your virtual
machines, and start exercises over again. It is also a simple method for clearing a classroom issue
which is blocking your progress and is not easily solved. Some chapters, such as chapters 02, 12,
and 14, might ask you to reset your classroom to ensure you work on a clean environment. It is
highly recommended to follow this instruction.
When your lab environment is first provisioned, and each time you restart the lab environment, the
monitor (MON) services might fail to properly initialize and can cause a cluster warning message.
On your Red Hat Online Learning cloud platform, this behavior is caused by the random order
in which complex network interfaces and services are started. This timing issue does not occur
in production Ceph storage cluster environments. To resolve the cluster warning, use the ceph
orch restart mon command to restart the monitor services, which should then result in a
HEALTH_OK cluster state.
CL260-RHCS5.0-en-1-20211117 xv
Introduction
Machine States
active The virtual machine is running and available. If just started, it may still
be starting services.
stopped The virtual machine is completely shut down. Upon starting, the virtual
machine boots into the same state as it was before it was shut down.
The disk state is preserved.
Classroom Actions
CREATE Create the ROLE classroom. Creates and starts all of the virtual
machines needed for this classroom.
CREATING The ROLE classroom virtual machines are being created. Creates and
starts all of the virtual machines needed for this classroom. Creation
can take several minutes to complete.
DELETE Delete the ROLE classroom. Destroys all virtual machines in the
classroom. All work saved on that system's disks is lost.
xvi CL260-RHCS5.0-en-1-20211117
Introduction
Machine Actions
OPEN CONSOLE Connect to the system console of the virtual machine in a new browser
tab. You can log in directly to the virtual machine and run commands,
when required. Normally, log in to the workstation virtual machine
only, and from there, use ssh to connect to the other virtual machines.
ACTION → Gracefully shut down the virtual machine, preserving disk contents.
Shutdown
ACTION → Power Forcefully shut down the virtual machine, while still preserving disk
Off contents. This is equivalent to removing the power from a physical
machine.
ACTION → Reset Forcefully shut down the virtual machine and reset the disk to its initial
state. All work saved on that system's disks is lost.
At the start of an exercise, if instructed to reset a single virtual machine node, click ACTION →
Reset for only the specific virtual machine.
At the start of an exercise, if instructed to reset all virtual machines, click ACTION → Reset on
every virtual machine in the list.
If you want to return the classroom environment to its original state at the start of the course, you
can click DELETE to remove the entire classroom environment. After the lab has been deleted,
you can click CREATE to provision a new set of classroom systems.
Warning
The DELETE operation cannot be undone. All work you have completed in the
classroom environment will be lost.
To adjust the timers, locate the two + buttons at the bottom of the course management page.
Click the auto-stop + button to add another hour to the auto-stop timer. Click the auto-destroy +
button to add another day to the auto-destroy timer. There is a maximum for auto-stop at 11 hours,
and a maximum auto-destroy at 14 days. Be careful to keep the timers set while you are working,
so as to not have your environment unexpectedly shut down. Be careful not to set the timers
unnecessarily high, which could waste your subscription time allotment.
CL260-RHCS5.0-en-1-20211117 xvii
Introduction
There are two types of exercises. The first type, a guided exercise, is a practice exercise that
follows a course narrative. If a narrative is followed by a quiz, it usually indicates that the topic did
not have an achievable practice exercise. The second type, an end-of-chapter lab, is a gradable
exercise to help to verify your learning. When a course includes a comprehensive review, the
review exercises are structured as gradable labs.
The action is a choice of start, grade, or finish. All exercises support start and finish.
Only end-of-chapter labs and comprehesive review labs support grade.
start
A script's start logic verifies the required resources to begin an exercise. It might include
configuring settings, creating resources, checking prerequisite services, and verifying
necessary outcomes from previous exercises. With exercise start logic, you can perform any
exercise at any time, even if you did not perform prerequisite exercises.
grade
End-of-chapter labs help to verify what you learned, after practicing with earlier guided
exercises. The grade action directs the lab command to display a list of grading criteria, with
a PASS or FAIL status for each. To achieve a PASS status for all criteria, fix the failures and
rerun the grade action.
finish
A script's finish logic deletes exercise resources that are no longer necessary. With cleanup
logic, you can repeatedly perform an exercise, and it helps course performance by ensuring
that unneeded objects release their resources.
To list the available exercises, use tab completion in the lab command with an action:
xviii CL260-RHCS5.0-en-1-20211117
Introduction
Grading lab.
All exercise scripts are stored on the workstation system in the folder /home/
student/.venv/labs/lib/python3.6/site-packages/SKU/, where the SKU is the course
code. When you run the lab command with a valid action and exercise, it creates an exercise log
file in /tmp/log/labs, and captures command output and error messages into the file.
Although exercise scripts are always run from workstation, they perform tasks on other systems
in the course environment. Many course environments, including OpenStack and OpenShift, use
a command-line interface (CLI) that is invoked from workstation to communicate with server
systems and components by using REST API calls. Because script actions typically distribute tasks
to multiple systems, additional troubleshooting is necessary to determine where a failed task
occurred. Log in to those other systems and use Linux diagnostic skills to read local system log
files and determine the root cause of the lab script failure.
CL260-RHCS5.0-en-1-20211117 xix
xx CL260-RHCS5.0-en-1-20211117
Chapter 1
CL260-RHCS5.0-en-1-20211117 1
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Objectives
After completing this section, you should be able to describe the personas in the cloud storage
ecosystem that characterize the use cases and tasks taught in this course.
Information from real Red Hat Ceph Storage users builds the various personas. Personas might
describe multiple job titles, and your organization's job titles might map to multiple personas. The
personas that are presented here embody the most common roles of Red Hat Ceph Storage users.
Roles might change depending on your organization's size and user ecosystem. This course uses
the storage administrator persona to define Red Hat Ceph Storage operations and use cases.
• Informs users about Ceph data presentation and methods, as choices for their data applications.
• Provides resilience and recovery, such as replication, backup, and disaster recovery methods.
• Provides access for data analytics and advanced mass data mining.
This course includes material to cover Ceph storage cluster deployment and configuration, and
incorporates automation techniques whenever appropriate.
2 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Cloud Operator
A cloud operator administers cloud resources at their organization, such as OpenStack or
OpenShift infrastructures. The storage administrator works closely with a cloud operator to
maintain the Ceph cluster that is configured to provide storage for those platforms
Automation Engineer
Automation engineers frequently use Ceph directly. An automation engineer is responsible for
creating playbooks for commonly repeated tasks. All user interface and end-user discussions
in this course are examples of actions that might be automated. Storage administrators would
be familiar with these same actions because they are typically the foremost Ceph subject
matter experts.
Service Administrator
Service administrators manage end-user services (as distinct from operating system
services). Service administrators have a similar role to project managers, but for an existing
production service offering.
Application Architect
A storage administrator relies on the application architect as a subject matter expert who can
correlate between Ceph infrastructure layout and resource availability, scaling, and latency.
This archicture expertise helps the storage administrator to design complex application
deployments effectively. To support the cloud users and their applications, a storage
administrator must comprehend those same aspects of resource availability, scaling, and
latency.
Infrastructure Architect
A storage administrator must master the storage cluster's architectural layout to manage
resource location, capacity, and latency. The infrastructure architect for the Ceph cluster
deployment and maintenance is a primary source of information for the storage administrator.
The infrastructure architect might be a cloud service provider employee or a vendor solutions
architect or consultant.
CL260-RHCS5.0-en-1-20211117 3
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
• At telecommunications service providers (telcos) and cloud service providers, the prevalent
personas are the cloud and storage operators, infrastructure architects, and cloud service
developers. Their customers request support with the provider's service ticketing system.
Commonly, customers only consume storage services and do not maintain them.
• At universities and smaller implementations, technical support personnel can potentially assume
all roles. A single individual might handle the storage administrator, infrastructure architect, and
cloud operator personas.
References
The Storage Administrator Role Is Evolving: Meet the Cloud Administrator
https://cloud.netapp.com/blog/meet-the-cloud-administrator
4 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Quiz
Application architect
Automation engineer
Cloud operator
Infrastructure architect
Service administrator
Storage administrator
Storage operator
Responsibility Persona
CL260-RHCS5.0-en-1-20211117 5
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Solution
Responsibility Persona
6 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Objectives
After completing this section, you should be able to describe the Red Hat Ceph Storage
architecture, introduce the Object Storage Cluster, and describe the choices in data access
methods.
Ceph has a modular and distributed architecture that contains the following elements:
• An object storage back end that is known as RADOS (Reliable Autonomic Distributed Object
Store)
• Various access methods to interact with RADOS
• Monitors (MONs) maintain maps of the cluster state. They help the other daemons to
coordinate with each other.
• Object Storage Devices (OSDs) store data and handle data replication, recovery, and
rebalancing.
• Managers (MGRs) track runtime metrics and expose cluster information through a browser-
based dashboard and REST API.
• Metadata Servers (MDSes) store metadata that CephFS uses (but not object storage or block
storage) so that clients can run POSIX commands efficiently.
These daemons can scale to meet the requirements of a deployed storage cluster.
Ceph Monitors
Ceph Monitors (MONs) are the daemons that maintain the cluster map. The cluster map
is a collection of five maps that contain information about the state of the cluster and its
configuration. Ceph must handle each cluster event, update the appropriate map, and replicate
the updated map to each MON daemon.
To apply updates, the MONs must establish a consensus on the state of the cluster. A majority of
the configured monitors must be available and agree on the map update. Configure your Ceph
clusters with an odd number of monitors to ensure that the monitors can establish a quorum when
they vote on the state of the cluster. More than half of the configured monitors must be functional
for the Ceph storage cluster to be operational and accessible.
CL260-RHCS5.0-en-1-20211117 7
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
One design goal for OSD operation is to bring computing power as close as possible to the
physical data so that the cluster can perform at peak efficiency. Both Ceph clients and OSD
daemons use the Controlled Replication Under Scalable Hashing (CRUSH) algorithm to efficiently
compute information about object location, instead of depending on a central lookup table.
CRUSH Map
CRUSH assigns every object to a Placement Group (PG), which is a single hash bucket. PGs
are an abstraction layer between the objects (application layer) and the OSDs (physical layer).
CRUSH uses a pseudo-random placement algorithm to distribute the objects across the PGs
and uses rules to determine the mapping of the PGs to the OSDs. In the event of failure, Ceph
remaps the PGs to different physical devices (OSDs) and synchronizes their content to match
the configured data protection rules.
Warning
A host that runs OSDs must not mount Ceph RBD images or CephFS file systems
by using the kernel-based client. Mounted resources can become unresponsive due
to memory deadlocks or blocked I/O that is pending on stale sessions.
Ceph Managers
Ceph Managers (MGRs) provide for the collection of cluster statistics.
If no MGRs are available in a cluster, client I/O operations are not negatively affected, but
attempts to query cluster statistics fail. To avoid this scenario, Red Hat recommends that you
deploy at least two Ceph MGRs for each cluster, each running in a separate failure domain.
The MGR daemon centralizes access to all data that is collected from the cluster and provides
a simple web dashboard to storage administrators. The MGR daemon can also export status
information to an external monitoring server, such as Zabbix.
8 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Metadata Server
The Ceph Metadata Server (MDS) manages Ceph File System (CephFS) metadata. It provides
POSIX-compliant, shared file-system metadata management, including ownership, time stamps,
and mode. The MDS uses RADOS instead of local storage to store its metadata. It has no access
to file contents.
The MDS enables CephFS to interact with the Ceph Object Store, mapping an inode to an object
and the location where Ceph stored the data within a tree. Clients who access a CephFS file
system first send a request to an MDS, which provides the needed information to get file content
from the correct OSDs.
Cluster Map
Ceph clients and OSDs require knowledge of the cluster topology. Five maps represent the cluster
topology, which is collectively referred to as the cluster map. The Ceph Monitor daemon maintains
the cluster map. A cluster of Ceph MONs ensures high availability if a monitor daemon fails.
• The Monitor Map contains the cluster's fsid ; the position, name, address, and port of each
monitor; and map time stamps. The fsid is a unique, auto-generated identifier (UUID) that
identifies the Ceph cluster. View the Monitor Map with the ceph mon dump command.
• The OSD Map contains the cluster's fsid , a list of pools, replica sizes, placement group
numbers, a list of OSDs and their status, and map time stamps. View the OSD Map with the
ceph osd dump command.
• The Placement Group (PG) Map contains the PG version; the full ratios; details on each
placement group such as the PG ID, the Up Set, the Acting Set, the state of the PG, data usage
statistics for each pool; and map time stamps. View the PG Map statistics with the ceph pg
dump command.
• The CRUSH Map contains a list of storage devices, the failure domain hierarchy (such as device,
host, rack, row, room), and rules for traversing the hierarchy when storing data. View the CRUSH
map with the ceph osd crush dump command.
• The Metadata Server (MDS) Map contains the pool for storing metadata, a list of metadata
servers, metadata servers status, and map time stamps. View the MDS Map with the ceph fs
dump command.
The following diagram depicts the four data access methods of a Ceph cluster, the libraries that
support the access methods, and the underlying Ceph components that manage and store the
data.
CL260-RHCS5.0-en-1-20211117 9
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
To maximize performance, write your applications to work directly with librados. This method
gives the best results to improve storage performance in a Ceph environment. For easier Ceph
storage access, instead use the higher-level access methods that are provided, such as the
RADOS Block Devices, Ceph Object Gateway (RADOSGW), and CephFS.
10 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Note
You need to update the endpoint only to port existing applications that use the
Amazon S3 API or the OpenStack Swift API.
The Ceph Object Gateway offers scalability support by not limiting the number of deployed
gateways and by providing support for standard HTTP load balancers. The Ceph Object Gateway
includes the following use cases:
The Ceph Metadata Server (MDS) manages the metadata that is associated with files stored in
CephFS, including file access, change, and modification time stamps.
• Pool Operations
• Snapshots
• Read/Write Objects
– Create or Remove
– Entire Object or Byte Range
– Append or Truncate
• Create/Set/Get/Remove XATTRs
• Create/Set/Get/Remove Key/Value Pairs
• Compound operations and dual-ack semantics
The object map tracks the presence of backing RADOS objects when a client writes to an RBD
image. When a write occurs, it is translated to an offset within a backing RADOS object. When
the object map feature is enabled, the presence of RADOS objects is tracked to signify that the
objects exist. The object map is kept in-memory on the librbd client to avoid querying the OSDs
for objects that do not exist.
• Resize
• Export
• Copy
• Flatten
• Delete
• Read
Storage devices have throughput limitations, which impact performance and scalability. Storage
systems often support striping, which is storing sequential pieces of information across multiple
CL260-RHCS5.0-en-1-20211117 11
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
storage devices, to increase throughput and performance. Ceph clients can use data striping to
increase performance when writing data to the cluster.
• Immutable ID
• Name
• Number of PGs to distribute the objects across the OSDs
• CRUSH rule to determine the mapping of the PGs for this pool
• Protection type (replicated or erasure coding)
• Parameters that are associated with the protection type
• Various flags to influence the cluster behavior
Configure the number of placement groups that are assigned to each pool independently to fit the
type of data and the required access for the pool.
The CRUSH algorithm determines the OSDs that host the data for a pool. Each pool has a single
CRUSH rule that is assigned as its placement strategy. The CRUSH rule determines which OSDs
store the data for all the pools that are assigned that rule.
Placement Groups
A Placement Group (PG) aggregates a series of objects into a hash bucket, or group. Ceph maps
each PG to a set of OSDs. An object belongs to a single PG, and all objects that belong to the
same PG return the same hash result.
The CRUSH algorithm maps an object to its PG based on the hashing of the object's name. The
placement strategy is also called the CRUSH placement rule. The placement rule identifies the
failure domain to choose within the CRUSH topology to receive each replica or erasure code
chunk.
When a client writes an object to a pool, it uses the pool's CRUSH placement rule to determine the
object's placement group. The client then uses its copy of the cluster map, the placement group,
and the CRUSH placement rule to calculate which OSDs to write a copy of the object to (or its
erasure-coded chunks).
The layer of indirection that the placement group provides is important when new OSDs become
available to the Ceph cluster. When OSDs are added to or removed from a cluster, placement
groups are automatically rebalanced between operational OSDs.
12 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
the client with the location of the objects; the client must use CRUSH to compute the locations of
objects that it needs to access.
To calculate the Placement Group ID (PG ID) for an object, the Ceph client needs the object ID
and the name of the object's storage pool. The client calculates the PG ID, which is the hash of the
object ID modulo the number of PGs. It then looks up the numeric ID for the pool, based on the
pool's name, and prepends the pool ID to the PG ID.
The CRUSH algorithm is then used to determine which OSDs are responsible for a placement
group (the Acting Set). The OSDs in the Acting Set that are up are in the Up Set. The first OSD in
the Up Set is the current primary OSD for the object's placement group, and all other OSDs in the
Up Set are secondary OSDs.
The Ceph client can then directly work with the primary OSD to access the object.
Data Protection
Like Ceph clients, OSD daemons use the CRUSH algorithm, but the OSD daemon uses it to
compute where to store the object replicas and for rebalancing storage. In a typical write scenario,
a Ceph client uses the CRUSH algorithm to compute where to store the original object. The client
maps the object to a pool and placement group and then uses the CRUSH map to identify the
primary OSD for the mapped placement group. When creating pools, set them as either replicated
or erasure coded pools. Red Hat Ceph Storage 5 supports erasure coded pools for Ceph RBD and
CephFS.
For resilience, configure pools with the number of OSDs that can fail without losing data. For
a replicated pool, which is the default pool type, the number determines the number of copies
of an object to create and distribute across different devices. Replicated pools provide better
performance than erasure coded pools in almost all cases at the cost of a lower usable-to-raw
storage ratio.
Erasure coding provides a more cost-efficient way to store data but with lower performance. For
an erasure coded pool, the configuration values determine the number of coding chunks and
parity blocks to create.
A primary advantage of erasure coding is its ability to offer extreme resilience and durability. You
can configure the number of coding chunks (parities) to use.
The following figure illustrates how data objects are stored in a Ceph cluster. Ceph maps one or
more objects in a pool to a single PG, represented by the colored boxes. Each of the PGs in this
figure is replicated and stored on separate OSDs within the Ceph cluster.
CL260-RHCS5.0-en-1-20211117 13
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
References
For more information, refer to the Red Hat Ceph Storage 5 Architecture Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/architecture_guide
For more information, refer to the Red Hat Ceph Storage 5 Hardware Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/hardware_guide
For more information, refer to the Red Hat Ceph Storage 5 Storage Strategies Guide
at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/storage_strategies_guide
14 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Guided Exercise
Outcomes
You should be able to navigate and work with services within the Ceph cluster.
This command confirms that the Ceph cluster in the classroom is operating.
Instructions
1. Log in to the admin node, clienta, and view Ceph services.
1.1. Log in to clienta as the admin user and switch to the root user.
1.2. Use the ceph orch ls command within the cephadm shell to view the running
services.
CL260-RHCS5.0-en-1-20211117 15
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
1.3. Use the cephadm shell command to launch the shell, and then use the ceph
orch ps command to view the status of all cluster daemons.
2.1. Use the ceph health command to view the health of your Ceph cluster.
Note
If the reported cluster status is not HEALTH_OK, the ceph health detail
command shows further information about the cause of the health alert.
2.2. Use the ceph status command to view the full cluster status.
16 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
services:
mon: 4 daemons, quorum serverc.lab.example.com,clienta,serverd,servere (age
10m)
mgr: serverc.lab.example.com.aiqepd(active, since 19m), standbys:
clienta.nncugs, serverd.klrkci, servere.kjwyko
osd: 9 osds: 9 up (since 19m), 9 in (since 2d)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 5 pools, 105 pgs
objects: 221 objects, 4.9 KiB
usage: 156 MiB used, 90 GiB / 90 GiB avail
pgs: 105 active+clean
3.1. Use the ceph mon dump command to view the cluster MON map.
3.2. Use the ceph mgr stat command to view the cluster MGR status.
3.3. Use the ceph osd pool ls command to view the cluster pools.
3.4. Use the ceph pg stat command to view Placement Group (PG) status.
CL260-RHCS5.0-en-1-20211117 17
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
3.5. Use the ceph osd status command to view the status of all OSDs.
3.6. Use the ceph osd crush tree command to view the cluster CRUSH hierarchy.
18 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 19
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Objectives
After completing this section, you should be able to describe and compare the use cases for the
various management interfaces provided for Red Hat Ceph Storage.
Cephadm is implemented as a module in the Manager daemon (MGR), which is the first daemon
that starts when deploying a new cluster. The Ceph cluster core integrates all the management
tasks, and Cephadm is ready to use when the cluster starts.
Cephadm is provided by the cephadm package. You should install this package in the first
cluster node, which acts as the bootstrap node. As Ceph 5 is deployed in the containerized
version, the only package requirements to have a Ceph cluster up and running are cephadm,
podman, python3, and chrony. This containerized version reduces the complexity and package
dependencies to deploy a Ceph cluster.
The following diagram illustrates how Cephadm interacts with the other services.
Cephadm can log in to the container registry to pull a Ceph image and deploy services on the
nodes that use that image. This Ceph container image is necessary when bootstrapping the
cluster, because the deployed Ceph containers are based on that image.
20 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
To interact with the Ceph cluster nodes, Cephadm uses SSH connections. By using these SSH
connections, Cephadm can add new hosts to the cluster, add storage, or monitor these hosts.
CL260-RHCS5.0-en-1-20211117 21
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
You can also use cephadm shell -- command to directly run commands through the
containerized shell.
services:
mon: 2 daemons, quorum node.example.com,othernode (age 29m)
mgr: servere.zyngtp(active, since 30m), standbys: node.example.com.jbfojo
osd: 15 osds: 15 up (since 29m), 15 in (since 33m)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 5 pools, 105 pgs
objects: 189 objects, 4.9 KiB
usage: 231 MiB used, 150 GiB / 150 GiB avail
pgs: 105 active+clean
You can use the Ceph CLI for tasks to deploy, manage, and monitor a cluster. This course uses the
Ceph CLI for many cluster management operations.
Single Sign-On
The Dashboard GUI permits authentication via an external identity provider.
Auditing
You can configure the Dashboard to log all the REST API requests.
22 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Security
The Dashboard uses SSL/TLS by default to secure all HTTP connections.
The Ceph Dashboard GUI also implements different features for managing and monitoring the
cluster. The next lists, although not exhaustive, summarize important management and monitoring
features:
• Management features
– Review the cluster hierarchy by using the CRUSH map
– Enable, edit, and disable manager modules
– Create, remove, and manage OSDs
– Manage iSCSI
– Manage pools
• Monitoring features
– Check the overall cluster health
– View the hosts in the cluster and its services
– Review the logs
– Review the cluster alerts
– Check the cluster capacity
The following image shows the status screen from the Dashboard GUI. You can quickly review
some important cluster parameters, such as the cluster status, the number of hosts in the cluster,
or the number of OSDs.
CL260-RHCS5.0-en-1-20211117 23
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
References
For more information, refer to the Red Hat Ceph Storage 5 Administration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/administration_guide
For more information, refer to the Red Hat Ceph Storage 5 Dashboard Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/dashboard_guide/index
24 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Guided Exercise
Outcomes
You should be able to navigate the Dashboard GUI primary screens.
This command confirms that the Ceph cluster in the classroom is operating.
Instructions
1. Use Firefox to navigate to the Dashboard web GUI URL at
https://serverc.lab.example.com:8443. If prompted, accept the self-signed
certificates that are used in this classroom.
On the Dashboard login screen, enter your credentials.
2. The Dashboard main screen appears, with three sections: Status, Capacity, and
Performance.
2.1. The Status section shows an overview of the whole cluster status. You can see the
cluster health status, which can be HEALTH_OK, HEALTH_WARN, or HEALTH_ERR.
Check that your cluster is in the HEALTH_OK state. This section displays the number
of cluster hosts, the number of monitors and the cluster quorum that uses those
monitors, the number and status of OSDs, and other options.
CL260-RHCS5.0-en-1-20211117 25
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
2.2. The Dashboard Capacity section displays overall Ceph cluster capacity, the number
of objects, the placement groups, and the pools. Check that the capacity of your
cluster is approximately 90 GiB.
2.3. The Performance section displays throughput information, and read and write disk
speed. Because the cluster just started, the throughput and speed should be 0.
26 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
3.1. The Hosts section displays the host members of the Ceph cluster, the Ceph services
that are running on each host, and the cephadm version that is running. In your
cluster, check that the three hosts serverc, serverd, and servere are running
the same cephadm version. In this menu, you can add hosts to the cluster, and edit or
delete the existing hosts.
3.2. The Inventory section displays the physical disks that the Ceph cluster detects. You
can view physical disk attributes, such as their host, device path, and size. Verify that
the total number of physical disks on your cluster is 20. In this menu, if you select one
physical disk and press Identify, then that disk's LED starts flashing to make it easy to
physically locate disks in your cluster.
CL260-RHCS5.0-en-1-20211117 27
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
3.3. Navigate to the OSDs section to view information about the cluster OSDs. This
section displays the number of OSDs, which host they reside on, the number and
usage of placement groups, and the disk read and write speeds. Verify that serverc
contains three OSDs. You can create, edit, and delete OSDs from this menu.
3.4. Navigate to the CRUSH Map section, which displays your cluster's CRUSH map. This
map provides information about your cluster's physical hierarchy. Verify that the three
host buckets serverc, serverd, and servere are defined within the default
bucket of type root.
28 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
3.5. The Logs section displays the Ceph logs. View the Cluster Logs and the Audit Logs.
You can filter logging messages by priority, keyword, and date. View the Info log
messages by filtering by Priority.
CL260-RHCS5.0-en-1-20211117 29
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
The Pools menu displays existing pool information, including the pool name and type of
data protection, and the application. You can also create, edit, or delete the pools from this
menu. Verify that your cluster contains a pool called default.rgw.log.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
30 CL260-RHCS5.0-en-1-20211117
Chapter 1 | Introducing Red Hat Ceph Storage Architecture
Summary
In this chapter, you learned:
• The following services provide the foundation for a Ceph storage cluster:
– Metadata Servers (MDSes) store metadata that CephFS uses to efficiently run POSIX
commands for clients.
• RADOS (Reliable Autonomic Distributed Object Store) is the back end for storage in the Ceph
cluster, a self-healing and self-managing object store.
• RADOS provides four access methods to storage: the librados native API, the object-based
RADOS Gateway, the RADOS Block Device (RBD), and the distributed file-based CephFS file
system.
• A Placement Group (PG) aggregates a set of objects into a hash bucket. The CRUSH algorithm
maps the hash buckets to a set of OSDs for storage.
• Pools are logical partitions of the Ceph storage that are used to store object data. Each pool is a
name tag for grouping objects. A pool groups objects for storage by using placement groups.
• Red Hat Ceph Storage provides two interfaces, a command line and a Dashboard GUI, for
managing clusters. Both interfaces use the same cephadm module to perform operations and to
interact with cluster services.
CL260-RHCS5.0-en-1-20211117 31
32 CL260-RHCS5.0-en-1-20211117
Chapter 2
CL260-RHCS5.0-en-1-20211117 33
Chapter 2 | Deploying Red Hat Ceph Storage
Objectives
After completing this section, you should be able to prepare for and perform a Red Hat Ceph
Storage cluster deployment using cephadm command-line tools.
The cephadm shell command runs a bash shell within a Ceph-supplied management container.
Use the cephadm shell to perform cluster deployment tasks initially and cluster management tasks
after the cluster is installed and running.
Launch the cephadm shell to run multiple commands interactively, or to run a single command.
To run it interactively, use the cephadm shell command to open the shell, then run Ceph
commands.
To run a single command, use the cephadm shell command followed by two dashes and the
Ceph command.
The following daemons can be collocated with OSD daemons: RADOSGW, MDS, RBD-mirror,
MON, MGR, Grafana, and NFS Ganesha.
34 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
Use the following command to copy the cluster key to a cluster node:
• Install the cephadm-ansible package on the host you have chosen as the bootstrap node,
which is the first node in the cluster.
• Run the cephadm preflight playbook. This playbook verifies that the host has the required
prerequisites.
• Use cephadm to bootstrap the cluster. The bootstrap process accomplishes the following tasks:
– Installs and starts a Ceph Monitor and a Ceph Manager daemon on the bootstrap node.
– Writes a copy of the cluster public SSH key to /etc/ceph/ceph.pub and adds the key to
the /root/.ssh/authorized_keys file.
– Writes a minimal configuration file needed to communicate with the new cluster to the /etc/
ceph/ceph.conf file.
– Deploys a basic monitoring stack with prometheus and grafana services, as well as other
tools such as node-exporter and alert-manager.
Installing Prerequisites
Install cephadm-ansible on the bootstrap node:
Run the cephadm-preflight.yaml playbook. This playbook configures the Ceph repository
and prepares the storage cluster for bootstrapping. It also installs prerequisite packages, such as
podman, lvm2, chrony, and cephadm.
The preflight playbook uses the cephadm-ansible inventory file to identify the admin and client
nodes in the storage cluster.
CL260-RHCS5.0-en-1-20211117 35
Chapter 2 | Deploying Red Hat Ceph Storage
[admin]
node00
[clients]
client01
client02
client03
Expand the storage cluster by using the ceph orchestrator command or the Dashboard GUI
to add cluster nodes and services.
Note
Before bootstrapping, you must create a username and password for the
registry.redhat.io container registry. Visit https://access.redhat.com/
RegistryAuthentication for instructions.
URL: https://boostrapnode.example.com:8443/
User: admin
Password: adminpassword
ceph telemetry on
36 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
https://docs.ceph.com/docs/master/mgr/telemetry/
Bootstrap complete.
service_type: host
addr: node-00
hostname: node-00
---
service_type: host
addr: node-01
hostname: node-01
---
service_type: host
addr: node-02
hostname: node-02
---
service_type: mon
placement:
hosts:
- node-00
- node-01
- node-02
---
service_type: mgr
placement:
hosts:
- node-00
- node-01
- node-02
---
service_type: rgw
service_id: realm.zone
placement:
hosts:
- node-01
- node-02
---
service_type: osd
placement:
host_pattern: "*"
data_devices:
all: true
CL260-RHCS5.0-en-1-20211117 37
Chapter 2 | Deploying Red Hat Ceph Storage
Labeling simplifies cluster management tasks by helping to identify the daemons running on each
host. For example, you can use the ceph orch host ls command to list You can use the Ceph
orchestrator or a YAML service specification file to deploy or remove daemons on specifically
labeled hosts.
Except for the _admin label, labels are free-form and have no specific meaning. You can use
labels such as mon, monitor, mycluster_monitor, or other text strings to label and group
cluster nodes. For example, assign the mon label to nodes that you deploy MON daemons to.
Assign the mgr label for nodes that you deploy MGR daemons to, and assign rgw for RADOS
gateways.
For example, the following command applies the _admin label to a host to designate is as the
admin node.
[ceph: root@node /]# ceph orch host label add ADMIN_NODE _admin
References
For more information, refer to the Red Hat Ceph Storage 5 Installation Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/installation_guide/index
38 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
Guided Exercise
Outcomes
You should be able to install a containerized Ceph cluster by using a service specification file.
This command confirms that the local container registry for the classroom is running and
deletes the prebuilt Ceph cluster so it can be redeployed with the steps in this exercise.
Important
This lab start script immediately deletes the prebuilt Ceph cluster and takes a
few minutes to complete. Wait for the command to finish before continuing.
Instructions
1. Log in to serverc as the admin user and switch to the root user.
2. Install the cephadm-ansible package, create the inventory file, and run the cephadm-
preflight.yml playbook to prepare cluster hosts.
CL260-RHCS5.0-en-1-20211117 39
Chapter 2 | Deploying Red Hat Ceph Storage
The ceph_origin variable is set to empty, which causes some playbook tasks to
be skipped because, in this classroom, the Ceph packages are installed from a local
classroom repository. For your production environment, set ceph_origin to rhcs to
enable the Red Hat Storage Tools repository for your supported deployment.
---
service_type: host
addr: 172.25.250.10
hostname: clienta.lab.example.com
---
service_type: host
addr: 172.25.250.12
hostname: serverc.lab.example.com
---
service_type: host
addr: 172.25.250.13
hostname: serverd.lab.example.com
---
service_type: host
addr: 172.25.250.14
hostname: servere.lab.example.com
---
service_type: mon
placement:
hosts:
- clienta.lab.example.com
- serverc.lab.example.com
- serverd.lab.example.com
- servere.lab.example.com
---
service_type: rgw
service_id: realm.zone
placement:
hosts:
- serverc.lab.example.com
- serverd.lab.example.com
---
service_type: mgr
40 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
placement:
hosts:
- clienta.lab.example.com
- serverc.lab.example.com
- serverd.lab.example.com
- servere.lab.example.com
---
service_type: osd
service_id: default_drive_group
placement:
host_pattern: 'server*'
data_devices:
paths:
- /dev/vdb
- /dev/vdc
- /dev/vdd
The service_type: host defines the nodes to add after the cephadm
bootstrap completes. Host clienta will be configured as an admin node.
The Ceph Orchestrator deploys one monitor daemon by default. In the file the
service_type: mon deploys a Ceph monitor daemon in the listed hosts.
The service_type: mgr deploys a Ceph Object Gateway daemon in the listed
hosts.
The service_type: mgr deploys a Ceph Manager daemon in the listed hosts.
The service_type: osd deploys a ceph-osd daemon in the listed hosts backed by
the /dev/vdb device.
4. As the root user on the serverc node, run the cephadm bootstrap command to
create the Ceph cluster. Use the service specification file located at initial-config-
primary-cluster.yaml
URL: https://serverc.lab.example.com:8443/
User: admin
Password: redhat
CL260-RHCS5.0-en-1-20211117 41
Chapter 2 | Deploying Red Hat Ceph Storage
ceph telemetry on
https://docs.ceph.com/docs/pacific/mgr/telemetry/
Bootstrap complete.
services:
mon: 4 daemons, quorum serverc.lab.example.com,serverd,servere,clienta (age
10s)
mgr: serverc.lab.example.com.bypxer(active, since 119s), standbys:
serverd.lflgzj, clienta.hloibd, servere.jhegip
osd: 9 osds: 9 up (since 55s), 9 in (since 75s)
data:
pools: 1 pools, 1 pgs
42 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
objects: 0 objects, 0 B
usage: 47 MiB used, 90 GiB / 90 GiB avail
pgs: 1 active+clean
Your cluster might be in the HEALTH_WARN state for a few minutes until all services
and OSDs are ready.
6. Label clienta as the admin node. Verify that you can execute cephadm commands from
clienta.
6.1. Apply the _admin label to clienta to label it as the admin node.
[ceph: root@serverc /]# ceph orch host label add clienta.lab.example.com _admin
Added label _admin to host clienta.lab.example.com
6.3. Return to workstation as the student user, then log into clienta as the admin
user and start the cephadm shell. Verify that you can execute cephadm commands
from clienta.
CL260-RHCS5.0-en-1-20211117 43
Chapter 2 | Deploying Red Hat Ceph Storage
Finish
On the workstation machine, use the lab command to complete this exercise. This command
does not disable or modify the Ceph cluster you just deployed. Your new cluster will be used in the
next exercise in this chapter.
44 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
Objectives
After completing this section, you should be able to expand capacity to meet application storage
requirements by adding OSDs to an existing cluster.
Use the cephadm shell -- ceph health command to verify that the cluster is in the
HEALTH_OK state before starting to deploy additional OSDs.
• As the root user, add the Ceph storage cluster public SSH key to the root user's
authorized_keys file on the new host.
• As the root user, add new nodes to the inventory file located in /usr/share/cephadm-
ansible/hosts/. Run the preflight playbook with the --limit option to restrict the
playbook's tasks to run only on the nodes specified. The Ansible Playbook verifies that the
nodes to be added meet the prerequisite package requirements.
• Choose one of the methods to add new hosts to the Ceph storage cluster:
– As the root user, in the Cephadm shell, use the ceph orch host add command to add a
new host to the storage cluster. In this example, the command also assigns host labels.
CL260-RHCS5.0-en-1-20211117 45
Chapter 2 | Deploying Red Hat Ceph Storage
– To add multiple hosts, create a YAML file with host descriptions. Create the YAML file within
the admin container where you will then run ceph orch. .
service_type: host
addr:
hostname: new-osd-1
labels:
- mon
- osd
- mgr
---
service_type: host
addr:
hostname: new-osd-2
labels:
- mon
- osd
After creating the YAML file, run the ceph orch apply command to add the OSDs:
Listing Hosts
Use ceph orch host ls from the cephadm shell to list the cluster nodes. The STATUS column
is blank when the host is online and operating normally.
Run the ceph orch device ls command from the cephadm shell to list the available devices.
The --wide option provides more device detail.
46 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
As the root user, run the ceph orch daemon add osd command to create an OSD using a
specific device on a specific host.
Alternately, run the ceph orch apply osd --all-available-devices command to deploy
OSDs on all available and unused devices.
You can create OSDs by using only specific devices on specific hosts by including selective disk
properties. The following example creates two OSDs in the group default_drive_group
backed by /dev/vdc and /dev/vdd on each host.
Run the ceph orch apply command to implement the configuration in the YAML file.
CL260-RHCS5.0-en-1-20211117 47
Chapter 2 | Deploying Red Hat Ceph Storage
References
For more information, refer to the Adding hosts chapter in the Red Hat Ceph Storage
Installation Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/installation_guide/index#adding-hosts_install
For more information, refer to the Management of OSDs using the Ceph
Orchestrator chapter in the Red Hat Ceph Storage Operations Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/
html-single/operations_guide/index#management-of-osds-using-the-ceph-
orchestrator
48 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
Guided Exercise
Outcomes
You should be able to expand your cluster by adding new OSDs.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command confirms that the Ceph cluster is reachable and provides an example service
specification file.
Instructions
In this exercise, expand the amount of storage in your Ceph storage cluster.
1.1. Log in to serverc as the admin user and use sudo to run the cephadm shell.
CL260-RHCS5.0-en-1-20211117 49
Chapter 2 | Deploying Red Hat Ceph Storage
2.1. Create the osd_spec.yml file in the /var/lib/ceph/osd/ directory with the
correct configuration. For your convenience, you can copy and paste the content from
the /root/expand-osd/osd_spec.yml file on clienta.
50 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
paths:
- /dev/vdb
- /dev/vdc
- /dev/vdd
2.2. Deploy the osd_spec.yml file, then run the ceph orch apply command to
implement the configuration.
3.1. Display the servere storage device inventory on the Ceph cluster.
4. Verify that the cluster is in a healthy state and that the OSDs were successfully added.
services:
mon: 4 daemons, quorum serverc.lab.example.com,clienta,servere,serverd (age
12m)
CL260-RHCS5.0-en-1-20211117 51
Chapter 2 | Deploying Red Hat Ceph Storage
...output omitted...
4.2. Use the ceph osd tree command to display the CRUSH tree. Verify that the new
OSDs' location in the infrastructure is correct.
4.3. Use the ceph osd df command to verify the data usage and the number of
placement groups for each OSD.
52 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
Finish
On the workstation machine, use the lab command to complete this exercise. This command
will not delete your Ceph cluster or modify your cluster configuration, allowing you to browse your
expanded configuration before continuing with the next chapter.
CL260-RHCS5.0-en-1-20211117 53
Chapter 2 | Deploying Red Hat Ceph Storage
Lab
Outcomes
You should be able to deploy a new Ceph cluster.
This command confirms that the local container registry for the classroom is running and
deletes the prebuilt Ceph cluster so it can be redeployed with the steps in this exercise.
Important
This lab start script immediately deletes the prebuilt Ceph cluster and takes a
few minutes to complete. Wait for the command to finish before continuing.
Instructions
Deploy a new cluster with serverc, serverd, and servere as MON, MGR, and OSD nodes. Use
serverc as the deployment bootstrap node. Add OSDs to the cluster after the cluster deploys.
1. Use serverc as the bootstrap node. Log in to serverc as the admin user and switch to the
root user. Run the cephadm-preflight.yml playbook to prepare the cluster hosts.
2. Create a services specification file called initial-cluster-config.yaml.
Using the following template, add hosts serverd.lab.example.com and
servere.lab.example.com, with their IP addresses, as service_type: host. Add
serverc and serverd to the mon and mgr sections.
---
service_type: host
addr: 172.25.250.12
hostname: serverc.lab.example.com
---
service_type: mon
placement:
hosts:
- serverc.lab.example.com
---
service_type: mgr
placement:
54 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
hosts:
- serverc.lab.example.com
---
service_type: osd
service_id: default_drive_group
placement:
host_pattern: 'server*'
data_devices:
paths:
- /dev/vdb
- /dev/vdc
- /dev/vdd
service_type: osd
service_id: default_drive_group
placement:
hosts:
- serverc.lab.example.com
- serverd.lab.example.com
- servere.lab.example.com
data_devices:
paths:
- /dev/vde
- /dev/vdf
Evaluation
Grade your work by running the lab grade deploy-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
Reset your classroom environment. This restores your Ceph environment to the original, prebuilt
Ceph cluster that is expected by other course chapters.
CL260-RHCS5.0-en-1-20211117 55
Chapter 2 | Deploying Red Hat Ceph Storage
Solution
Outcomes
You should be able to deploy a new Ceph cluster.
This command confirms that the local container registry for the classroom is running and
deletes the prebuilt Ceph cluster so it can be redeployed with the steps in this exercise.
Important
This lab start script immediately deletes the prebuilt Ceph cluster and takes a
few minutes to complete. Wait for the command to finish before continuing.
Instructions
Deploy a new cluster with serverc, serverd, and servere as MON, MGR, and OSD nodes. Use
serverc as the deployment bootstrap node. Add OSDs to the cluster after the cluster deploys.
1. Use serverc as the bootstrap node. Log in to serverc as the admin user and switch to the
root user. Run the cephadm-preflight.yml playbook to prepare the cluster hosts.
1.1. Log in to serverc as the admin user and switch to the root user.
56 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
---
service_type: host
addr: 172.25.250.12
hostname: serverc.lab.example.com
---
service_type: mon
placement:
hosts:
- serverc.lab.example.com
---
service_type: mgr
placement:
hosts:
- serverc.lab.example.com
---
service_type: osd
service_id: default_drive_group
placement:
host_pattern: 'server*'
data_devices:
paths:
- /dev/vdb
- /dev/vdc
- /dev/vdd
CL260-RHCS5.0-en-1-20211117 57
Chapter 2 | Deploying Red Hat Ceph Storage
placement:
hosts:
- serverc.lab.example.com
- serverd.lab.example.com
- servere.lab.example.com
---
service_type: osd
service_id: default_drive_group
placement:
host_pattern: 'server*'
data_devices:
paths:
- /dev/vdb
- /dev/vdc
- /dev/vdd
3.1. Run the cephadm bootstrap command to create the Ceph cluster. Use the
initial-cluster-config.yaml service specification file that you just created.
URL: https://serverc.lab.example.com:8443/
User: admin
Password: redhat
58 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
3.2. Using the cephadm shell, verify that the cluster was successfully deployed. Wait for the
cluster to finish deploying and reach the HEALTH_OK status.
services:
mon: 3 daemons, quorum serverc.lab.example.com,servere,serverd (age 2m)
mgr: serverc.lab.example.com.blxerd(active, since 3m), standbys:
serverd.nibyts, servere.rkpsii
osd: 9 osds: 9 up (since 2m), 9 in (since 2m)
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 46 MiB used, 90 GiB / 90 GiB avail
pgs: 1 active+clean
4. Expand the cluster by adding OSDs to serverc, serverd, and servere. Use the following
service specification file.
service_type: osd
service_id: default_drive_group
placement:
hosts:
- serverc.lab.example.com
- serverd.lab.example.com
- servere.lab.example.com
data_devices:
paths:
- /dev/vde
- /dev/vdf
4.2. Use the ceph orch apply command to add the OSDs to the cluster OSD nodes.
CL260-RHCS5.0-en-1-20211117 59
Chapter 2 | Deploying Red Hat Ceph Storage
4.3. Verify that the OSDs were added. Wait for the new OSDs to display as up and in.
services:
mon: 3 daemons, quorum serverc.lab.example.com,servere,serverd (age 5m)
mgr: serverc.lab.example.com.blxerd(active, since 6m), standbys:
serverd.nibyts, servere.rkpsii
osd: 15 osds: 15 up (since 10s), 15 in (since 27s)
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 83 MiB used, 150 GiB / 150 GiB avail
pgs: 1 active+clean
Evaluation
Grade your work by running the lab grade deploy-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
Reset your classroom environment. This restores your Ceph environment to the original, prebuilt
Ceph cluster that is expected by other course chapters.
60 CL260-RHCS5.0-en-1-20211117
Chapter 2 | Deploying Red Hat Ceph Storage
Summary
In this chapter, you learned:
– The cephadm shell runs a bash shell within a specialized management container. Use the
cephadm shell to perform cluster deployment tasks and cluster management tasks after the
cluster is installed.
• As of version 5.0, all Red Hat Ceph Storage cluster services are containerized.
• Preparing for a new cluster deployment requires planning cluster service placement and
distributing SSH keys to nodes.
– Installs and starts the MON and MGR daemons on the bootstrap node.
– Writes a copy of the cluster public SSH key and adds the key to authorized keys file.
– Writes a copy of the administrative secret key to the key ring file.
• Assign labels to the cluster hosts to identify the daemons running on each host. The _admin
label is reserved for administrative nodes.
• Expand cluster capacity by adding OSD nodes to the cluster or additional storage space to
existing OSD nodes.
CL260-RHCS5.0-en-1-20211117 61
62 CL260-RHCS5.0-en-1-20211117
Chapter 3
CL260-RHCS5.0-en-1-20211117 63
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Objectives
After completing this section, you should be able to identify and configure the primary settings for
the overall Red Hat Ceph Storage cluster.
Ceph configuration settings use unique names that consist of lowercase character words
connected with underscores.
Note
Configuration settings might contain dash or space characters when using some
configuration methods. However, using underscores in configuration naming is a
consistent, recommended practice.
Every Ceph daemon, process, and library accesses its configuration from one of these sources:
Important
Later settings override those found in earlier sources when multiple setting sources
are present. The configuration file configures the daemons when they start.
Configuration file settings override those stored in the central database.
The monitor (MON) nodes manage a centralized configuration database. On startup, Ceph
daemons parse configuration options that are provided via command-line options, environment
variables, and the local cluster configuration file. The daemons then contact the MON cluster to
retrieve configuration settings that are stored in the centralized configuration database.
Red Hat Ceph Storage 5 deprecates the ceph.conf cluster configuration file, making the
centralized configuration database the preferred way to store configuration settings.
64 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
The configuration file uses an INI file format, containing several sections that include configuration
for Ceph daemons and clients. Each section has a name that is defined with the [name] header,
and one or more parameters that are defined as a key-value pair.
[name]
parameter1 = value1
parameter2 = value2
Use a hash sign (#) or semicolon (;) to disable settings or to add comments.
Bootstrap a cluster with custom settings by using a cluster configuration file. Use the cephadm
boostrap command with the --config option to pass the configuration file.
Configuration Sections
Ceph organizes configuration settings into groups, whether stored in the configuration file or in
the configuration database, using sections called for the daemons or clients to which they apply.
• The [global] section stores general configuration that is common to all daemons or any
process that reads the configuration, including clients. You can override [global] parameters
by creating called sections for individual daemons or clients.
• The [mon] section stores configuration for the Monitors (MON).
• The [osd] section stores configuration for the OSD daemons.
• The [mgr] section stores configuration for the Managers (MGR).
• The [mds] section stores configuration for the Metadata Servers (MDS).
• The [client] section stores configuration that applies to all the Ceph clients.
Note
A well-commented sample.ceph.conf example file can be found in /var/lib/
containers/storage/overlays/{ID}/merged/usr/share/doc/ceph/ on
any cluster node, the area where podman manages containerized service data.
Instance Settings
Group the settings that apply to a specific daemon instance in their own section, with a name of
the form [daemon-type.instance-ID].
[mon]
# Settings for all mon daemons
[mon.serverc]
# Settings that apply to the specific MON daemon running on serverc
CL260-RHCS5.0-en-1-20211117 65
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
The same naming applies to the [osd], [mgr], [mds], and [client] sections. For OSD
daemons, the instance ID is always numeric, for example [osd.0]. For clients, the instance ID is
the active user name, such as [client.operator3].
Meta Variables
Meta variables are Ceph-defined variables. Use them to simplify the configuration.
$cluster
The name of the Red Hat Ceph Storage 5 cluster. The default cluster name is ceph.
$type
The daemon type, such as the value mon for a monitor. OSDs use osd, MDSes use mds, MGRs
use mgr, and client applications use client.
$id
The daemon instance ID. This variable has the value serverc for the Monitor on serverc.
$id is 1 for osd.1, and is the user name for a client application.
$name
The daemon name and instance ID. This variable is a shortcut for $type.$id.
$host
The host name on which the daemon is running.
Use ceph config commands to query the database and view configuration information.
Use the assimilate-conf subcommand to apply configuration from a file to a running cluster.
This process recognizes and applies the changed settings from the configuration file to the
centralized database. This command is useful to import custom settings from a previous storage
cluster to a new one. Invalid or unrecognized options display on standard output, and require
manual handling. Redirect screen output to a file using the -o output-file.
66 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
The mon_host option lists cluster monitors. This option is essential and cannot be stored in the
configuration database. To avoid using a cluster configuration file, Ceph clusters support using
DNS service records to provide the mon_host list.
The local cluster configuration file can contain other options to fit your requirements.
• mon_host_override, the initial list of monitors for the cluster to contact to start
communicating.
• mon_dns_serv_name, the name of the DNS SRV record to check to identify the cluster
monitors via DNS.
• mon_data, osd_data, mds_data, mgr_data, define the daemon's local data storage
directory.
• keyring, keyfile, and key, the authentication credentials to authenticate with the monitor.
service_type: mon
placement:
host_pattern: "mon*"
count: 3
---
service_type: osd
service_id: default_drive_group
placement:
host_pattern: "osd*"
data_devices:
all: true
• service_type defines the type of service, such as mon, mds, mgr, or rgw.
• placement defines the location and quantity of the services to deploy. You can define the
hosts, host pattern, or label to select the target servers.
• data_devices is specific to OSD services, supporting filters such as size, model, or paths.
Use the cephadm bootstrap --apply-spec command to apply the service configurations
from the specified file.
The ceph tell $type.$id config command temporarily overrides configuration settings,
and requires that both the MONs and the daemon being configured are running.
Run this command from any cluster host configured to run ceph commands. Settings that are
changed with this command revert to their original settings when the daemon restarts.
CL260-RHCS5.0-en-1-20211117 67
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
• ceph tell $type.$id config get gets a specific runtime setting for a daemon.
• ceph tell $type.$id config set sets a specific runtime setting for a daemon. When the
daemon restarts, these temporary settings revert to their original values.
The ceph tell $type.$id config command can also accept wildcards to get or set
the value on all daemons of the same type. For example, ceph tell osd.* config get
debug_ms displays the value of that setting for all OSD daemons in the cluster.
You can use the ceph daemon $type.$id config command to temporarily override a
configuration setting. Run this command on the cluster node where the setting is required.
The ceph daemon does not need to connect through the MONs. The ceph daemon command
will still function even when the MONs are not running, which can be useful for troubleshooting.
• ceph daemon $type.$id config get gets a specific runtime setting for a daemon.
• ceph daemon $type.$id config set sets a specific runtime setting for a daemon.
Temporary settings revert to their original values when their daemon restarts.
References
For more information, refer to the Red Hat Ceph Storage 5 Configuration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/configuration_guide/index#the-basics-of-ceph-configuration
68 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Guided Exercise
Outcomes
You should be able to view stored configuration settings, and view and set the value of a
specific stored setting and a specific runtime setting.
If you performed the practice cluster deployments in the Deploying Red Hat
Ceph Storage chapter exercises, but have not reset your environment back
to the default classroom cluster since that chapter, then you must reset your
environment before executing the lab start command.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command confirms that the required hosts for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
CL260-RHCS5.0-en-1-20211117 69
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
70 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
no_config_file false
override
rbd_default_features 61
default
setgroup ceph
cmdline
setuser ceph
cmdline
4. View the runtime and cluster configuration database value of the debug_ms setting
for the osd.1 daemon. Edit the setting value to 10. Verify that the runtime and cluster
configuration database value is applied. Restart the osd.1 daemon. Verify that the values for
the debug_ms setting persist after the restart.
4.2. View the debug_ms setting value that is stored in the cluster configuration database
for the osd.1 daemon.
4.4. Verify that the runtime and cluster configuration database value is applied.
4.6. Run the cephadm shell and verify that the values persist after the restart.
5. Use ceph tell to temporarily override the debug_ms setting value for the osd.1
daemon to 5. Restart the osd.1 daemon in serverc and verify that the value is reverted.
CL260-RHCS5.0-en-1-20211117 71
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
5.1. View the runtime value of the debug_ms setting for osd.1.
5.2. Temporarily change the runtime value of the debug_ms setting for osd.1 to 5. This
setting reverts when you restart the osd.1 daemon.
5.3. Verify the runtime value of the debug_ms setting for osd.1.
7. Use a web browser to access the Ceph Dashboard GUI and edit the
mon_allow_pool_delete advanced configuration setting to true in the global
section.
7.2. In the Ceph Dashboard web UI, click Cluster → Configuration to display the
Configuration Settings page.
72 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
7.3. Select the advanced option from the Level menu to view advanced configuration
settings. Type mon_allow_pool_delete in the search bar to find the setting.
CL260-RHCS5.0-en-1-20211117 73
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
7.5. Edit the mon_allow_pool_delete setting, set the global value to true, and then
click Save.
74 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 75
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Objectives
After completing this section, you should be able to describe the purpose of cluster monitors and
the quorum procedures, query the monitor map, manage the configuration database, and describe
Cephx.
MONs form a quorum and elect a leader by using a variation of the Paxos algorithm, to achieve
consensus among a distributed set of computers.
• Leader: the first MON to obtain the most recent version of the cluster map.
• Provider: a MON that has the most recent version of the cluster map, but is not the leader.
• Requester: a MON that does not have the most recent version of the cluster map and must
synchronize with a provider before it can rejoin the quorum.
Synchronization always occurs when a new MON joins the cluster. Each MON periodically checks
whether a neighboring monitor has a more recent version of the cluster map. If a MON does not
have the most recent version of the cluster map, then it must synchronize and obtain it.
A majority of the MONs in a cluster must be running to establish a quorum. For example, if five
MONs are deployed, then three must be running to establish a quorum. Deploy at least three MON
nodes in your production Ceph cluster to ensure high availability. You can add or remove MONs in
a running cluster.
The cluster configuration file defines the MON host IP addresses and ports for the cluster to
operate. The mon_host setting can contain IP addresses or DNS names. The cephadm tool does
not update the cluster configuration file. Define a strategy to keep the cluster configuration files
synchronized across cluster nodes, such as with rsync.
[global]
mon_host = [v2:172.25.250.12:3300,v1:172.25.250.12:6789],
[v2:172.25.250.13:3300,v1:172.25.250.13:6789],
[v2:172.25.250.14:3300,v1:172.25.250.14:6789]
Important
Changing MON node IP addresses is not recommended after the cluster is
deployed and running.
76 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Alternately, use the ceph quorum_status command. Add the -f json-pretty option to
create a more readable output.
You can also view the status of MONs in the Dashboard. In the Dashboard, click Cluster →
Monitors to view the status of the Monitor nodes and quorum.
The MON map contains the cluster fsid (File System ID), and the name, IP address, and network
port to communicate with each MON node. The fsid is a unique, auto-generated identifier (UUID)
that identifies the Ceph cluster.
The MON map also keeps map version information, such as the epoch and time of the last change.
MON nodes maintain the map by synchronizing changes and agreeing on the current version.
Use the ceph mon dump command to view the current MON map.
CL260-RHCS5.0-en-1-20211117 77
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
0: [v2:172.25.250.12:3300/0,v1:172.25.250.12:6789/0] mon.serverc
1: [v2:172.25.250.13:3300/0,v1:172.25.250.13:6789/0] mon.serverd
2: [v2:172.25.250.14:3300/0,v1:172.25.250.14:6789/0] mon.servere
dumped monmap epoch 4
The database might grow large over time. Run the ceph tell mon.$id compact command to
compact the database to improve performance. Alternately, set the mon_compact_on_start
configuration to true to compact the database on each daemon start:
Define threshold settings that trigger a change in health status based on the database size.
Cluster Authentication
Ceph uses the Cephx protocol by default for cryptographic authentication between Ceph
components, using shared secret keys for authentication. Deploying the cluster with cephadm
enables Cephx by default. You can disable Cephx if needed, but it is not recommended because
it weakens cluster security. To enable or disable the Cephx protocol, use the ceph config set
command to manage multiple settings.
78 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
The /etc/ceph directory and daemon data directories contain the Cephx key-ring files. For
MONs, the data directory is /var/lib/ceph/$fsid/mon.$host/.
Note
Key-ring files store the secret key as plain text. Secure them with appropriate Linux
file permissions.
Use the ceph auth command to create, view, and manage cluster keys. Use the ceph-
authtool command to create key-ring files.
The following command creates a key-ring file for the MON nodes.
The cephadm tool creates a client.admin user in the /etc/ceph directory, which allows you to
run administrative commands and to create other Ceph client user accounts.
References
For more information, refer to the Monitor Configuration Reference chapter in the
Red Hat Ceph Storage 5 Configuration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/configuration_guide/index
CL260-RHCS5.0-en-1-20211117 79
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Guided Exercise
Outcomes
You should be able to view the cluster quorum settings, quorum status, and MON map.
You should be able to analyze cluster authentication settings and to compact the monitor
configuration database.
This command confirms that the required hosts for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
2. Use the ceph status command to view the cluster quorum status.
services:
mon: 4 daemons, quorum serverc.lab.example.com,serverd,servere,clienta (age
1h)
...output omitted...
80 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
0: [v2:172.25.250.12:3300/0,v1:172.25.250.12:6789/0] mon.serverc.lab.example.com
1: [v2:172.25.250.13:3300/0,v1:172.25.250.13:6789/0] mon.serverd
2: [v2:172.25.250.14:3300/0,v1:172.25.250.14:6789/0] mon.servere
3: [v2:172.25.250.10:3300/0,v1:172.25.250.10:6789/0] mon.clienta
dumped monmap epoch 4
4. View the value of the mon_host setting for the serverc MON.
5. View the MON IP, port information, and quorum status in the MON stats.
6. Use the ceph auth ls command to view the cluster authentication settings.
CL260-RHCS5.0-en-1-20211117 81
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Note
The admin key ring is stored by default in the /etc/ceph/
ceph.client.admin.keyring file.
8. Verify the space that the MON database uses on serverc. Set the option to compact the
MON database on start. Use ceph orch to restart the MON daemons and wait for the
cluster to reach a healthy state. Verify again the space of the MON database on serverc.
Note
Your cluster is expected to have a different size than in the examples.
8.1. Verify the space that the MON database uses on serverc. The name of the fsid
folder inside /var/lib/ceph can be different in your environment.
8.2. Set the option to compact the MON database on start. Use ceph orch to restart
the MON daemons, then wait for the cluster to reach a healthy state. This process can
take many seconds.
82 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
8.3. Exit the cephadm shell, then verify the current space of the MON database on
serverc.
Note
The MON database compact process can take a few minutes depending on the size
of the database. If the file size is bigger than before, then wait a few seconds until
the file is compacted.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 83
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Objectives
After completing this section, you should be able to describe the purpose for each of the cluster
networks, and view and modify the network configuration.
Ceph clients make requests directly to OSDs over the cluster's public network. OSD replication
and recovery traffic uses the public network unless you configure a separate cluster network
for this purpose.
Configuring a separate cluster network might improve cluster performance by decreasing the
public network traffic load and separating client traffic from back-end OSD operations traffic.
Configure the nodes for a separate cluster network by performing the following steps.
• Configure the appropriate cluster network IP addresses on the new network interface on each
node.
• Use the --cluster-network option of the cephadm bootstrap command to create the
cluster network at the cluster bootstrap.
You can use a cluster configuration file to set public and cluster networks. You can configure
more than one subnet for each network, separated by commas. Use CIDR notation for the subnets
(for example, 172.25.250.0/24).
[global]
public_network = 172.25.250.0/24,172.25.251.0/24
cluster_network = 172.25.249.0/24
84 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Important
If you configure multiple subnets for a network, those subnets must be able to route
to each other.
The public and cluster networks can be changed with the ceph config set command or
with the ceph config assimilate-conf command.
The example command is the equivalent of the following [mon] section in a cluster configuration
file.
[mon]
public_network = 172.25.252.0/24
Use the ceph orch daemon add command to manually deploy daemons to a specific subnet or
IP address.
Using runtime ceph orch daemon commands for configuration changes is not recommended.
Instead, use service specification files as the recommended method for managing Ceph clusters.
Running IPv6
The default value of the ms_bind_ipv4 setting is true for the cluster and the value of the
ms_bind_ipv6 setting is false. To bind Ceph daemons to IPv6 addresses, set ms_bind_ipv6
to true and set ms_bind_ipv4 to false in a cluster configuration file.
[global]
public_network = <IPv6 public-network/netmask>
cluster_network = <IPv6 cluster-network/netmask>
CL260-RHCS5.0-en-1-20211117 85
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Important
All nodes and networking devices in a communication path must have the same
MTU value. For bonded network interfaces, set the MTU value on the bonded
interface and the underlying interfaces will inherit the same MTU value.
Separating back-end OSD traffic onto its own network might help to prevent data breaches over
the public network. To secure the back-end cluster network, ensure that traffic is not routed
between the cluster and public networks.
The following table lists the default ports for Red Hat Ceph Storage 5.
OSD 6800-7300/TCP Each OSD uses three ports in this range: one
for communicating with clients and MONs
over the public network; one for sending data
to other OSDs over a cluster network, or over
the public network if the former does not
exist; and another for exchanging heartbeat
packets over a cluster network or over the
public network if the former does not exist.
86 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
MONs always operate on the public network. To secure MON nodes for firewall rules, configure
rules with the public interface and public network IP address. You can do it by manually adding
the port to the firewall rules.
You can also secure MON nodes by adding the ceph-mon service to the firewall rules.
To configure a cluster network, OSDs need rules for both the public and cluster networks.
Clients connect to OSDs by using the public network and OSDs communicate with each other
over the cluster network.
To secure OSD nodes for firewall rules, configure rules with the appropriate network interface and
IP address.
You can also secure OSD nodes by adding the ceph service to the firewall rules.
References
For more information, refer to the Network Configuration Reference chapter in the
Red Hat Ceph Storage 5 Configuration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/configuration_guide/index#ceph-network-configuration
CL260-RHCS5.0-en-1-20211117 87
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Guided Exercise
Outcomes
You should be able to configure public and cluster network settings and secure the cluster
with firewall rules.
This command confirms that the required hosts for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
2. Use the ceph health command to view the health of the cluster.
3. View the configured public_network setting for the OSD and MON services.
88 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
4. Exit the cephadm shell. Create the osd-cluster-network.conf file and add a
public_network setting with the IPv4 network address value of 172.25.250.0/24 in
the [osd] section.
5. Use the cephadm shell with the --mount option to mount the osd-cluster-
network.conf file in the default location (/mnt). Use the ceph config assimilate-
conf command with the public-network.conf file to apply the configuration. Verify
that cluster-network is defined for the service.
5.1. Use the cephadm shell with the --mount option to mount the osd-cluster-
network.conf file and verify the integrity of the file.
5.2. Use the ceph config assimilate-conf command with the osd-cluster-
network.conf file to apply the configuration Verify that cluster_network is
defined for the service.
Note
You must restart the cluster for this setting to take effect. Omit that step for this
exercise, to save time.
7. Log in to serverc as the admin user and switch to the root user. Configure a firewall rule
to secure the MON service on serverc.
CL260-RHCS5.0-en-1-20211117 89
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
9. Increase the MTU for the cluster network interface to support jumbo frames.
[root@serverc ~]# nmcli conn modify 'Wired connection 2' 802-3-ethernet.mtu 9000
[root@serverc ~]# nmcli conn down 'Wired connection 2'
Connection 'Wired connection 2' successfully deactivated (D-Bus active path: /org/
freedesktop/NetworkManager/ActiveConnection/10)
[root@serverc ~]# nmcli conn up 'Wired connection 2'
Connection successfully activated (D-Bus active path: /org/freedesktop/
NetworkManager/ActiveConnection/11)
[root@serverc ~]# ip link show eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc fq_codel state UP mode
DEFAULT group default qlen 1000
link/ether 52:54:00:01:fa:0c brd ff:ff:ff:ff:ff:ff
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
90 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Lab
Outcomes
You should be able to configure cluster settings.
This command confirms that the required hosts for this exercise are accessible.
Instructions
Configure Ceph cluster settings using both the command line and Ceph Dashboard GUI. View
MON settings and configure firewall rules for MON and RGW nodes.
1. Configure your Red Hat Ceph Storage cluster settings. Set mon_data_avail_warn to 15
and mon_max_pg_per_osd to 400. These changes must persist across cluster restarts.
2. Configure the mon_data_avail_crit setting to 10 by using the Ceph Dashboard GUI.
3. Display the MON map and view the cluster quorum status.
4. Configure firewall rules for the MON and RGW nodes on serverd.
5. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade configure-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 91
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Solution
Outcomes
You should be able to configure cluster settings.
This command confirms that the required hosts for this exercise are accessible.
Instructions
Configure Ceph cluster settings using both the command line and Ceph Dashboard GUI. View
MON settings and configure firewall rules for MON and RGW nodes.
1. Configure your Red Hat Ceph Storage cluster settings. Set mon_data_avail_warn to 15
and mon_max_pg_per_osd to 400. These changes must persist across cluster restarts.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
Configure mon_data_avail_warn to 15 and mon_max_pg_per_osd to 400.
92 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
2.3. In the Ceph Dashboard web UI, click Cluster → Configuration to display the
Configuration Settings page.
2.4. Select the advanced option from the Level menu to view advanced configuration
settings. Type mon_data_avail_crit in the search bar.
2.7. Verify that a message indicates that the configuration option is updated.
3. Display the MON map and view the cluster quorum status.
4. Configure firewall rules for the MON and RGW nodes on serverd.
4.1. Exit the cephadm shell. Log in to serverd as the admin user and switch to the root
user.
Configure a firewall rule for the MON node on serverd.
CL260-RHCS5.0-en-1-20211117 93
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Evaluation
Grade your work by running the lab grade configure-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
94 CL260-RHCS5.0-en-1-20211117
Chapter 3 | Configuring a Red Hat Ceph Storage Cluster
Summary
In this chapter, you learned:
• Most cluster configuration settings are stored in the cluster configuration database on the MON
nodes. The database is automatically synchronized across MONs.
• Certain configuration settings, such as cluster boot settings, can be stored in the cluster
configuration file. The default file name is ceph.conf. This file must be synchronized manually
between all cluster nodes.
• Most configuration settings can be modified when the cluster is running. You can change a
setting temporarily or make it persistent across daemon restarts.
• The MON map holds the MON cluster quorum information that can be viewed with ceph
commands or with the dashboard. You can configure MON settings to ensure high cluster
availability.
• Cephx provides cluster authentication via shared secret keys. The client.admin key ring is
required for administering the cluster.
• Cluster nodes operate across the public network. You can configure an additional cluster
network to separate OSD replication, heartbeat, backfill, and recovery traffic. Cluster
performance and security might be increased by configuring a cluster network.
CL260-RHCS5.0-en-1-20211117 95
96 CL260-RHCS5.0-en-1-20211117
Chapter 4
CL260-RHCS5.0-en-1-20211117 97
Chapter 4 | Creating Object Storage Cluster Components
Objectives
After completing this section, you should be able to describe OSD configuration scenarios and
create BlueStore OSDs using cephadm.
Introducing BlueStore
BlueStore replaced FileStore as the default storage back end for OSDs. FileStore stores objects as
files in a file system (Red Hat recommends XFS) on top of a block device. BlueStore stores objects
directly on raw block devices and eliminates the file-system layer, which improves read and write
operation speeds.
FileStore is deprecated. Continued use of FileStore in RHCS 5 requires a Red Hat support
exception. Newly created OSDs, whether by cluster growth or disk replacement, use BlueStore by
default.
BlueStore Architecture
Objects that are stored in a Ceph cluster have a cluster-wide unique identifier, binary object data,
and object metadata. BlueStore stores the object metadata in the block database. The block
database stores metadata as key-value pairs in a RocksDB database, which is a high-performing
key-value store.
The block database resides on a small BlueFS partition on the storage device. BlueFS is a minimal
file system that is designed to hold the RocksDB files.
BlueStore writes data to block storage devices by utilizing the write-ahead log (WAL). The write-
ahead log performs a journaling function and logs all transactions.
98 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
BlueStore Performance
FileStore writes to a journal and then writes from the journal to the block device. BlueStore avoids
this double-write performance penalty by writing data directly to the block device and logging
transactions to the write-ahead log simultaneously with a separate data stream. BlueStore write
operations are approximately twice as fast as FileStore with similar workloads.
When using a mix of different cluster storage devices, customize BlueStore OSDs to improve
performance. When you create a BlueStore OSD, the default is to place the data, block database,
and write-ahead log all on the same block device. Many of the performance advantages come
from the block database and the write-ahead log, so placing those components on separate,
faster devices might improve performance.
Note
You might improve performance by moving BlueStore devices if the new device
is faster than the primary storage device. For example, if object data is on HDD
devices, then improve performance by placing the block database on SSD devices
and the write-ahead log on NVMe devices.
Use service specification files to define the location of the BlueStore data, block database, and
write-ahead log devices. The following example specifies the BlueStore devices for an OSD
service.
service_type: osd
service_id: osd_example
placement:
host_pattern: '*'
data_devices:
paths:
- /dev/vda
db_devices:
paths:
- /dev/nvme0
wal_devices:
paths:
- /dev/nvme1
• Allows use of separate devices for the data, block database, and write-ahead log (WAL).
• Supports use of virtually any combination of HDD, SSD, and NVMe devices.
• Operates over raw devices or partitions, eliminating double writes to storage devices, with
increased metadata efficiency.
• Writes all data and metadata with checksums. All read operations are verified with their
corresponding checksums before returning to the client.
The following graphs show the performance of BlueStore versus the deprecated FileStore
architecture.
CL260-RHCS5.0-en-1-20211117 99
Chapter 4 | Creating Object Storage Cluster Components
BlueStore runs in user space, manages its own cache and database, and can have a lower memory
footprint than FileStore. BlueStore uses RocksDB to store key-value metadata. BlueStore is self-
tuning by default, but you can manually tune BlueStore parameters if required.
The BlueStore partition writes data in chunks of the size of the bluestore_min_alloc_size
parameter. The default value is 4 KiB. If the data to write is less than the size of the chunk,
BlueStore fills the remaining space of the chunk with zeroes. It is a recommended practice to set
the parameter to the size of the smallest typical write on the raw partition.
100 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
performance. With sharding, these operations are independent from the used space level, allowing
a more precise compaction and minimizing the effect on OSD performance.
Note
Red Hat recommends that the configured space for RocksDB is at least 4% of the
data device size.
In Red Hat Ceph Storage 5, sharding is enabled by default. Sharding is not enabled in OSDs from
clusters that are migrated from earlier versions. OSDs from clusters migrated from previous
versions will not have sharding enabled.
Use ceph config get to verify whether sharding is enabled for an OSD and to view the current
definition.
The default values result in good performance in most Ceph use cases. The optimal sharding
definition for your production cluster depends on several factors. Red Hat recommends use of
default values unless you are faced with significant performance issues. In a production-upgraded
cluster, you might want to weigh the performance benefits against the maintenance effort to
enable sharding for RocksDB in a large environment.
You can use the BlueStore administrative tool, ceph-bluestore-tool, to reshard the RocksDB
database without reprovisioning OSDs. To reshard an OSD, stop the daemon and pass the new
sharding definition with the --sharding option. The --path option refers to the OSD data
location, which defaults to /var/lib/ceph/$fsid/osd.$ID/.
Use the ceph orch device ls command to list devices across the hosts in the cluster.
CL260-RHCS5.0-en-1-20211117 101
Chapter 4 | Creating Object Storage Cluster Components
Nodes with the Yes label in the Available column are candidates for OSD provisioning. To view
only in-use storage devices, use the ceph device ls command.
Note
Some devices might not be eligible for OSD provisioning. Use the --wide option to
view the details of why the cluster rejects the device.
To prepare a device for provisioning, Use the ceph orch device zap command. This
command removes all partitions and purges the data in the device so it can be used for
provisioning. Use the --force option to ensure the removal of any partition that a previous OSD
might have created.
[ceph: root@node /]# ceph orch device zap node /dev/vda --force
There are multiple ways to provision OSDs with cephadm. Consider the appropriate method
according to the wanted cluster behavior.
Orchestrator-Managed Provisioning
The Orchestrator service can discover available devices among cluster hosts, add the devices, and
create the OSD daemons. The Orchestrator handles the placement for the new OSDs that are
balanced between the hosts, as well as handling BlueStore device selection.
Use the ceph orch apply osd --all-available-devices command to provision all
available, unused devices.
This command creates an OSD service called osd.all-available-devices and enables the
Orchestrator service to manage all OSD provisioning. The Orchestrator automatically creates
OSDs from both new disk devices in the cluster and from existing devices that are prepared with
the ceph orch device zap command.
To disable the Orchestrator from automatically provisioning OSDs, set the unmanaged flag to
true.
102 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Note
You can also update the unmanaged flag with a service specification file.
To stop an OSD daemon, use the ceph orch daemon stop command with the OSD ID.
To remove an OSD daemon, use the ceph orch daemon rm command with the OSD ID.
The following is an example service specification YAML file that defines two OSD services, each
using different filters for placement and BlueStore device location.
service_type: osd
service_id: osd_size_and_model
placement:
host_pattern: '*'
data_devices:
size: '100G:'
db_devices:
model: My-Disk
wal_devices:
size: '10G:20G'
unmanaged: true
CL260-RHCS5.0-en-1-20211117 103
Chapter 4 | Creating Object Storage Cluster Components
---
service_type: osd
service_id: osd_host_and_path
placement:
host_pattern: 'node[6-10]'
data_devices:
paths:
- /dev/sdb
db_devices:
paths:
- /dev/sdc
wal_devices:
paths:
- /dev/sdd
encrypted: true
The osd_size_and_model service specifies that any host can be used for placement and the
service will be managed by the storage administrator. The data device must have a device with 100
GB or more, and the write-ahead log must have a device of 10 - 20 GB. The database device must
be of the My-Disk model.
The osd_host_and_path service specifies that the target host must be provisioned on nodes
between node6 and node10 and the service will be managed by the orchestrator service. The
device paths for data, database, and write-ahead log must be /dev/sdb, /dev/sdc, and /dev/
sdd. The devices in this service will be encrypted.
Run the ceph orch apply command to apply the service specification.
Use the ceph-volume lvm command to manually create and delete BlueStore OSDs. The
following command creates a new BlueStore OSD on block storage device /dev/vdc:
An alternative to the create subcommand is to use the ceph-volume lvm prepare and
ceph-volume lvm activate subcommands. With this method, OSDs are gradually introduced
into the cluster. You can control when the new OSDs are in the up or in state, so you can ensure
that large amounts of data are not unexpectedly rebalanced across OSDs.
The prepare subcommand configures logical volumes for the OSD to use. You can specify
a logical volume or a device name. If you specify a device name, then a logical volume is
automatically created.
104 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
The activate subcommand enables a systemd unit for the OSD so that it starts at boot time.
You need the OSD fsid (UUID) from the output of the ceph-volume lvm list command to
use the activate subcommand. Providing the unique identifier ensures that the correct OSD is
activated, because OSD IDs can be reused.
When the OSD is created, use the systemctl start ceph-osd@$id command to start the
OSD so it has the up and in state in the cluster.
The inventory subcommand provides information about all physical storage devices on a node.
References
For more information, refer to the Red Hat Ceph Storage 5 Architecture Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/architecture_guide/index
For more information, refer to the BlueStore chapter in the Red Hat Ceph Storage 5
Administration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/administration_guide/osd-bluestore
For more information, refer to the Advanced service specifications and filters for
deploying OSDs chapter in the Red Hat Ceph Storage 5 Operation Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/operations_guide/index#advanced-service-specifications-and-filters-for-
deploying-osds_ops
CL260-RHCS5.0-en-1-20211117 105
Chapter 4 | Creating Object Storage Cluster Components
Guided Exercise
Outcomes
You should be able to create BlueStore OSDs and place the data, write-ahead log (WAL),
and metadata (DB) onto separate storage devices.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
2. Verify the health of the cluster. View the current cluster size. View the current OSD tree.
106 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
3. List all active disk devices in the cluster. Use grep or awk to filter available devices from the
ceph orch device ls command.
3.2. Use grep or awk to filter available devices from the ceph orch device ls
command.
[ceph: root@clienta /]# ceph orch device ls | awk /server/ | grep Yes
serverc.lab.example.com /dev/vde hdd c63...82a-b 10.7G Unknown N/A N/A Yes
serverc.lab.example.com /dev/vdf hdd 6f2...b8e-9 10.7G Unknown N/A N/A Yes
serverd.lab.example.com /dev/vde hdd f84...6f0-8 10.7G Unknown N/A N/A Yes
serverd.lab.example.com /dev/vdf hdd 297...63c-9 10.7G Unknown N/A N/A Yes
servere.lab.example.com /dev/vde hdd 2aa...c03-b 10.7G Unknown N/A N/A Yes
servere.lab.example.com /dev/vdf hdd 41c...794-b 10.7G Unknown N/A N/A Yes
4. Create two OSD daemons by using the disk devices /dev/vde and /dev/vdf on
serverc.lab.example.com. Record the ID assigned to each OSD. Verify that the
daemons are running correctly. View the cluster storage size and the new OSD tree.
CL260-RHCS5.0-en-1-20211117 107
Chapter 4 | Creating Object Storage Cluster Components
[ceph: root@clienta /]# ceph orch ps | grep -ie osd.9 -ie osd.10
osd.9 serverc.lab.example.com running (6m) 6m ago 6m ...
osd.10 serverc.lab.example.com running (6m) 6m ago 6m ...
5. Enable the orchestrator service to create OSD daemons automatically from the available
cluster devices. Verify the creation of the all-available-devices OSD service, and
the existence of new OSD daemons.
108 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
5.1. Enable the orchestrator service to create OSD daemons automatically from the
available cluster devices.
CL260-RHCS5.0-en-1-20211117 109
Chapter 4 | Creating Object Storage Cluster Components
6. Stop the OSD daemon associated with the /dev/vde device on servere and remove it.
Verify the removal process ends correctly. Zap the /dev/vde device on servere. Verify
that the Orchestrator service then re-adds the OSD daemon correctly.
Note
It is expected that the OSD ID might be different in your lab
environment. Review the output of the ceph device ls | grep
'servere.lab.example.com:vde' and use the ID to perform the next steps.
6.1. Stop the OSD daemon associated with the /dev/vde device on
servere.lab.example.com and remove it.
6.3. Zap the /dev/vde device on servere. Verify that the Orchestrator service re-adds
the OSD daemon correctly.
110 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
7. View the OSD services in YAML format. Copy the definition corresponding to the all-
available-devices service. Create the all-available-devices.yaml file and add
the copied service definition.
8.1. Modify the all-available-device.yaml file to add the unmanaged: true flag.
CL260-RHCS5.0-en-1-20211117 111
Chapter 4 | Creating Object Storage Cluster Components
8.3. Verify that the osd.all-available-devices service now has the unmanaged
flag.
9. Stop the OSD daemon associated with the /dev/vdf device on serverd and remove it.
Verify that the removal process ends correctly. Zap the /dev/vdf device on serverd.
Verify that the Orchestrator service does not create a new OSD daemon from the cleaned
device.
Note
It is expected that the OSD ID might be different in your lab environment. Use ceph
device ls | grep 'serverd.lab.example.com:vdf' to obtain your OSD
ID.
9.1. Stop the OSD daemon associated with the /dev/vdf device on serverd and
remove it.
9.2. Verify that the removal was successful, then remove the OSD ID.
9.3. Verify that the orchestrator service does not create a new OSD daemon from the
cleaned device.
112 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 113
Chapter 4 | Creating Object Storage Cluster Components
Objectives
After completing this section, you should be able to describe and compare replicated and erasure
coded pools, and create and configure each pool type.
Understanding Pools
Pools are logical partitions for storing objects. Ceph clients write objects to pools.
Ceph clients require the cluster name (ceph by default) and a monitor address to connect to the
cluster. Ceph clients usually obtain this information from a Ceph configuration file, or by being
specified as command-line parameters.
A Ceph client uses the list of pools retrieved with the cluster map, to determine where to store new
objects.
The Ceph client creates an input/output context to a specific pool and the Ceph cluster uses the
CRUSH algorithm to map these pools to placement groups, which are then mapped to specific
OSDs.
Pools provide a layer of resilience for the cluster because pools define the number of OSDs that
can fail without losing data.
Pool Types
The available pool types are replicated and erasure coded. You decide which pool type to
use based on your production use case and the type of workload.
The default pool type is replicated, which functions by copying each object to multiple OSDs.
This pool type requires more storage because it creates multiple copies of objects, however, read
operation availability is increased through redundancy.
Erasure coded pools require less storage and network bandwidth but use more CPU processing
time because of parity calculations.
Erasure coded pools are recommended for infrequently accessed data that does not require low
latency. Replicated pools are recommended for frequently accessed data that requires fast read
performance. The recovery time for each pool type can vary widely and is based on the cluster
deployment, failure, and sizing characteristics.
Pool Attributes
You must specify certain attributes when you create a pool:
• The pool type, which determines the protection mechanism the pool uses to ensure data
durability. The replicated type distributes multiple copies of each object across the cluster.
The erasure coded type splits each object into chunks, and distributes them along with
114 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
additional erasure coded chunks to protect objects using an automatic error correction
mechanism.
• The number of placement groups (PGs) in the pool, which store their objects in a set of
OSDs determined by the CRUSH algorithm.
• Optionally, a CRUSH rule set that Ceph uses to identify which placement groups to use to
store objects for the pool.
Note
Change the osd_pool_default_pg_num and osd_pool_default_pgp_num
configuration settings to set the default number of PGs for a pool.
[ceph: root@node /]# ceph osd pool create pool-name pg-num pgp-num
replicated crush-rule-name
Where:
• pg_num is the total configured number of placement groups (PGs) for this pool.
• pgp_num is the effective number of placement groups for this pool. Set this equal to pg_num.
• replicated specifies that this is a replicated pool, and is the default if not included in the
command.
• crush-rule-name is the name of the CRUSH rule set you want to use for this pool. The
osd_pool_default_crush_replicated_ruleset configuration parameter sets the
default value.
The number of placement groups in a pool can be adjusted after it is initially configured. If pg_num
and pgp_num are set to the same number, then any future adjustments to pg_num automatically
adjusts the value of pgp_num. The adjustment to pgp_num triggers the movement of PGs across
OSDs, if needed, to implement the change. Define a new number of PGs in a pool by using the
following command.
When you create a pool with the ceph osd pool create command, you do not specify the
number of replicas (size). The osd_pool_default_size configuration parameter defines the
number of replicas, and defaults to a value of 3.
CL260-RHCS5.0-en-1-20211117 115
Chapter 4 | Creating Object Storage Cluster Components
Change the size of a pool with the ceph osd pool set pool-name size number-of-
replicas command. Alternatively, update the default setting of the osd_pool_default_size
configuration setting.
Objects stored in an erasure coded pool are divided into a number of data chunks which are stored
in separate OSDs. The number of coding chunks are calculated based on the data chunks and are
stored in different OSDs. The coding chunks are used to reconstruct the object's data if an OSD
fails. The primary OSD receives the write operation, then encodes the payload into K+M chunks
and sends them to the secondary OSDs in erasure coded pools.
Erasure coded pools use this method to protect their objects and, unlike replicated pools, do not
rely on storing multiple copies of each object.
116 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Erasure coding uses storage capacity more efficiently than replication. Replicated pools maintain
n copies of an object, whereas erasure coding maintains only k + m chunks. For example, replicated
pools with 3 copies use 3 times the storage space. Erasure coded pools with k=4 and m=2 use only
1.5 times the storage space.
Note
Red Hat supports the following k+m values which result in the corresponding
usable-to-raw ratio:
The formula for calculating the erasure code overhead is nOSD * k / (k+m) *
OSD Size. For example, if you have 64 OSDs of 4 TB each (256 TB total), with
k=8 and m=4, then the formula is 64 * 8 / (8+4) * 4 = 170.67. Then divide the raw
storage capacity by the overhead to get the ratio. 256 TB/170.67 TB equals a ratio
of 1.5.
Erasure coded pools require less storage than replicated pools to obtain a similar level of data
protection, which can reduce the cost and size of the storage cluster. However, calculating coding
chunks adds CPU processing and memory overhead for erasure coded pools, reducing overall
performance.
CL260-RHCS5.0-en-1-20211117 117
Chapter 4 | Creating Object Storage Cluster Components
[ceph: root@node /]# ceph osd pool create pool-name pg-num pgp-num \
erasure erasure-code-profile crush-rule-name
Where:
• pg-num is the total number of placement groups (PGs) for this pool.
• pgp-num is the effective number of placement groups for this pool. Normally, this should be
equal to the total number of placement groups.
• erasure-code-profile is the name of the profile to use. You can create new profiles with
the ceph osd erasure-code-profile set command. A profile defines the k and m values
and the erasure code plug-in to use. By default, Ceph uses the default profile.
• crush-rule-name is the name of the CRUSH rule set to use for this pool. If not set, Ceph uses
the one defined in the erasure code profile.
You can configure placement group autoscaling on a pool. Autoscaling allows the cluster
to calculate the number of placement groups and to choose appropriate pg_num values
automatically. Autoscaling is enabled by default in Red Hat Ceph Storage 5.
Every pool in the cluster has a pg_autoscale_mode option with a value of on, off, or warn.
This example enables the pg_autoscaler module on the Ceph MGR nodes and sets the
autoscaling mode to on for a pool:
Erasure coded pools cannot use the Object Map feature. An object map is an index of objects that
tracks where the blocks of an rbd object are allocated. Having an object map for a pool improves
the performance of resize, export, flatten, and other operations.
Create profiles to define different sets of erasure coding parameters. Ceph automatically creates
the default profile during installation. This profile is configured to divide objects into two data
chunks and one coding chunk.
118 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
k
The number of data chunks that are split across OSDs. The default value is 2.
m
The number of OSDs that can fail before the data becomes unavailable. The default value is 1.
directory
This optional parameter is the location of the plug-in library. The default value is /usr/
lib64/ceph/erasure-code.
plugin
This optional parameter defines the erasure coding algorithm to use.
crush-failure-domain
This optional parameter defines the CRUSH failure domain, which controls chunk placement.
By default, it is set to host, which ensures that an object's chunks are placed on OSDs on
different hosts. If set to osd, then an object's chunks can be placed on OSDs on the same
host. Setting the failure domain to osd is less resilient because all OSDs on a host will fail if
the host fails. Failure domains can be defined and used to ensure chunks are placed on OSDs
on hosts in different data center racks or other customization.
crush-device-class
This optional parameter selects only OSDs backed by devices of this class for the pool. Typical
classes might include hdd, ssd, or nvme.
crush-root
This optional parameter sets the root node of the CRUSH rule set.
key=value
Plug-ins might have key-value parameters unique to that plug-in.
technique
Each plug-in provides a different set of techniques that implement different algorithms.
Important
You cannot modify or change the erasure code profile of an existing pool.
Use the ceph osd erasure-code-profile get command to view the details of an existing
profile.
• Rename a pool by using the ceph osd pool rename command. This does not affect the data
stored in the pool. If you rename a pool and you have per-pool capabilities for an authenticated
user, you must update the user's capabilities with the new pool name.
CL260-RHCS5.0-en-1-20211117 119
Chapter 4 | Creating Object Storage Cluster Components
Warning
Deleting a pool removes all data in the pool and is not reversible. You must set
mon_allow_pool_delete to true to enable pool deletion.
• Prevent pool deletion for a specific pool by using the ceph osd pool set pool_name
nodelete true command. Set nodelete back to false to allow deletion of the pool.
• View and modify pool configuration settings by using the ceph osd pool set and ceph osd
pool get commands.
• List pools and pool configuration settings by using the ceph osd lspools and ceph osd
pool ls detail commands.
• List pools usage and performance statistics by using the ceph df and ceph osd pool
stats commands.
• Enable Ceph applications for a pool by using the ceph osd pool application enable
command. Application types are cephfs for Ceph File System, rbd for Ceph Block Device, and
rgw for RADOS Gateway.
• Set pool quotas to limit the maximum number of bytes or the maximum number of objects that
can be stored in the pool by using the ceph osd pool set-quota command.
Important
When a pool reaches the configured quota, operations are blocked. You can remove
a quota by setting its value to 0.
Configure these example setting values to enable protection against pool reconfiguration:
osd_pool_default_flag_nodelete
Sets the default value of the nodelete flag on pools. Set the value to true to prevent pool
deletion.
osd_pool_default_flag_nopgchange
Sets the default value of the nopgchange flag on pools. Set the value to true to prevent
changes to pg_num, and pgp_num.
osd_pool_default_flag_nosizechange
Sets the default value of the nosizechange flag on pools. Set the value to true to prevent
pool size changes.
Pool Namespaces
A namespace is a logical group of objects in a pool. Access to a pool can be limited so that a user
can only store or retrieve objects in a particular namespace. One advantage of namespaces is to
restrict user access to part of a pool.
Namespaces are useful for restricting storage access by an application. They allow you to logically
partition a pool and restrict applications to specific namespaces inside the pool.
You could dedicate an entire pool to each application, but having more pools means more PGs
per OSD, and PGs are computationally expensive. This might degrade OSD performance as load
increases. With namespaces, you can keep the number of pools the same and not dedicate an
entire pool to each application.
120 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Important
Namespaces are currently only supported for applications that directly use
librados. RBD and Ceph Object Gateway clients do not currently support this
feature.
To store an object inside a namespace, the client application must provide the pool and the
namespace names. By default, each pool contains a namespace with an empty name, known as
the default namespace.
Use the rados command to store and retrieve objects from a pool. Use the -N name and --
namespace= name options to specify the pool and namespace to use.
The following example stores the /etc/services file as the srv object in the mytestpool
pool, under the system namespace.
List all the objects in all namespaces in a pool by using the --all option. To obtain JSON
formatted output, add the --format=json-pretty option.
The following example lists the objects in the mytestpool pool. The mytest object has an empty
namespace. The other objects belong to the system or the flowers namespaces.
CL260-RHCS5.0-en-1-20211117 121
Chapter 4 | Creating Object Storage Cluster Components
"name": "rose",
"namespace": "flowers"
},
{
"name": "mytest",
"namespace": ""
},
{
"name": "networks",
"namespace": "system"
}
]
References
For more information, refer to the Pools and Erasure Code Pools chapters in the
Red Hat Ceph Storage 5 Storage Strategies Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/storage_strategies_guide
122 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Guided Exercise
Outcomes
You should be able to create, delete, and rename pools as well as view and configure pool
settings.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
3. Verify that PG autoscaling is enabled for the replpool1 pool and that it is the default for
new pools.
4. List the pools, verify the existence of the replpool pool, and view the autoscale status for
the pools.
4.1. List the pools and verify the existence of the replpool pool.
CL260-RHCS5.0-en-1-20211117 123
Chapter 4 | Creating Object Storage Cluster Components
5. Set the number of replicas for the replpool1 pool to 4. Set the minimum number of
replicas required for I/O to 2, allowing up to two OSDs to fail without losing data. Set the
application type for the pool to rbd. Use the ceph osd pool ls detail command to
verify the pool configuration settings. Use the ceph osd pool get command to get the
value of a specific setting.
5.1. Set the number of replicas for the replpool1 pool to 4. Set the minimum number of
replicas required for I/O to two.
[ceph: root@clienta /]# ceph osd pool application enable replpool1 rbd
enabled application 'rbd' on pool 'replpool1'
5.3. Use the ceph osd pool ls detail command to verify the pool configuration
settings.
124 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Note
The pool uses CRUSH rule 0. Configuring CRUSH rules and pool CRUSH rules is
covered in a later chapter.
[ceph: root@clienta /]# ceph tell mon.* config set mon_allow_pool_delete true
mon.serverc.lab.example.com: {
"success": ""
}
mon.serverd: {
"success": ""
}
mon.servere: {
"success": ""
}
mon.clienta: {
"success": ""
}
Important
When you rename a pool, you must update any associated user authentication
settings with the new pool name. User authentication and capabilities are covered in
a later chapter.
7. List the existing erasure coded profiles and view the details of the default profile. Create
an erasure code profile called ecprofile-k4-m2 with k=4 and m=2 values. These values
allow the simultaneous loss of two OSDs without losing any data and meets the minimum
requirement for Red Hat support.
CL260-RHCS5.0-en-1-20211117 125
Chapter 4 | Creating Object Storage Cluster Components
7.3. Create an erasure code profile called ecprofile-k4-m2 with k=4 and m=2 values.
[ceph: root@clienta /]# ceph osd erasure-code-profile set ecprofile-k4-m2 k=4 m=2
8. Create an erasure coded pool called ecpool1 using the ecprofile-k4-m2 profile with
64 placement groups and an rgw application type. View the details of the ecpool1 pool.
Configure the ecpool1 pool to allow partial overwrites so that RBD and CephFS can use it.
Delete the ecpool1.
8.1. Create an erasure coded pool called ecpool1 by using the ecprofile-k4-m2
profile with 64 placement groups and set the application type to rgw.
[ceph: root@clienta /]# ceph osd pool create ecpool1 64 64 erasure ecprofile-k4-m2
pool 'ecpool1' created
[ceph: root@clienta /]# ceph osd pool application enable ecpool1 rgw
enabled application 'rgw' on pool 'ecpool1'
8.2. View the details of the ecpool1 pool. Your pool ID is expected to be different.
8.3. Configure the ecpool1 pool to allow partial overwrites so that RBD and CephFS can
use it.
[ceph: root@clienta /]# ceph osd pool set ecpool1 allow_ec_overwrites true
set pool 7 allow_ec_overwrites to true
126 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
10.2. Log in as admin by using redhat as the password. You should see the Dashboard
page.
CL260-RHCS5.0-en-1-20211117 127
Chapter 4 | Creating Object Storage Cluster Components
10.4. Enter replpool1 in the Name field, replicated in the Pool type field, on in the
PG Autoscale field, and 3 in the Replicated size. Leave other values as default. Click
CreatePool.
128 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
11.2. Enter ecpool1 in the Name field, erasure in the Pool type field, off in the PG
Autoscale field, and 64 in the Placement groups field. Check the EC Overwrites box
of the Flags section, and select the ecprofile-k4-m2 profile from the Erasure
code profile field. Click CreatePool.
CL260-RHCS5.0-en-1-20211117 129
Chapter 4 | Creating Object Storage Cluster Components
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
130 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Objectives
After completing this section, you should be able to describe Cephx and configure user
authentication and authorization for Ceph clients.
Authenticating Users
Red Hat Ceph Storage uses the cephx protocol to authorize communication between clients,
applications, and daemons in the cluster. The cephx protocol is based on shared secret keys.
The installation process enables cephx by default, so that the cluster requires user authentication
and authorization by all client applications.
Accounts used by Ceph daemons have names that match their associated daemon: osd.1 or
mgr.serverc and are created during the installation.
Accounts used by client applications that use librados have names with the client. prefix.
For example, when integrating OpenStack with Ceph, it is common to create a dedicated
client.openstack user account. For the Ceph Object Gateway, the installation creates a
dedicated client.rgw.hostname user account. Developers creating custom software on top of
librados should create dedicated accounts with appropriate capabilities.
Administrator account names also have the client. prefix. These accounts are used when
running commands such as ceph and rados. The installer creates the superuser account,
client.admin, with capabilities that allow the account to access everything and to modify the
cluster configuration. Ceph uses the client.admin account to run administrative commands,
unless you explicitly specify a user name with the --name or --id options.
You can set the CEPH_ARGS environment variable to define parameters such as the cluster name
or the ID of the user.
End users of a Ceph-aware application do not have an account on the Ceph cluster. Rather, they
access the application, which then accesses Ceph on their behalf. From the Ceph point of view, the
application is the client. The application might provide its own user authentication through other
mechanisms.
Figure 4.12 provides an overview of how an application can provide its own user authentication.
CL260-RHCS5.0-en-1-20211117 131
Chapter 4 | Creating Object Storage Cluster Components
The Ceph Object Gateway has its own user database to authenticate Amazon S3 and Swift users,
but uses the client.rgw.hostname account to access the cluster.
On these client systems, librados uses the keyring parameter from the /etc/ceph/
ceph.conf configuration file to locate the key-ring file. Its default value is /etc/ceph/
$cluster.$name.keyring. For example, for the client.openstack account, the key-ring
file is /etc/ceph/ceph.client.openstack.keyring.
The key-ring file stores the secret key as plain text. Protect the file with appropriate Linux file
permissions for access only by authorized Linux users. Only deploy a Ceph user's key-ring file on
systems that need it for authentication.
132 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
requests a ticket from a Monitor to authenticate to cluster daemons. This is similar to the Kerberos
protocol, with a cephx key-ring file being comparable to a Kerberos keytab file.
A more detailed discussion of the protocol is available from the upstream Ceph project's
documentation at High Availability Authentication [https://docs.ceph.com/docs/master/
architecture/#high-availability-authentication].
In this example, the ceph command authenticates as client.operator3 to list the available
pools:
Important
Do not include the client. prefix when using the --id option. The --id option
automatically assumes that client. prefix. Alternatively, the --name option
requires the client. prefix.
If you store the key-ring file in its default location, you do not need the --keyring option. The
cephadm shell automatically mounts the key-ring from the /etc/ceph/ directory.
Use capabilities to restrict or provide access to data in a pool, a pool's namespace, or a set of
pools based on application tags. Capabilities also allow the daemons in the cluster to interact with
each other.
Cephx Capabilities
Within cephx, for each daemon type, several capabilities are available:
• r grants read access. Each user account should have at least read access on the Monitors to be
able to retrieve the CRUSH map.
• w grants write access. Clients need write access to store and modify objects on OSDs. For
Managers (MGRs), w grants the right to enable or disable modules.
• x grants authorization to execute extended object classes. This allows clients to perform extra
operations on objects such as setting locks with rados lock get or listing RBD images with
rbd list.
• class-read and class-write are subsets of x. You typically use them on RBD pools.
CL260-RHCS5.0-en-1-20211117 133
Chapter 4 | Creating Object Storage Cluster Components
This example creates the formyapp1 user account, and grants the capability to store and retrieve
objects from any pool:
This example utilizes the rbd profile to define the access rights for the new forrbd user account.
A client application can use this account for block-based access to Ceph storage using a RADOS
Block Device.
The rbd-read-only profile works the same way but grants read-only access. Ceph utilizes
other existing profiles for internal communication between daemons. You cannot create your own
profiles, Ceph defines them internally.
Capability Description
r Gives the user read access. Required with monitors to retrieve the CRUSH
map.
x Gives the user the capability to call class methods (that is, both read and
write) and to conduct authentication operations on monitors.
class-read Gives the user the capability to call class read methods. Subset of x.
class-write Gives the user the capability to call class write methods. Subset of x.
* Gives the user read, write and execute permissions for a particular daemon
or pool, and the ability to execute admin commands.
profile osd Gives a user permissions to connect as an OSD to other OSDs or monitors.
Conferred on OSDs to enable OSDs to handle replication heartbeat traffic
and status reporting.
profile rbd Gives a user read-write access to the Ceph Block Devices.
profile rbd- Gives a user read-only access to the Ceph Block Devices.
read-only
134 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Restricting Access
Restrict user OSD permissions such that users can only access the pools they need. This example
creates the formyapp2 user and limits their access to read and write on the myapp pool:
If you do not specify a pool when you configure capabilities, then Ceph sets them on all existing
pools.
• By object name prefix. The following example restricts access to only those objects whose
names start with pref in any pool.
• By namespace. Implement namespaces to logically group objects within a pool. You can then
restrict user accounts to objects belonging to a specific namespace:
• By path. The Ceph File System (CephFS) utilizes this method to restrict access to specific
directories. This example creates a new user account, webdesigner, that can access only the /
webcontent directory and its contents:
• By Monitor command. This method restricts administrators to a specific list of commands. The
following example creates the operator1 user account and limits its access to two commands:
User Management
To list existing user accounts, run the ceph auth list command:
CL260-RHCS5.0-en-1-20211117 135
Chapter 4 | Creating Object Storage Cluster Components
To get the details of a specific account, use the ceph auth get command:
To export and import user accounts, run the ceph auth export and ceph auth import
commands:
This example creates the app1 user account with read and write access to all pools, and stores the
key-ring file in /etc/ceph/ceph.client.app1.keyring:
Authentication requires the key-ring file, so you must copy the file to all client systems that
operate with this new user account.
136 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
This example modifies the app1 user account capabilities on OSDs to allow only read and write
access to the myapp pool:
[ceph: root@node /]# ceph auth caps client.app1 mon 'allow r' \
osd 'allow rw pool=myapp'
updated caps for client.app1
The ceph auth caps command overwrites existing capabilities When using this command, you
must specify the full set of capabilities for all daemons, not only those you want to modify. Define
an empty string to remove all capabilities:
References
For more information, refer to the Ceph User Management chapter in the Red Hat
Ceph Storage 5 Administration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/administration_guide
CL260-RHCS5.0-en-1-20211117 137
Chapter 4 | Creating Object Storage Cluster Components
Guided Exercise
Outcomes
You should be able to configure user authentication and capabilities to store and retrieve
objects in the cluster.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and switch to the root user.
2. Configure two users for an application with the following capabilites. The first user,
client.docedit, stores and retrieves documents in the docs namespace of the
replpool1 pool. The second user, client.docget, only retrieves documents from the
replpool1 pool.
Note
The tee command saves the output of the command, instead of using the -o
option. This technique is used because the cephadm container does not retain
standard output files after the command exits.
2.1. Use the cephadm shell to create the client.docedit user with read and write
capabilities in the docs namespace within the replpool1 pool. Save the associated
key-ring file by using the appropriate directory and file name: /etc/ceph/
ceph.client.docedit.keyring
138 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
2.2. Use the cephadm shell to create the client.docget user with read capabilities
in the docs namespace within the repl1pool pool. Save the associated
key-ring file using the appropriate directory and file name: /etc/ceph/
ceph.client.docget.keyring
[root@clienta ~]$ cephadm shell -- ceph auth ls | grep -A3 -ie docedit \
-ie docget
installed auth entries:
client.docedit
key: AQARyFNhUVqjLxAAvD/00leu3V93+e9umSTBKQ==
caps: [mon] allow r
caps: [osd] allow rw pool=replpool1 namespace=docs
client.docget
key: AQDByFNhac58MxAA/ukJXL52cpsQLw65zZ+WcQ==
caps: [mon] allow r
caps: [osd] allow r pool=replpool1 namespace=docs
installed auth entries:
3. Your application is running on serverd. Copy the users' key-ring files to that server to
allow the application to authenticate with the cluster.
4. Use the cephadm shell with the --mount option to mount the /etc/ceph directory. Store
and retrieve an object to verify that the key-rings are working correctly. The two files should
be identical as verified by the diff command showing no output.
CL260-RHCS5.0-en-1-20211117 139
Chapter 4 | Creating Object Storage Cluster Components
5. Your application evolves over time and now the client.docget user also needs write
access to the docs namespace within the replpool1 pool. This user also needs to store
documents in the docarchive pool.
Confirm that the client.docget user cannot store objects yet in the docs namespace
within the replpool1 pool:
6. Grant the client.docget user rw capabilities on the docs namespace within the
replpool1 pool, and rw capabilities on the non-yet-created docarchive pool. Confirm
that the client.docget user can now store objects in the docs namespace.
[ceph: root@clienta /]# ceph auth caps client.docget mon 'allow r' \
osd 'allow rw pool=replpool1 names pace=docs, allow rw pool=docarchive'
updated caps for client.docget
[ceph: root@clienta /]# rados --id docget -p replpool1 -N docs put \
mywritetest /etc/hosts
You must define the total user capabities with the ceph auth caps command because
it overwrites previous definitions. You can define capabilities on pools that do not exist yet,
such as the docarchive pool.
7. Exit the cephadm shell and clean up by deleting the client.docedit and the
client.docget users. Remove the associated key-ring files.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
140 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Lab
Outcomes
You should be able to create and configure BlueStore OSDs and pools, and set up
authentication to the cluster.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user. Create a new OSD daemon by using the /dev/vde
device on serverc. View the details of the OSD. Restart the OSD daemon and verify it
starts correctly.
2. Create a replicated pool called labpool1 with 64 PGs. Set the number of replicas to 3. Set
the application type to rbd. Set the pg_auto_scale mode to on for the pool.
3. Create an erasure code profile called k8m4 with data chunks on 8 OSDs (k=8), able to
sustain the loss of 4 OSDs (m=4), and set crush-failure-domain=rack. Create an
erasure coded pool called labpool2 with 64 PGs that uses the k8m4 profile.
4. Create the client.rwpool user account with the capabilities to read and write objects in
the labpool1 pool. This user must not be able to access the labpool2 pool in any way.
Create the client.rpool user account with the capability to only read objects with names
containing an rgb_ prefix from the labpool1 pool.
Store the key-ring files for these two accounts in the correct location on clienta.
Store the /etc/profile file as the my_profile object in the labpool1 pool.
5. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade component-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
CL260-RHCS5.0-en-1-20211117 141
Chapter 4 | Creating Object Storage Cluster Components
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
142 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Solution
Outcomes
You should be able to create and configure BlueStore OSDs and pools, and set up
authentication to the cluster.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user. Create a new OSD daemon by using the /dev/vde
device on serverc. View the details of the OSD. Restart the OSD daemon and verify it
starts correctly.
1.1. Log in to serverc as the admin user and use sudo to run the cephadm shell.
1.2. Create a new OSD daemon by using the /dev/vde device on serverc.
1.3. View the details of the OSD. The OSD ID might be different in your lab environment.
CL260-RHCS5.0-en-1-20211117 143
Chapter 4 | Creating Object Storage Cluster Components
},
{
"type": "v1",
"addr": "172.25.250.12:6817",
"nonce": 2214147187
}
]
},
"osd_fsid": "eae3b333-24f3-46fb-83a5-b1de2559166b",
"host": "serverc.lab.example.com",
"crush_location": {
"host": "serverc",
"root": "default"
}
}
2. Create a replicated pool called labpool1 with 64 PGs. Set the number of replicas to 3. Set
the application type to rbd. Set the pg_auto_scale mode to on for the pool.
2.2. Set the number of replicas to 3. The pool ID might be different in your lab environment.
[ceph: root@clienta /]# ceph osd pool application enable labpool1 rbd
enabled application 'rbd' on pool 'labpool1'
144 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
3. Create an erasure code profile called k8m4 with data chunks on 8 OSDs (k=8), able to
sustain the loss of 4 OSDs (m=4), and set crush-failure-domain=rack. Create an
erasure coded pool called labpool2 with 64 PGs that uses the k8m4 profile.
3.1. Create an erasure code profile called k8m4 with data chunks on 8 OSDs (k=8), able to
sustain the loss of 4 OSDs (m=4), and set crush-failure-domain=rack.
[ceph: root@clienta /]# ceph osd erasure-code-profile set k8m4 k=8 m=4 \
crush-failure-domain=rack
[ceph: root@clienta /]#
3.2. Create an erasure coded pool called labpool2 with 64 PGs that use the k8m4 profile.
[ceph: root@clienta /]# ceph osd pool create labpool2 64 64 erasure k8m4
pool 'labpool2' created
4. Create the client.rwpool user account with the capabilities to read and write objects in
the labpool1 pool. This user must not be able to access the labpool2 pool in any way.
Create the client.rpool user account with the capability to only read objects with names
containing an rgb_ prefix from the labpool1 pool.
Store the key-ring files for these two accounts in the correct location on clienta.
Store the /etc/profile file as the my_profile object in the labpool1 pool.
4.1. Exit the cephadm shell, then interactively use cephadm shell to create the two
accounts from the clienta host system. Create the client.rwpool user account
with read and write access to the labpool1 pool.
4.2. Create the client.rpool user account with read access to objects with names
containing an rgb_ prefix in the labpool1 pool. Note that there is no equals sign (=)
between object_prefix and its value.
4.3. Use sudo to run a new cephadm shell with a bind mount from the host. Use the
rados command to store the /etc/profile file as the my_profile object in
the labpool1 pool. Use the client.rwpool user account rather than the default
client.admin account to test the access rights you defined for the user.
CL260-RHCS5.0-en-1-20211117 145
Chapter 4 | Creating Object Storage Cluster Components
4.4. Verify that the client.rpool user can retrieve the my_profile object from the
labpool1 pool.
Evaluation
Grade your work by running the lab grade component-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
146 CL260-RHCS5.0-en-1-20211117
Chapter 4 | Creating Object Storage Cluster Components
Summary
In this chapter, you learned:
• BlueStore is the default storage back end for Red Hat Ceph Storage 5. It stores objects
directly on raw block devices and improves performance over the previous FileStore back
end.
• BlueStore OSDs use a RocksDB key-value database to manage metadata and store it on a
BlueFS partition. Red Hat Ceph Storage 5 uses sharding by default for new OSDs.
• Block.db stores object metadata and the write-ahead log (WAL) stores journals. You can
improve OSD performance by placing the block.db and WAL devices on faster storage than
the object data.
• You can provision OSDs by using service specification files, by choosing a specific host and
device, or automatically with the orchestrator service.
• Pools are logical partitions for storing objects. The available pool types are replicated and
erasure coded.
• Replicated pools are the default type of pool, they copy each object to multiple OSDs.
• Erasure coded pools function by dividing object data into chunks (k), calculating coding
chunks (m) based on the data chunks, then storing each chunk on separate OSDs. The coding
chunks are used to reconstruct object data if an OSD fails.
• A pool namespace allows you to logically partition a pool and is useful for restricting storage
access by an application.
• The cephx protocol authenticates clients and authorizes communication between clients,
applications, and daemons in the cluster. It is based on shared secret keys.
• Clients can access the cluster when they are configured with a user account name and a key-
ring file containing the user's secret key.
• Cephx capabilities provide a way to control access to pools and object data within pools.
CL260-RHCS5.0-en-1-20211117 147
148 CL260-RHCS5.0-en-1-20211117
Chapter 5
CL260-RHCS5.0-en-1-20211117 149
Chapter 5 | Creating and Customizing Storage Maps
Objectives
After completing this section, you should be able to administer and update the cluster CRUSH
map used by the Ceph cluster.
The CRUSH algorithm works to uniformly distribute the data in the object store, manage
replication, and respond to system growth and hardware failures. When new OSDs are added or an
existing OSD or OSD host fails, Ceph uses CRUSH to rebalance the objects in the cluster among
the active OSDs.
A CRUSH hierarchy
This lists all available OSDs and organizes them into a treelike structure of buckets.
The CRUSH hierarchy is often used to represent where OSDs are located. By default, there
is a root bucket representing the whole hierarchy, which contains a host bucket for each OSD
host.
The OSDs are the leaves of the tree, and by default all OSDs on the same OSD host are
placed in that host's bucket. You can customize the tree structure to rearrange it, add more
levels, and group OSD hosts into buckets representing their location in different server racks
or data centers.
150 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
To summarize, buckets are the containers or branches in the CRUSH hierarchy. Devices are OSDs,
and are leaves in the CRUSH hierarchy.
• The ID of the bucket. These IDs are negative numbers to distinguish them from storage device
IDs.
• The type of the bucket. The default map defines several types that you can retrieve with the
ceph osd crush dump command.
Bucket types include root, region, datacenter, room, pod, pdu, row, rack, chassis, and
host, but you can also add your own types. The bucket at the root of the hierarchy is of the
root type.
• The algorithm that Ceph uses to select items inside the bucket when mapping PG replicas to
OSDs. Several algorithms are available: uniform, list, tree, and straw2. Each algorithm
represents a trade-off between performance and reorganization efficiency. The default
algorithm is straw2.
CL260-RHCS5.0-en-1-20211117 151
Chapter 5 | Creating and Customizing Storage Maps
Configuring the CRUSH map and creating separate failure domains allows OSDs and cluster nodes
to fail without any data loss occurring. The cluster simply operates in a degraded state until the
problem is fixed.
Configuring the CRUSH map and creating separate performance domains can reduce
performance bottlenecks for clients and applications that use the cluster to store and retrieve
data. For example, CRUSH can create one hierarchy for HDDs and another hierarchy for SSDs.
A typical use case for customizing the CRUSH map is to provide additional protection against
hardware failures. You can configure the CRUSH map to match the underlying physical
infrastructure, which helps mitigate the impact of hardware failures.
By default, the CRUSH algorithm places replicated objects on OSDs on different hosts. You can
customize the CRUSH map so that object replicas are placed across OSDs in different shelves, or
on hosts in different rooms, or in different racks with distinct power sources.
Another use case is to allocate OSDs with SSD drives to pools used by applications requiring very
fast storage, and OSDs with traditional HDDs to pools supporting less demanding workloads.
The CRUSH map can contain multiple hierarchies that you can select through different CRUSH
rules. By using separate CRUSH hierarchies, you can establish separate performance domains. Use
case examples for configuring separate performance domains are:
• To separate block storage used by VMs from object storage used by applications.
• To separate "cold" storage, containing infrequently accessed data, from "hot" storage,
containing frequently accessed data.
• A list of all the infrastructure buckets and the IDs of the storage devices or other buckets in
each of them. Remember that a bucket is a container, or a branch, in the infrastructure tree. For
example, it might represent a location or a piece of physical hardware.
The cluster installation process deploys a default CRUSH map. You can use the ceph osd crush
dump command to print the CRUSH map in JSON format. You can also export a binary copy of the
map and decompile it into a text file:
• The weight of the storage device, normally based on its capacity in terabytes. For example, a
4 TB storage device has a weight of about 4.0. This is the relative amount of data the device can
store, which the CRUSH algorithm uses to help ensure uniform object distribution.
152 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
You can set the weight of an OSD with the ceph osd crush reweight command. CRUSH
tree bucket weights should equal the sum of their leaf weights. If you manually edit the CRUSH
map weights, then you should execute the following command to ensure that the CRUSH tree
bucket weights accurately reflect the sum of the leaf OSDs within the bucket.
• The class of the storage device. Multiple types of storage devices can be used in a storage
cluster, such as HDDs, SSDs, or NVMe SSDs. A storage device's class reflects this information
and you can use that to create pools optimized for different application workloads. OSDs
automatically detect and set their device class. You can explicitly set the device class of an OSD
with the ceph osd crush set-device-class command. Use the ceph osd crush rm-
device-class to remove a device class from an OSD.
The ceph osd crush tree command shows the CRUSH map's current CRUSH hierarchy:
Device classes are implemented by creating a “shadow” CRUSH hierarchy for each device class in
use that contains only devices of that class. CRUSH rules can then distribute data over the shadow
hierarchy. You can view the CRUSH hierarchy with shadow items with the ceph osd crush
tree --show-shadow command.
Create a new device class by using the ceph osd crush class create command. Remove a
device class using the ceph osd crush class rm command.
List configured device classes with the ceph osd crush class ls command.
The ceph osd crush rule ls command lists the existing rules and the ceph osd crush
rule dump rule_name command prints the details of a rule.
The decompiled CRUSH map also contains the rules and might be easier to read:
CL260-RHCS5.0-en-1-20211117 153
Chapter 5 | Creating and Customizing Storage Maps
The name of the rule. Use this name to select the rule when creating a pool with the ceph
osd pool create command.
The ID of the rule. Some commands use the rule ID instead of the rule name. For example, the
ceph osd pool set pool-name crush_ruleset ID command, which sets the rule for
an existing pool, uses the rule ID.
If a pool makes fewer replicas than this number, then CRUSH does not select this rule.
If a pool makes more replicas than this number, then CRUSH does not select this rule.
Takes a bucket name, and begins iterating down the tree. In this example, the iterations
start at the bucket called default, which is the root of the default CRUSH hierarchy. With
a complex hierarchy composed of multiple data centers, you could create a rule for a data
center designed to force objects in specific pools to be stored in OSDs in that data center. In
that situation, this step could start iterating at the data center bucket.
Selects a set of buckets of the given type (host) and chooses a leaf (OSD) from the subtree
of each bucket in the set. In this example, the rule selects an OSD from each host bucket in
the set, ensuring that the OSDs come from different hosts. The number of buckets in the set
is usually the same as the number of replicas in the pool (the pool size):
• If the number after firstn is 0, choose as many buckets as there are replicas in the pool.
• If the number is greater than zero, and less than the number of replicas in the pool, choose
that many buckets. In that case, the rule needs another step to draw buckets for the
remaining replicas. You can use this mechanism to force the location of a subset of the
object replicas.
• If the number is less than zero, subtract its absolute value from the number of replicas and
choose that many buckets.
For example, you could create the following rule to select as many OSDs as needed on separate
racks, but only from the DC1 data center:
rule myrackruleinDC1 {
id 2
type replicated
154 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
min_size 1
max_size 10
step take DC1
step chooseleaf firstn 0 type rack
step emit
}
Important
Adjusting CRUSH tunables will probably change how CRUSH maps placement
groups to OSDs. When that happens, the cluster needs to move objects to different
OSDs in the cluster to reflect the recalculated mappings. Cluster performance could
degrade during this process.
Rather than modifying individual tunables, you can select a predefined profile with the ceph osd
crush tunables profile command. Set the value of profile to optimal to enable the best
(optimal) values for the current version of Red Hat Ceph Storage.
Important
Red Hat recommends that all cluster daemons and clients use the same release
version.
CL260-RHCS5.0-en-1-20211117 155
Chapter 5 | Creating and Customizing Storage Maps
It is usually easier to update the CRUSH map with the ceph osd crush command. However,
there are less common scenarios which can only be implemented by using the second method.
For example, these commands create three new buckets, one of the datacenter type and two of
the rack type:
You can then organize the new buckets in a hierarchy with the following command:
You also use this command to reorganize the tree. For example, the following commands attach
the two rack buckets from the previous example to the data center bucket, and attaches the data
center bucket to the default root bucket:
156 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
When Ceph starts, it uses the ceph-crush-location utility to automatically verify that each
OSD is in the correct CRUSH location. If the OSD is not in the expected location in the CRUSH
map, it is automatically moved. By default, this is root=default host=hostname.
You can replace the ceph-crush-location utility with your own script to change where OSDs
are placed in the CRUSH map. To do this, specify the crush_location_hook parameter in the /
etc/ceph/ceph.conf configuration file.
...output omitted...
[osd]
crush_location_hook = /path/to/your/script
...output omitted...
Ceph executes the script with these arguments: --cluster cluster-name --id osd-id --
type osd. The script must print the location as a single line on its standard output. The upstream
Ceph documentation has an example of a custom script that assumes each system has an /etc/
rack file containing the name of its rack:
#!/bin/sh
echo "root=default rack=$(cat /etc/rack) host=$(hostname -s)"
[osd.0]
crush_location = root=default datacenter=DC1 rack=rackA1
[osd.1]
crush_location = root=default datacenter=DC1 rack=rackB1
[ceph: root@node /]# ceph osd crush rule create-replicated name root
\
failure-domain-type [class]
The following example creates the new inDC2 rule to store replicas in the DC2 data center, and
distributes the replicas across racks:
[ceph: root@node /]# ceph osd crush rule create-replicated inDC2 DC2 rack
[ceph: root@node /]# ceph osd crush rule ls
replicated_rule
erasure-code
inDC2
CL260-RHCS5.0-en-1-20211117 157
Chapter 5 | Creating and Customizing Storage Maps
After you have defined the rule, use it when creating a replicated pool:
For erasure coding, Ceph automatically creates rules for each erasure coded pool you create. The
name of the rule is the name of the new pool. Ceph uses the rule parameters you define in the
erasure code profile that you specify when you create the pool.
The following example first creates the new myprofile erasure code profile, then creates the
myecpool pool based on this profile:
[ceph: root@node /]# ceph osd erasure-code-profile set myprofile k=2 m=1 \
crush-root=DC2 crush-failu re-domain=rack crush-device-class=ssd
[ceph: root@node /]# ceph osd pool create myecpool 50 50 erasure myprofile
pool 'myecpool' created
[ceph: root@node /]# ceph osd crush rule ls
replicated_rule
erasure-code
myecpool
Command Action
ceph osd getcrushmap -o binfile Export a binary copy of the current map.
crushtool -i binfile --test Perform dry runs on a binary CRUSH map and
simulate placement group creation.
ceph osd setcrushmap -i binfile Import a binary CRUSH map into the cluster.
Note
The ceph osd getcrushmap and ceph osd setcrushmap commands provide
a useful way to back up and restore the CRUSH map for your cluster.
158 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
During the cluster life cycle, the number of PGs must be adjusted as the cluster layout changes.
CRUSH attempts to ensure a uniform distribution of objects among OSDs in the pool, but there
are scenarios where the PGs become unbalanced. The placement group autoscaler can be used
to optimize PG distribution, and is on by default. You can also manually set the number of PGs per
pool, if required.
Objects are typically distributed uniformly, provided that there are one or two orders of magnitude
(factors of ten) more placement groups than OSDs in the pool. If there are not enough PGs, then
objects might be distributed unevenly. If there is a small number of very large objects stored in the
pool, then object distribution might become unbalanced.
Note
PGs should be configured so that there are enough to evenly distribute objects
across the cluster. If the number of PGs is set too high, then it increases CPU
and memory use significantly. Red Hat recommends approximately 100 to 200
placement groups per OSD to balance these factors.
Red Hat recommends the use of the Ceph Placement Groups per Pool Calculator, https://
access.redhat.com/labs/cephpgc/, from the Red Hat Customer Portal Labs.
The following example remaps the PG 3.25 from ODs 2 and 0 to 1 and 0:
Remapping hundreds of PGs this way is not practical. The osdmaptool command is useful here.
It takes the actual map for a pool, analyses it, and generates the ceph osd pg-upmap-items
commands to run for an optimal distribution:
1. Export the map to a file. The following command saves the map to the ./om file:
CL260-RHCS5.0-en-1-20211117 159
Chapter 5 | Creating and Customizing Storage Maps
2. Use the --test-map-pgs option of the osdmaptool command to display the actual
distribution of PGs. The following command prints the distribution for the pool with the ID of
3:
This output shows that osd.2 has only 27 PGs but osd.1 has 39.
3. Generate the commands to rebalance the PGs. Use the --upmap option of the osdmaptool
command to store the commands in a file:
References
For more information, refer to Red Hat Ceph Storage 5 Strategies Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/storage_strategies_guide/
160 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
Guided Exercise
Outcomes
You should be able to create data placement rules to target a specific device class, create a
pool by using a specific data placement rule, and decompile and edit the CRUSH map.
This command confirms that the hosts required for this exercise are accessible, backs up the
CRUSH map, adds the ssd device class, and sets the mon_allow_pool_delete setting to
true.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell. Verify
that the cluster returns a HEALTH_OK state.
2. Create a new CRUSH rule called onssd that uses only the OSDs backed by SSD storage.
Create a new pool called myfast with 32 placement groups that use that rule. Confirm that
the pool is using only OSDs that are backed by SSD storage.
2.2. Display the CRUSH map tree to locate the OSDs backed by SSD storage.
CL260-RHCS5.0-en-1-20211117 161
Chapter 5 | Creating and Customizing Storage Maps
2.3. Add a new CRUSH map rule called onssd to target the OSDs with SSD devices.
2.4. Use the ceph osd crush rule ls command to verify the successful creation of
the new rule.
2.5. Create a new replicated pool called myfast with 32 placement groups that uses the
onssd CRUSH map rule.
2.6. Verify that the placement groups for the pool called myfast are only using the OSDs
backed by SSD storage. In a previous step, the OSDs are osd.2, osd.5, and osd.8.
Retrieve the ID of the pool called myfast.
2.7. Use the ceph pg dump pgs_brief command to list all the PGs in the cluster.
The pool ID is the first number in a PG ID. For example, the PG 6.1b belongs to the
pool whose ID is 6.
162 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
The pool called myfast, whose ID is 6, only uses osd.1, osd.5, and osd.6. These
are the only OSDs with SSD drives.
3. Create a new CRUSH hierarchy under root=default-cl260 that has three rack buckets
(rack1, rack2, and rack3), each of which contains one host bucket (hostc, hostd, and
hoste).
3.1. Create a new CRUSH map hierarchy that matches this infrastructure:
You should place the three SSDs (in this example are OSDs 1, 5, and 6) on hostc.
Because in your cluster OSD numbers can be differet, modify the CRUSH map
hierarchy accordingly to this requirement.
First, create the buckets with the ceph osd crush add-bucket command.
CL260-RHCS5.0-en-1-20211117 163
Chapter 5 | Creating and Customizing Storage Maps
3.2. Use the ceph osd crush move command to build the hierarchy.
3.3. Display the CRUSH map tree to verify the new hierarchy.
164 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
[ceph: root@clienta /]# ceph osd crush set osd.1 1.0 root=default-cl260 \
rack=rack1 host=hostc
set item id 1 name 'osd.1' weight 1 at location
{host=hostc,rack=rack1,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.5 1.0 root=default-cl260 \
rack=rack1 host=hostc
set item id 5 name 'osd.5' weight 1 at location
{host=hostc,rack=rack1,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.6 1.0 root=default-cl260 \
rack=rack1 host=hostc
set item id 6 name 'osd.6' weight 1 at location
{host=hostc,rack=rack1,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.0 1.0 root=default-cl260 \
rack=rack2 host=hostd
set item id 0 name 'osd.0' weight 1 at location
{host=hostd,rack=rack2,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.3 1.0 root=default-cl260 \
rack=rack2 host=hostd
set item id 3 name 'osd.3' weight 1 at location
{host=hostd,rack=rack2,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.4 1.0 root=default-cl260 \
rack=rack2 host=hostd
set item id 4 name 'osd.4' weight 1 at location
{host=hostd,rack=rack2,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.2 1.0 root=default-cl260 \
rack=rack3 host=hoste
set item id 2 name 'osd.2' weight 1 at location
{host=hoste,rack=rack3,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.7 1.0 root=default-cl260 \
rack=rack3 host=hoste
set item id 7 name 'osd.7' weight 1 at location
{host=hoste,rack=rack3,root=default-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.8 1.0 root=default-cl260 \
rack=rack3 host=hoste
set item id 8 name 'osd.8' weight 1 at location
{host=hoste,rack=rack3,root=default-cl260} to crush map
3.5. Display the CRUSH map tree to verify the new OSD locations.
CL260-RHCS5.0-en-1-20211117 165
Chapter 5 | Creating and Customizing Storage Maps
All the OSDs with SSD devices are in the rack1 bucket and no OSDs are in the
default tree.
4. Add a custom CRUSH rule by decompiling the binary CRUSH map and editing the resulting
text file to add a new CRUSH rule called ssd-first. This rule always selects OSDs backed
by SSD storage as the primary OSD, and OSDs backed by HDD storage as secondary OSDs
for each placement group.
When the rule is created, compile the map and load it into your cluster. Create a new
replicated pool called testcrush that uses the rule, and verify that its placement groups
are mapped correctly.
Clients accessing the pools that are using this new rule will read data from fast drives
because clients always read and write from the primary OSDs.
4.1. Retrieve the current CRUSH map by using the ceph osd getcrushmap command.
Store the binary map in the /home/ceph/cm-org.bin file.
4.2. Use the crushtool command to decompile the binary map to the ~/cm-org.txt
text file. When successful, this command returns no output, but immediately use the
echo $? command to determine its return code.
4.3. Save a copy of the CRUSH map as ~/cm-new.txt, and add the following rule at the
end of the file.
166 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
type replicated
min_size 1
max_size 10
step take rack1
step chooseleaf firstn 1 type host
step emit
step take default-cl260 class hdd
step chooseleaf firstn -1 type rack
step emit
}
With this rule, the first replica uses an OSD from rack1 (backed by SSD storage),
and the remaining replicas use OSDs backed by HDD storage from different racks.
4.5. Before applying the new map to the running cluster, use the crushtool command
with the --show-mappings option to verify that the first OSD is always from rack1.
The first OSD is always 1, 5, or 6, which corresponds to the OSDs with SSD devices
from rack1.
4.6. Apply the new CRUSH map to your cluster by using the ceph osd setcrushmap
command.
CL260-RHCS5.0-en-1-20211117 167
Chapter 5 | Creating and Customizing Storage Maps
4.8. Create a new replicated pool called testcrush with 32 placement groups and use
the ssd-first CRUSH map rule.
4.9. Verify that the first OSDs for the placement groups in the pool called testcrush are
the ones from rack1. These OSDs are osd.1, osd.5, and osd.6.
5. Use the pg-upmap feature to manually remap some secondary OSDs in one of the PGs in
the testcrush pool.
5.1. Use the new pg-upmap optimization feature to manually map a PG to specific OSDs.
Remap the second OSD of your PG from the previous step to another OSD of your
choosing, except 1, 5 or, 6.
5.2. Use the ceph pg map command to verify the new mapping. When done, log off
from clienta.
168 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 169
Chapter 5 | Creating and Customizing Storage Maps
Objectives
After completing this section, you should be able to describe the purpose and modification of the
OSD maps.
When a change occurs in the cluster's infrastructure, such as OSDs joining or leaving the cluster,
the MONs update the corresponding map accordingly. The MONs maintain a history of map
revisions. Ceph identifies each version of each map using an ordered set of incremented integers
known as epochs.
The ceph status -f json-pretty command displays the epoch of each map. Use the ceph
map dump subcommand to display each individual map, such as ceph osd dump.
170 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
Even though the cluster map as a whole is maintained by the MONs, OSDs do not use a leader
to manage the OSD map; they propagate the map among themselves. OSDs tag every message
they exchange with the OSD map epoch. When an OSD detects that it is lagging behind, it
performs a map update with its peer OSD.
In large clusters, where OSD map updates are frequent, it is not practical to always distribute the
full map. Instead, receiving OSD nodes perform incremental map updates.
Ceph also tags the messages between OSDs and clients with the epoch. Whenever a client
connects to an OSD, the OSD inspects the epoch. If the epoch does not match, then the OSD
responds with the correct increment so that the client can update its OSD map. This negates the
need for aggressive propagation, because clients learn about the updated map only at the time of
next contact.
MONs use the Paxos algorithm as a mechanism to ensure that they agree on the cluster state.
Paxos is a distributed consensus algorithm. Every time a MON modifies a map, it sends the update
to the other monitors through Paxos. Ceph only commits the new version of the map after a
majority of monitors agree on the update.
The MON submits a map update to Paxos and only writes the new version to the local key-value
store after Paxos acknowledges the update. The read operations directly access the key-value
store.
CL260-RHCS5.0-en-1-20211117 171
Chapter 5 | Creating and Customizing Storage Maps
When a leader monitor learns of an OSD failure, it updates the map, increments the epoch, and
uses the Paxos update protocol to notify the other monitors, at the same time revoking their
leases. After a majority of monitors acknowledge the update, and the cluster has a quorum, the
leader monitor issues a new lease so that the monitors can distribute the updated OSD map. This
method avoids the map epoch ever going backwards anywhere in the cluster, and finding previous
leases that are still valid.
Command Action
172 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
References
For more information, refer to the Red Hat Ceph Storage 5 Storage Strategies Guide
at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/storage_strategies_guide
CL260-RHCS5.0-en-1-20211117 173
Chapter 5 | Creating and Customizing Storage Maps
Guided Exercise
Outcomes
You should be able to display the OSD map and modify the OSD near-full and full ratios.
This command confirms that the hosts required for this exercise are accessible. It resets the
full_ratio and nearfull_ratio settings to the default values, and installs the ceph-
base package on servera.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell. Verify
that the cluster status is HEALTH_OK.
2. Run the ceph osd dump command to display the OSD map. Record the current epoch
value in your lab environment. Record the value of the full_ratio and nearfull_ratio
settings.
Verify that the status of each OSD is up and in.
174 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
3.1. Set the full_ratio parameter to 0.97 (97%) and nearfull_ratio to 0.9 (90%).
3.2. Verify the full_ratio and nearfull_ratio values. Compare this epoch value
with the value from the previous dump of the OSD map. The epoch has incremented
two versions because each ceph osd set-*-ratio command produces a new
OSD map version.
4.1. Instead of using the ceph osd dump command, use the ceph osd getmap
command to extract a copy of the OSD map to a binary file and the osdmaptool
command to view the file.
Use the ceph osd getmap command to save a copy of the OSD map in the
map.bin file.
4.2. Use the osdmaptool --print command to display the text version of the binary
OSD map. The output is similar to the output of the ceph osd dump command.
CL260-RHCS5.0-en-1-20211117 175
Chapter 5 | Creating and Customizing Storage Maps
5. Extract and decompile the current CRUSH map, then compile and import the CRUSH map.
You will not change any map settings, but only observe the change in the epoch.
5.1. Use the osdmaptool --export-crush command to extract a binary copy of the
CRUSH map and save it in the crush.bin file.
5.2. Use the crushtool command to decompile the binary CRUSH map.
5.3. Use the crushtool command to compile the CRUSH map using the crush.txt
file. Send the output to the crushnew.bin file.
5.4. Use the osdmaptool --import-crush command to import the new binary
CRUSH map into a copy of the binary OSD map.
6. Use the osdmaptool command to test the impact of changes to the CRUSH map before
applying them in production.
176 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 177
Chapter 5 | Creating and Customizing Storage Maps
Lab
Outcomes
You should be able to create a new CRUSH hierarchy and move OSDs into it, create a
CRUSH rule and configure a replicated pool to use it, and set the CRUSH tunables profile.
This command confirms that the hosts required for this exercise are accessible, backs up the
CRUSH map, and sets the mon_allow_pool_delete setting to true.
Instructions
1. Create a new CRUSH hierarchy under root=review-cl260 that has two data center
buckets (dc1 and dc2), two rack buckets (rack1 and rack2), one in each data center, and
two host buckets (hostc and hostd), one in each rack.
Place osd.1 and osd.2 into dc1, rack1, hostc.
Place osd.3 and osd.4 into dc2, rack2, hostd.
2. Add a CRUSH rule called replicated1 of type replicated. Set the root to review-
cl260 and the failure domain to datacenter.
3. Create a new replicated pool called reviewpool with 64 PGs that use the new CRUSH rule
from the previous step.
4. Set CRUSH tunables to use the optimal profile.
5. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade map-review command from your workstation
machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
178 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
CL260-RHCS5.0-en-1-20211117 179
Chapter 5 | Creating and Customizing Storage Maps
Solution
Outcomes
You should be able to create a new CRUSH hierarchy and move OSDs into it, create a
CRUSH rule and configure a replicated pool to use it, and set the CRUSH tunables profile.
This command confirms that the hosts required for this exercise are accessible, backs up the
CRUSH map, and sets the mon_allow_pool_delete setting to true.
Instructions
1. Create a new CRUSH hierarchy under root=review-cl260 that has two data center
buckets (dc1 and dc2), two rack buckets (rack1 and rack2), one in each data center, and
two host buckets (hostc and hostd), one in each rack.
Place osd.1 and osd.2 into dc1, rack1, hostc.
Place osd.3 and osd.4 into dc2, rack2, hostd.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
1.2. Create the buckets with the ceph osd crush add-bucket command.
180 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
1.3. Use the ceph osd crush move command to build the hierarchy.
1.4. Place the OSDs as leaves in the new tree and set all OSD weights to 1.0.
[ceph: root@clienta /]# ceph osd crush set osd.1 1.0 root=review-cl260 \
datacenter=dc1 rack=rack1 host=hostc
set item id 1 name 'osd.1' weight 1 at location
{datacenter=dc1,host=hostc,rack=rack1,root=review-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.2 1.0 root=review-cl260 \
datacenter=dc1 rack=rack1 host=hostc
set item id 2 name 'osd.2' weight 1 at location
{datacenter=dc1,host=hostc,rack=rack1,root=review-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.3 1.0 root=review-cl260 \
datacenter=dc2 rack=rack2 host=hostd
set item id 3 name 'osd.3' weight 1 at location
{datacenter=dc2,host=hostd,rack=rack2,root=review-cl260} to crush map
[ceph: root@clienta /]# ceph osd crush set osd.4 1.0 root=review-cl260 \
datacenter=dc1 rack=rack2 host=hostd
set item id 4 name 'osd.4' weight 1 at location
{datacenter=dc1,host=hostd,rack=rack2,root=review-cl260} to crush map
1.5. Display the CRUSH map tree to verify the new hierarchy and OSD locations.
CL260-RHCS5.0-en-1-20211117 181
Chapter 5 | Creating and Customizing Storage Maps
2. Add a CRUSH rule called replicated1 of type replicated. Set the root to review-
cl260 and the failure domain to datacenter.
2.1. Use the ceph osd crush rule create-replicated command to create the rule.
2.2. Verify that the replicated1 CRUSH rule was created correctly. Record the CRUSH
rule ID, it might be different in your lab environment.
[ceph: root@clienta /]# ceph osd crush rule dump | grep -B2 -A 20 replicated1
{
"rule_id": 1,
"rule_name": "replicated1",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -9,
"item_name": "review-cl260"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "datacenter"
},
{
"op": "emit"
}
]
}
3. Create a new replicated pool called reviewpool with 64 PGs that use the new CRUSH rule
from the previous step.
182 CL260-RHCS5.0-en-1-20211117
Chapter 5 | Creating and Customizing Storage Maps
3.2. Verify that the pool was created correctly. The pool ID and CRUSH rule ID might be
different in your lab environment. Compare the CRUSH rule ID with the output of the
previous step.
Evaluation
Grade your work by running the lab grade map-review command from your workstation
machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 183
Chapter 5 | Creating and Customizing Storage Maps
Summary
In this chapter, you learned:
• The CRUSH algorithm provides a decentralized way for Ceph clients to interact with the
Red Hat Ceph Storage cluster, which enables massive scalability.
• The CRUSH map contains two main components: a hierarchy of buckets that organize OSDs
into a treelike structure where the OSDs are the leaves of the tree, and at least one CRUSH rule
that determines how Ceph assigns PGs to OSDs from the CRUSH tree.
• Ceph provides various command-line tools to display, tune, modify, and use the CRUSH map.
• You can modify the CRUSH algorithm's behavior by using tunables, which disable, enable, or
adjust features of the CRUSH algorithm.
• The OSD map epoch is the map's revision number and increments whenever a change occurs.
Ceph updates the OSD map every time an OSD joins or leaves the cluster and OSDs keep the
map synchronized among themselves.
184 CL260-RHCS5.0-en-1-20211117
Chapter 6
CL260-RHCS5.0-en-1-20211117 185
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Objectives
After completing this section, you should be able to provide block storage to Ceph clients using
RADOS block devices (RBDs), and manage RBDs from the command line.
The RADOS Block Device (RBD) feature provides block storage from the Red Hat Ceph Storage
cluster. RADOS provides virtual block devices stored as RBD images in pools in the Red Hat Ceph
Storage cluster.
• Ensure that the rbd pool (or custom pool) for your RBD images exists. Use the ceph osd
pool create command to create a custom pool to store RBD images. After creating the
custom pool, initialize it with the rbd pool init command.
• Although Ceph administrators can access the pool, Red Hat recommends that you create a
more restricted Cephx user for clients by using the ceph auth command. Grant the restricted
user read/write access to only the needed RBD pool instead of access to the entire cluster.
• Create the RBD image with the rbd create --size size pool-name/image-name
command. This command uses the default pool name if you do not specify a pool name.
The rbd_default_pool parameter specifies the name of the default pool used to store RBD
images. Use ceph config set osd rbd_default_pool value to set this parameter.
186 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
The rbd device map command uses the krbd kernel module to map an image. The rbd map
command is an abbreviated form of the rbd device map command. The rbd device unmap,
or rbd unmap, command uses the krbd kernel module to unmap a mapped image. The following
example command maps the test RBD image in the rbd pool to the /dev/rbd0 device on the
host client machine:
A Ceph client system can use the mapped block device, called /dev/rbd0 in the example, like any
other block device. You can format it with a file system, mount it, and unmount it.
Warning
Two clients can map the same RBD image as a block device at the same time.
This can be useful for high availability clustering for standby servers, but Red Hat
recommends attaching a block device to one client at a time when the block device
contains a normal, single-mount file system. Mounting a RADOS block device that
contains a normal file system, such as XFS, on two or more clients at the same time
can cause file-system corruption and data loss.
The rbd device list command, abbreviated rbd showmapped, lists the RBD images mapped
in the machine.
The rbd device unmap command, abbreviated rbd unmap, unmaps the RBD image from the
client machine.
The rbd map and rbd unmap commands require root privileges.
CL260-RHCS5.0-en-1-20211117 187
Chapter 6 | Providing Block Storage Using RADOS Block Devices
The following steps configure rbdmap to persistently map and unmap an RBD image that already
contains a file system:
2. Create a single-line entry in the /etc/ceph/rbdmap RBD map file. This entry must specify
the name of the RBD pool and image. It must also reference the Cephx user who has read/
write permissions to access the image and the corresponding key-ring file. Ensure that the
key-ring file for the Cephx user exists on the client system.
3. Create an entry for the RBD in the /etc/fstab file on the client system. The name of the
block device has the following form:
/dev/rbd/pool_name/image_name
Specify the noauto mount option, because the rbdmap service, not the Linux fstab routines,
handles the mounting of the file system.
4. Confirm that the block device mapping works. Use the rbdmap map command to mount the
devices. Use the rbdmap unmap command to unmount them.
Cloud and virtualization solutions, such as OpenStack and libvirt, use librbd to provide
RBD images as block devices to cloud instances and the virtual machines that they manage. For
example, RBD images can store QEMU virtual machine images. Using the RBD clone feature,
virtualized containers can boot a virtual machine without copying the boot image. The copy-on-
write (COW) mechanism copies data from the parent to the clone when it writes to an unallocated
object within the clone. The copy-on-read (COR) mechanism copies data from the parent to the
clone when it reads from an unallocated object within the clone.
188 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Because the user space implementation of the Ceph block device (for example, librbd) cannot
take advantage of the Linux page cache, it performs its own in-memory caching, known as
RBD caching. RBD caching behaves in a similar manner to the Linux page cache. When the OS
implements a barrier mechanism or a flush request, Ceph writes all dirty data to the OSDs. This
means that using write-back caching is just as safe as using physical hard disk caching with a VM
that properly sends flushes (for example, Linux kernel >= 2.6.32). The cache uses a Least Recently
Used (LRU) algorithm, and in write-back mode it can coalesce contiguous requests for better
throughput.
Note
The RBD cache is local to the client because it uses RAM on the machine that
initiated the I/O requests. For example, if you have Nova compute nodes in your
Red Hat OpenStack Platform installation that use librbd for their virtual machines,
the OpenStack client initiating the I/O request will use local RAM for its RBD cache.
Write-through Caching
Set the maximum dirty byte to 0 to force write-through mode. The Ceph cluster
acknowledges the writes when the data is written and flushed on all relevant OSD journals.
If using write-back mode, then the librbd library caches and acknowledges the I/O requests
when it writes the data into the local cache of the server. Consider write-through for strategic
production servers to reduce the risk of data loss or file system corruption in case of a server
failure. Red Hat Ceph Storage offers the following set of RBD caching parameters:
CL260-RHCS5.0-en-1-20211117 189
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Run ceph config set client parameter value command or ceph config set
global parameter value command for client or global, respectively.
Note
When using librbd with Red Hat OpenStack Platform, create separate Cephx user
names for OpenStack Cinder, Nova, and Glance. By following this recommended
practice, you can create different caching strategies based on the type of RBD
images that your Red Hat OpenStack Platform environment accesses.
190 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
op_features:
flags:
create_timestamp: Thu Sep 23 18:54:35 2021
access_timestamp: Thu Sep 23 18:54:35 2021
modify_timestamp: Thu Sep 23 18:54:35 2021
[root@node ~]# rados -p rbd ls
rbd_object_map.d3d0d7d0b79e.0000000000000008
rbd_id.rbdimage
rbd_object_map.d42c1e0a1883
rbd_directory
rbd_children
rbd_info
rbd_header.d3d0d7d0b79e
rbd_header.d42c1e0a1883
rbd_object_map.d3d0d7d0b79e
rbd_trash
Ceph block devices allow storing data striped over multiple Object Storage Devices (OSD) in a
Red Hat Ceph Storage cluster.
CL260-RHCS5.0-en-1-20211117 191
Chapter 6 | Providing Block Storage Using RADOS Block Devices
You can specify the size of the objects used with the --object-size option. This parameter
must specify an object size between 4096 bytes (4 KiB) and 33,554,432 bytes (32 MiB),
expressed in bytes, K or M (for example, 4096, 8 K or 4 M).
image_format
The RBD image format version. The default value is 2, the most recent version. Version 1 has
been deprecated and does not support features such as cloning and mirroring.
stripe_unit
The number of consecutive bytes stored in one object, object_size by default.
stripe_count
The number of RBD image objects that a stripe spans, 1 by default.
For RBD format 2 images, you can change the value of each of those parameters. The settings
must align with the following equation:
For example:
Remember that object_size must be no less than 4096 bytes and no greater than
33,554,432 bytes. Use the --object-size option to specify this value when you create the RBD
image. The default object_size is 4192304 bytes (4 MiB).
References
rbd(8) and rbdmap(8) man pages
For more information, refer to the Red Hat Ceph Storage 5 Block Device Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/block_device_guide/index
192 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Guided Exercise
Outcomes
You should be able to create and manage RADOS block device images and use them as
regular block devices.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Verify that the Red Hat Ceph cluster is in a healthy state.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
2. Create a replicated pool called test_pool with 32 placement groups. Set the application
type for the pool to rbd.
2.3. List the configured pools and view the usage and availability for test_pool. The ID
of test_pool might be different in your lab environment.
CL260-RHCS5.0-en-1-20211117 193
Chapter 6 | Providing Block Storage Using RADOS Block Devices
3.1. Create the client.test_pool.clientb user and display the new file.
4. Open a second terminal window and log in to the clientb node as the admin user. Copy
the key-ring file for the new test_pool user. Use the client.test_pool.clientb
user name when connecting to the cluster.
4.1. Log in to clientb as the admin user and switch to the root user.
194 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
4.3. Go to the first terminal and copy the Ceph configuration and the key-ring files from
the /etc/ceph/ directory on the clienta node to the /etc/ceph/ directory on
the clientb node.
4.4. Go to the second terminal window. Temporarily set the default user ID used for
connections to the cluster to test_pool.clientb.
5. Create a new RADOS Block Device Image and map it to the clientb machine.
5.1. Create an RBD image called test in the test_pool pool. Specify a size of
128 megabytes.
5.3. Map the RBD image on the clientb node by using the kernel RBD client.
CL260-RHCS5.0-en-1-20211117 195
Chapter 6 | Providing Block Storage Using RADOS Block Devices
6. Verify that you can use the RBD image mapped on the clientb node like a regular disk
block device.
196 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
6.9. Unmount the file system and unmap the RBD image on the clientb node.
7. Configure the client system so that it persistently mounts the test_pool/test RBD
image as /mnt/rbd.
7.3. Verify your RBD map configuration. Use the rbdmap command to map and unmap
configured RBD devices.
CL260-RHCS5.0-en-1-20211117 197
Chapter 6 | Providing Block Storage Using RADOS Block Devices
7.4. After you have verified that the RBD mapped devices work, enable the rbdmap
service. Reboot the clientb node to verify that the RBD device mounts persistently.
When the clientb node finishes rebooting, log in and verify that it has mounted the
RBD device.
8. Unmount your file system, unmap and delete the test_pool/test RBD image, and
delete the temporary objects to clean up your environment.
8.1. Unmount the /mnt/rbd file system and unmap the RBD image.
8.2. Remove the RBD entry from the /etc/fstab file. The resulting file should contain
the following:
8.3. Remove the RBD map entry for test_pool/test from the /etc/ceph/rbdmap
file. The resulting file should contain the following:
198 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
8.5. Verify that the test_pool RBD pool does not yet contain extra data. The
test_pool pool should initially contain the three list objects.
9. Exit and close the second terminal. In the first terminal, return to workstation as the
student user.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 199
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Objectives
After completing this section, you should be able to create and configure RADOS block devices
snapshots and clones.
To disable the layering feature, use the rbd feature disable command:
Name Description
RBD Snapshots
RBD snapshots are read-only copies of an RBD image created at a particular time. RBD snapshots
use a COW technique to reduce the amount of storage needed to maintain snapshots. Before
applying a write I/O request to an RBD snapshot image, the cluster copies the original data to
another area in the placement group of the object affected by the I/O operation. Snapshots do
200 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
not consume any storage space when created, but grow in size as the objects that they contain
change. RBD images support incremental snapshots.
Important
Use the fsfreeze command to suspend access to a file system before taking
a snapshot. The fsfreeze -freeze command stops access to the file system
and creates a stable image on disk. Do not take a file system snapshot when the
file system is not frozen because it will corrupt the snapshot's file system. After
taking the snapshot, use fsfreeze --unfreeze command to resume file system
operations and access.
The snapshot COW procedure operates at the object level, regardless of the size of the write I/O
request made to the RBD image. If you write a single byte to an RBD image that has a snapshot,
then Ceph copies the entire affected object from the RBD image into the snapshot area.
CL260-RHCS5.0-en-1-20211117 201
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Deleting an RBD image fails if snapshots exist for the image. Use the rbd snap purge
command to delete the snapshots.
Use the rbd snap create command to create a snapshot of a Ceph block device.
Use the rbd snap ls command to list the block device snapshots.
Use the rbd snap rollback command to roll back a block device snapshot, overwriting the
current version of the image with data from the snapshot.
Use the rbd snap rm command to delete a snapshot for Ceph block devices.
202 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
RBD Clones
RBD clones are read/write copies of an RBD image that use a protected RBD snapshot as a base.
An RBD clone can also be flattened, which converts it into an RBD image independent of its
source. The cloning process has three steps:
1. Create a snapshot:
The newly created clone behaves just like a regular RBD image. Clones support COW and COR,
with COW as the default. COW copies the parent snapshot data into the clone before applying a
write I/O request to the clone.
You can also enable COR support for RBD clones. Data that is the same for the parent RBD
snapshot and the clone is read directly from the parent. This can make reads more expensive if the
parent's OSDs have high latency relative to the client. COR copies objects to the clone when they
are first read.
If you enable COR, Ceph copies the data from the parent snapshot into the clone before
processing a read I/O request, if the data is not already present in the clone. Activate the COR
feature by running the ceph config set client rbd_clone_copy_on_read true
command or the ceph config set global rbd_clone_copy_on_read true command
for the client or global setting. The original data is not overwritten.
CL260-RHCS5.0-en-1-20211117 203
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Important
If COR is disabled on an RBD clone, every read operation that the clone cannot
satisfy results in an I/O request to the parent of the clone.
The clone COW and COR procedures operate at the object level, regardless of the I/O request
size. To read or write a single byte of the RBD clone, Ceph copies the entire object from the parent
image or snapshot into the clone.
When flattening a clone, Ceph copies all missing data from the parent into the clone and then
removes the reference to the parent. The clone becomes an independent RBD image and is no
longer the child of a protected snapshot.
Note
You cannot delete RBD images directly from a pool. Instead, use the rbd trash
mv command to move an image from a pool to the trash. Delete objects from the
trash with the rbd trash rm command. You are allowed to move active images
that are in use by clones to the trash for later deletion.
204 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
References
For more information, refer to the Snapshot management chapter in the Red Hat
Ceph Storage 5 Block Device Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/block_device_guide/index#snapshot-management
CL260-RHCS5.0-en-1-20211117 205
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Guided Exercise
Outcomes
You should be able to create and manage RADOS block device snapshots, as well as clone
and create a child image.
This command confirms that the hosts required for this exercise are accessible. It also
creates an image called image1 within the rbd pool. Finally, this command creates a user
and associated key in Red Hat Ceph Storage cluster and copies it to the clientb node.
Instructions
1. Use the ceph health command to verify that the primary cluster is in a healthy state.
1.1. Log in to clienta as the admin user and switch to the root user.
1.2. Use the cephadm shell to run the ceph health command to verify that the primary
cluster is in a healthy state.
2. Map the rbd/image1 image as a block device, format it with an XFS file system, and
confirm that the /dev/rbd0 device is writable.
206 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
3. Create an initial snapshot called firstsnap. Calculate the provisioned and actual disk
usage of the rbd/image1 image and its associated snapshots by using the rbd disk-
usage command.
3.1. Run the cephadm shell. Create an initial snapshot called firstsnap.
3.2. Calculate the provisioned and used size of the rbd/image1 image and its associated
snapshots.
4. Open another terminal window. Log in to clientb as the admin user and switch to the
root user. Set the CEPH_ARGS variable to the '--id=rbd.clientb' value.
5. On the clientb node, map the image1@firstsnap snapshot and verify that the device
is writable.
CL260-RHCS5.0-en-1-20211117 207
Chapter 6 | Providing Block Storage Using RADOS Block Devices
6. On the clienta node, exit the cephadm shell. Mount the /dev/rbd0 device in /mnt/
image directory, copy some data into it, and then unmount it.
6.3. Check the disk space usage for the /dev/rbd0 device.
7.2. Check the disk space usage for the /dev/rbd0 device and list the directory content.
Notice that the file0 file does not display on the clientb node because the file
system of the snapshot block device is empty.
Changes to the original block device did not alter the snapshot.
7.3. Unmount the /mnt/snapshot directory, and then unmap the /dev/rbd0 device.
208 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
8. On the clienta node, protect the firstsnap snapshot and create a clone called clone1
in the rbd pool. Verify that the child image is created.
8.2. Clone the firstsnap block device snapshot to create a read or write child image
called clone1 that uses the rbd pool.
9. On the clientb node, map the rbd/clone1 image as a block device, mount it, and then
copy some content to the clone.
9.2. Mount the block device in /mnt/clone directory, and then list the directory
contents.
10.1. On the clientb node, unmount the file system and unmap the RBD image.
CL260-RHCS5.0-en-1-20211117 209
Chapter 6 | Providing Block Storage Using RADOS Block Devices
10.2. On the clienta node, exit the cephadm shell. Unmount the file system, and then
unmap the RBD image.
11. Exit and close the second terminal. Return to workstation as the student user.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
210 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Objectives
After completing this section, you should be able to export an RBD image from the cluster to an
external file and import it into another cluster.
The RADOS block device feature provides the ability to export and import entire RBD images or
only RBD image changes between two points in time.
The --export-format option specifies the format of the exported data, allowing you to convert
earlier RBD format 1 images to newer, format 2 images. The following example exports an RBD
image called test to the /tmp/test.dat file.
The --export-format option specifies the data format of the data to be imported. When
importing format 2 exported data, use the --stripe-unit, --stripe-count, --object-
size, and --image-feature options to create the new RBD format 2 image.
CL260-RHCS5.0-en-1-20211117 211
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Note
The --export-format parameter value must match for the related rbd export
and the rbd import commands.
• The creation date and time of the RBD image. For example, without using the --from-snap
option.
• A snapshot of an RBD image, such as is obtained by using the --from-snap snapname
option.
If you specify a start point snapshot, the command exports the changes after you created that
snapshot. If you do not specify a snapshot, the command exports all changes since the creation of
the RBD image, which is the same as a regular RBD image export operation.
Note
The import-diff operation performs the following validity checks:
• If the export-diff is relative to a start snapshot, this snapshot must also exist in
the target RBD image.
• If the export-diff is performed specifying an end snapshot, the same snapshot
name is created in the target RBD image after the data is imported.
You can also use the dash character (-) to specify either stdout or standard input (stdin) as the
export target or import source.
The rbd merge-diff command merges the output of two continuous incremental rbd
export-diff image operations into one single target path. The command can only process two
incremental paths at one time.
212 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
To merge more than two continuous incremental paths in a single command, pipe one rbd
export-diff output to another rbd export-diff command. Use the dash character (-) as
the target in the command before the pipe, and as the source in the command after the pipe.
For example, you can merge three incremental diffs into a single merged target on one command
line. The snapshot end time of the earlier export-diff command must be equal to the snapshot
start time of the later export-diff command.
[ceph: root@node /]# rbd merge-diff first second - | rbd merge-diff - third merged
The rbd merge-diff command only supports RBD images with stripe-count set to 1.
References
rbd(8) man pages.
CL260-RHCS5.0-en-1-20211117 213
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Guided Exercise
Outcomes
You should be able to:
This command confirms that the hosts required for this exercise are accessible. It also
ensures that clienta has the necessary RBD client authentication keys.
Instructions
1. Open two terminals and log in to clienta and serverf as the admin user. Verify that
both clusters are reachable and have a HEALTH_OK status.
1.1. Open a terminal window. Log in to clienta as the admin user and use sudo to run
the cephadm shell. Verify that the primary cluster is in a healthy state.
1.2. Open another terminal window. Log in to serverf as the admin user and use sudo
to run the cephadm shell. Verify that the secondary cluster is in a healthy state.
2. Create a pool called rbd, and then enable the rbd client application for the Ceph Block
Device and make it usable by the RBD feature.
214 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
2.1. In the primary cluster, create a pool called rbd with 32 placement groups. Enable
the rbd client application for the Ceph Block Device and make it usable by the RBD
feature.
2.2. In the secondary cluster, create a pool called rbd with 32 placement groups. Enable
the rbd client application for the Ceph Block Device and make it usable by the RBD
feature.
3. Create an RBD image called rbd/test in your primary Ceph cluster. Map it as a block
device, format it with an XFS file system, mount it in /mnt/rbd directory, copy some data
into it, and then unmount it.
3.1. Create the RBD image. Exit the cephadm shell and switch to the root user. Map the
image, and then format it with an XFS file system.
[ceph: root@clienta /]# rbd create test --size 128 --pool rbd
[ceph: root@clienta /]# exit
exit
[admin@clienta ~]$ sudo -i
[root@clienta ~]# rbd map --pool rbd test
/dev/rbd0
[root@clienta ~]# mkfs.xfs /dev/rbd0
...output omitted...
3.2. Mount /dev/rbd0 to the /mnt/rbd directory and copy a file to it.
3.3. Unmount your file system to ensure that the system flushes all data to the Ceph
cluster.
CL260-RHCS5.0-en-1-20211117 215
Chapter 6 | Providing Block Storage Using RADOS Block Devices
4. Create a backup copy of the primary rbd/test block device. Export the entire rbd/test
image to a file called /mnt/export.dat. Copy the export.dat file to the secondary
cluster.
4.1. In the primary cluster, run the cephadm shell using the --mount argument to bind
mount the /home/admin/rbd-export/ directory.
4.2. Export the entire rbd/test image to a file called /mnt/export.dat. Exit the
cephadm shell.
4.3. Copy the export.dat file to the secondary cluster in the /home/admin/rbd-
import/ directory.
5. In the secondary cluster, import the /mnt/export.dat file containing the exported rbd/
test RBD image into the secondary cluster. Confirm that the import was successful by
mapping the imported image to a block device, mounting it, and inspecting its contents.
5.1. Exit the current cepdadm shell. Use sudo to run the cephadm shell with the --mount
argument to bind mount the /home/admin/rbd-import/ directory.
5.2. List the contents of the backup cluster's empty rbd pool. Use the rbd import
command to import the RBD image contained in the /mnt/export.dat file into the
backup cluster, referring to it as rbd/test.
5.3. Exit the cephadm shell, and then switch to the root user. Map the backup cluster's
imported RBD image and mount the file system it contains. Confirm that its contents
are the same as those originally created on the primary cluster's RBD image.
216 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
5.4. Unmount the file system and unmap the RBD image.
6. In this part of the exercise, you will create a pair of snapshots of rbd/test on your primary
cluster and export the changes between those snapshots as an incremental diff image.
You will then import the changes from the incremental diff into your copy of the rbd/test
image on your secondary cluster.
6.1. In the primary cluster, run the cephadm shell and create an initial snapshot called
rbd/test@firstsnap. Calculate the provisioned and actual disk usage of the rbd/
test image and its associated snapshots.
6.2. In the secondary cluster, run the cephadm shell, create an initial snapshot called rbd/
test@firstsnap. Calculate the provisioned and actual disk usage of the rbd/test
image and its associated snapshots.
CL260-RHCS5.0-en-1-20211117 217
Chapter 6 | Providing Block Storage Using RADOS Block Devices
6.3. In the primary cluster, mount the file system on the /dev/rbd0 device, mapped from
the rbd/test image, to change the RBD image. Make changes to the file system to
effect changes to the RBD image. Unmount the file system when you are finished.
6.4. In the primary cluster, run the cephadm shell and note that the amount of data used
in the image of the primary cluster increased. Create a new snapshot called rbd/
test@secondsnap to delimit the ending time window of the changes that you want
to export. Note the adjustments made to the reported used data.
6.5. In the primary cluster, exit the current cepdadm shell. Run the cephadm shell with the
--mount argument to bind mount the /home/admin/rbd-import/ directory.
218 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
6.6. Export the changes between the snapshots of the primary cluster's rbd/test
image to a file called /mnt/export-diff.dat. Exit the cephadm shell, and copy
the export-diff.dat file to the secondary cluster in the /home/admin/rbd-
import/ directory.
6.7. In the secondary cluster, run the cephadm shell using the --mount argument to
mount the /home/admin/rbd-import/ directory. Use the rbd import-diff
command to import the changes to the secondary cluster's copy of the rbd/test
image by using the /mnt/export-diff.dat file. This eliminates the need to save
the exported image to a file as an intermediate step. Inspect the information about
the remote RBD image. Exit the cephadm shell.
Note
The end snapshot is present on the secondary cluster's RBD image. The rbd
import-diff command automatically creates it.
6.8. Verify that the backup cluster's image is identical to the primary cluster's RBD image.
CL260-RHCS5.0-en-1-20211117 219
Chapter 6 | Providing Block Storage Using RADOS Block Devices
7.1. In the primary cluster, unmap the RBD image. Run the cephadm shell, purge all
existing snapshots on the RBD image of both clusters, and then delete the RBD
image.
7.2. In the secondary cluster, unmap the RBD image. Run the cephadm shell, purge all
existing snapshots on the RBD image of both clusters, and then delete the RBD
image.
8. Exit and close the second terminal. Return to workstation as the student user.
220 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 221
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Lab
Outcomes
You should be able to:
This command verifies the status of the cluster and creates the rbd pool if it does not
already exist.
Instructions
Perform the following steps on your clienta admin node, which is a client node to the primary 3-
node Ceph storage cluster.
1. Log in to clienta as the admin user. Create a pool called rbd260, enable the rbd client
application for the Ceph block device, and make it usable by the RBD feature.
2. Create a 128 MiB RADOS block device image called prod260 in the rbd260 pool. Verify your
work.
3. Map the prod260 RBD image in the rbd260 pool to a local block device file by using the
kernel RBD client. Format the device with an XFS file system. Mount the file system on the
/mnt/prod260 image and copy the /etc/resolv.conf file to the root of this new file
system. When done, unmount and unmap the device.
4. Create a snapshot of the prod260 RBD image in the rbd260 pool and name it
beforeprod.
5. Export the prod260 RBD image from the rbd260 pool to the /root/prod260.xfs file.
Import that image file into the rbd pool on your primary 3-node Ceph cluster, and name the
imported image img260 in that pool.
6. Configure the client system so that it persistently mounts the rbd260/prod260 RBD image
as /mnt/prod260. Authenticate as the admin Ceph user using existing keys found in the /
etc/ceph/ceph.client.admin.keyring file.
222 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Evaluation
Grade your work by running the lab grade block-review command from your workstation
machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 223
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Solution
Outcomes
You should be able to:
This command verifies the status of the cluster and creates the rbd pool if it does not
already exist.
Instructions
Perform the following steps on your clienta admin node, which is a client node to the primary 3-
node Ceph storage cluster.
1. Log in to clienta as the admin user. Create a pool called rbd260, enable the rbd client
application for the Ceph block device, and make it usable by the RBD feature.
1.1. Log in to clienta, as the admin user and use sudo to run the cephadm shell.
Verify that the primary cluster is in a healthy state.
1.2. Create a pool called rbd260 with 32 placement groups. Enable the rbd client
application for the Ceph Block Device and make it usable by the RBD feature.
224 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
1.3. List the rbd260 pool details to verify your work. The pool ID might be different in your
lab environment.
2. Create a 128 MiB RADOS block device image called prod260 in the rbd260 pool. Verify your
work.
2.1. Create the 128 MiB prod260 RBD image in the rbd260 pool.
[ceph: root@clienta /]# rbd create prod260 --size 128 --pool rbd260
2.2. List the images in the rbd260 pool to verify the result.
3. Map the prod260 RBD image in the rbd260 pool to a local block device file by using the
kernel RBD client. Format the device with an XFS file system. Mount the file system on the
/mnt/prod260 image and copy the /etc/resolv.conf file to the root of this new file
system. When done, unmount and unmap the device.
3.1. Exit the cephadm shell, then switch to the root user. Install the ceph-common
package on the clienta node. Map the prod260 image from the rbd260 pool using
the kernel RBD client.
3.2. Format the /dev/rbd0 device with an XFS file system and mount the file system on
the device. Change the user and group ownership of the root directory of the new file
system to admin.
CL260-RHCS5.0-en-1-20211117 225
Chapter 6 | Providing Block Storage Using RADOS Block Devices
3.3. Copy the /etc/resolv.conf file to the root of the /mnt/prod260 file system, and
then list the contents to verify the copy.
4. Create a snapshot of the prod260 RBD image in the rbd260 pool and name it
beforeprod.
4.1. Run the cephadm shell. Create the beforeprod snapshot of the prod260 image in
the rbd260 pool.
4.2. List the snapshots of the prod260 RBD image in the rbd260 pool to verify your work.
Note
The snapshot ID and the time stamp are different in your lab environment.
5. Export the prod260 RBD image from the rbd260 pool to the /root/prod260.xfs file.
Import that image file into the rbd pool on your primary 3-node Ceph cluster, and name the
imported image img260 in that pool.
5.1. Export the prod260 RBD image from the rbd260 pool to a file called /root/
prod260.xfs.
5.2. Retrieve the size of the /home/admin/prod260.xfs file to verify the export.
226 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
5.3. Import the /root/prod260.xfs file as the img260 RBD image in the rbd pool.
5.4. List the images in the rbd pool to verify the import. Exit from the cephadm shell.
Note
The rbd ls command might display images from previous exercises.
6. Configure the client system so that it persistently mounts the rbd260/prod260 RBD image
as /mnt/prod260. Authenticate as the admin Ceph user using existing keys found in the /
etc/ceph/ceph.client.admin.keyring file.
6.1. Create an entry for the rbd260/prod260 image in the /etc/ceph/rbdmap RBD
map file. The resulting file should have the following contents:
6.2. Create an entry for the /dev/rbd/rbd260/prod260 image in the /etc/fstab file.
The resulting file should have the following contents:
6.3. Use the rbdmap command to validate your RBD map configuration.
CL260-RHCS5.0-en-1-20211117 227
Chapter 6 | Providing Block Storage Using RADOS Block Devices
6.4. After you have confirmed that the RBD mapped devices work, enable the rbdmap
service. Reboot the clienta node to confirm that the RBD device mounts
persistently.
6.5. After rebooting, log in to the clienta node as the admin user. Confirm that the
system has mounted the RBD device.
Evaluation
Grade your work by running the lab grade block-review command from your workstation
machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
228 CL260-RHCS5.0-en-1-20211117
Chapter 6 | Providing Block Storage Using RADOS Block Devices
Summary
In this chapter, you learned:
• The rbd command manages RADOS block device pools, images, snapshots, and clones.
• The rbd map command uses the krbd kernel module to map RBD images to Linux block
devices. Configuring the rbdmap service can map these images persistently.
• RBD has an export and import mechanism for maintaining copies of RBD images that are fully
functional and accessible.
• The rbd export-diff and the rbd import-diff commands export and import RBD image
changes made between two points in time.
CL260-RHCS5.0-en-1-20211117 229
230 CL260-RHCS5.0-en-1-20211117
Chapter 7
CL260-RHCS5.0-en-1-20211117 231
Chapter 7 | Expanding Block Storage Operations
Objectives
After completing this section, you should be able to configure an RBD mirror to replicate an RBD
block device between two Ceph clusters for disaster recovery purposes.
RBD Mirroring
Red Hat Ceph Storage supports RBD mirroring between two storage clusters. This allows you to
automatically replicate RBD images from one Red Hat Ceph Storage cluster to another remote
cluster. This mechanism mirrors the source (primary) RBD image and the target (secondary) RBD
image over the network using an asynchronous mechanism. If the cluster containing the primary
RBD image becomes unavailable, then you can fail over to the secondary RBD image from the
remote cluster and restart the applications that use it.
When failing over from the source RBD image to the mirror RBD image, you must demote the
source RBD image and promote the target RBD image. A demoted image becomes locked and
unavailable. A promoted image becomes available and accessible in read/write mode.
The RBD mirroring features requires the rbd-mirror daemon. The rbd-mirror daemon pulls the
image updates from the remote peer cluster and applies them to the local cluster image.
232 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Pool Mode
In pool mode, Ceph automatically enables mirroring for each RBD image created in the
mirrored pool. When you create an image in the pool on the source cluster, Ceph creates a
secondary image on the remote cluster.
Image Mode
In image mode, mirroring can be selectively enabled for individual RBD images within the
mirrored pool. In this mode, you have to explicitly select the RBD images to replicate between
the two clusters.
Journal-based mirroring
This mode uses the RBD journaling image feature to ensure point-in-time and crash-
consistent replication between two Red Hat Ceph Storage clusters. Every writes to the RBD
image is first recorded to the associated journal before modifying the actual image. The
remote cluster reads from this journal and replays the updates to its local copy of the image.
Snapshot-based mirroring
Snapshot-based mirroring uses periodically scheduled or manually created RBD image mirror
snapshots to replicate crash-consistent RBDs images between two Red Hat Ceph Storage
clusters. The remote cluster determines any data or metadata updates between two mirror
snapshots and copies the deltas to the image's local copy. The RBD fast-diff image
CL260-RHCS5.0-en-1-20211117 233
Chapter 7 | Expanding Block Storage Operations
feature enables the quick determination of updated data blocks without the need to scan
the full RBD image. The complete delta between two snapshots must be synced prior to use
during a failover scenario. Any partially applied set of deltas will be rolled back at the moment
of failover.
Managing Replication
Image resynchronization
In case of an inconsistent state between the two peer clusters, the rbd-mirror daemon
does not attempt to mirror the image that is causing the inconsistency, use rbd mirror
image resync to resynchronize an image.
To achieve RBD mirroring, and enable the rbd-mirror daemon to discover its peer cluster, you
must have a registered peer and a created user account. Red Hat Ceph Storage 5 automates this
process by using the rbd mirror pool peer bootstrap create command.
Important
Each instance of the rbd-mirror daemon must connect to both the local and
remote Ceph clusters simultaneously. Also, the network must have sufficient
bandwidth between the two data centers to handle the mirroring workload.
234 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
The following list outlines the steps required to configure mirroring between two clusters, called
prod and backup:
1. Create a pool with the same name in both clusters, prod and backup.
2. Create or modify the RBD image with the exclusive-lock, and journaling features
enabled.
4. In the prod cluster, bootstrap the storage cluster peer and save the bootstrap token
• For one-way replication the rbd-mirror daemon runs only on the backup cluster.
CL260-RHCS5.0-en-1-20211117 235
Chapter 7 | Expanding Block Storage Operations
The backup cluster displays the following pool information and status.
Peer Sites:
UUID: 5e2f6c8c-a7d9-4c59-8128-d5c8678f9980
Name: prod
Direction: rx-only
Client: client.rbd-mirror-peer
[ceph: root@backup-node /]# rbd mirror pool status
health: OK
daemon health: OK
image health: OK
images: 1 total
1 replaying
The prod cluster displays the following pool information and status.
Peer Sites:
UUID: 6c5f860c-b683-44b4-9592-54c8f26ac749
236 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Name: backup
Mirror UUID: 7224d1c5-4bd5-4bc3-aa19-e3b34efd8369
Direction: tx-only
[ceph: root@prod-node /]# rbd mirror pool status
health: UNKNOWN
daemon health: UNKNOWN
image health: OK
images: 1 total
1 replaying
Note
In one-way mode, the source cluster is not aware of the state of the replication. The
RBD mirroring agent in the target cluster updates the status information.
Failover Procedure
If the primary RBD image becomes unavailable, then you can use the following steps to enable
access to the secondary RBD image:
• Stop access to the primary RBD image. This means stopping all applications and virtual
machines that are using the image.
• Use the rbd mirror image demote pool-name/image-name command to demote the
primary RBD image.
• Use the rbd mirror image promote pool-name/image-name command to promote the
secondary RBD image.
• Resume access to the RBD image. Restart the applications and virtual machines.
Note
When a failover after a non-orderly shutdown occurs, you must promote the non-
primary images from a Ceph Monitor node in the backup storage cluster. Use the
--force option because the demotion cannot propagate to the primary storage
cluster
References
rbd(8) man page
For more information, refer to the Mirroring Ceph block devices chapter in the Block
Device Guide for Red Hat Ceph Storage 5 at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/block_device_guide/index
CL260-RHCS5.0-en-1-20211117 237
Chapter 7 | Expanding Block Storage Operations
Guided Exercise
Outcomes
You should be able to:
This command confirms that the hosts required for this exercise are accessible. Your Ceph
clusters and configuration will not be modified by this lab start command.
Instructions
1. Open two terminals and log in to clienta and serverf as the admin user. Verify that
both clusters are reachable and have a HEALTH_OK status.
1.1. Open a terminal window and log in to clienta, as the admin user and switch to the
root user. Run a cephadm shell. Verify the health of your production cluster.
services:
mon: 4 daemons, quorum serverc.lab.example.com,servere,serverd,clienta (age
15m)
mgr: serverc.lab.example.com.btgxor(active, since 15m), standbys:
servere.fmyxwv, clienta.soxncl, serverd.ufqxxk
osd: 9 osds: 9 up (since 15m), 9 in (since 47h)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
238 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Important
Ensure that the monitor daemons displayed in the services section match those
of your 3-node production cluster plus the client.
1.2. Open another terminal window and log in to serverf as the admin user and switch
to the root user. Verify the health of your backup cluster.
services:
mon: 1 daemons, quorum serverf.lab.example.com (age 18m)
mgr: serverf.lab.example.com.qfmyuk(active, since 18m)
osd: 5 osds: 5 up (since 18m), 5 in (since 47h)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
pools: 5 pools, 105 pgs
objects: 189 objects, 4.9 KiB
usage: 82 MiB used, 50 GiB / 50 GiB avail
pgs: 105 active+clean
Important
Ensure that the monitor daemon displayed in the services section matches that
of your single-node backup cluster.
2. Create a pool called rbd in the production cluster with 32 placement groups. In the backup
cluster, configure a pool to mirror the data from the rbd pool in the production cluster to
the backup cluster. Pool-mode mirroring always mirrors data between two pools that have
the same name in both clusters.
2.1. In the production cluster, create a pool called rbd with 32 placement groups. Enable
the rbd client application for the Ceph Block Device and make it usable by the RBD
feature.
CL260-RHCS5.0-en-1-20211117 239
Chapter 7 | Expanding Block Storage Operations
2.2. In the backup cluster, create a pool called rbd with 32 placement groups. Enable
the rbd client application for the Ceph Block Device and make it usable by the RBD
feature.
3. In the production cluster, create a test RBD image and verify it. Enable pool-mode
mirroring on the pool.
3.1. Create an RBD image called image1 in the rbd pool in the production cluster.
Specify a size of 1024 MB. Enable the exclusive-lock and journaling RBD
image features.
3.2. List the images, and show the information about the image1 image in the rbd pool.
3.3. Enable pool-mode mirroring on the rbd pool, and verify it.
240 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
4. In the production cluster, create a /root/mirror/ directory. Run the cephadm shell by
using the --mount argument to mount the /root/mirror/ directory. Bootstrap the
storage cluster peer and create Ceph user accounts, then save the token in the /mnt/
bootstrap_token_prod file in the container. Copy the bootstrap token file to the
backup storage cluster.
4.1. On the clienta node, exit the cephadm shell. Create the /root/mirror/
directory, then run the cephadm shell to bind mount the /root/mirror directory.
4.2. Bootstrap the storage cluster peer and save the output in the /mnt/
bootstrap_token_prod file. Name the production cluster prod.
4.3. Exit the cephadm shell to the clienta host system. Copy the bootstrap token file to
the backup storage cluster in the /root directory.
CL260-RHCS5.0-en-1-20211117 241
Chapter 7 | Expanding Block Storage Operations
5. In the backup cluster, run the cephadm shell with a bind mount of the `/root/
bootstrap_token_prod file. Deploy a rbd-mirror daemon in the serverf node.
Import the bootstrap token. Verify that the RBD image is present.
5.1. On the serverf node, exit the cephadm shell. Run the cephadm shell again to bind
mount the /root/mirror directory.
5.2. Deploy a rbd-mirror daemon, use the argument --placement to set the
serverf.lab.example.com node, and then verify it.
Important
Ignore the known error containing the following text: auth: unable to find a
keyring on …
6.1. In the production cluster, run the cephadm shell. Display the pool information and
status.
Peer Sites:
242 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
UUID: deacabfb-545f-4f53-9977-ce986d5b93b5
Name: bup
Mirror UUID: bec08767-04c7-494e-b01e-9c1a75f9aa0f
Direction: tx-only
[ceph: root@clienta /]# rbd mirror pool status
health: UNKNOWN
daemon health: UNKNOWN
image health: OK
images: 1 total
1 replaying
6.2. In the backup cluster, display the pool information and status.
Peer Sites:
UUID: 591a4f58-3ac4-47c6-a700-86408ec6d585
Name: prod
Direction: rx-only
Client: client.rbd-mirror-peer
[ceph: root@serverf /]# rbd mirror pool status
health: OK
daemon health: OK
image health: OK
images: 1 total
1 replaying
7. Clean up your environment. Delete the RBD image from the production cluster and verify
that it is absent from both clusters.
7.1. In the production cluster, remove the image1 block device from the rbd pool.
7.2. In the production cluster, list block devices in the rbd pool.
7.3. In the backup cluster, list block devices in the rbd pool.
8. Exit and close the second terminal. Return to workstation as the student user.
CL260-RHCS5.0-en-1-20211117 243
Chapter 7 | Expanding Block Storage Operations
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
244 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Objectives
After completing this section, you should be able to configure the Ceph iSCSI Gateway to export
RADOS Block Devices using the iSCSI protocol, and configure clients to use the iSCSI Gateway.
The Linux I/O target kernel subsystem runs on every iSCSI gateway to support the iSCSI protocol.
Previously called LIO, the ISCSI target subsystem is now called TCM, or the Target Core Mod. The
TCM subsystem utilizes a user-space pass-through (TCMU) to interact with the Ceph librbd
library to expose RBD images to iSCSI clients.
In the cephadm shell, run the ceph tell <daemon_type>.<id> config set command to
set the timeout parameters.
• Install the iSCSI gateway nodes with Red Hat Enterprise Linux 8.3 or later.
• Have an operational cluster running Red Hat Ceph Storage 5 or later.
• Have 90 MiB of RAM available for each RBD image exposed as a target on iSCSI gateway
nodes.
• Open TCP ports 3260 and 5000 on the firewall on each Ceph iSCSI node.
• Create a new RADOS block device or use an existing, available device.
CL260-RHCS5.0-en-1-20211117 245
Chapter 7 | Expanding Block Storage Operations
service_type: iscsi
service_id: iscsi
placement:
hosts:
- serverc.lab.example.com
- servere.lab.example.com
spec:
pool: iscsipool1
trusted_ip_list: "172.25.250.12,172.25.250.14"
api_port: 5000
api_secure: false
api_user: admin
api_password: redhat
Open a web browser and log in to the Ceph Dashboard as a user with administrative privileges. In
the Ceph Dashboard web UI, click Block → iSCSI to display the iSCSI Overview page.
After the Ceph Dashboard is configured to access the iSCSI gateway APIs, use it to manage iSCSI
targets. Use the Ceph Dashboard to create, view, edit, and delete iSCSI targets.
246 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
The Using gwcli to add more iSCSI gateways section of the Block Device Guide for Red Hat Ceph
Storage 5 provides detailed instructions on how to manage iSCSI targets using the ceph-iscsi
gwcli utility.
These are example steps to configure an iSCSI target from the Ceph Dashboard.
b. Click +Add portal and select the first of at least two gateways.
c. Click +Add image and select an image for the target to export.
CL260-RHCS5.0-en-1-20211117 247
Chapter 7 | Expanding Block Storage Operations
Note
A system might be able to access the same storage device through multiple
different communication paths, whether those are using Fibre Channel, SAS, iSCSI,
or some other technology. Multipathing allows you to configure a virtual device that
can use any of these communication paths to access your storage. If one path fails,
then the system automatically switches to use one of the other paths instead.
If deploying a single iSCSI gateway for testing, skip the multipath configuration.
These example steps configure an iSCSI initiator to use multipath support and to log in to an iSCSI
target. Configure your client's Challenge-Handshake Authentication Protocol (CHAP) user name
and password to log in to the iSCSI targets.
devices {
device {
vendor "LIO-ORG"
hardware _handler "1 alua"
path_grouping_policy "failover"
path_selector "queue-length 0"
failback 60
path_checker tur
prio alua
prio_args exclusive_pref_bit
fast_io_fail_tmo 25
no_path_retry queue
}
}
248 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
• Update the CHAP user name and password to match your iSCSI gateway configuration in
the /etc/iscsi/iscsid.conf file.
node.session.auth.authmethod = CHAP
node.session.auth.username = user
node.session.auth.password = password
4. Discover and log in to the iSCSI portal, and then view targets and their multipath
configuration.
• Use the multipath command to show devices set up in a failover configuration with a
priority group for each path.
CL260-RHCS5.0-en-1-20211117 249
Chapter 7 | Expanding Block Storage Operations
Note
If logging in to an iSCSI target through a single iSCSI gateway, then the system
creates a physical device for the iSCSI target (for example, /dev/sdX). If logging in
to an iSCSI target through multiple iSCSI gateways with Device Mapper multipath,
then the Device Mapper creates a multipath device (for example, /dev/mapper/
mpatha).
References
For more information, refer to the The Ceph iSCSI Gateway chapter in the Block
Device Guide for Red Hat Ceph Storage 5 at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/block_device_guide/index#the-ceph-iscsi-gateway
For more information, refer to the Management of iSCSI functions using the Ceph
dashboard section in the _Dashboard Guide for Red Hat Ceph Storage 5 at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/dashboard_guide/index#management-of-iscsi-functions-on-the-ceph-
dashboard
Further information, refer to the Device Mapper Multipath configuration for Red Hat
Enterprise Linux 8 is available at Configuring Device Mapper Multipath at
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-
single/configuring_device_mapper_multipath/
250 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Quiz
2. What are two of the requirements for deploying the Ceph iSCSI Gateway? (Choose
two.)
a. Red Hat Enterprise Linux 8.3 or later.
b. At least two nodes on which to deploy the Ceph iSCSI Gateway.
c. 90 MiB of RAM per RBD image exposed as a target.
d. A dedicated network between the initiators and the iSCSI Gateways.
e. A 10 GbE network between the iSCSI Gateways and the Red Hat Ceph Storage cluster
nodes.
3. Which two of the following methods is used to expose an RBD image as an iSCSI
target? (Choose two.)
a. Use the targetcli command from the targetcli package.
b. Use the Storage page in the RHEL Web Console.
c. Use the Block → iSCSI page in the Ceph Dashboard.
d. Use the mpathconf command from the mpathconf package..
e. Use the gwcli command from the ceph-iscsi package.
4. Which package must be present on iSCSI initiator systems that connect to a target
provided by an Ceph iSCSI gateway?
a. ceph-iscsi
b. iscsi-initiator-utils
c. ceph-common
d. storaged-iscsi
CL260-RHCS5.0-en-1-20211117 251
Chapter 7 | Expanding Block Storage Operations
Solution
2. What are two of the requirements for deploying the Ceph iSCSI Gateway? (Choose
two.)
a. Red Hat Enterprise Linux 8.3 or later.
b. At least two nodes on which to deploy the Ceph iSCSI Gateway.
c. 90 MiB of RAM per RBD image exposed as a target.
d. A dedicated network between the initiators and the iSCSI Gateways.
e. A 10 GbE network between the iSCSI Gateways and the Red Hat Ceph Storage cluster
nodes.
3. Which two of the following methods is used to expose an RBD image as an iSCSI
target? (Choose two.)
a. Use the targetcli command from the targetcli package.
b. Use the Storage page in the RHEL Web Console.
c. Use the Block → iSCSI page in the Ceph Dashboard.
d. Use the mpathconf command from the mpathconf package..
e. Use the gwcli command from the ceph-iscsi package.
4. Which package must be present on iSCSI initiator systems that connect to a target
provided by an Ceph iSCSI gateway?
a. ceph-iscsi
b. iscsi-initiator-utils
c. ceph-common
d. storaged-iscsi
252 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Lab
Outcomes
You should be able to configure two-way pool-mode RBD mirroring between two clusters.
The lab command confirms that the hosts required for this exercise are accessible. It
creates the rbd pool in the primary, and secondary clusters. It also creates an image
in primary cluster, called myimage with exclusive-lock and journaling features
enabled. Finally, this command creates the /home/admin/mirror-review directory in the
primary cluster.
Instructions
1. Log in to clienta as the admin user. Run the cephadm shell with a bind mount of the /
home/admin/mirror-review/ directory. Verify that the primary cluster is in a healthy
state. Verify that the rbd pool is created successfully.
2. Deploy the rbd-mirror daemon in the primary and secondary clusters.
3. Enable pool-mode mirroring on the rbd pool and verify it. Verify that the journaling
feature on the myimage image is enabled.
4. Register the storage cluster peer to the pool, and then copy the bootstrap token file to the
secondary cluster.
5. In the secondary cluster, import the bootstrap token located in the /home/admin/mirror-
review/ directory. Verify that the RBD image is present.
6. Verify the mirroring status in both clusters. Note which is the primary image.
7. Demote the primary image and promote the secondary image, and then verify the change.
8. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade mirror-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
CL260-RHCS5.0-en-1-20211117 253
Chapter 7 | Expanding Block Storage Operations
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
254 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Solution
Outcomes
You should be able to configure two-way pool-mode RBD mirroring between two clusters.
The lab command confirms that the hosts required for this exercise are accessible. It
creates the rbd pool in the primary, and secondary clusters. It also creates an image
in primary cluster, called myimage with exclusive-lock and journaling features
enabled. Finally, this command creates the /home/admin/mirror-review directory in the
primary cluster.
Instructions
1. Log in to clienta as the admin user. Run the cephadm shell with a bind mount of the /
home/admin/mirror-review/ directory. Verify that the primary cluster is in a healthy
state. Verify that the rbd pool is created successfully.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell with a
bind mount. Use the ceph health command to verify that the primary cluster is in a
healthy state.
1.2. Verify that the rbd pool and the myimage image are created.
CL260-RHCS5.0-en-1-20211117 255
Chapter 7 | Expanding Block Storage Operations
2.2. Open another terminal window. Log in to serverf as the admin user and use sudo
to run a cephadm shell. Use the ceph health command to verify that the primary
cluster is in a healthy state.
3. Enable pool-mode mirroring on the rbd pool and verify it. Verify that the journaling
feature on the myimage image is enabled.
3.1. On the primary cluster, enable pool-mode mirroring on the rbd pool and verify it.
3.2. On the primary cluster, verify the journaling feature on the myimage image.
256 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
4. Register the storage cluster peer to the pool, and then copy the bootstrap token file to the
secondary cluster.
4.1. Bootstrap the storage cluster peer and save the output in the /mnt/
bootstrap_token_primary file. Name the production cluster primary.
4.2. Exit the cephadm shell to the clienta host. Copy the bootstrap token file to the
backup storage cluster in the /home/admin directory.
5. In the secondary cluster, import the bootstrap token located in the /home/admin/mirror-
review/ directory. Verify that the RBD image is present.
5.1. Exit the cephadm shell to the serverf host. Use sudo to run the cephadm shell with a
bind mount for the /home/admin/mirror-review/ directory.
Important
Ignore the known error containing the following text: auth: unable to find a
keyring on …
The image could take a few minutes to replicate and display in the list.
CL260-RHCS5.0-en-1-20211117 257
Chapter 7 | Expanding Block Storage Operations
6. Verify the mirroring status in both clusters. Note which is the primary image.
6.1. On the primary cluster, run the cephadm shell and verify the mirroring status.
7. Demote the primary image and promote the secondary image, and then verify the change.
7.1. On the primary cluster, demote the image and verify the change.
7.2. On the secondary cluster, promote the image and verify the change.
258 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
7.4. On the secondary cluster, verify the change. Note that the primary image is now in the
secondary cluster, on the serverf server.
8.1. Exit and close the second terminal. Return to workstation as the student user.
CL260-RHCS5.0-en-1-20211117 259
Chapter 7 | Expanding Block Storage Operations
Evaluation
Grade your work by running the lab grade mirror-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
260 CL260-RHCS5.0-en-1-20211117
Chapter 7 | Expanding Block Storage Operations
Summary
In this chapter, you learned:
• RBD mirroring supports automatic or selective mirroring of images using pool mode or image
mode.
• The RBD mirror agent can replicate pool data between two Red Hat Ceph Storage clusters, in
either one-way or two-way mode, to facilitate disaster recovery.
• Deploying an iSCSI gateway publishes RBD images as iSCSI targets for network-based block
storage provisioning.
CL260-RHCS5.0-en-1-20211117 261
262 CL260-RHCS5.0-en-1-20211117
Chapter 8
CL260-RHCS5.0-en-1-20211117 263
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Objectives
After completing this section, you should be able to deploy a RADOS Gateway to provide clients
with access to Ceph object storage.
Applications do not use normal file-system operations to access object data. Instead, applications
access a REST API to send and receive objects. Red Hat Ceph Storage supports the two most
common object APIs, Amazon S3 (Simple Storage Service) and OpenStack Swift (OpenStack
Object Storage).
Amazon S3 calls the flat namespace for object storage a bucket while OpenStack Swift calls it
a container. Because a namespace is flat, neither buckets nor containers can be nested. Ceph
typically uses the term bucket, as does this lecture.
A single user account can be configured for access to multiple buckets on the same storage
cluster. Buckets can each have different access permissions and be used to store objects for
different use cases.
The advantage of object storage is that it is easy to use, expand, and scale. Because each object
has a unique ID, it can be stored or retrieved without the user knowing the object's location.
Without the directory hierarchy, relationships between objects are simplified.
Objects, similar to files, contain a binary data stream and can grow to arbitrarily large sizes.
Objects also contain metadata about the object data, and natively support extended metadata
information, typically in the form of key-value pairs. You can also create your own metadata keys
and store custom information in the object as key values.
The core daemon, radosgw, is built on top of the librados library. The daemon provides a web
service interface, based on the Beast HTTP, WebSocket, and networking protocol library, as a
front-end to handle API requests.
The radosgw is a client to Red Hat Ceph Storage that provides object access to other client
applications. Client applications use standard APIs to communicate with the RADOS Gateway, and
the RADOS Gateway uses librados module calls to communicate with the Ceph cluster.
264 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
The RADOS Gateway provides the radosgw-admin utility for creating users for using the
gateway. These users can only access the gateway, and are not cdephx users with direct
access to the storage cluster. RADOS Gateway clients authenticate using these gateway user
accounts when submitting Amazon S3 or OpenStack Swift API requests. After a gateway user is
authenticated by the RADOS Gateway, the gateway uses cephx credentials to authenticate to the
storage cluster to handle the object request. Gateway users can also be managed by integrating
an external LDAP-based authentication service.
The RADOS Gateway service automatically creates pools on a per-zone basis. These pools use
placement group values from the configuration database and use the default CRUSH hierarchy.
The default pool settings might not be optimal for a production environment.
The RADOS Gateway creates multiple pools for the default zone.
You can manually create pools with custom settings. Red Hat recommends using the zone name
as a prefix for manually created pools, as in .<zone-name>.rgw.control. For example, using
.us-east-1.rgw.buckets.data as a pool name when us-east-1 is the zone name.
CL260-RHCS5.0-en-1-20211117 265
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Deploying RADOS Gateway instances for static web hosting has restrictions.
• Instances should have domain names that are different from and do not overlap those of the
standard S3 and Swift API gateway instances.
• Instances should use public-facing IP addresses that are different from the standard S3 and
Swift API gateway instances.
Use the Ceph Orchestrator to deploy or remove RADOS Gateway services. Use the Ceph
Orchestrator with either the command-line interface or a service specification file.
In this example, the Ceph Orchestrator deploys the my_rgw_service RADOS Gateway service
with two daemons in a single cluster, and presents the service on port 80.
The following example YAML file contains common parameters defined for a RADOS Gateway
deployment.
service_type: rgw
service_name: rgw_service_name
placement:
count: 2
hosts:
- node01
- node02
spec:
rgw_frontend_port: 8080
rgw_realm: realm_name
rgw_zone: zone_name
ssl: true
rgw_frontend_ssl_certificate: |
-----BEGIN PRIVATE KEY-----
266 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
...output omitted...
-----END PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
...output omitted...
-----END CERTIFICATE-----
networks:
- 172.25.200.0/24
In this example, a RGW service is created with similar parameters than the previous one, but now
using the CLI.
Notice that in the service specification file, the parameter names for the realm, zone, and port
are different than the used by the CLI. Some parameters such as the network to be used by RGW
instances or the ssl certificate content can only be defined by using the service speficiation file.
The count parameter sets the number of RGW instances to be created on each server defined
in the hosts parameter. If you create more than one instance, then the Ceph orchestrator sets
the port of the first instance to the specified rgw_frontend_port or port value from. For each
subsequent instance, the port value is increased by 1. Using the previous YAML file example, the
service deployment creates:
• Two RGW instances in the node01 server, one with port 8080, another with port 8081.
• Two RGW instances in the node02 server, one with port 8080, another with port 8081.
Each instance has its own unique port enabled for access and creates the same responses to
requests. Configure high availability for the RADOS Gateway by deploying a load-balancer service
that presents a single service IP address and port.
Note
The Ceph orchestrator service names the daemons by using the format
rgw.<realm>.<zone>.<host>.<random-string>
When using Transport Layer Security/Secure Socket Layer (TLS/SSL), the ports are defined
using an s character at the end of the port number, such as port=443s. The port option
supports a dual-port configuration using the plus character (+), so that users can access the
RADOS Gateway on either of two different ports.
For example, a rgw_frontends configuration can enable the RADOS Gateway to listen on the
80/TCP port, and with TLS/SSL support on the 443/TCP port.
CL260-RHCS5.0-en-1-20211117 267
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Beast configuration options are passed to the embedded web server in the Ceph configuration file
or from the configuration database. If a value is not specified, the default value is empty.
ssl_certificate
Specifies the path to the SSL certificate file used for SSL-enabled endpoints.
ssl_private_key
Specifies an SSL private key, but if a value is not provided, then the file specified by
ssl_certificate is used as the private key.
tcp_nodelay
Sets performance optimization parameters in some environments.
Important
Red Hat recommends use of HAProxy and keepalived services to configure TLS/
SSL access in production environments.
Instead, configure HAProxy and keepalived to balance the load across RADOS Gateway
servers. HAProxy presents only one IP address and it balances the requests to all RGW instances.
Keepalived ensures that the proxy nodes maintain the same presented IP address, independent
of node availability.
268 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Note
Configure at least two separated hosts for HAProxy and keepalived services to
maintain high availability.
You can configure the HAProxy service to use HTTPS. To enable HTTPS, generate SSL keys and
certificates for the configuration. If you do not have a certificate from a Certificate Authority, then
use a self-signed certificate.
Server-side Encryption
You can enable server-side encryption to allow sending requests to the RADOS Gateway service
using unsecured HTTP when it is not possible to send encrypted requests over SSL. Currently, the
server-side encryption scenario is only supported when using the Amazon S3 API.
There are two options to configure server-side encryption for the RADOS Gateway, customer-
provided keys or a key management service.
Customer-provided Keys
This option is implemented according to the Amazon SSE-C specification. Each read or write
request to the RADOS Gateway service contains an encryption key provided by the user via
the S3 client.
An object can only be encrypted with one key and users use different keys to encrypt different
objects. It is the user's responsibility to track the keys used to encrypt each object.
Figure 8.2 demonstrates the encryption flow between RADOS Gateway and an example
HashiCorp Vault key management service.
Currently, HashiCorp Vault and OpenStack Barbican are the tested key management service
implementations for RADOS Gateway.
CL260-RHCS5.0-en-1-20211117 269
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Note
Integration with OpenStack Barbican is in technology preview and is not yet
supported for production environments.
References
For more information, refer to Red Hat RADOS Gateway for Production Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/object_gateway_guide/
270 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Guided Exercise
Outcomes
You should be able to deploy a Ceph RADOS Gateway by using the Ceph orchestrator.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell. Verify the
health of the cluster.
2. View the cluster services. Verify that there are no rgw services running.
CL260-RHCS5.0-en-1-20211117 271
Chapter 8 | Providing Object Storage Using a RADOS Gateway
3. Create the rgw_service.yaml file. Configure the service to start two RGW instances in
each of the serverd and servere hosts. The ports of the RGW instances must start from
port 8080. Your file should look like this example.
4. Use the Ceph orchestrator to create an RGW service with the rgw_service.yaml file.
View the cluster and RGW service status. Verify that there are two daemons per host.
4.1. Use Ceph orchestrator to create the RGW service with the rgw_service.yaml file.
4.2. View the cluster status and find the status of the new RGW service daemons.
services:
mon: 4 daemons, quorum serverc.lab.example.com,clienta,serverd,servere (age
4m)
mgr: serverc.lab.example.com.aiqepd(active, since 10m), standbys:
clienta.nncugs, serverd.klrkci
osd: 9 osds: 9 up (since 8m), 9 in (since 9m)
rgw: 4 daemons active (2 hosts, 1 zones)
...output omitted...
4.3. Verify that the orchestrator created two running daemons per node.
272 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
5. Log in to the serverd node and view the running containers. Filter the running container
processes to find the RGW container. Verify that the Beast embedded web server is
accessible on port 8080 and also on port 8081.
5.1. Exit the cephadm shell. Log in to serverd as the admin user and switch to the root
user. List the running containers, filtered to find the RGW container.
5.2. Verify that the Beast embedded web server is accessible on port 8080, and also on
port 8081. If the gateway is working, you will receive a tagged response.
CL260-RHCS5.0-en-1-20211117 273
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
274 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Objectives
After completing this section, you should be able to configure the RADOS Gateway with multisite
support to allow objects to be stored in two or more geographically diverse Ceph storage clusters.
The latest multisite configuration simplifies failover and failback procedures, supports active/
active replication configuration between clusters, and incorporates new features such as a simpler
configuration and support for namespaces.
Multisite Components
The multisite components and definitions are listed below.
zone
A zone is backed by its own Red Hat Ceph Storage cluster. Each zone has one or more RADOS
Gateways associated with it.
CL260-RHCS5.0-en-1-20211117 275
Chapter 8 | Providing Object Storage Using a RADOS Gateway
zone group
A zone group is a set of one or more zones. Data stored in one zone in the zone group is
replicated to all other zones in the zone group. One zone in every zone group is designated as
the master zone for that group. The other zones in the zone group are secondary zones.
realm
A realm represents the global namespace for all objects and buckets in the multisite
replication space. A realm contains one or more zone groups, each of which contains one
or more zones. One zone group in the realm is designated as the master zone group, and
the others are secondary zone groups. All RADOS Gateways in the environment pull their
configuration from the RADOS Gateway in the master zone group and master zone.
Because the master zone in the master zone group handles all metadata updates, operations such
as creating users must occur in the master zone.
Important
You can execute metadata operations in a secondary zone, but it is not
recommended because the metadata will not be synchronized over the realm. This
behavior can lead to metadata fragmentation and configuration inconsistency
between zones.
• A single zone configuration has one zone group and one zone in the realm. One or more
(possibly load-balanced) RADOS Gateways are backed by one Red Hat Ceph Storage cluster.
• A multizone configuration has one zone group but multiple zones. Each zone is backed by one
or more RADOS Gateways and an independent Red Hat Ceph Storage cluster. Data stored in
one zone is replicated to all zones in the zone group. This can be used for disaster recovery if
one zone suffers a catastrophic failure.
• A multizone group configuration has multiple zone groups, each with one or more zones. You
can use a multizone group to manage the geographic location of RADOS Gateways within one
or more zones in a region.
• A multiregion configuration allows the same hardware to be used to support multiple object
namespaces that are common across zone groups and zones.
A minimal RADOS Gateway multisite deployment requires two Red Hat Ceph Storage clusters,
and a RADOS Gateway for each cluster. They exist in the same realm and are assigned to the
same master zone group. One RADOS Gateway is associated with the master zone in that zone
group. The other is associated with a separate secondary zone in that zone group. This is a basic
multizone configuration.
When you update the configuration of the master zone, the RADOS Gateway service updates the
period. This new period becomes the current period of the realm, and the epoch of this period
increases its value by one. For other configuration changes, only the epoch is incremented; the
period does not change.
276 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
When a multisite configuration is active, the RADOS Gateway performs an initial full
synchronization between the master and secondary zones. Subsequent updates are incremental.
When the RADOS Gateway writes data to any zone within a zone group, it synchronizes this data
between all of the zones in other zone groups. When the RADOS Gateway synchronizes data, all
active gateways update the data log and notify the other gateways. When the RADOS Gateway
synchronizes metadata because of a bucket or user operation, the master updates the metadata
log and notifies the other RADOS Gateways.
1. Create a realm.
CL260-RHCS5.0-en-1-20211117 277
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Use the --read-only option to set the zone as read-only when adding it to the zone group.
Use the radosgw-admin sync status command to verify the synchronization status.
278 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
create new buckets and users. If the master zone does not recover immediately, promote one of
the secondary zones as a replacement for the master zone.
To promote a secondary zone, modify the zone and zone group and commit the period update.
2. Update the master zone group after changing the role of the zone.
The following command defines the metadata-zone zone as a metadata zone managed by
Elasticsearch:
CL260-RHCS5.0-en-1-20211117 279
Chapter 8 | Providing Object Storage Using a RADOS Gateway
To view details of the Ceph RADOS Gateway service, log in to the Dashboard and click Object
Gateway. You are presented with a choice of Daemons, Users, or Buckets.
280 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
References
RGW Metadata Search
http://ceph.com/rgw/new-luminous-rgw-metadata-search/
CL260-RHCS5.0-en-1-20211117 281
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Guided Exercise
Outcomes
You should be able to deploy a Ceph RADOS Gateway and configure multisite replication by
using serverc in the primary cluster as site us-east-1 and serverf as site us-east-2.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Open two terminals and log in to both serverc and serverf as the admin user. Verify
that both clusters are reachable and have a HEALTH_OK status.
1.1. Open a terminal window. Log in to serverc as the admin user and use sudo to run
the cephadm shell. Verify that the primary cluster is in a healthy state.
1.2. Open another terminal window. Log in to serverf as the admin user and use sudo
to run the cephadm shell. Verify that the secondary cluster is in a healthy state.
2. On the serverc node, configure the us-east-1 site. Create a realm, zone group,
zone, and a replication user. Set the realm and zone as defaults for the site. Commit the
configuration and review the period id. Use the names provided in the table:
282 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Option Name
--realm cl260
--zonegroup classroom
--zone us-east-1
--uid repl.user
2.1. Create a realm called cl260. Set the realm as the default.
2.2. Create a zone group called classroom. Configure the classroom zone group with
an endpoint pointing to the RADOS Gateway running on the serverc node. Set the
classroom zone group as the default.
2.3. Create a master zone called us-east-1. Configure the us-east-1 zone with an
endpoint pointing to http://serverc:80. Use replication as the access key
and secret as the secret key. Set the us-east-1 zone as the default.
CL260-RHCS5.0-en-1-20211117 283
Chapter 8 | Providing Object Storage Using a RADOS Gateway
"system_key": {
"access_key": "replication",
"secret_key": "secret"
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "us-east-1.rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "us-east-1.rgw.buckets.data"
}
},
"data_extra_pool": "us-east-1.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"realm_id": "9eef2ff2-5fb1-4398-a69b-eeb3d9610638",
"notif_pool": "us-east-1.rgw.log:notif"
}
2.4. Create a system user called repl.user to access the zone pools. The keys for the
repl.user user must match the keys configured for the zone.
2.5. Every realm has an associated current period, holding the current state of zone
groups and storage policies. Commit the realm configuration changes to the period.
Note the period id associated with the current configuration.
284 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
"id": "7cdc83cf-69d8-478e-b625-d5250ac4435b",
"zonegroups": [
{
"id": "d3524ffb-8a3c-45f1-ac18-23db1bc99071",
"name": "classroom",
"api_name": "classroom",
"is_master": "true",
"endpoints": [
"http://serverc:80"
],
...output omitted...
"master_zonegroup": "d3524ffb-8a3c-45f1-ac18-23db1bc99071",
"master_zone": "4f1863ca-1fca-4c2d-a7b0-f693ddd14882",
...output omitted...
"realm_id": "9eef2ff2-5fb1-4398-a69b-eeb3d9610638",
"realm_name": "cl260",
"realm_epoch": 2
}
3. Create a new RADOS Gateway service called cl260-1 in the cl260 realm and us-
east-1 zone, and with a single RGW daemon on the serverc node. Verify that the RGW
daemon is up and running. Update the zone name in the configuration database.
4. On the serverf node, pull the realm and period configuration in from the serverc node.
Use the credentials for repl.user to authenticate. Verify the current period id is the same
as for the serverc node.
4.1. On the second terminal, pull the realm configuration from the serverc node.
CL260-RHCS5.0-en-1-20211117 285
Chapter 8 | Providing Object Storage Using a RADOS Gateway
5. On the serverf node, configure the us-east-2 site. Set the cl260 realm and
classroom zone group as the defaults and create the us-east-2 zone. Commit the
site configuration and review the period id. Update the zone name in the configuration
database.
5.1. Set the cl260 realm and classroom zone group as default.
286 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
5.2. Create a zone called us-east-2. Configure the us-east-2 zone with an endpoint
pointing to http://serverf:80.
CL260-RHCS5.0-en-1-20211117 287
Chapter 8 | Providing Object Storage Using a RADOS Gateway
"api_name": "classroom",
"is_master": "true",
"endpoints": [
"http://serverc:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "4f1863ca-1fca-4c2d-a7b0-f693ddd14882",
"zones": [
{
"id": "3879a186-cc0c-4b42-8db1-7624d74951b0",
"name": "us-east-2",
"endpoints": [
"http://serverf:80"
],
...output omitted...
},
{
"id": "4f1863ca-1fca-4c2d-a7b0-f693ddd14882",
"name": "us-east-1",
"endpoints": [
"http://serverc:80"
],
...output omitted...
"master_zonegroup": "d3524ffb-8a3c-45f1-ac18-23db1bc99071",
"master_zone": "4f1863ca-1fca-4c2d-a7b0-f693ddd14882",
...output omitted...
"realm_id": "9eef2ff2-5fb1-4398-a69b-eeb3d9610638",
"realm_name": "cl260",
"realm_epoch": 2
}
6. Create a new RADOS Gateway service called cl260-2 in the cl260 realm and us-
east-2 zone, and with a single RGW daemon on the serverf node. Verify that the RGW
daemon is up and running. View the period associated with the current configuration. Verify
the sync status of the site.
288 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
7. Exit and close the second terminal. Return to workstation as the student user.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 289
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Lab
Outcomes
You should be able to deploy and configure the Ceph RADOS Storage service.
This command ensures that the lab environment is created and ready for the lab exercise.
Instructions
1. Log in to serverc as the admin user. Create a realm called prod and set it as the
default. Create a master zone group called us-west with the endpoint set to http://
serverc:8080. Set the zone group as the default.
2. Create a master zone called us-west-1 and set the endpoint to http://serverc:8080.
Set the zone as the default. Use admin as the access key and secure as the secret key.
3. Create a system user called admin.user with admin as the access key and secure as the
secret key.
4. Commit the configuration. Save the period ID value in the /home/admin/period-id.txt
file on the serverc node. Do not include quotes, double quotes, or characters other than
the period ID in UUID format.
5. Deploy a RADOS Gateway service called prod-object with two instances running on port
8080, one on the serverc node and the second on the servere node.
6. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade object-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
290 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
CL260-RHCS5.0-en-1-20211117 291
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Solution
Outcomes
You should be able to deploy and configure the Ceph RADOS Storage service.
This command ensures that the lab environment is created and ready for the lab exercise.
Instructions
1. Log in to serverc as the admin user. Create a realm called prod and set it as the
default. Create a master zone group called us-west with the endpoint set to http://
serverc:8080. Set the zone group as the default.
1.1. Log in to serverc as the admin user and use sudo to run the cephadm shell.
2. Create a master zone called us-west-1 and set the endpoint to http://serverc:8080.
Set the zone as the default. Use admin as the access key and secure as the secret key.
292 CL260-RHCS5.0-en-1-20211117
Chapter 8 | Providing Object Storage Using a RADOS Gateway
3. Create a system user called admin.user with admin as the access key and secure as the
secret key.
5. Deploy a RADOS Gateway service called prod-object with two instances running on port
8080, one on the serverc node and the second on the servere node.
Evaluation
Grade your work by running the lab grade object-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 293
Chapter 8 | Providing Object Storage Using a RADOS Gateway
Summary
In this chapter, you learned:
• The RADOS Gateway is a service that connects to a Red Hat Ceph Storage cluster, and provides
object storage to applications using a REST API.
• You can deploy the RADOS Gateway by using the Ceph orchestrator command-line interface or
by using a service specification file.
• You can use HAProxy and keepalived services to load balance the RADOS Gateway service.
• The RADOS Gateway supports multisite configuration, which allows RADOS Gateway objects to
be replicated between separate Red Hat Ceph Storage clusters.
• Objects written to a RADOS Gateway for one zone are replicated to all other zones in the zone
group.
• Metadata and configuration updates must occur in the master zone of the master zone group.
294 CL260-RHCS5.0-en-1-20211117
Chapter 9
CL260-RHCS5.0-en-1-20211117 295
Chapter 9 | Accessing Object Storage Using a REST API
Objectives
After completing this section, you should be able to configure the RADOS Gateway to provide
access to object storage compatible with the Amazon S3 API, and manage objects stored using
that API.
The Amazon S3 interface defines the namespace in which objects are stored as a bucket. To
access and manage objects and buckets using the S3 API, applications use RADOS Gateway
users for authentication. Each user has an access key that identifies the user and a secret key that
authenticates the user.
There are object and metadata size limits to consider when using the Amazon S3 API:
When creating RADOS Gateway users, both the --uid and --display-name options are
required, and specify a unique account name and a human friendly display name. Use the --
access-key and --secret options to specify a custom AWS account and secret key for the
RADOS user.
296 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
"secret_key": "67890"
}
...output omitted...
If the access key and secret key are not specified, the radosgw-admin command automatically
generates them and displays them in the output.
Important
When the radosgw-admin command automatically generates the access key and
secret, either key might includes a JSON escape character (\). Clients might not
handle this character correctly. Either regenerate or manually specify the keys to
avoid this issue.
To add an access key to an existing user, use the --gen-access-key option. Creating additional
keys is convenient for granting the same user access to multiple applications that require different
or unique keys.
CL260-RHCS5.0-en-1-20211117 297
Chapter 9 | Accessing Object Storage Using a REST API
"secret_key": "MFVxrGNMBjKOO7JscLFbEyrEmJFnLl43PHSswpLC"
},
{
"user": "s3user",
"access_key": "GPYJGPSONURDY7SG0LLO",
"secret_key": "T7jcG5YgEqqPxWMkdCTBsY0DM3rgIOmqkmtjRlCX"
}
...output omitted...
To remove an access key and related secret key from a user, use the radosgw-admin key rm
command with the --access-key option. This is useful for removing single application access
without impacting access with other keys.
Temporarily disable and enable RADOS Gateway users by using the radosgw-admin user
suspend and radosgw-admin user enable commands. When suspended, a user's subusers
are also suspended and unable to interact with the RADOS Gateway service.
You can modify user information such as email, display name, keys and access control level. The
access control levels are: read, write, readwrite, and full. The full access level includes
the readwrite level and the access control management capability.
To remove a user and also delete their objects and buckets, use the --purge-data option.
Set quotas to limit the amount of storage a user or bucket can consume. Set the quota parameters
first, then enable the quota. To disable a quota, set a negative value for the quota parameter.
Bucket quotas apply to all buckets owned by a specific UUID, regardless of the user accessing or
uploading to those buckets.
In this example, the quota for the app1 user is set to a maximum of 1024 objects. The user quota is
then enabled.
Similarly, apply quotas to buckets by setting the --quota-scope option to bucket. In this
example, the loghistory bucket is set for a maximum size of 1024 bytes.
298 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
Action Command
Storage administrators monitor usage statistics to determine the storage consumption or user
bandwidth usage. Monitoring can also help find inactive applications or inappropriate user quotas.
Use the radosgw-admin user stats and radosgw-admin user info commands to view
user information and statistics.
Use the radosgw-admin usage show command to show the usage statistics of a user between
specific dates.
View the statistics for all users by using the radosgw-admin usage show command. Use these
overall statistics to help understand object storage patterns and plan the deployment of new
instances for scaling the RADOS gateway service.
CL260-RHCS5.0-en-1-20211117 299
Chapter 9 | Accessing Object Storage Using a REST API
where dns_suffix is the fully qualified domain name to be used to create your bucket's name.
In addition to configuring rgw_dns_name, you must configure your DNS server with a wildcard
DNS record for that domain that points to the RADOS Gateway IP address. Syntax for
implementing wildcard DNS entries varies for different DNS servers.
Upload objects to a bucket using the aws cp command. This example command uploads an
object called demoobject to the demobucket bucket, using the local file /tmp/demoobject.
Note
There are multiple S3 public clients available, such as awscli, cloudberry, cyberduck,
and curl, which provide access to object storage supporting the S3 API.
RADOS Gateway also supports S3 API object expiration by using rules defined for a set of bucket
objects. Each rule has a prefix, which selects the objects, and a number of days after which objects
become unavailable.
RADOS Gateway supports only a subset of the Amazon S3 API policy language applied to
buckets. No policy support is available for users, groups, or roles. Bucket policies are managed
through standard S3 operations rather than using the radosgw-admin command.
• The Resource key defines the resources which permissions the policy modifies. The policy uses
the Amazon Resource Name (ARN) associated with the resources to identify it.
• The Actions key defines the operations allowed or denied for a resource. Each resource has a
set of operations available.
• The Effect key indicates if the policy allows or denies the action previously defined for a
resource. By default, a policy denies the access to a resource.
300 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
• The Principal key defines the user to whom the policy allows or denies access to a resource.
{
"Version": "2021-03-10",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": ["arn:aws:iam::testaccount:user/testuser"]
},
"Action": "s3:ListBucket",
"Resource": [
"arn:aws:s3:::testbucket"
]
}
]
}
Ceph Object Gateway implements a subset of STS APIs that provide temporary credentials for
identity and access management. These temporary credentials can be used to make subsequent
S3 calls, which will be authenticated by the STS engine in RGW. Permissions of the temporary
credentials can be further restricted via an IAM policy passed as a parameter to the STS APIs.
Ceph Object Gateway supports STS AssumeRoleWithWebIdentity.
CL260-RHCS5.0-en-1-20211117 301
Chapter 9 | Accessing Object Storage Using a REST API
References
For more information, refer to the Static Web Hosting section in the Red Hat Ceph
Storage 5 Object Gateway Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/object_gateway_guide/basic-configuration#static-web-hosting
For more information, refer to the Security section in the Red Hat Ceph Storage 5
Object Gateway Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/object_gateway_guide/security
For more information, refer to the Ceph Object Gateway and the S3 API chapter in
the Red Hat Ceph Storage 5 developer Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/developer_guide/ceph-object-gateway-and-the-s3-api
302 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
Guided Exercise
Outcomes
You should be able to configure the Ceph Object Gateway to allow access to Ceph object
storage via the Amazon S3 API.
Instructions
1. Log in to clienta as the admin user.
2. Create an Amazon S3 API user called operator. Use 12345 as the S3 access key and
67890 as the secret key.
CL260-RHCS5.0-en-1-20211117 303
Chapter 9 | Accessing Object Storage Using a REST API
3. Configure the AWS CLI tool to use operator credentials. Enter 12345 as the access key
and 67890 as the secret key.
7. Use the radosgw-admin command to view the metadata of the testbucket bucket.
304 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
"ver": {
"tag": "_2d3Y6puJve1TnYs0pwHc0Go",
"ver": 1
},
"mtime": "2021-10-06T01:51:37.514627Z",
"data": {
"bucket": {
"name": "testbucket",
"marker": "cb16a524-d938-4fa2-837f-d1f2011676e2.54360.1",
"bucket_id": "cb16a524-d938-4fa2-837f-d1f2011676e2.54360.1",
"tenant": "",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
}
},
"owner": "operator",
"creation_time": "2021-10-06T01:51:37.498002Z",
"linked": "true",
"has_bucket_info": "false"
}
}
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 305
Chapter 9 | Accessing Object Storage Using a REST API
Objectives
After completing this section, you should be able to configure the RADOS Gateway to provide
access to object storage compatible with the Swift API, and manage objects stored using that API.
The OpenStack Swift API is an alternative to the Amazon S3 API to access objects stored in
the Red Hat Ceph Storage cluster through a RADOS Gateway. There are important differences
between the OpenStack Swift and Amazon S3 APIs.
OpenStack Swift refers to the namespace in which objects are stored as a container.
The OpenStack Swift API has a different user model than the Amazon S3 API. To authenticate
with a RADOS Gateway using the OpenStack Swift API, you must configure subusers for your
RADOS Gateway user accounts.
The OpenStack Swift API, however, has a multi-tier design, built to accommodate tenants and
assigned users. A Swift tenant owns the storage and its containers used by a service. Swift users
are assigned to the service and have different levels of access to the storage owned by the tenant.
To accommodate the OpenStack Swift API authentication and authorization model, RADOS
Gateway has the concept of subusers. This model allows Swift API tenants to be handled as
RADOS Gateway users, and Swift API users to be handled as RADOS Gateway subusers.
The Swift API tenant:user tuple maps to RADOS Gateway authentication system as a
user:subuser. A subuser is created for each Swift user, and it is associated with a RADOS
Gateway user and an access key.
The --access option sets the user permissions (read, write, read/write, full), and --uid specifies
the existing associated RADOS Gateway user. Use the radosgw-admin key create command
306 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
with the --key-type=swift option to create a Swift authentication key associated with the
subuser.
When a Swift client communicates with a RADOS Gateway, then the latter acts as both the data
server and the Swift authentication daemon (using the /auth URL path). The RADOS Gateway
supports both Internal Swift (version 1.0) and OpenStack Keystone (version 2.0) authentication.
The secret specified using -K is the secret created with the Swift key.
Note
Ensure your command-line parameters are not being overridden or influenced by
any operating system environment variables. If you are using Auth version 1.0, then
use the ST_AUTH, ST_USER, and ST_KEY environment variables. If you are using
Auth version 2.0, then use the OS_AUTH_URL, OS_USERNAME, OS_PASSWORD,
OS_TENANT_NAME, and OS_TENANT_ID environment variables.
In the Swift API, a container is a collection of objects. An object in the Swift API is a binary large
object (blob) of data stored in Swift.
To verify RADOS Gateway accessibility using the Swift API, use the swift post command to
create a container.
If you use the absolute path to define the file location, then the object's name contains the path
to file, including the slash character /. For example, the following command uploads the /etc/
hosts file to the services bucket.
In this example, the uploaded object name is etc/hosts. You can define the object name by
using the --object-name option.
CL260-RHCS5.0-en-1-20211117 307
Chapter 9 | Accessing Object Storage Using a REST API
You can manage the subuser keys by using the radosgw-admin key command. This example
creates a subuser key.
The key-type option only admits the values swift or s3. Use the --access-key option if you
want to manually specify an S3 access key, and use the --secret-key option if you want to
manually specify an S3 or Swift secret key. If the access key and secret key are not specified,
the radosgw-admin command automatically generates them and displays them in the output.
Alternately, use the --gen-access-key option to generate only a random access key, or the --
gen-secret option to generate only a random secret.
To enable versioning on a container, set the value of a container flag to be the name of the
container which stores the versions. Set the flag when creating new containers or by updating the
metadata on existing containers.
Note
You should use a different archive container for each container to be versioned.
Enabling versioning on an archive container is not recommended.
The Swift API supports two header keys for this versioning flag, either X-History-Location
or X-Versions-Location, which determines how the Swift API handles object DELETE
operations.
308 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
With the X-History-Location flag set, you receive a 404 Not Found error after deleting
the object inside the container. Swift copies the object to the archive container and removes the
original copy in the versioned container. You can recover the object from the archive container.
With the X-Versions-Location flag set, Swift removes the current object version in the
versioned container. Then, Swift copies the most recent object version in the archive container to
the versioned container, and deletes that most recent object version from the archive container.
To completely remove an object from a versioned container with the X-Versions-Location
flag set, you must remove the object as many times as there are object versions available in the
archive container.
Set only one of these flags at the same time on an OpenStack Swift container. If the container's
metadata contains both flags, then a 400 Bad Request error is issued.
RADOS Gateway supports the Swift API object versioning feature. To activate this
feature in the RADOS Gateway, set rgw_swift_versioning_enabled to true in the
[client.radosgw.radosgw-name] section in the /etc/ceph/ceph.conf configuration file.
RADOS Gateway also supports using the X-Delete-At and X-Delete-After headers when
adding objects using the Swift API. At the time specified by the header, RADOS Gateway stops
serving that object, and removes it shortly after.
Configure Swift API tenants in RADOS Gateway with the radosgw-admin command. This
command requires a tenant to create the user provided using the --tenant option.
References
For more information, refer to the Configuration Reference chapter in the Object
Gateway Configuration and Administration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/object_gateway_guide/index#rgw-configuration-reference-rgw
CL260-RHCS5.0-en-1-20211117 309
Chapter 9 | Accessing Object Storage Using a REST API
Guided Exercise
Outcomes
You should be able to configure the Ceph Object Gateway to allow access to Ceph object
storage via the Swift API.
This command confirms that the hosts required for this exercise are accessible. It installs the
awscli package on clienta and creates the operator user and the testbucket bucket.
Instructions
1. Log in to clienta as the admin user.
2. Create a Swift subuser called operator:swift on the Ceph Object Gateway for
accessing object storage using the Swift API.
You are creating a new subuser attached to the operator user that was created by the
lab start command. The operator user can currently access Ceph object storage
using the S3 API. With this additional configuration, the operator user can create objects
and buckets using the S3 API, and then use the Swift API to manage the same objects.
Use opswift as the Swift secret key. Assign full permissions to the user.
310 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
"permissions": "full-control"
}
],
"keys": [
{
"user": "operator",
"access_key": "12345",
"secret_key": "67890"
}
],
"swift_keys": [
{
"user": "operator:swift",
"secret_key": "opswift"
}
],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"default_storage_class": "",
...output omitted...
3. Install the swift client on clienta. Verify access using the operator:swift subuser
credentials. Use -K to specify the Swift secret key for the user.
3.2. Verify the bucket status. Your output might be different in your lab environment.
CL260-RHCS5.0-en-1-20211117 311
Chapter 9 | Accessing Object Storage Using a REST API
312 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 313
Chapter 9 | Accessing Object Storage Using a REST API
Lab
Outcomes
You should be able to:
• Create buckets and containers using the Amazon S3 and OpenStack Swift APIs.
• Upload and download objects using the Amazon S3 and OpenStack Swift APIs.
This command ensures that the lab environment is created and ready for the lab exercise.
This command confirms that the hosts required for this exercise are accessible and
configures a multisite RADOS Gateway service.
Instructions
Important
This lab runs on a multisite RADOS Gateway deployment.
To ensure that metadata operations, such as user and bucket creation, occur on
the master zone and are synced across the multisite service, perform all metadata
operations on the serverc node. Other normal operations, such as uploading or
downloading objects from the RADOS Gateway service, can be performed on any
cluster node that has access to the service endpoint.
1. On the serverc node, create a user for the S3 API and a subuser for the Swift API. Create
the S3 user with the name S3 Operator, UID operator, access key 12345, and secret key
67890. Grant full access to the operator user.
Create the Swift subuser of the operator user with the name operator:swift and the
secret `opswift. Grant full access to the subuser.
2. Configure the AWS CLI tool to use the operator user credentials. Create a bucket called
log-artifacts. The RADOS Gateway service is running on the default port on the
serverc node.
3. Create a container called backup-artifacts. The RADOS Gateway service is on the
default port on the serverc node.
314 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
4. Create a 10MB file called log-object-10MB.bin in the /tmp directory. Upload the log-
object-10MB.bin file to the log-artifacts bucket. On the serverf node, download
the log-object-10MB.bin from the log-artifacts bucket.
5. On the serverf node, create a 20MB file called backup-object20MB.bin in the /tmp
directory. Upload the backup-object20MB.bin file to the backup-artifacts bucket,
using the service default port. View the status of the backup-artifacts bucket and verify
that the Objects field has the value of 1.
6. On the serverc node, download the backup-object-20MB.bin file to the /home/
admin directory.
7. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade api-review command from your workstation
machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 315
Chapter 9 | Accessing Object Storage Using a REST API
Solution
Outcomes
You should be able to:
• Create buckets and containers using the Amazon S3 and OpenStack Swift APIs.
• Upload and download objects using the Amazon S3 and OpenStack Swift APIs.
This command ensures that the lab environment is created and ready for the lab exercise.
This command confirms that the hosts required for this exercise are accessible and
configures a multisite RADOS Gateway service.
Instructions
Important
This lab runs on a multisite RADOS Gateway deployment.
To ensure that metadata operations, such as user and bucket creation, occur on
the master zone and are synced across the multisite service, perform all metadata
operations on the serverc node. Other normal operations, such as uploading or
downloading objects from the RADOS Gateway service, can be performed on any
cluster node that has access to the service endpoint.
1. On the serverc node, create a user for the S3 API and a subuser for the Swift API. Create
the S3 user with the name S3 Operator, UID operator, access key 12345, and secret key
67890. Grant full access to the operator user.
Create the Swift subuser of the operator user with the name operator:swift and the
secret `opswift. Grant full access to the subuser.
316 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
1.2. Create an Amazon S3 API user called S3 Operator with the UID of operator. Assign
an access key of 12345 and a secret of 67890, and grant the user full access.
1.3. Create a Swift subuser called operator:swift. Set opswift as the subuser secret
and grant full access.
2. Configure the AWS CLI tool to use the operator user credentials. Create a bucket called
log-artifacts. The RADOS Gateway service is running on the default port on the
serverc node.
2.1. Configure the AWS CLI tool to use operator credentials. Enter 12345 as the access key
and 67890 as the secret key.
CL260-RHCS5.0-en-1-20211117 317
Chapter 9 | Accessing Object Storage Using a REST API
4. Create a 10MB file called log-object-10MB.bin in the /tmp directory. Upload the log-
object-10MB.bin file to the log-artifacts bucket. On the serverf node, download
the log-object-10MB.bin from the log-artifacts bucket.
4.3. Log in to serverf as the admin user. Download the log-object-10MB.bin from
the log-artifacts bucket.
5. On the serverf node, create a 20MB file called backup-object20MB.bin in the /tmp
directory. Upload the backup-object20MB.bin file to the backup-artifacts bucket,
using the service default port. View the status of the backup-artifacts bucket and verify
that the Objects field has the value of 1.
318 CL260-RHCS5.0-en-1-20211117
Chapter 9 | Accessing Object Storage Using a REST API
5.3. View the statistics for the backup-artifacts bucket and verify that it contains the
uploaded object.
Evaluation
Grade your work by running the lab grade api-review command from your workstation
machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 319
Chapter 9 | Accessing Object Storage Using a REST API
Summary
In this chapter, you learned:
• You can access the RADOS Gateway by using clients that are compatible with the Amazon S3
API or the OpenStack Swift API.
• The RADOS Gateway can be configured to use either of the bucket name formats supported by
the Amazon S3 API.
• To support authentication by using the OpenStack Swift API, Swift users are represented by
RADOS Gateway subusers.
• You can define deletion policies for Amazon S3 buckets, and object versioning for Swift
containers, to manage the behavior of deleted objects.
320 CL260-RHCS5.0-en-1-20211117
Chapter 10
CL260-RHCS5.0-en-1-20211117 321
Chapter 10 | Providing File Storage with CephFS
Objectives
After completing this section, you should be able to provide file storage on the Ceph cluster by
deploying the Ceph File System (CephFS).
Block-based storage provides a storage volume that operates similar to a disk device, organized
into equally sized chunks. Typically, block-based storage volumes are either formatted with a file
system, or applications such as databases directly access and write to them.
With object-based storage, you can store arbitrary data and metadata as a unit that is labeled with
a unique identifier in a flat storage pool. Rather than accessing data as blocks or in a file-system
hierarchy, you use an API to store and retrieve objects. Fundamentally, the Red Hat Ceph Storage
RADOS cluster is an object store.
MDS daemons operate in two modes: active and standby. An active MDS manages the metadata
on the CephFS file system. A standby MDS serves as a backup, and switches to the active mode
if the active MDS becomes unresponsive. CephFS shared file systems require an active MDS
service. You should deploy at least one standby MDS in your cluster to ensure high availability.
If you do not create enough MDS pools to match the number of configured standby daemons,
then the Ceph cluster displays a WARN health status. The recommended solution is create more
MDS pools to provide a pool for each daemon. However, a temporary solution is to set the number
of standby pools to 0, which disables the Ceph MDS standby check through the ceph fs set
fs-name standby_count_wanted 0 command.
322 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
CephFS clients first contact a MON to authenticate and retrieve the cluster map. Then, the client
queries an active MDS for file metadata. The client uses the metadata to access the objects that
comprise the requested file or directory by communicating directly with the OSDs.
MDS features and configuration options are described in the following list:
MDS Ranks
MDS ranks define how the metadata workload is distributed over the MDS daemons. The
number of ranks, which is defined by the max_mds configuration setting, is the maximum
number of MDS daemons that can be active at a time. MDS daemons start without a rank and
the MON daemon is responsible for assigning them a rank.
Note
You can create snapshots of subvolumes, but Red Hat Ceph Storage 5 does not
support creating snapshots of subvolume groups. You can list and remove existing
snapshots of subvolume groups.
Quotas
Configure your CephFS file system to restrict the number of bytes or files that are stored
by using quotas. Both the FUSE and kernel clients support checking quotas when mounting
a CephFS file system. These clients are also responsible for stopping writing data to the
CephFS file system when the user reaches the quota limit. Use the setfattr command's
ceph.quota.max_bytes and ceph.quota.max_files options to set the limits.
• Red Hat Ceph Storage 5 supports more than one active MDS in a cluster, which can increase
metadata performance. To remain highly available, you can configure additional standby MDSes
to take over from any active MDS that fails.
• Red Hat Ceph Storage 5 supports more than one CephFS file system in a cluster. Deploying
more than one CephFS file system requires running more MDS daemons.
Deploying CephFS
To implement a CephFS file system, create the required pools, create the CephFS file system,
deploy the MDS daemons, and then mount the file system. You can manually create the pools,
CL260-RHCS5.0-en-1-20211117 323
Chapter 10 | Providing File Storage with CephFS
create the CephFS file system, and deploy the MDS daemons, or use the ceph fs volume
create command, which does all these steps automatically. The first option gives the system
administrator more control over the process, but with more steps than the simpler ceph fs
volume create command.
This example creates two pools with standard parameters. Because the metadata pool stores
file location information, consider a higher replication level for this pool to avoid data errors
that render your data inaccessible.
By default, Ceph uses replicated data pools. However, erasure-coded data pools are now also
supported for CephFS file systems. Create an erasure-coded pool with the ceph osd pool
command:
To add an existing erasure pool as a data pool in your CephFS file system, use ceph fs
add_data_pool.
324 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
service_type: mds
service_id: fs-name
placements:
hosts:
- host-name-1
- host-name-2
- ...
Use the YAML service specification to deploy the MDS service with the ceph orch apply
command:
Finally, create the CephFS file system with the ceph fs new command.
The kernel client requires a Linux kernel version 4 or later, which is available starting with RHEL 8.
For previous kernel versions, use the FUSE client instead.
The two clients have unique advantages and disadvantages. Not all features are supported in
both clients. For example, the kernel client does not support quotas, but can be faster. The FUSE
client supports quotas and ACLs. You must enable ACLs to use them with the CephFS file system
mounted with the FUSE client.
• Install the ceph-common package. For the FUSE client, also install the ceph-fuse package.
• Verify that the Ceph configuration file exists (/etc/ceph/ceph.conf by default).
• Authorize the client to access the CephFS file system.
• Extract the new authorization key with the ceph auth get command and copy it to the /etc/
ceph folder on the client host.
• When using the FUSE client as a non-root user, add user_allow_other in the /etc/
fuse.conf configuration file.
To provide the key ring for a specific user, use the --id option.
CL260-RHCS5.0-en-1-20211117 325
Chapter 10 | Providing File Storage with CephFS
You must authorize the client to access the CephFS file system, by using the ceph fs
authorize command:
With the ceph fs authorize command, you can provide fine-grained access control for
different users and folders in the CephFS file system. You can set different options for folders in a
CephFS file system:
• r: Read access to the specified folder. Read access is also granted to the subfolders, if no other
restriction is specified.
• w: Write access to the specified folder. Write access is also granted to the subfolders, if no other
restriction is specified.
• p: Clients require the p option in addition to r and w capabilities to use layouts or quotas.
• s: Clients require the s option in addition to r and w capabilities to create snapshots.
This example allows one user to read the root folder, and also provides read, write, and snapshot
permissions to the /directory folder.
By default, the CephFS FUSE client mounts the root directory (/) of the accessed file system. You
can mount a specific directory with the ceph-fuse -r directory command.
Note
When you try to mount a specific directory, this operation fails if the directory does
not exist in the CephFS volume.
When more than one CephFS file system is configured, the CephFS FUSE client mounts the
default CephFS file system. To use a different file system, use the --client_fs option.
To persistently mount your CephFS file system by using the FUSE client, you can add the following
entry to the /etc/fstab file:
You must authorize the client to access the CephFS file system, with the ceph fs authorize
command. Extract the client key with the ceph auth get command, and then copy the key to
the /etc/ceph folder on the client host.
326 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
With the CephFS kernel client, you can mount a specific subdirectory from a CephFS file system.
This example mounts a directory called /dir/dir2 from the root of a CephFS file system:
You can specify a list of several comma-separated MONs to mount the device. The standard port
(6789) is the default, or you can add a colon and a nonstandard port number after the name of
each MON. Recommended practice is to specify more than one MON in case that some are offline
when the file system is mounted.
These other options are available when using the CephFS kernel client:
secretfile=secret_key_file The path to the file with the secret key for this
client.
To persistently mount your CephFS file system by using the kernel client, you can add the
following entry to the /etc/fstab file:
Removing CephFS
You can remove a CephFS if needed. However, first back up all your data, because removing your
CephFS file system destroys all the stored data on that file system.
CL260-RHCS5.0-en-1-20211117 327
Chapter 10 | Providing File Storage with CephFS
In Red Hat Ceph Storage, NFS Ganesha shares files with the NFS 4.0 or later protocol. This
requirement is necessary for proper feature functioning by the CephFS client, the OpenStack
Manila File Sharing service, and other Red Hat products that are configured to access the NFS
Ganesha service.
The following list outlines the advantages of a user space NFS server:
You can deploy NFS Ganesha in an active-active configuration on top of an existing CephFS file
system through the ingress service. The main goal of this active-active configuration is for load
balancing, and scaling to many instances that handle higher loads. Thus, if one node fails, then the
cluster redirects all the workload to the rest of the nodes.
System administrators can deploy the NFS Ganesha daemons via the CLI or manage them
automatically if either the Cephadm or Rook orchestrators are enabled.
The following list outlines the advantages to having an ingress service on top of an existing NFS
service:
Note
The ingress implementation is not yet completely developed. It can deploy multiple
Ganesha instances and balance the load between them, but failover between
hosts is not yet fully implemented. This feature is expected to be available in future
releases.
You can use multiple active-active NFS Ganesha services with Pacemaker for high availability. The
Pacemaker component is responsible for all cluster-related activities, such as monitoring cluster
membership, managing the services and resources, and fencing cluster members.
As prerequisites, create a CephFS file system and install the nfs-ganesha, nfs-ganesha-
ceph, nfs-ganesha-rados-grace, and nfs-ganesha-rados-urls packages on the Ceph
MGR nodes.
328 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
After the prerequisites are satisfied, enable the Ceph MGR NFS module:
The node-list is a comma-separated list where the daemon containers are deployed.
MDS Autoscaler
CephFS shared file systems require at least one active MDS service for correct operation, and
at least one standby MDS to ensure high availability. The MDS autoscaler module ensures the
availability of enough MDS daemons.
This module monitors the number of ranks and the number of standby daemons, and adjusts the
number of MDS daemons that the orchestrator spawns.
Note
Both the source and target clusters must use Red Hat Ceph Storage version 5 or
later.
The CephFS mirroring feature is snapshot-based. The first snapshot synchronization requires
bulk transfer of the data from the source cluster to the remote cluster. Then, for the following
synchronizations, the mirror daemon identifies the modified files between local snapshots and
synchronizes those files in the remote cluster. This synchronization method is faster than other
methods that require bulk transfer of the data to the remote cluster, because it does not need to
CL260-RHCS5.0-en-1-20211117 329
Chapter 10 | Providing File Storage with CephFS
query the remote cluster (file differences are calculated between local snapshots) and needs only
to transfer the updated files to the remote cluster.
The CephFS mirroring module is disabled by default. To configure a snapshot mirror for CephFS,
you must enable the mirroring module on the source and remote clusters:
Then, you can deploy the CephFS mirroring daemon on the source cluster:
The previous command deploys the CephFS mirroring daemon on node-name and creates the
Ceph user cephfs-mirror. For each CephFS peer, you must create a user on the target cluster:
Then, you can enable mirroring on the source cluster. Mirroring must be enabled for a specific file
system.
The next step is to prepare the target peer. You can create the peer bootstrap in the target node
with the next command:
You can use the site-name string to identify the target storage cluster. When the target peer is
created, you must import into the source cluster the bootstrap token from creating the peer on the
target cluster:
Finally, configure a directory for snapshot mirroring on the source cluster with the following
command:
330 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
References
mount.ceph(8), ceph-fuse(8), ceph(8), rados(8), and cephfs-mirror(8)
man pages
For more information, refer to the Red Hat Ceph Storage 5 File System Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/file_system_guide/index
For more information regarding CephFS deployment, refer to the Deployment of the
Ceph File System chapter in the Red Hat Ceph Storage 5 File System Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/file_system_guide/index#deployment-of-the-ceph-file-system
For more information regarding CephFS over NFS protocol, refer to the Exporting
Ceph File System Namespaces over the NFS Protocol chapter in the
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/file_system_guide/index#exporting-ceph-file-system-namespaces-over-
the-nfs-protocol_fs
CL260-RHCS5.0-en-1-20211117 331
Chapter 10 | Providing File Storage with CephFS
Guided Exercise
Outcomes
You should be able to deploy a Metadata Server (MDS) and mount a CephFS file system
with the kernel client and the Ceph-Fuse client. You should be able to save the file system as
persistent storage.
Instructions
• The serverc, serverd, and servere nodes are an operational 3-node Ceph cluster. All three
nodes operate as a MON, a MGR, and an OSD host with at least one colocated OSD.
• The clienta node is your admin node server and you will use it to install the MDS on serverc.
1. Log in to clienta as the admin user. Deploy the serverc node as an MDS. Verify that
the MDS is operating and that the ceph_data and ceph_metadata pools for CephFS are
created.
1.1. Log in to clienta as the admin user, and use sudo to run the cephadm shell.
1.2. Create the two required CephFS pools. Name these pools mycephfs_data and
mycephfs_metadata.
1.3. Create the CephFS file system with the name mycephfs. Your pool numbers might
differ in your lab environment.
332 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
services:
mon: 4 daemons, quorum serverc.lab.example.com,serverd,servere,clienta (age
29m)
mgr: servere.xnprpz(active, since 30m), standbys: clienta.jahhir,
serverd.qbvejy, serverc.lab.example.com.xgbgpo
mds: 1/1 daemons up
osd: 15 osds: 15 up (since 28m), 15 in (since 28m)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 7 pools, 169 pgs
objects: 212 objects, 7.5 KiB
usage: 215 MiB used, 150 GiB / 150 GiB avail
pgs: 169 active+clean
1.6. List the available pools. Verify that mycephfs_data and mycephfs_metadata are
listed.
2. Mount the new CephFS file system on the /mnt/mycephfs directory as a kernel client on
the clienta node. Verify normal operation by creating two folders dir1 and dir2 on the
CL260-RHCS5.0-en-1-20211117 333
Chapter 10 | Providing File Storage with CephFS
file system. Create an empty file called atestfile in the dir1 directory and a 10 MB file
called ddtest in the same directory.
2.1. Exit the cephadm shell. Switch to the root user. Verify that the Ceph client key ring is
present in the /etc/ceph folder on the client node.
2.3. Create a mount point called /mnt/mycephfs and mount the new CephFS file
system.
2.5. Create two directories called dir1 and dir2, directly underneath the mount point.
Ensure that they are available.
2.6. Create an empty file called atestfile in the dir1 directory. Then, create a 10 MB
file called ddtest in the same directory.
334 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
3. Run the ceph fs status command and inspect the size of the used data in the
mycephfs_data pool. The larger size is reported because the CephFS file system
replicates across the three Ceph nodes.
4. Create a restricteduser user, which has read access to the root folder, and read and
write permissions on the dir2 folder. Use this new user to mount again the CephFS file
system on clienta and check the permissions.
4.1. Create the restricteduser user with read permission on the root folder, and
read and write permissions on the dir2 folder. Use the cephadm shell --mount
option to copy the user key-ring file to the /etc/ceph folder on clienta.
4.2. Use the kernel client to mount the mycephfs file system with this user.
CL260-RHCS5.0-en-1-20211117 335
Chapter 10 | Providing File Storage with CephFS
4.3. Test the user permissions in the different folders and files.
3 directories, 2 files
[root@clienta ~]# touch /mnt/mycephfs/dir1/restricteduser_file1
touch: cannot touch '/mnt/mycephfs/dir1/restricteduser_file1': Permission denied
[root@clienta ~]# touch /mnt/mycephfs/dir2/restricteduser_file2
[root@clienta ~]# ls /mnt/mycephfs/dir2
restricteduser_file2
[root@clienta ~]# rm /mnt/mycephfs/dir2/restricteduser_file2
5. Install the ceph-fuse package and mount to a new directory called cephfuse.
5.1. Create a directory called /mnt/mycephfuse to use as a mount point for the Fuse
client.
5.3. Use the installed Ceph-Fuse driver to mount the file system.
5.4. Run the tree command on the /mnt directory to see its data.
336 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
4 directories, 2 files
6. Use the FUSE client to persistently mount the CephFS in the /mnt/mycephfuse folder.
6.1. Configure the /etc/fstab file to mount the file system at startup.
6.2. Mount again the /mnt/mycephfuse folder with the mount -a command. Verify
with the df command.
CL260-RHCS5.0-en-1-20211117 337
Chapter 10 | Providing File Storage with CephFS
Warning
Run the lab finish script on the workstation server so that the clienta
node can be safely rebooted without mount conflicts.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
338 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
Objectives
After completing this section, you should be able to configure CephFS, including snapshots,
replication, memory management, and client access.
CephFS Administration
Use the following commands to manage CephFS file systems:
Action Command
CephFS provides tools to inspect and repair MDS journals (cephfs-journal-tool) or MDS
tables (cephfs-table-tool), and to inspect and rebuild metadata (cephfs-data-scan).
This example retrieves object mapping information for a file within Ceph:
• Convert the inode number to hexadecimal. Use the %x formatting output of the printf
command.
CL260-RHCS5.0-en-1-20211117 339
Chapter 10 | Providing File Storage with CephFS
• Search for the hexadecimal ID in the RADOS object list. A large file might return multiple
objects.
Interpret this output as saying that the e95 map epoch of the OSD map for the cephfs_data
pool (ID 3) maps the 10000000000.00000000 object to placement group 3.30, which is on OSD
1 and OSD 2, and OSD 1 is primary. If the OSDs in up and acting status are not the same, then it
implies that the cluster is rebalancing or has another issue.
Layout attributes are initially set on the directory at the top of the CephFS file system. You can
manually set layout attributes on other directories or files. When you create a file, it inherits layout
attributes from its parent directory. If layout attributes are not set in its parent directory, then the
closest ancestor directory with layout attributes is used.
The layout attributes for a file, such as these examples, use the ceph.file.layout prefix.
340 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
The getfattr command displays the layout attributes for a file or directory:
Important
Layout attributes are set when data is initially saved to a file. If the parent directory's
layout attributes change after the file is created, then the file's layout attributes do
not change. Additionally, a file's layout attributes can be changed only if it is empty.
CL260-RHCS5.0-en-1-20211117 341
Chapter 10 | Providing File Storage with CephFS
CephFS Statistics
Managing Snapshots
CephFS enables asynchronous snapshots by default when deploying Red Hat Ceph Storage 5.
These snapshots are stored in a hidden directory called .snap. In earlier Red Hat Ceph Storage
versions, snapshots were disabled by default, as they were an experimental feature.
Creating Snapshots
Use cephfs set to enable snapshot creation for an existing CephFS file system.
To create a snapshot, first mount the CephFS file system on your client node. Use the -o
fs=_fs-name option to mount a CephFS file system when you have more than one. Then, create
a subdirectory inside the .snap directory. The snapshot name is the new subdirectory name. This
snapshot contains a copy of all the current files in the CephFS file system.
342 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
Authorize the client to make snapshots for the CephFS file system with the s option.
To restore a file, copy it from the snapshot directory to another normal directory.
To fully restore a snapshot from the .snap directory tree, replace the normal entries with copies
from the chosen snapshot.
To discard a snapshot, remove the corresponding directory in .snap. The rmdir command
succeeds even if the snapshot directory is not empty, without needing to use a recursive rm
command.
Scheduling Snapshots
You can use CephFS to schedule snapshots. The snap_schedule module manages the
scheduled snapshots. You can use this module to create and delete snapshot schedules. Snapshot
schedule information is stored in the CephFS metadata pool.
To create a snapshot schedule, first enable the snap_schedule module on the MGR node.
If an earlier version than Python 3.7 is installed, then the start-time string must use the format
%Y-%m-%dT%H:%M:%S. For Python version 3.7 or later, you can use more flexible date parsing. For
example, to create a snapshot schedule to create a snapshot for the /volume folder every hour,
you can use the ceph fs snap-schedule add command.
On the client node, review the snapshots in the .snap folder on your mounted CephFS:
CL260-RHCS5.0-en-1-20211117 343
Chapter 10 | Providing File Storage with CephFS
You can list the snapshot schedules for a path with the list option:
Use the status option to verify the details for the snapshot schedules.
You can activate and deactivate snapshot schedules through the activate and deactivate
options. When you add a snapshot schedule, it is activated by default if the path exists. However,
if the path does not exist, then it is set as inactive, so you can activate it later when you create the
path.
References
getfattr(1), and setfattr(1) man pages
For more information, refer to the Red Hat Ceph Storage 5 File System Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/file_system_guide/index
For more information regarding snapshots, refer to the Ceph File System snapshots
chapter in the Red Hat Ceph Storage 5 File System Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/file_system_guide/index#ceph-file-system-snapshots
344 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
Guided Exercise
Outcomes
You should be able to manage CephFS and create snapshots.
Instructions
• The serverc, serverd, and servere nodes are an operational 3-node Ceph cluster. All three
nodes operate as a MON, a MGR, and an OSD host with at least one colocated OSD.
• The clienta node is set up as your admin node server and you use it to install the Metadata
Server (MDS) on serverc.
• The mycephfs CephFS file system is mounted on the /mnt/mycephfs folder with the key ring
for the admin user.
2.1. Use the getfattr command to create and verify the layout of the /mnt/cephfs/
dir1 directory. This command should not return any layout attribute that is
associated with this directory.
CL260-RHCS5.0-en-1-20211117 345
Chapter 10 | Providing File Storage with CephFS
Note
If a directory has no layout attributes, then it inherits the layout of its parent. You
must set the layout attributes of a directory before you can view them. By default,
layout attributes are set on only the top-level directory of the mounted CephFS file
system (on the mount point).
2.2. Use the setfattr command to change the layout of the /mnt/mycephfs/dir1
directory.
2.3. Verify the layout of the /mnt/mycephfs/dir1 directory again. Layout attributes
should now be available for this directory. Settings other than the one that you
specified are inherited from the closest parent directory where attributes were set.
2.4. Verify the layout for the /mnt/mycephfs/dir1/ddtest file and notice that the
layout does not change for this file, which existed before the layout change.
2.5. Create a file called anewfile under the /mnt/cephfs/dir1/ directory. Notice how
the stripe_count for the layout of this file matches the new layout of the /mnt/
cephfs/dir1/ directory. The new file inherits the layout attributes from its parent
directory at creation time.
346 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
2.7. Ensure that the /mnt/mycephfs/dir1/anewfile file is not empty. Verify that you
cannot modify a specific file's layout attributes if the file is not empty.
2.8. Clear the configured layout attributes for the /mnt/mycephfs/dir1 directory.
Verify the new layout settings (that are inherited from the top of the CephFS file
system) by creating a file called a3rdfile.
3. Create a snapshot for your CephFS file system. Mount the CephFS file system on /mnt/
mycephfs as the user restricteduser.
3.1. Unmount the /mnt/mycephfs folder and mount again the CephFS file system to
that folder as the user restricteduser.
3.3. Create a mysnapshot folder, which is the snapshot name. As the user
restricteduser does not currently have snapshot permissions, you must grant
snapshot permissions to this user. After you set the permissions, you must remount
the CephFS file system to use the new permissions.
CL260-RHCS5.0-en-1-20211117 347
Chapter 10 | Providing File Storage with CephFS
[client.restricteduser]
key = AQBc315hI7PaBRAA9/9fdmj+wjblK+izstA0aQ==
caps mds = "allow r fsname=mycephfs, allow rw fsname=mycephfs path=/dir2"
caps mon = "allow r fsname=mycephfs"
caps osd = "allow rw tag cephfs data=mycephfs"
exported keyring for client.restricteduser
[ceph: root@clienta /]# ceph auth caps client.restricteduser \
mds 'allow rws fsname=mycephfs' \
mon 'al low r fsname=mycephfs' \
osd 'allow rw tag cephfs data=mycephfs'
updated caps for client.restricteduser
[ceph: root@clienta /]# exit
exit
[admin@clienta .snap]$ cd
[admin@clienta ~]$ sudo umount /mnt/mycephfs
[admin@clienta ~]$ sudo mount.ceph serverc.lab.example.com:/ \
/mnt/mycephfs -o name=restricteduser
[admin@clienta ~]$ cd /mnt/mycephfs/.snap
[admin@clienta .snap]$ mkdir mysnapshot
3.4. Check that the files in the snapshot are the same as the files in the mounted CephFS
file system.
1 directory, 3 files
[admin@clienta .snap]$ tree /mnt/mycephfs/.snap/mysnapshot
/mnt/mycephfs/.snap/mysnapshot
└── dir1
├── a3rdfile
├── anewfile
└── ddtest
1 directory, 3 files
4. Schedule to create an hourly snapshot of your CephFS file system's root folder.
4.1. Use sudo to run the cephadm shell. Enable the snapshot module.
4.3. Check that your snapshot schedule is correctly created and in an active state.
348 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
4.4. Exit from the cephadm shell. Check that your snapshot is correctly created in your
.snap folder.
4 directories, 6 files
Creating the scheduled snapshot might take time. As you scheduled it every hour, it
might take up to one hour to be triggered. You do not have to wait until the snapshot
is created.
Warning
Run the lab finish script on the workstation server so that the clienta
node can be safely rebooted without mount conflicts.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 349
Chapter 10 | Providing File Storage with CephFS
Lab
Outcomes
You should be able to deploy an MDS and use the kernel client to mount the CephFS file
system.
• The serverc, serverd, and servere nodes are an operational 3-node Ceph cluster. All
three nodes operate as a MON, a MGR, and an OSD host with at least one colocated OSD.
• The clienta node is set up as your admin node server and you use it to install the MDS
on serverc.
Instructions
1. Log in to clienta as the admin user. Create the ceph_data and ceph_metadata pools
for CephFS. Create the mycephfs CephFS file system. From clienta, deploy the MDS to
serverc. Verify that the MDS is up and active. Verify that the ceph health is OK.
2. On the clienta node, create the /mnt/cephfs-review mount point and mount the
CephFS file system as a kernel client.
3. Create a 10 MB test file called cephfs.test1. Verify that the created data is replicated
across all three nodes by showing 30 MB in the cephfs_data pool.
4. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade fileshare-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
350 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
CL260-RHCS5.0-en-1-20211117 351
Chapter 10 | Providing File Storage with CephFS
Solution
Outcomes
You should be able to deploy an MDS and use the kernel client to mount the CephFS file
system.
• The serverc, serverd, and servere nodes are an operational 3-node Ceph cluster. All
three nodes operate as a MON, a MGR, and an OSD host with at least one colocated OSD.
• The clienta node is set up as your admin node server and you use it to install the MDS
on serverc.
Instructions
1. Log in to clienta as the admin user. Create the ceph_data and ceph_metadata pools
for CephFS. Create the mycephfs CephFS file system. From clienta, deploy the MDS to
serverc. Verify that the MDS is up and active. Verify that the ceph health is OK.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
1.2. Create the two required CephFS pools. Name these pools cephfs_data and
cephfs_metadata.
1.3. Create the CephFS file system with the name mycephfs. Your pool numbers might
differ in your lab environment.
352 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
1.5. Verify that the MDS service is active. It can take some time until the MDS service is
shown.
services:
mon: 4 daemons, quorum serverc.lab.example.com,servere,serverd,clienta (age
2h)
mgr: serverc.lab.example.com.btgxor(active, since 2h), standbys:
clienta.soxncl, servere.fmyxwv, serverd.ufqxxk
mds: 1/1 daemons up
osd: 9 osds: 9 up (since 2h), 9 in (since 36h)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 7 pools, 169 pgs
objects: 212 objects, 7.5 KiB
usage: 162 MiB used, 90 GiB / 90 GiB avail
pgs: 169 active+clean
io:
client: 1.1 KiB/s wr, 0 op/s rd, 3 op/s wr
2. On the clienta node, create the /mnt/cephfs-review mount point and mount the
CephFS file system as a kernel client.
2.1. Exit the cephadm shell. Verify that the Ceph client key ring is present in the
/etc/ceph folder on the client node.
CL260-RHCS5.0-en-1-20211117 353
Chapter 10 | Providing File Storage with CephFS
2.3. Create the /mnt/cephfs-review mount point directory. Mount the new CephFS file
system as a kernel client.
2.4. Change the ownership of the top-level directory of the mounted file system to user
and group admin.
3. Create a 10 MB test file called cephfs.test1. Verify that the created data is replicated
across all three nodes by showing 30 MB in the cephfs_data pool.
3.1. Use the dd command to create one 10 MB file, and then verify that it triples across the
OSD nodes.
354 CL260-RHCS5.0-en-1-20211117
Chapter 10 | Providing File Storage with CephFS
Evaluation
Grade your work by running the lab grade fileshare-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 355
Chapter 10 | Providing File Storage with CephFS
Summary
In this chapter, you learned:
• You can distinguish the different characteristics for file-based, block-based, and object-based
storage.
• CephFS is a POSIX-compliant file system that is built on top of RADOS to provide file-based
storage.
• CephFS requires at least one Metadata Server that is separate from file data.
– Create two pools, one for CephFS data and another for CephFS metadata.
• You can mount CephFS file systems with either of the two available clients:
– The kernel client, which does not support quotas but is faster.
• NFS Ganesha is a user space NFS file server for accessing Ceph storage.
• You can modify the RADOS layout to control how files are mapped to objects.
• CephFS enables asynchronous snapshots by creating a folder in the hidden .snap folder.
356 CL260-RHCS5.0-en-1-20211117
Chapter 11
CL260-RHCS5.0-en-1-20211117 357
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Objectives
After completing this section, you should be able to administer and monitor a Red Hat Ceph
Storage cluster, including starting and stopping specific services or the full cluster, and querying
cluster health and utilization.
Client I/O operations continue normally while MGR nodes are down, but queries for cluster
statistics fail. Deploy at least two MGRs for each cluster to provide high availability. MGRs are
typically run on the same hosts as MON nodes, but it is not required.
The first MGR daemon that is started in a cluster becomes the active MGR and all other MGRs
are on standby. If the active MGR does not send a beacon within the configured time interval, a
standby MGR takes over. You can configure the mon_mgr_beacon_grace setting to change the
beacon time interval if needed. The default value is 30 seconds.
Use the ceph mgr fail <MGR_NAME> command to manually failover from the active MGR to a
standby MGR.
Use the ceph mgr stat command to view the status of the MGRs.
View the modules that are available and enabled by using the ceph mgr module ls command.
View published addresses for specific modules, such as the Dashboard module URL, by using the
ceph mgr services command.
358 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
The Dashboard relies on the Prometheus and Grafana services to display collected monitoring
data and to generate alerts. Prometheus is an open source monitoring and alerting tool. Grafana is
an open source statistical graphing tool.
The Dashboard supports alerts based on Ceph metrics and configured thresholds. The
Prometheus AlertManager component configures, gathers, and triggers the alerts. Alerts are
displayed in the Dashboard as notifications. You can view details of recent alerts and mute alerts.
• HEALTH_WARN indicates that the cluster is in a warning condition. For example, an OSD is down,
but there are enough OSDs working properly for the cluster to function.
• HEALTH_ERR indicates that the cluster is in an error condition. For example, a full OSD could
have an impact on the functionality of the cluster.
If the Ceph cluster is in a warning or an error state, the ceph health detail command
provides additional details.
The ceph -w command displays additional real-time monitoring information about the events
happening in the Ceph cluster.
This command provides the status of cluster activities, such as the following details:
To monitor the cephadm log, use the ceph -W cephadm command. Use the ceph log last
cephadm to view the most recent log entries.
Cluster daemons are referred to by the type of $daemon and the daemon $id. The type of
$daemon is mon, mgr, mds, osd, rgw, rbd-mirror,crash, or cephfs-mirror.
The daemon $id for MON, MGR, and RGW is the host name. The daemon $id for OSD is the
OSD ID. The daemon $id for MDS is the file system name followed by the host name.
Use the ceph orch ps command to list all cluster daemons. Use the --daemon_type=DAEMON
option to filter for a specific daemon type.
CL260-RHCS5.0-en-1-20211117 359
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
To stop, start, or restart a daemon on a host, use systemctl commands and the daemon name.
To list the names of all daemons on a cluster host, run the systemctl list-units command
and search for ceph.
The cluster fsid is in the daemon name. Some service names end in a random six character string
to distinguish individual services of the same type on the same host.
Use the ceph.target command to manage all the daemons on a cluster node.
You can also use the ceph orch command to manage cluster services. First, obtain the service
name by using the ceph orch ls command. For example, find the service name for cluster
OSDs and restart the service.
360 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
You can manage an individual cluster daemon by using the ceph orch daemon command.
Use the ceph osd set and ceph osd unset commands to manage these flags:
noup
Do not automatically mark a starting OSD as up. If the cluster network is experiencing latency
issues, OSDs can mark each other down on the MON, then mark themselves up. This scenario
is called flapping. Set the noup and nodown flags to prevent flapping.
nodown
The nodown flag tells the Ceph MON to mark a stopping OSD with the down state. Use the
nodown flag when performing maintenance or a cluster shutdown. Set the nodown flag to
prevent flapping.
noout
The noout flag tells the Ceph MON not to remove any OSDs from the CRUSH map, which
prevents CRUSH from automatically rebalancing the cluster when OSDs are stopped. Use the
noout flag when performing maintenance on a subset of the cluster. Clear the flag after the
OSDs are restarted.
noin
The noin flag prevent booting OSDs from being marked with the in state. The flag prevents
data from being automatically allocated to that specific OSD.
CL260-RHCS5.0-en-1-20211117 361
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
norecover
The norecover flag prevents recovery operations from running. Use the norecover flag
when performing maintenance or a cluster shutdown.
nobackfill
The nobackfill flag prevents backfill operations from running. Use the nobackfill flag
when performing maintenance or a cluster shutdown. Backfilling is discussed later in this
section.
norebalance
The norebalance flag prevents rebalancing operations from running. Use the norebalance
flag when performing maintenance or a cluster shutdown.
noscrub
The noscrub flag prevents scrubbing operations from running. Scrubbing will be discussed
later in this section.
nodeep-scrub
The nodeep-scrub flag prevents any deep-scrubbing operation from running. Deep-
scrubbing is discussed later in this section.
Cluster Power Up
Perform the following steps to power on the cluster:
• Power up cluster nodes in the following order: admin node, MON and MGR nodes, OSD nodes,
MDS nodes.
• Clear the noout, norecover, norebalance, nobackfill, nodown and pause flags.
• Bring up Ceph Object Gateways and iSCSI Gateways.
• Bring up CephFS.
362 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Ceph containers write to individual log files for each daemon. Enable logging for each specific
Ceph daemon by configuring the daemon's log_to_file setting to true. This example enables
logging for MON nodes.
Monitoring OSDs
If the cluster is not healthy, Ceph displays a detailed status report containing the following
information:
The ceph status and ceph health commands report space-related warning or error
conditions. The various ceph osd subcommands report OSD usage details, status, and location
information.
CL260-RHCS5.0-en-1-20211117 363
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
7 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 44 KiB 1024 MiB 9.0 GiB 10.28
1.00 38 up
8 hdd 0.00980 1.00000 10 GiB 1.0 GiB 28 MiB 44 KiB 1024 MiB 9.0 GiB 10.28
1.00 47 up
TOTAL 90 GiB 9.2 GiB 255 MiB 274 KiB 9.0 GiB 81 GiB 10.28
CLASS The type of devices that the OSD uses (HDD, SDD, or NVMe).
WEIGHT The weight of the OSD in the CRUSH map. By default, this is set to the
OSD capacity in TB and is changed by using the ceph osd crush
reweight command. The weight determines how much data CRUSH
places onto the OSD relative to other OSDs. For example, two OSDs with
the same weight receive roughly the same number of I/O requests and
store approximately the same amount of data.
REWEIGHT Either the default reweight value or the actual value set by the ceph osd
reweight command. You can reweight an OSD to temporarily override the
CRUSH weight.
OMAP The BlueFS storage that is used to store object map (OMAP) data, which
are the key-value pairs stored in RocksDB.
Use the ceph osd perf command to view OSD performance statistics.
364 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
• down or up - indicating whether the daemon is running and communicating with the MONs.
If an OSD fails and the daemon goes offline, the cluster might report it as down and in for a short
period of time. This is intended to give the OSD a chance to recover on its own and rejoin the
cluster, avoiding unnecessary recovery traffic.
For example, a brief network interruption might cause the OSD to lose communication with
the cluster and be temporarily reported as down. After a short interval controlled by the
mon_osd_down_out_interval configuration option (five minutes by default), the cluster
reports the OSD as down and out. At this point, the placement groups assigned to the failed OSD
are migrated to other OSDs.
If the failed OSD then returns to the up and in states, the cluster reassigns placement groups
based on the new set of OSDs and by rebalancing the objects in the cluster.
Note
Use the ceph osd set noout and ceph osd unset noout commands to
enable or disable the noout flag on the cluster. However, the ceph osd out
osdid command tells the Ceph cluster to ignore an OSD for data placement and
marks the OSD with the out state.
OSDs verify each other's status at regular time intervals (six seconds by default). They report
their status to the MONs every 120 seconds, by default. If an OSD is down, the other OSDs or the
MONs do not receive heartbeat responses from that down OSD.
CL260-RHCS5.0-en-1-20211117 365
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
When the value of the mon_osd_full_ratio setting is reached or exceeded, the cluster stops
accepting write requests from clients and enters the HEALTH_ERR state. The default full ratio is
0.95 (95%) of the available storage space in the cluster. Use the full ratio to reserve enough space
so that if OSDs fail, there is enough space left that automatic recovery succeeds without running
out of space.
The mon_osd_nearfull_ratio setting is a more conservative limit. When the value of the
mon_osd_nearfull_ratio limit is reached or exceeded, the cluster enters the HEALTH_WARN
state. This is intended to alert you to the need to add OSDs to the cluster or fix issues before you
reach the full ratio. The default near full ratio is 0.85 (85%) of the available storage space in the
cluster.
Use the ceph osd set-full-ratio, ceph osd set-nearfull-ratio, and ceph osd
set-backfillfull-ratio commands to configure these settings.
Note
The default ratio settings are appropriate for small clusters, such as the one used in
this lab environment. Production clusters typically require lower ratios.
Different OSDs might be at full or nearfull depending on exactly what objects are stored in
which placement groups. If you have some OSDs full or nearfull and others with plenty of
space remaining, analyze your placement group distribution and CRUSH map weights.
366 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
of scrubbing or deep-scrubbing can also occur in a healthy cluster and does not indicate a
problem.
Placement group scrubbing is a background process that verifies data consistency by comparing
an object's size and other metadata with its replicas on other OSDs and reporting inconsistencies.
Deep scrubbing is a resource-intensive process that compares the contents of data objects by
using a bitwise comparison and recalculates checksums to identify bad sectors on the drive.
Note
Although scrubbing operations are critical to maintain a healthy cluster, they have a
performance impact, particularly deep scrubbing. Schedule scrubbing to avoid peak
I/O times. Temporarily prevent scrub operations with the noscrub and nodeep-
scrub cluster flags.
PG state Description
peering The OSDs are being brought into agreement about the current state
of the objects in the PG.
active Peering is complete. The PG is available for read and write requests.
clean The PG has the correct number of replicas and there are no stray
replicas.
undersized The PG is configured to store more replicas than there are OSDs
available to the placement group.
inconsistent Replicas of this PG are not consistent. One or more replicas in the PG
are different, indicating some form of corruption of the PG.
replay The PG is waiting for clients to replay operations from a log after an
OSD crash.
incomplete The PG is missing information from its history log about writes that
might have occurred. This could indicate that an OSD has failed or is
not started.
CL260-RHCS5.0-en-1-20211117 367
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
PG state Description
remapped The acting set has changed, and the PG is temporarily remapped to a
different set of OSDs while the primary OSD recovers or backfills.
When an OSD is added to a placement group, the PG enters the peering state to ensure that all
nodes agree about the state of the PG. If the PG can handle read and write requests after peering
completes, then it reports an active state . If the PG also has the correct number of replicas for
all of its objects, then it reports a clean state. The normal PG operating state after writes are
complete is active+clean.
When an object is written to the PG's primary OSD, the PG reports a degraded state until all
replica OSDs acknowledge that they have also written the object.
The backfill state means that data is being copied or migrated to rebalance PGs across OSDs. If
a new OSD is added to the PG, it is gradually backfilled with objects to avoid excessive network
traffic. Backfilling occurs in the background to minimize the performance impact on the cluster.
The backfill_wait state indicates that a backfill operation is pending. The backfill state
indicates that a backfill operation is in progress. The backfill_too_full state indicates that a
backfill operation was requested, but could not be completed due to insufficient storage capacity.
A PG marked as inconsistent might have replicas that are different from the others, detected
as a different data checksum or metadata size on one or more replicas. A clock skew in the Ceph
cluster and corrupted object content can also cause an inconsistent PG state.
Note
The MONs use the mon_pg_stuck_threshold parameter to decide if a PG
has been in an error state for too long. The default value for the threshold is 300
seconds.
Ceph marks a PG as stale when all OSDs that have copies of a specific PG are in down and out
states. To return from a stale state, an OSD must be revived to have a PG copy available and for
368 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
PG recovery to begin. If the situation remains unresolved, the PG is inaccessible and I/O requests
to the PG hang.
By default, Ceph performs an automatic recovery. If recovery fails for any PGs, the cluster status
continues to display HEALTH_ERR.
Ceph can declare that an OSDs or a PG is lost, which might result in data loss. To determine
the affected OSDs, first retrieve an overview of cluster status with the ceph health detail
command. Then, use the ceph pg dump_stuck option command to inspect the state of PGs.
Note
If many PGs remain in the peering state, the ceph osd blocked-by command
displays the OSD that is preventing the OSD peering.
Inspect the PG using either the ceph pg dump | grep pgid or the ceph pg query pgid
command. The OSDs hosting the PG are displayed in square brackets ([]).
First, update cephadm by running the cephadm-ansible preflight playbook with the
upgrade_ceph_packages option set to true.
Then run the ceph orch upgrade start --ceph-version VERSION command using the
name of the new version.
Run the ceph status command to view the progress of the upgrade.
Do not mix clients and cluster nodes that use different versions of Red Hat Ceph Storage in the
same cluster. Clients include RADOS gateways, iSCSI gateways, and other applications that use
librados, librbd, or libceph.
Use the ceph versions command after a cluster upgrade to verify that matching versions are
installed.
CL260-RHCS5.0-en-1-20211117 369
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
The balancer module does not run if the cluster is not in the HEALTH_OK state. When the cluster
is healthy, it throttles its changes so that it keeps the number of PGs that need to be moved under
a 5% threshold. Configure the target_max_misplaced_ratio MGR setting to adjust this
threshold:
The balancer module is enabled by default. Use the ceph balancer on and ceph balancer
off commands to enable or disable the balancer.
Use the ceph balancer status command to display the balancer status.
Automated Balancing
Automated balancing uses one of the following modes:
370 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
crush-compat
This mode uses the compat weight-set feature to calculate and manage an alternative
set of weights for devices in the CRUSH hierarchy. The balancer optimizes these weight-set
values, adjusting them up or down in small increments to achieve a distribution that matches
the target distribution as closely as possible.
upmap
The PG upmap mode enables storing explicit PG mappings for individual OSDs in the OSD
map as exceptions to the normal CRUSH placement calculation. The upmap mode analyzes
PG placement, and then runs the required pg-upmap-items commands to optimize PG
placement and achieve a balanced distribution.
Because these upmap entries provide fine-grained control over the PG mapping, the upmap
mode is usually able to distribute PGs evenly among OSDs, or +/-1 PG if there is an odd
number of PGs.
Setting the mode to upmap requires that all clients be Luminous or newer. Use the ceph
osd set-require-min-compat-client luminous command to set the required
minimum client version.
Use the ceph balancer mode upmap command to set the balancer mode to upmap.
Use the ceph balancer mode crush-compat command to set the balancer mode to crush-
compat.
Manual Balancing
You can run the balancer manually to control when balancing occurs and to evaluate the
balancer plan before executing it. To run the balancer manually, use the following commands to
disable automatic balancing, and then generate and execute a plan.
CL260-RHCS5.0-en-1-20211117 371
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Note
Only execute the plan if you expect it to improve the distribution. The plan is
discarded after execution.
References
For more information, refer to the Red Hat Ceph Storage 5 Administration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/administration_guide/index
372 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Guided Exercise
Outcomes
You should be able to administer and monitor the cluster, including starting and stopping
specific services, analyzing placement groups, setting OSD primary affinity, verifying
daemon versions, and querying cluster health and utilization.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
2. View the enabled MGR modules. Verify that the dashboard module is enabled.
CL260-RHCS5.0-en-1-20211117 373
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
"dashboard",
"insights",
"iostat",
"prometheus",
"restful"
],
"disabled_modules": [
{
"name": "alerts",
"can_run": true,
"error_string": "",
"module_options": {
...output omitted...
4.1. Using a web browser, navigate to the dashboard URL obtained in the previous step.
Log in as the admin user with the redhat password.
4.2. On the Dashboard page, click Monitors to view the status of the Monitor nodes and
quorum.
6. Find the location of the OSD 2 daemon, stop the OSD, and view the cluster OSD status.
374 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
}
]
},
"osd_fsid": "1163a19e-e580-40e0-918f-25fd94e97b86",
"host": "serverc.lab.example.com",
"crush_location": {
"host": "serverc",
"root": "default"
}
}
6.3. Exit the serverc node. View the cluster OSD status.
7. Start osd.2 on the serverc node, and then view the cluster OSD status.
8. View the log files for the osd.2 daemon. Filter the output to view only systemd events.
CL260-RHCS5.0-en-1-20211117 375
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
9. Mark the osd.4 daemon as being out of the cluster and observe how it affects the cluster
status. Then, mark the osd.4 daemon as being in the cluster again.
9.1. Mark the osd.4 daemon as being out of the cluster. Verify that the osd.4 daemon
is marked out of the cluster and notice that the OSD's weight is now 0.
Note
Ceph recreates the missing object replicas previously available on the osd.4
daemon on different OSDs. You can trace the recovery of the objects using the
ceph status or the ceph -w commands.
376 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Note
You can mark an OSD as out even though it is still running (up). The in or out
status does not correlate to an OSD's running state.
10. Analyze the current utilization and number of PGs on the OSD 2 daemon.
11. View the placement group status for the cluster. Create a test pool and a test object. Find
the placement group to which the test object belongs and analyze that placement group's
status.
11.1. View the placement group status for the cluster. Examine the PG states. Your output
may be different in your lab environment.
11.2. Create a pool called testpool and an object called testobject containing the /
etc/ceph/ceph.conf file.
11.3. Find the placement group of the testobject object in the testpool pool and
analyze its status. Use the placement group information from your lab environment in
the query.
CL260-RHCS5.0-en-1-20211117 377
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "active+clean",
"epoch": 334,
"up": [
8,
2,
5
],
"acting": [
8,
2,
5
],
"acting_recovery_backfill": [
"2",
"5",
"8"
],
"info": {
"pgid": "9.11",
...output omitted...
12. List the OSD and cluster daemon versions. This is a useful command to run after cluster
upgrades.
378 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 379
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Objectives
After completing this section, you should be able to perform common cluster maintenance tasks,
such as adding or removing MONs and OSDs, and recovering from various component failures.
Evaluate the potential performance impact before performing cluster maintenance activities. The
following factors typically affect cluster performance when adding or removing OSD nodes:
• Client load
If an OSD node has a pool that is experiencing high client loads, then performance and recovery
time could be negatively affected. Because write operations require data replication for
resiliency, write-intensive client loads increase cluster recovery time.
• Node capacity
The capacity of the node being added or removed affects the cluster recovery time. The node's
storage density also affects recovery times. For example, a node with 36 OSDs takes longer to
recover than a node with 12 OSDs.
When removing nodes, verify that you have sufficient spare capacity to avoid reaching the full or
near full ratios. When a cluster reaches the full ratio, Ceph suspends write operations to prevent
data loss.
• CRUSH rules
A Ceph OSD node maps to at least one CRUSH hierarchy, and that hierarchy maps to at
least one pool via a CRUSH rule. Each pool using a specific CRUSH hierarchy experiences a
performance impact when adding and removing OSDs.
• Pool types
Replication pools use more network bandwidth to replicate data copies, while erasure-coded
pools use more CPU to calculate data and coding chunks.
The more data copies that exist, the longer it takes for the cluster to recover. For example, an
erasure-coded pool with many chunks takes longer to recover than a replicated pool with fewer
copies of the same data.
• Node hardware
380 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Nodes with higher throughput characteristics, such as 10 Gbps network interfaces and SSDs,
recover more quickly than nodes with lower throughput characteristics, such as 1 Gbps network
interfaces and SATA drives.
When a storage device fails, the OSD status changes to down. Other cluster issues, such as a
network error, can also mark an OSD as down. When an OSD is down, first verify if the physical
device has failed.
Replacing a failed OSD requires replacing both the physical storage device and the software-
defined OSD. When an OSD fails, you can replace the physical storage device and either reuse
the same OSD ID or create a new one. Reusing the same OSD ID avoids having to reconfigure the
CRUSH map.
If an OSD has failed, use the Dashboard GUI or the following CLI commands to replace the OSD.
To verify that the OSD has failed, perform the following steps.
• View the cluster status and verify that an OSD has failed.
If the OSD does not start, then the physical storage device might have failed. Use the
journalctl command to view the OSD logs or use the utilities available in your production
environment to verify that the physical device has failed.
If you have verified that the physical device needs replacement, perform the following steps.
[ceph: root@node /]# ceph osd set noscrub ; ceph osd set nodeep-scrub
• Watch cluster events and verify that a backfill operation has started.
CL260-RHCS5.0-en-1-20211117 381
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
• Verify that the backfill process has moved all PGs off the OSD and it is now safe to remove.
• When the OSD is safe to remove, replace the physical storage device and destroy the OSD.
Optionally, remove all data, file systems, and partitions from the device.
[ceph: root@node /]# ceph orch device zap HOST_NAME _OSD_ID --force
Note
Find the current device ID using the Dashboard GUI, or the ceph-volume lvm
list or ceph osd metadata CLI commands.
• Replace the OSD using the same ID as the one that failed. Verify that the operation has
completed before continuing.
• Replace the physical device and recreate the OSD. The new OSD uses the same OSD ID as the
one that failed.
Note
The device path of the new storage device might be different than the failed device.
Use the ceph orch device ls command to find the new device path.
• Re-enable scrubbing.
[ceph: root@node /]# ceph osd unset noscrub ; ceph osd unset nodeep-scrub
Adding a MON
Add a MON to your cluster by performing the following steps.
382 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Note
Specify all MON nodes when running this command. If you only specify the new
MON node, then the command removes all other MONs, leaving the cluster with
only one MON node.
[ceph: root@node /]# ceph orch apply mon --placement="NODE1 NODE2 NODE3 NODE4 ..."
Removing a MON
Use the ceph orch apply mon command to remove a MON from the cluster. Specify all MONs
except the one that you want to remove.
[ceph: root@node /]# ceph orch apply mon --placement="NODE1 NODE2 NODE3 ..."
[ceph: root@node /]# ceph orch host maintenance enter HOST_NAME [--force]
References
For more information, refer to the Red Hat Ceph Storage 5 Operations Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/operations_guide/index
CL260-RHCS5.0-en-1-20211117 383
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Guided Exercise
Outcomes
You should be able to add, replace, and remove components in an operational Red Hat Ceph
Storage cluster.
This command confirms that the hosts required for this exercise are accessible and stops the
osd.3 daemon to simulate an OSD failure.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
2. Set the noscrub and nodeep-scrub flags to prevent the cluster from starting scrubbing
operations temporarily.
3. Verify the Ceph cluster status. The cluster will transition to the HEALTH_WARN status after
some time.
384 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
4.3. Log in to the serverd node and use sudo to run the cephadm shell. Identify the
device name for the failed OSD.
Note
You can also identify the device name of an OSD by using the ceph osd
metadata OSD_ID command from the admin node.
5. Exit the cephadm shell. Identify the service name of the osd.3 daemon running on the
serverd node. The service name will be different in your lab environment.
CL260-RHCS5.0-en-1-20211117 385
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
7. On the serverd node, start the osd.3 service. On the admin node, verify that the OSD
has started.
386 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
8. Clear the noscrub and nodeep-scrub flags. Verify that the cluster health status returns
to HEALTH_OK. Press CTL+C to exit the ceph -w command.
9.4. Verify that the MONs are active and correctly placed.
10. Remove the MON service from serverg node, remove its OSDs, and then remove
serverg from the cluster. Verify that the serverg node is removed.
CL260-RHCS5.0-en-1-20211117 387
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Important
Always keep at least three MONs running in a production cluster.
10.3. Remove the serverg node from the cluster. Verify that the serverg node has been
removed.
388 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
11. You receive an alert that there is an issue on the servere node. Put the servere node
into maintenance mode, reboot the host, and then exit maintenance mode.
11.1. Put the servere node into maintenance mode, and then verify that it has a
maintenance status.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 389
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
390 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Lab
Outcomes
You should be able to locate the Ceph Dashboard URL, set an OSD out and in, watch
cluster events, find and start a down OSD, find an object's PG location and state, and view
the balancer status.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user. Verify that the dashboard module is enabled. Find
the dashboard URL of the active MGR.
2. You receive an alert that an OSD is down. Identify which OSD is down. Identify on which node
the down OSD runs, and start the OSD.
3. Set the OSD 5 daemon to the out state and verify that all data has been migrated off of the
OSD.
4. Set the OSD 5 daemon to the in state and verify that PGs have been placed onto it.
5. Display the balancer status.
6. Identify the PG for object data1 in the pool1 pool. Query the PG and find its state.
7. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade cluster-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 391
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
392 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Solution
Outcomes
You should be able to locate the Ceph Dashboard URL, set an OSD out and in, watch
cluster events, find and start a down OSD, find an object's PG location and state, and view
the balancer status.
This command confirms that the hosts required for this exercise are accessible.
Instructions
1. Log in to clienta as the admin user. Verify that the dashboard module is enabled. Find
the dashboard URL of the active MGR.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
CL260-RHCS5.0-en-1-20211117 393
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Note
Your output might be different depending on which MGR node is active in your lab
environment.
2. You receive an alert that an OSD is down. Identify which OSD is down. Identify on which node
the down OSD runs, and start the OSD.
394 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
3. Set the OSD 5 daemon to the out state and verify that all data has been migrated off of the
OSD.
3.2. Verify that all PGs have been migrated off of the OSD 5 daemon. It will take some time
for the data migration to finish. Press CTL+C to exit the command.
services:
mon: 4 daemons, quorum serverc.lab.example.com,clienta,serverd,servere (age
9h)
mgr: serverc.lab.example.com.aiqepd(active, since 9h), standbys:
serverd.klrkci, servere.kjwyko, clienta.nncugs
osd: 9 osds: 9 up (since 46s), 8 in (since 7s); 4 remapped pgs
rgw: 2 daemons active (2 hosts, 1 zones)
data:
pools: 5 pools, 105 pgs
objects: 221 objects, 4.9 KiB
usage: 235 MiB used, 80 GiB / 80 GiB avail
pgs: 12.381% pgs not active
1/663 objects degraded (0.151%)
92 active+clean
10 remapped+peering
2 activating
1 activating+degraded
io:
recovery: 199 B/s, 0 objects/s
progress:
Global Recovery Event (2s)
[............................]
CL260-RHCS5.0-en-1-20211117 395
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
4. Set the OSD 5 daemon to the in state and verify that PGs have been placed onto it.
4.2. Verify that PGs have been placed onto the OSD 5 daemon.
396 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
6. Identify the PG for object data1 in the pool1 pool. Query the PG and find its state.
Note
In this example, the PG is 6.1c. Use the PG value in the output displayed in your lab
environment.
6.2. Query the PG and view its state and primary OSD.
...output omitted...
CL260-RHCS5.0-en-1-20211117 397
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Evaluation
Grade your work by running the lab grade cluster-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
398 CL260-RHCS5.0-en-1-20211117
Chapter 11 | Managing a Red Hat Ceph Storage Cluster
Summary
In this chapter, you learned how to:
• Enable or disable Ceph Manager (MGR) modules, and more about the role of the Ceph
Manager (MGR).
• Use the CLI to find the URL of the Dashboard GUI on the active MGR.
• View the status of cluster MONs by using the CLI or the Dashboard GUI.
• Power down the entire cluster by setting cluster flags to stop background operations, then
stopping daemons and nodes in a specific order by function.
• Power up the entire cluster by starting nodes and daemons in a specific order by function, then
setting cluster flags to enable background operations.
• Start, stop, or restart individual cluster daemons and view daemon logs.
• Use the balancer module to optimize the placement of PGs across OSDs.
CL260-RHCS5.0-en-1-20211117 399
400 CL260-RHCS5.0-en-1-20211117
Chapter 12
CL260-RHCS5.0-en-1-20211117 401
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Objectives
After completing this section, you should be able to choose Red Hat Ceph Storage architecture
scenarios and operate Red Hat Ceph Storage-specific performance analysis tools to optimize
cluster deployments.
Latency
It is a common misconception that disk latency and response time are the same thing. Disk
latency is a function of the device, but response time is measured as a function of the entire
server.
For hard drives using spinning platters, disk latency has two components:
• Seek time: The time it takes to position the drive heads on the correct track on the platter,
typically 0.2 to 0.8 ms.
• Rotational latency: The additional time it takes for the correct starting sector on that track
to pass under the drive heads, typically a few milliseconds.
After the drive has positioned the heads, it can start transferring data from the platter. At that
point, the sequential data transfer rate is important.
For solid-state drives (SSDs), the equivalent metric is the random access latency of the
storage device, which is typically less than a millisecond. For non-volatile memory express
drives (NVMes), the random access latency of the storage drive is typically in microseconds.
Throughput
Throughput refers to the actual number of bytes per second the system can read or write. The
size of the block and the data transfer rate affect the throughput. The higher the disk block
size, the more you attenuate the latency factor. The higher the data transfer rate, the faster a
disk can transfer data from its surface to a buffer.
As a reference value, hard drives using spinning platters have a throughput around 150 Mb/s,
SSDs are around 500 Mbps, and NVMes are in the order of 2,000 Mb/s.
You can measure throughput for networks and the whole system, from a remote client to a
server.
402 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Tuning Objectives
The hardware you use determines the performance limits of your system and your Ceph cluster.
The objective of tuning performance is to use your hardware as efficiently as possible.
It is a common observation that tuning a specific subsystem can adversely affect the performance
of another. For example, you can tune your system for low latency at the expense of high
throughput. Therefore, before starting to tune, establish your goals to align with the expected
workload of your Ceph cluster:
IOPS optimized
Workloads on block devices are often IOPS intensive, for example, databases running on
virtual machines in OpenStack. Typical deployments require high-performance SAS drives for
storage and journals placed on SSDs or NVMe devices.
Throughput optimized
Workloads on a RADOS Gateway are often throughput intensive. Objects can store significant
amounts of data, such as audio and video content.
Capacity optimized
Workloads that require the ability to store a large quantity of data as inexpensively as possible
usually trade performance for price. Selecting less-expensive and slower SATA drives is the
solution for this kind of workload.
• Reduce latency
• Increase IOPS at the device
• Increase block size
Ceph Deployment
It is important to plan a Ceph cluster deployment correctly. The MONs performance is critical
for overall cluster performance. MONs should be on dedicated nodes for large deployments. To
ensure a correct quorum, an odd number of MONs is required.
Designed to handle large quantities of data, Ceph can achieve improved performance if the
correct hardware is used and the cluster is tuned correctly.
After the cluster installation, begin continuous monitoring of the cluster to troubleshoot failures
and schedule maintenance activities. Although Ceph has significant self-healing abilities, many
types of failure events require rapid notification and human intervention. Should performance
issues occur, begin troubleshooting at the disk, network, and hardware level. Then, continue with
diagnosing RADOS block devices and the Ceph RADOS Gateways.
In a typical deployment, OSDs use traditional spinning disks with high latency because they
provide satisfactory metrics that meet defined goals at a lower cost per megabyte. By default,
BlueStore OSDs place the data, block database, and WAL on the same block device. However,
CL260-RHCS5.0-en-1-20211117 403
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
you can maximize the efficiency by using separate low latency SSDs or NVMe devices for the
block database and WAL. Multiple block databases and WALs can share the same SSD or
NVMe device, reducing the cost of the storage infrastructure.
Consider the impact of the following SSD specifications against the expected workload:
• Mean Time Between Failures (MTBF) for the number of supported writes
• IOPS capabilities
• Data transfer rate
• Bus/SSD couple capabilities
Warning
When an SSD or NVMe device that hosts journals fails, every OSD using it to host
its journal also becomes unavailable. Consider this when deciding how many block
databases or WALs to place on the same storage device.
The RADOS Gateway maintains one index per bucket. By default, Ceph stores this index in
one RADOS object. When a bucket stores more than 100,000 objects, the index performance
degrades because the single index object becomes a bottleneck.
Ceph can keep large indexes in multiple RADOS objects, or shards. Enable this feature by
setting the rgw_override_bucket_index_max_shards parameter. The recommended
value is the number of objects expected in a bucket divided by 100,000.
As the index grows, Ceph must regularly reshard the bucket. Red Hat Ceph Storage provides a
bucket index automatic resharding feature. The rgw_dynamic_resharding parameter, set
to true by default, controls this feature.
Each MDS maintains a cache in memory for different kinds of items, such as inodes. Ceph
limits the size of this cache with the mds_cache_memory_limit parameter. Its default value,
expressed in absolute bytes, is equal to 4 GB.
Use this formula to estimate how many PGs should be available for a single, specific pool:
404 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Apply the formula for each pool to get the total number of PGs for the cluster. Red Hat
recommends between 100 and 200 PGs per OSD.
Note
Red Hat provides the Ceph Placement Groups (PGs) per Pool Calculator
to recommend the number of PGs per pool, at https://access.redhat.com/labs/
cephpgc/.
Splitting PGs
Ceph supports increasing or decreasing the number of PGs in a pool. If you do not specify a
value when creating a pool, it is created with a default value of 8 PGs, which is very low.
Red Hat recommends that you make incremental increases in the number of placement
groups until you reach the desired number of PGs. Increasing the number of PGs by a
significant amount can cause cluster performance degradation, because the expected data
relocation and rebalancing is intensive.
Use the ceph osd pool set command to manually increase or decrease the number of
PGs by setting the pg_num parameter. You should only increase the number of PGs in a pool
by small increments when doing it manually with the pg_autoscale_mode option disabled.
Setting the total number of placement groups to a number that is a power of 2 provides better
distribution of the PGs across the OSDs. Increasing the pg_num parameter automatically
increases the pgp_num parameter, but at a gradual rate to minimize the impact on cluster
performance.
Merging PGs
Red Hat Ceph Storage can merge two PGs into a larger PG, reducing the total number of
PGs. Merging can be useful when the number of PGs in a pool is too large and performance
is degraded. Because merging is a complex process, merge only one PG at a time to minimize
the impact on cluster performance.
PG Auto-scaling
As discussed, the PG autoscale feature allows Ceph to make recommendations and
automatically adjust the number of PGs. This feature is enabled by default when creating a
pool. For existing pools, configure autoscaling with this command:
Set the mode parameter to off to disable it, on to enable it and allow Ceph to automatically
make adjustments in the number of PGs, or warn to raise health alerts when the number of
PGs must be adjusted.
CL260-RHCS5.0-en-1-20211117 405
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
The previous table is split in two parts to make it easier to read on this page. The first two
pools have the AUTOSCALE feature set to on, with Ceph automatically adjusting the number
of PGs. The third pool is configured to provide a health alert if the number of PGs needs
adjusting. The PG_NUM parameter is the current number of PGs in each pool or the number of
PGs that the pool is working towards. The NEW PG_NUM parameter is the number of PGs that
Ceph recommends to set in the pool.
Scalability
You can scale clustered storage in two ways:
Scaling up requires that nodes can accept more CPU and RAM resources to handle an increase
in the number of disks and disk size. Scaling out requires adding nodes with similar resources and
capacity to match the cluster's existing nodes for balanced operations.
• To increase performance and provide better isolation for troubleshooting, use separate
networks for OSD traffic and for client traffic.
• At a minimum, use 10 GB networks or larger for the storage cluster. 1 GB networks are not
suitable for production environments.
• Evaluate network sizing based on both cluster and client traffic, and the amount of data stored.
• Network monitoring is highly recommended.
• Use separate NICs to connect to the networks where possible, or else use separate ports.
406 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
The Ceph daemons automatically bind to the correct interfaces, such as binding MONs to the
public network, and binding OSDs to both public and cluster networks.
To avoid degraded cluster performance, adjust the backfilling and recovery operations to create
a balance between rebalancing and normal cluster operations. Ceph provides parameters to limit
the backfilling and recovery operations' I/O and network activity.
Parameter Definition
CL260-RHCS5.0-en-1-20211117 407
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Parameter Definition
Configuring Hardware
Using realistic metrics for your cluster's expected workload, build the cluster's hardware
configuration to provide sufficient performance, but keep the cost as low as possible. Red Hat
suggests these hardware configurations for the three performance priorities:
IOPS optimized
• NVMe drives have data, the block database, and WAL collocated on the same storage
device.
• Assuming a 2 GHz CPU, use 10 cores per NVMe or 2 cores per SSD.
Throughput optimized
Capacity optimized
• HDDs have data, the block database, and WAL collocated on the same storage device.
408 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
• The seq and rand tests are sequential and random read benchmarks. These tests require
that a write benchmark is run first with the --no-cleanup option. By default, RADOS
bench removes the objects created for the writing test. The --no-cleanup option keeps
the objects, which can be useful for performing multiple tests on the same objects.
With the --no-cleanup option, you must manually remove data that remains in the pool
after running the rados bench command.
For example, the following information is provided by the rados bench command,
including throughput, IOPS, and latency:
CL260-RHCS5.0-en-1-20211117 409
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
For example, information is provided by the rbd bench command, including throughput
and latency:
References
sysctl(8), ceph(8), rados(8), and rbd(8) man pages
For more information, refer to the Ceph performance benchmark chapter in the
Red Hat Ceph Administration Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/administration_guide/index#ceph-performance-benchmarking
410 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Guided Exercise
Outcomes
You should be able to run performance analysis tools and configure the Red Hat Ceph
Storage cluster using the results.
If you performed the practice exercises in the Managing a Red Hat Ceph
Storage Cluster chapter, but have not reset your environment to the default
classroom cluster since that chapter, then you must reset your environment
before executing the lab start command. All remaining chapters use the
default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command ensures that the lab environment is available for the exercise.
Instructions
• Create a new pool called testpool and change the PG autoscale mode to off. Reduce the
number of PGs, and then check the recommended number of PGs. Change the PG autoscale
mode to warn and check the health warning message.
• Modify the primary affinity settings on an OSD so it is more likely to be set as primary for
placement groups.
• Using the Ceph built in benchmarking tool known as the rados bench, measure the
performance of a Ceph cluster at a pool level.
• The admin user has SSH key-based access from the clienta node to the admin account
on all cluster nodes, and has passwordless sudo access to the root and ceph accounts on all
cluster nodes.
CL260-RHCS5.0-en-1-20211117 411
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
• The serverc, serverd, and servere nodes comprise an operational 3-node Ceph cluster. All
three nodes operate as a MON, a MGR, and an OSD host with three 10 GB collocated OSDs.
Warning
The parameters used in this exercise are appropriate for this lab environment.
In production, these parameters should only be modified by qualified Ceph
administrators, or as directed by Red Hat Support.
1. Log in to clienta as the admin user. Create a new pool called testpool, set the
PG autoscale mode to warn, reduce the number of PGs, and view the health warning
messages. Set the PG autoscale mode to on again, and then verify the number of PGs and
that cluster health is ok again.
1.1. Connect to clienta as the admin user and use sudo to run the cephadm shell.
1.2. Create a new pool called testpool with the default number of PGs.
1.3. Verify the cluster health status and the information from the PG autoscaler. The
autoscaler mode for the created pool testpool should be on and the number of
PGs is 32.
1.4. Set the PG autoscale option to off for the pool testpool. Reduce the number of
PGs to 8. Verify the autoscale recommended number of PGs, which should be 32.
Verify that the cluster health is OK.
412 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
[ceph: root@clienta /]# ceph osd pool set testpool pg_autoscale_mode off
set pool 6 pg_autoscale_mode to off
[ceph: root@clienta /]# ceph osd pool set testpool pg_num 8
set pool 6 pg_num to 8
[ceph: root@clienta /]# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET
RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
device_health_metrics 0 3.0 92124M 0.0000
1.0 1 on
.rgw.root 1323 3.0 92124M 0.0000
1.0 32 on
default.rgw.log 3702 3.0 92124M 0.0000
1.0 32 on
default.rgw.control 0 3.0 92124M 0.0000
1.0 32 on
default.rgw.meta 0 3.0 92124M 0.0000
4.0 8 on
testpool 0 3.0 92124M 0.0000
1.0 8 32 off
[ceph: root@clienta /]# ceph health detail
HEALTH_OK
1.5. Set the PG autoscale option to warn for the pool testpool. Verify that cluster
health status is now WARN, because the recommended number of PGs is higher than
the current number of PGs. It might take several minutes before the cluster shows the
health warning message.
[ceph: root@clienta /]# ceph osd pool set testpool pg_autoscale_mode warn
set pool 6 pg_autoscale_mode to warn
[ceph: root@clienta /]# ceph health detail
HEALTH_WARN 1 pools have too few placement groups
[WRN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups
Pool testpool has 8 placement groups, should have 32
1.6. Enable the PG autoscale option and verify that the number of PGs has been
increased automatically to 32, the recommended value. This increase might take a
few minutes to display.
CL260-RHCS5.0-en-1-20211117 413
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
2. Modify the primary affinity settings on an OSD so that it is more likely to be selected as
primary for placement groups. Set the primary affinity for OSD 7 to 0.
2.3. Verify the primary affinity settings for OSDs in the cluster.
3. Create a pool called benchpool with the object clean-up feature turned off.
[ceph: root@clienta /]# ceph osd pool create benchpool 100 100
pool 'benchpool' created
3.2. Use the rbd pool init command to initialize a custom pool to store RBD images.
This step could take several minutes to complete.
414 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
4. Open a second terminal and log in to the clienta node as the admin user. Use the first
terminal to generate a workload and use the second terminal to collect metrics. Run a write
test to the RBD pool benchpool. This might take several minutes to complete.
Note
This step requires sufficient time to complete the write OPS for the test. Be
prepared to run the osd pref command in the second terminal immediately after
starting the benchpool command in the first terminal.
4.1. Open a second terminal. Log in to clienta as the admin user and use sudo to run
the cephadm shell.
CL260-RHCS5.0-en-1-20211117 415
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
3 72 72
4 135 135
5 59 59
Note
If no data displays, then use the first terminal to generate the workload again. The
metric collection must run while the bench tool is generating workload.
4.4. In the second terminal, locate the system by using the OSD ID from the previous step,
where the OSD has high latency. Determine the name of the system.
5.1. Verify the performance counters for the OSD. Redirect the output of the command
to a file called perfdump.txt
[ceph: root@clienta /]# ceph tell osd.6 perf dump > perfdump.txt
5.2. In the perfdump.txt file, locate the section starting with osd:. Note the
op_latency and subop_latency counters, which are the read and write
operations and suboperations latency. Note the op_r_latency and op_w_latency
parameters.
Each counter includes avgcount and sum fields that are required to calculate the
exact counter value. Calculate the value of the op_latency and subop_latency
counters by using the formula counter = counter.sum / counter.avgcount.
416 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
"avgtime": 0.020147238
},
...output omitted...
"op_r_latency": {
"avgcount": 3059,
"sum": 1.395967825,
"avgtime": 0.000456347
},
...output omitted...
"op_w_latency": {
"avgcount": 480,
"sum": 71.668254827,
"avgtime": 0.149308864
},
...output omitted...
"op_rw_latency": {
"avgcount": 125,
"sum": 0.755260647,
"avgtime": 0.006042085
},
...output omitted...
"subop_latency": {
"avgcount": 1587,
"sum": 59.679174303,
"avgtime": 0.037605024
},
...output omitted...
5.3. In the first terminal, repeat the capture using the rados bench write command.
5.4. In the second terminal, view the variation of the value using the following formulas:
[ceph: root@clienta /]# ceph tell osd.6 perf dump > perfdump.txt
[ceph: root@clienta /]# cat perfdump.txt | grep -A88 '"osd"'
...output omitted...
Note
The values are cumulative and are returned when the command is executed.
6.1. In the second terminal, dump the information maintained in memory for the most
recently processed operations. Redirect the dump to the historicdump.txt
CL260-RHCS5.0-en-1-20211117 417
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
file. By default, each OSD records information on the last 20 operations over 600
seconds. View the historicdump.txt file contents.
418 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 419
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Objectives
After completing this section, you should be able to protect OSD and cluster hardware resources
from over-utilization by controlling scrubbing, deep scrubbing, backfill, and recovery processes to
balance CPU, RAM, and I/O requirements.
• Tune the BlueStore back end used by OSDs to store objects on physical devices.
• Adjust the schedule for automatic data scrubbing and deep scrubbing.
• Adjust the schedule of asynchronous snapshot trimming (deleting removed snapshots).
• Control how quickly backfill and recovery operations occur when OSDs fail or are added or
replaced.
Efficient copy-on-write
The Ceph Block Device and Ceph File System snapshots rely on a copy-on-write clone
mechanism that is implemented efficiently in BlueStore. This results in efficient I/O for regular
snapshots and for erasure-coded pools that rely on cloning to implement efficient two-phase
commits.
Multidevice support
BlueStore can use multiple block devices for storing the data, metadata, and write-ahead log.
In BlueStore, the raw partition is managed in chunks of the size specified by the
bluestore_min_alloc_size variable. The bluestore_min_alloc_size is set by default
to 4,096, which is equivalent to 4 KB, for HDDs and SSDs. If the data to write in the raw partition
is smaller than the chunk size, then it is filled with zeroes. This can lead to a waste of the unused
space if the chunk size is not properly sized for your workload, such as for writing many small
objects.
Red Hat recommends setting the bluestore_min_alloc_size variable to match the smallest
common write to avoid wasting unused space. For example, if your client writes 4 KB objects
frequently, then configure the settings on OSD nodes such as bluestore_min_alloc_size
420 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Important
Red Hat does not recommend changing the bluestore_min_alloc_size value
in your production environment before first contacting Red Hat Support.
Set the value for the bluestore_min_alloc_size variable by using the ceph config
command:
For reference, a value between 0 and 0.7 is considered small and acceptable fragmentation, a
score between 0.7 and 0.9 is considerable but still safe fragmentation, and scores higher than 0.9
indicates severe fragmentation that is causing performance issues.
By default, Red Hat Ceph Storage performs light scrubbing every day and deep scrubbing every
week. However, Ceph can begin the scrubbing operation at any time, which can impact cluster
performance. You can enable or disable cluster level light scrubbing by using the ceph osd set
noscrub and ceph osd unset noscrub commands. Although scrubbing has a performance
impact, Red Hat recommends keeping the feature enabled because it maintains data integrity.
Red Hat recommends setting the scrubbing parameters to restrict scrubbing to known periods
with the lowest workloads.
Note
The default configuration allows light scrubbing at any time during the day.
CL260-RHCS5.0-en-1-20211117 421
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Light Scrubbing
Tune the light scrubbing process by adding parameters in the [osd] section of the ceph.conf
file . For example, use the osd_scrub_begin_hour parameter to set the time of day that light
scrubbing begins, thereby avoiding light scrubbing during peak workloads.
The light scrubbing feature has the following tuning parameters: osd_scrub_begin_hour =
begin_hour:: The begin_hour parameter specifies the time to start scrubbing. Valid values
are from 0 to 23. If the value is set to 0 and osd_scrub_end_hour is also 0, then scrubbing is
allowed the entire day.
osd_scrub_end_hour = end_hour
The end_hour parameter specifies the time to stop scrubbing. Valid values are from 0 to 23.
If the value is set to 0 and osd_scrub_begin_hour is also 0, then scrubbing is allowed the
entire day.
osd_scrub_load_threshold
Perform a scrub if the system load is below the threshold, defined by the getloadavg() /
number online CPUs parameter. The default value is 0.5.
osd_scrub_min_interval
Perform a scrub no more often than the number of seconds defined in this parameter if the
load is below the threshold set in the osd_scrub_load_threshold parameter. The default
value is 1 day.
osd_scrub_interval_randomize_ratio
Add a random delay to the value defined in the osd_scrub_min_interval parameter. The
default value is 0.5.
osd_scrub_max_interval
Do not wait more than this period before performing a scrub, regardless of load. The default
value is 7 days.
osd_scrub_priority
Set the priority for scrub operations by using this parameter. The default value is 5. This value
is relative to the value of the osd_client_op_priority, which has a higher default priority
of 63.
Deep Scrubbing
You can enable and disable deep scrubbing at the cluster level by using the ceph osd set
nodeep-scrub and ceph osd unset nodeep-scrub commands. You can configure deep
scrubbing parameters by adding them to the [osd] section of the ceph configuration file,
ceph.conf. As with the light scrubbing parameters, any changes made to the deep scrub
configuration can impact cluster performance.
The following parameters are the most critical to tuning deep scrubbing:
osd_deep_scrub_interval
The interval for deep scrubbing. The default value is 7 days.
osd_scrub_sleep
Introduces a pause between deep scrub disk reads. Increase this value to slow down scrub
operations and to have a lower impact on client operations. The default value is 0.
You can use an external scheduler to implement light and deep scrubbing by using the following
commands:
422 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
• The ceph pg dump command displays the last light and deep scrubbing occurrences in the
LAST_SCRUB and LAST_DEEP_SCRUB columns.
• The ceph pg scrub pg-id command schedules a deep scrub on a particular PG.
• The ceph pg deep-scrub pg-id command schedules a deep scrub on a particular PG.
Use the ceph osd pool set pool-name parameter value command to set these
parameters for a specific pool.
noscrub
If set to true, Ceph does not light scrub the pool. The default value is false.
nodeep-scrub
If set to true, Ceph does not deep scrub the pool. The default value is false
scrub_min_interval
Scrub no more often than the number of seconds defined in this parameter. If set to the
default 0, then Ceph uses the osd_scrub_min_interval global configuration parameter.
scrub_max_interval
Do not wait more than the period defined in this parameter before scrubbing the pool. If set to
the default 0, Ceph uses the osd_scrub_max_interval global configuration parameter.
deep_scrub_interval
The interval for deep scrubbing. If set to the default 0, Ceph uses the
osd_deep_scrub_interval global configuration parameter.
To reduce the impact of the snapshot trimming process on the cluster, you can configure
a pause after the deletion of each snapshot object. Configure this pause by using the
osd_snap_trim_sleep parameter, which is the time in seconds to wait before allowing the next
snapshot trimming operation. The default value for this parameter is 0. Contact Red Hat Support
for further advice on how to set this parameter based on your environment settings.
Control the snapshot trimming process using the osd_snap_trim_priority parameter, which
has a default value of 5.
Backfill occurs when a new OSD joins the cluster or when an OSD dies and Ceph reassigns its PGs
to other OSDs. When such events occur, Ceph creates object replicas across the available OSDs.
Recovery occurs when a Ceph OSD becomes inaccessible and comes back online, for example due
to a short outage. The OSD goes into recovery mode to obtain the latest copy of the data.
Use the following parameters to manage the backfill and recovery operations:
CL260-RHCS5.0-en-1-20211117 423
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
osd_max_backfills
Control the maximum number of concurrent backfill operations per OSD. The default value is
1.
osd_recovery_max_active
Control the maximum number of concurrent recovery operations per OSD. The default value
is 3.
osd_recovery_op_priority
Set the recovery priority. The value can range from 1 - 63. The higher the number, the higher
the priority. The default value is 3.
References
For more information, refer to the OSD Configuration Reference chapter of the
Configuration Guide for Red Hat Ceph Storage at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/
html-single/configuration_guide/index#ceph-monitor-and-osd-configuration-
options_conf
For more information on tuning Red Hat Ceph Storage 5 BlueStore, refer to
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/administration_guide/index#osd-bluestore
424 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Guided Exercise
Outcomes
You should be able to:
Instructions
• The clienta node is the admin node and is a client of the Ceph cluster.
• The serverc, serverd, and servere nodes are an operational 3-node Ceph cluster. All three
nodes operate as a MON, a MGR, and an OSD host with three 10 GB collocated OSDs.
Warning
The parameters used in this exercise are appropriate for this lab environment.
In production, these parameters should only be modified by qualified Ceph
administrators, or as directed by Red Hat Support.
1. Log in to clienta as the admin user. Inspect OSD 0 for BlueStore fragmentation.
1.1. Connect to clienta as the admin user and use sudo to run the cephadm shell.
1.2. Retrieve information about OSD 0 fragmentation. The value should be low because
the number of operations in the cluster is low and the cluster is new.
CL260-RHCS5.0-en-1-20211117 425
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
[ceph: root@clienta /]# ceph tell osd.0 bluestore allocator score block
{
"fragmentation_rating": 0.0016764709285897418
}
2. By default, Red Hat Ceph Storage allows one PG backfill at a time, to or from an OSD.
Modify this parameter to 2 on a per-OSD basis. Configure PG backfilling on an OSD.
2.1. Select one OSD running on the serverc node and obtain its IDs. In the following
example, the options are the osd.0, osd.1 and osd.2 OSDs. Yours might be
different.
2.2. On your selected OSD on host serverc, retrieve the value for the
osd_max_backfills parameter. In this example, the selected OSD is osd.0.
2.3. Modify the current runtime value for the osd_max_backfills parameter to 2.
3. By default, Red Hat Ceph Storage allows three simultaneous recovery operations for HDDs
and ten for SSDs. Modify the maximum number of data recovery operations to 1 per OSD.
426 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
3.2. Set the current runtime for the osd_recovery_max_active parameter to 1 on the
OSD of your choice. Verify that the changes are applied.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 427
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Objectives
After completing this section, you should be able to identify key tuning parameters and
troubleshoot performance for Ceph clients, including RADOS Gateway, RADOS Block Devices,
and CephFS.
Beginning Troubleshooting
The hardware backing a Ceph cluster is subject to failure over time. The data in your cluster
becomes fragmented and requires maintenance. You should perform consistent monitoring and
troubleshooting in your cluster to keep it in a healthy state. This section presents some practices
that enable troubleshooting for various issues on a Ceph cluster. You can perform this initial
troubleshooting of your cluster before contacting Red Hat Support.
Identifying Problems
When troubleshooting issues with Ceph, the first step is to determine which Ceph component is
causing the problem. Sometimes, you can find this component in the information provided by the
ceph health detail or ceph health status commands. Other times, you must investigate
further to discover the issue. Verify a cluster's status to help determine if there is a single failure or
an entire node failure.
The ceph status and ceph health commands show the cluster health status. When the
cluster health status is HEALTH_WARN or HEALTH_ERR, use the ceph health detail command
to view the health check message so that you can begin troubleshooting the issue.
Some health status messages indicate a specific issue; others provide a more general indication.
For example, if the cluster health status changes to HEALTH_WARN and you see the health
message HEALTH_WARN 1 osds down; Degraded data redundancy, then that is a clear
indication of the problem.
428 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Other health status messages might require further troubleshooting because they might indicate
several possible root causes. For example, the following message indicates an issue that has
multiple possible solutions:
You can resolve this issue by changing the pg_num setting on the specified pool, or by
reconfiguring the pg_autoscaler mode setting from warn to on so that Ceph automatically
adjusts the number of PGs.
Ceph sends health messages regarding performance when a cluster performance health check
fails. For example, OSDs send heartbeat ping messages to each other to monitor OSD daemon
availability. Ceph also uses the OSD ping response times to monitor network performance. A
single failed OSD ping message could mean a delay from a specific OSD, indicating a potential
problem with that OSD. Multiple failed OSD ping messages might indicate a failure of a network
component, such as a network switch between OSD hosts.
Note
View the list of health check messages of a Ceph cluster at https://
access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/
troubleshooting_guide/index#health-messages-of-a-ceph-cluster_diag.
You can find a list of health check messages specific to CephFS at https://
access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/
file_system_guide/index#health-messages-for-the-ceph-file-system_fs.
Ceph specifies the health check alert by using health check codes. For example, the previous
HEALTH_WARN message shows the POOL_TOO_FEW_PGS health code.
The health-code is the code provided by the ceph health detail command. The optional
parameter duration is the time that the health message is muted, specified in seconds, minutes,
or hours. You can unmute a health message with the ceph health unmute health-code
command.
When you mute a health message, Ceph automatically unmutes the alert if the health status
further degrades. For example, if your cluster reports one OSD down and you mute that alert,
Ceph automatically removes the mute if another OSD goes down. Any health alerts that can be
measured unmute.
CL260-RHCS5.0-en-1-20211117 429
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Configuring Logging
If there is a problem in a specific area of your cluster, then you can enable logging for that area. For
example, if your OSDs are running adequately but your metadata servers are not, enable debug
logging for the specific metadata server instances. Enable logging for each subsystem as needed.
Adding debugging to your Ceph configuration is typically done temporarily during runtime. You
can add Ceph debug logging to your Ceph configuration database if you encounter issues when
starting your cluster. View Ceph log files under the default location /var/log/ceph. Ceph stores
logs in a memory-based cache.
Warning
Logging is resource-intensive. Verbose logging can generate over 1 GB of data per
hour. If your OS disk reaches its capacity, then the node stops working. When you
fix your cluster issues, revert the logging configuration to default values. Consider
setting up log file rotation.
You can set different logging levels for each subsystem in your cluster. Debug levels are on a scale
of 1 to 20, where 1 is terse and 20 is verbose.
Ceph does not send memory-based logs to the output logs except in the following circumstances:
To use different debug levels for the output log level and the memory level, use a slash (/)
character. For example, debug_mon = 1/5 sets the output log level of the ceph-mon daemon to
1 and its memory log level to 5.
[ceph: root@node /]# ceph tell type.id config set debug_subsystem debug-level
The type and id arguments are the type of the Ceph daemon and its ID. The subsystem is the
specific subsystem whose debug level you want to modify.
Note
You can find a list of the subsystems at https://access.redhat.com/documentation/
en-us/red_hat_ceph_storage/5/html-single/troubleshooting_guide/index#ceph-
subsystems_diag.
This example modifies the OSD 0 debug level for the messaging system between Ceph
components:
430 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
For example, add debug levels for specific Ceph daemons by setting these parameters in your
Ceph configuration database:
You can add a size setting after the rotation frequency, so that the log file is rotated when it
reaches the specified size:
rotate 7
weekly
size size
compress
sharedscripts
Use the crontab command to add an entry to inspect the /etc/logrotate.d/ceph file.
For example, you can instruct Cron to check /etc/logrotate.d/ceph every 30 minutes.
CL260-RHCS5.0-en-1-20211117 431
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
you get the clock skew error. This error can cause packet loss, high latency, or limited bandwidth,
impacting cluster performance and stability.
• Verify the Ceph nodes and verify they are able to reach each other using their host names.
• Ensure that Ceph nodes are able to reach each other on their appropriate ports, if firewalls are
used. Open the appropriate firewall ports if necessary.
• Validate that network connectivity between hosts has the expected latency and no packet loss,
for example, by using the ping command.
• Slower connected nodes could slow down the faster ones. Verify that the inter-switch links can
handle the accumulated bandwidth of the connected nodes.
• Verify that NTP is working correctly in your cluster nodes. For example, you can check the
information provided by the chronyc tracking command.
The ceph-common package provides bash tab completion for the rados, ceph, rbd, and
radosgw-admin commands. You can access option and attribute completions by pressing the
Tab key when you enter the command at the shell prompt.
On the client system, you can add the debug_ms = 1 parameter to the configuration database
by using the ceph config set client debug_ms 1 command. The Ceph client stores
debug messages in the /var/log/ceph/ceph-client.id.log log file.
Most of the Ceph client commands, such as rados, ceph, or rbd, also accept the --debug-ms=1
option to execute only that command with an increased logging level.
In the /var/run/ceph/fsid directory, there is a list of admin sockets for that host. Allow one
admin socket per OSD, one for each MON, and one for each MGR. Administrators can use the
432 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
ceph command with the --admin-daemon socket-patch option to query the client through
the socket.
The following example mounts a CephFS file system with the FUSE client, gets the performance
counters, and sets the debug_ms configuration parameter to 1:
CL260-RHCS5.0-en-1-20211117 433
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
...output omitted...
"debug_ms": "5/5",
...output omitted...
From a client, you can find the version of the running Ceph cluster with the ceph versions
command:
You can also list the supported level of features with the ceph features command.
Using this minimum client setting, Ceph denies the use of features that are not compatible with
the current client version. Historically, the main exception has been changes to CRUSH. For
example, if you run the ceph osd set-require-min-compat-client jewel command,
then you cannot use the ceph osd pg-upmap command. This fails because "Jewel" version
434 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
clients do not support the PG upmap feature. Verify the minimum version required by your cluster
with the ceph osd command:
Either enable Cephx for all components, or disable it completely. Ceph does not support a mixed
setting, such as enabling Cephx for clients but disabling it for communication between the Ceph
services. By default, Cephx is enabled and a client trying to access the Ceph cluster without Cephx
receives an error message.
Important
Red Hat recommends using authentication in your production environment.
All Ceph commands authenticate as the client.admin user by default, although you can specify
the user name or the user ID by using the --name and --id options.
The following is a list of the most common Ceph MON error messages:
If the Ceph MON daemon is running but it is reported as down, then the cause depends on
the MON state. If the Ceph MON is in the probing state longer than expected, then it cannot
find the other Ceph Monitors. This problem can be caused by networking issues, or the Ceph
Monitor can have an outdated Ceph Monitor map (monmap) is trying to reach the other Ceph
Monitors on incorrect IP addresses.
If the Ceph MON is in the electing state longer than expected, then its clock might not
be synchronized. If the state changes from synchronizing to electing, then it means
that the Ceph MON is generating maps faster than the synchronization process can handle.
If the state is either leader or peon, then the Ceph Mon has reached a quorum, but the rest
CL260-RHCS5.0-en-1-20211117 435
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
of the cluster does not recognize a quorum. This problem is mainly caused by a failed clock
synchronization, an improperly working network, or the NTP synchronization is not correct.
clock skew
This error message indicates that the clocks for the MON might not be synchronized. The
mon_clock_drift_allowed parameter controls the maximum difference between clocks
that your cluster allows before showing the warning message. This problem is mainly caused
by a failed clock synchronization, an improperly working network, or the NTP synchronization
is not correct.
The following is a list of the most common Ceph OSD error messages:
full osds
Ceph returns the HEALTH_ERR full osds message when the cluster reaches the capacity
set by the mon_osd_full_ratio parameter. By default, this parameter is set to 0.95 which
means 95% of the cluster capacity.
Use the ceph df command to determine the percentage of used raw storage, given by the
%RAW USED column. If the percentage of raw storage is above 70%, then you can delete
unnecessary data or scale the cluster by adding new OSD nodes to reduce it.
nearfull osds
Ceph returns the nearfull osds message when the cluster reaches the capacity set by
the mon_osd_nearfull_ratio default parameter. By default, this parameter is set to 0.85
which means 85% of the cluster capacity.
436 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
In case of errors, you should consult the log file in the /var/log/ceph/ folder.
To log to a file, set the log_to_file parameter to true. You can update the location of the
log file and the log level by using the log_file and debug parameters, respectively. You can
also enable the rgw_enable_ops_log and rgw_enable_usage_log parameters in the
Ceph configuration database to log each successful RADOS Gateway operation and the usage,
respectively.
Verify the debugging logs using the radosgw-admin log list command. This command
provides a list of the log objects that are available. View log file information using the radosgw-
admin log show command. To retrieve the information directly from the log object, add the --
object parameter with the object ID. To retrieve the information on the bucket at the timestamp,
add the --bucket, --date, and --bucket-id parameters, which refer to the bucket name, the
timestamp, and the bucket ID.
You can verify issues on RADOS Gateway request completion by looking for HTTP status lines in
the RADOS Gateway log file.
The RADOS Gateway is a Ceph client that stores all of its configuration in RADOS objects. The
RADOS PGs holding this configuration data must be in the active+clean state. If the state is
not active+clean, then Ceph I/O requests will hang if the primary OSD becomes unable to
serve data, and HTTP clients will eventually time out. Identify the inactive PGs with the ceph
health detail command.
Troubleshooting CephFS
A CephFS Metadata Server (MDS) maintains a cache shared with its clients, FUSE, or the kernel
so that an MDS can delegate part of its cache to clients. For example, a client accessing an inode
can locally manage and cache changes to that object. If another client also requests access to the
same inode, the MDS can request that the first client update the server with the new metadata.
To maintain cache consistency, an MDS requires a reliable network connection with its clients.
Ceph can automatically disconnect, or evict, unresponsive clients. When this occurs, unflushed
client data is lost.
When a client tries to gain access to CephFS, the MDS requests the client that has the current
capabilities to release them. If the client is unresponsive, then CephFS shows an error message
CL260-RHCS5.0-en-1-20211117 437
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
after a timeout. You can configure the timeout by using the session_timeout attribute with the
ceph fs set command. The default value is 60 seconds.
The session_autoclose attribute controls eviction. If a client fails to communicate with the
MDS for more than the default 300 seconds, then the MDS evicts it.
Ceph temporarily bans evicted clients so that they cannot reconnect. If this ban occurs, you must
reboot the client system or unmount and remount the file system to reconnect.
References
For more information, refer to the Configuring Logging chapter in the
Troubleshooting Guide for Red Hat Ceph Storage at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/troubleshooting_guide/index#configuring-logging
For more information, refer to the Troubleshooting Guide of the Red Hat Customer
Portal Ceph Storage Guide at
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-
single/troubleshooting_guide/index
438 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Guided Exercise
Outcomes
You should be able to identify the error code for each Ceph component and resolve the
issues.
Instructions
1. Log in to clienta as the admin user and use sudo to run the cephadm shell. Verify the
health of the Ceph storage cluster.
Two separate issues need troubleshooting. The first issue is a clock skew error, and the
second issue is a down OSD which is degrading the PGs.
Note
The lab uses chronyd for time synchronization with the classroom server.
CL260-RHCS5.0-en-1-20211117 439
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
2.1. Exit the cephadm shell. On the serverd system, view the chronyd service status.
The chronyd service is inactive on the serverd system.
2.4. Return to the clienta system and use sudo to run the cephadm shell.
440 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
The time skew message might still display if the monitoring service has not yet
updated the time. Allow the cluster sufficient time for services to obtain the
corrected time. Continue with these exercise steps, but verify that the skew issue is
resolved before finishing the exercise.
Note
The health detail output might show the cluster state as HEALTH_OK. When an OSD
is down, the cluster migrates its PGs to other OSDs to return the cluster to a healthy
state. However, the down OSD still requires troubleshooting.
3.1. Exit the cephadm shell. On the serverc system, list the Ceph service units.
CL260-RHCS5.0-en-1-20211117 441
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
3.2. Restart the OSD 0 service. The fsid service and the OSD 0 service name are
different in your lab environment.
3.4. Return to the clienta system and use sudo to run the cephadm shell.
442 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
3.6. Verify the health of the storage cluster. If the status is HEALTH_WARN and you
have resolved the time skew and OSD issues, then wait until the cluster status is
HEALTH_OK before continuing.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 443
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Lab
Outcomes
You should be able to:
This command ensures that the lab environment is available for the exercise.
Instructions
1. Log in to clienta as the admin user. Verify the health of the Ceph storage cluster.
Hypothesize possible causes for the displayed issues.
2. First, troubleshoot the clock skew issue.
3. Troubleshoot the down OSD issue. Use diagnostic logging to find and correct a non-working
configuration.
4. For the OSD 5 service, set the operations history size to track 40 completed operations and
the operations history duration to 700 seconds.
5. For all OSDs, modify the current runtime value for the maximum concurrent backfills to 3 and
for the maximum active recovery operations to 1.
6. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade tuning-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
444 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
CL260-RHCS5.0-en-1-20211117 445
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Solution
Outcomes
You should be able to:
This command ensures that the lab environment is available for the exercise.
Instructions
1. Log in to clienta as the admin user. Verify the health of the Ceph storage cluster.
Hypothesize possible causes for the displayed issues.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
services:
446 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
data:
pools: 5 pools, 105 pgs
objects: 189 objects, 4.9 KiB
usage: 105 MiB used, 90 GiB / 90 GiB avail
pgs: 18.095% pgs not active
35/567 objects degraded (6.173%)
68 active+clean
19 peering
11 active+undersized
7 active+undersized+degraded
2.1. Open a second terminal and log in to serverd as the admin user. The previous health
detail output stated that the time on the serverd system is 300 seconds different
from the other servers. View the chronyd service status on the serverd system to
identify the problem.
The chronyd service is inactive on the serverd system.
CL260-RHCS5.0-en-1-20211117 447
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
2.4. Return to workstation as the student user and close the second terminal.
2.5. In the first terminal, verify the health of the storage cluster
The time skew message might still display if the monitoring service has not yet updated
the time. Allow the cluster sufficient time for services to obtain the corrected time.
Continue with these exercise steps, but verify that the skew issue is resolved before
finishing the exercise.
448 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
3. Troubleshoot the down OSD issue. Use diagnostic logging to find and correct a non-working
configuration.
3.2. Attempt to restart the OSD 4 service with the ceph orch command.
OSD 4 remains down after waiting a sufficient time.
3.4. Open a second terminal and log in to servere as the admin user. On the servere
system, list the Ceph units.
CL260-RHCS5.0-en-1-20211117 449
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
The OSD 4 services might not yet list as failed if the orchestrator is still attempting
to restart the service. Wait until the service lists as failed before continuing this
exercise.
3.5. Restart the OSD 4 service. The fsid service and the OSD 0 service name are different
in your lab environment.
The OSD 4 service still fails to start.
3.6. In the first terminal, modify the OSD 4 logging configuration to write to the /var/
log/ceph/myosd4.log file and increase the logging level for OSD 4. Attempt to
restart the OSD 4 service with the ceph orch command.
3.7. In the second terminal window, view the myosd4.log file to discover the issue.
Error messages indicate an incorrect cluster network address configuration.
450 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
3.8. Return to workstation as the student user and close the second terminal.
3.9. In the first terminal, compare the cluster network addresses for the OSD 0 and OSD 4
services.
3.10. Modify the cluster network value for the OSD 4 service. Attempt to restart the OSD 4
service.
3.11. Verify that the OSD 4 service is now up. Verify the health of the storage cluster.
CL260-RHCS5.0-en-1-20211117 451
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
4. For the OSD 5 service, set the operations history size to track 40 completed operations and
the operations history duration to 700 seconds.
5. For all OSDs, modify the current runtime value for the maximum concurrent backfills to 3 and
for the maximum active recovery operations to 1.
452 CL260-RHCS5.0-en-1-20211117
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Evaluation
Grade your work by running the lab grade tuning-review command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
On the workstation machine, use the lab command to complete this exercise. This is important
to ensure that resources from previous exercises do not impact upcoming exercises.
CL260-RHCS5.0-en-1-20211117 453
Chapter 12 | Tuning and Troubleshooting Red Hat Ceph Storage
Summary
In this chapter, you learned:
• Red Hat Ceph Storage 5 performance depends on the performance of the underlying storage,
network, and operating system file system components.
• Ceph implements a scale-out model architecture. Increasing the number of OSD nodes
increases the overall performance. The greater the parallel access, the greater the load capacity.
• The RADOS and RBD bench commands are used to stress and benchmark a Ceph cluster.
• Controlling scrubbing, deep scrubbing, backfill, and recovery processes helps avoid cluster over-
utilization.
• Troubleshooting Ceph issues starts with determining which Ceph component is causing the
issue.
• Enabling logging for a failing Ceph subsystem provides diagnostic information about the issue.
454 CL260-RHCS5.0-en-1-20211117
Chapter 13
CL260-RHCS5.0-en-1-20211117 455
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Objectives
After completing this section, you should be able to describe Red Hat OpenStack Platform
storage requirements, and compare the architecture choices for using Red Hat Ceph Storage as
an RHOSP storage back end.
Figure 13.1 presents a high-level overview of the core service relationships of a simple RHOSP
installation. All services interact with the Identity service (Keystone) to authenticate users,
services, and privileges before any operation is allowed. Cloud users can choose to use the
command-line interface or the graphical Dashboard service to access existing resources and to
create and deploy virtual machines.
The Orchestration service is the primary component for installing and modifying an RHOSP
cloud. This section introduces the OpenStack services for integrating Ceph into an OpenStack
infrastructure.
456 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
CL260-RHCS5.0-en-1-20211117 457
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Note
RHOSP 16.1 and 16.2 support RHCS 5 only as an external cluster. RHOSP 17 supports
dedicated RHCS 5 deployment with cephadm to replace ceph-ansible.
Dedicated
An organization without an existing, stand-alone Ceph cluster installs a dedicated Ceph
cluster that is composed of Ceph services and storage nodes during an RHOSP overcloud
installation. Only services and workloads that are deployed for, or on, an OpenStack overcloud
can use an OpenStack-dedicated Ceph implementation. External applications cannot access
or use OpenStack-dedicated Ceph cluster storage.
External
An organization can use an existing, stand-alone Ceph cluster for storage when creating a new
OpenStack overcloud. The TripleO deployment is configured to access that external cluster
to create the necessary pools, accounts, and other resources during overcloud installation.
Instead of creating internal Ceph services, the deployment configures the OpenStack
overcloud to access the existing Ceph cluster as a Ceph client.
A dedicated Ceph cluster supports a maximum of 750 OSDs when running the Ceph control
plane services on the RHOSP controllers. An external Ceph cluster can scale significantly larger,
depending on the hardware configuration. Updates and general maintenance are easier on an
external cluster because they can occur independently of RHOSP operations.
458 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
To maintain Red Hat support, RHOSP installations must be built and configured with the TripleO
Orchestration service. For a dedicated storage configuration, RHOSP 16 TripleO uses the same
RHCS 4 ceph-ansible playbooks that are used to install stand-alone Ceph clusters. However,
because TripleO dynamically organizes the playbooks and environment files to include in the
deployment, direct use of Ansible without TripleO is not supported.
Figure 13.2 presents an example of overcloud nodes to implement different service roles in a
simple overcloud.
The following node roles determine the services that are placed on storage nodes that handle data
plane traffic and on the physical storage devices.
The CephStorage role is the default, and control plane services are expected to be installed on
controller nodes.
• CephStorage - The most common dedicated Ceph storage node configuration. Contains
OSDs only, without control plane services.
• CephAll - A stand-alone full storage node with OSDs and all control plane services. This
configuration might be used with the ControllerNoCeph node role.
CL260-RHCS5.0-en-1-20211117 459
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
• CephFile - A node to scale out file sharing. Contains OSDs and MDS services.
• CephObject - A node to scale out object gateway access. Contains OSDs and RGW services.
When storage management traffic increases, controller nodes can become overloaded. The
following node roles support various configurations and distributions of Ceph control plane
services across multiple nodes. Coordinate controller node roles with role choices for storage
nodes to ensure that all wanted control plane services are deployed.
• Controller - The most common controller node configuration. Contains all normal control plane
services, including Ceph MGR, MDS, MON, RBD, and RGW services.
• ControllerNoCeph - A normal controller, but without Ceph control plane services. This node
role is selected when Ceph control plane services are moved to segregated nodes for increased
performance and scaling.
The following node roles are not included by default in the RHOSP distribution, but are described
in Red Hat online documentation. Use these roles to alleviate overloaded controller nodes by
moving primary Ceph services to separate, dedicated nodes. These roles are commonly found in
larger OpenStack installations with increased storage traffic requirements.
• CephMon - A custom-created node role that moves only the MON service from the controllers
to a separate node.
• CephMDS - A custom-created node role that moves only the MDS service from the controllers
to a separate node.
A Hyperconverged Infrastructure (HCI) node is a configuration with both compute and storage
services and devices on the same node. This configuration can result in increased performance
for heavy storage throughput applications. The default is the ComputeHCI role, which adds only
OSDs to a compute node, effectively enlarging your dedicated Ceph cluster. Ceph control plane
services remain on the controller nodes. The other node roles add various choices of control plane
services to the hyperconverged node.
• ComputeHCI - A compute node plus OSDs. These nodes have no Ceph control plane services.
• HciCephAll - A compute node plus OSDs and all Ceph control plane services.
• HciCephFile - A compute node plus OSDs and the MDS service. Used for scaling out file
sharing storage capacity.
• HciCephMon - A compute node plus OSDs and the MON and MGR services. Used for scaling
out block storage capacity.
• HciCephObject - A compute node plus OSDs and the RGW service. Used for scaling out
object gateway access.
A Distributed Compute Node (DCN) is another form of hyperconverged node that is designed for
use in remote data centers or branch offices that are part of the same OpenStack overcloud. For
DCN, the overcloud deployment creates a dedicated Ceph cluster, with a minimum of three nodes,
per remote site in addition to the dedicated Ceph cluster at the primary site. This architecture
is not a stretch cluster configuration. Later DCN versions support installing the Glance in the
remote location for faster local image access.
460 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Note
The following narrative provides a limited view of TripleO cloud deployment
resources. Your organization's deployment will require further design effort, because
every production overcloud has unique storage needs.
Because the default orchestration files are continuously being enhanced, you must not modify
default template files in their original location. Instead, create a directory to store your custom
environment files and parameter overrides. The following ceph-ansible-external.yaml
environment file instructs TripleO to use the ceph-ansible client role to access a preexisting,
external Ceph cluster. To override the default settings in this file, use a custom parameter file.
parameter_defaults:
# NOTE: These example parameters are required when using CephExternal
#CephClusterFSID: '4b5c8c0a-ff60-454b-a1b4-9747aa737d19'
#CephClientKey: 'AQDLOh1VgEp6FRAAFzT7Zw+Y9V6JJExQAsRnRQ=='
#CephExternalMonHost: '172.16.1.7, 172.16.1.8'
# the following parameters enable Ceph backends for Cinder, Glance, Gnocchi and
Nova
NovaEnableRbdBackend: true
CinderEnableRbdBackend: true
CinderBackupBackend: ceph
GlanceBackend: rbd
# Uncomment below if enabling legacy telemetry
# GnocchiBackend: rbd
# If the Ceph pools which host VMs, Volumes and Images do not match these
# names OR the client keyring to use is not called 'openstack', edit the
# following as needed.
NovaRbdPoolName: vms
CinderRbdPoolName: volumes
CinderBackupRbdPoolName: backups
GlanceRbdPoolName: images
# Uncomment below if enabling legacy telemetry
# GnocchiRbdPoolName: metrics
CephClientUserName: openstack
CL260-RHCS5.0-en-1-20211117 461
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
A TripleO deployment specifies a list of environment files for all of the overcloud services to
be deployed, with an openstack overcloud deploy command. Before deployment, the
openstack tripleo container image prepare command is used to determine all of the
services that are referenced in the configuration, and prepare a list of the corrector containers
to download and provide for the overcloud deployment. During the installation, Kolla is used to
configure and start each service container on the correct nodes, as defined by the node roles.
For this external Ceph cluster example, TripleO needs a parameter file that specifies the real
cluster parameters, to override the parameter defaults in the ceph-ansible-external.yaml
file. This example parameter-overrides.yaml file is placed in your custom deployment
files directory. You can obtain the key from the result of an appropriate ceph auth add
client.openstack command.
parameter_defaults:
# The cluster FSID
CephClusterFSID: '4b5c8c0a-ff60-454b-a1b4-9747aa737d19'
# The CephX user auth key
CephClientKey: 'AQDLOh1VgEp6FRAAFzT7Zw+Y9V6JJExQAsRnRQ=='
# The list of Ceph monitors
CephExternalMonHost: '172.16.1.7, 172.16.1.8, 172.16.1.9'
TripleO relies on the Bare Metal service to prepare nodes before installing them as Ceph
servers. Disk devices, both physical and virtual, must be cleaned of all partition tables and other
artifacts. Otherwise, Ceph refuses to overwrite the device, after determining that the device is in
use. To delete all metadata from disks, and create GPT labels, set the following parameter in the
/home/stack/undercloud.conf file on the undercloud. The Bare Metal service boots the
nodes and cleans the disks each time the node status is set to available for provisioning.
clean_nodes=true
462 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
References
For more information, refer to the Storage Guide at
https://access.redhat.com/documentation/en-us/
red_hat_openstack_platform/16.2/html-single/storage_guide/index
For more information, refer to the Integrating an Overcloud with an Existing Red Hat
Ceph Cluster at
https://access.redhat.com/documentation/en-
us/red_hat_openstack_platform/16.1/html-single/
integrating_an_overcloud_with_an_existing_red_hat_ceph_cluster/index
CL260-RHCS5.0-en-1-20211117 463
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Quiz
2. Which two of the following options describe the implementation choices for Ceph
integration designs? (Choose two.)
a. Stand-alone
b. External
c. Dedicated
d. Containerized
3. Which of the following options is the most common node role that is used when
TripleO builds a Ceph server?
a. CephStorage
b. CephAll
c. ControllerStorage
d. StorageNode
4. Which of the following options are benefits of a dedicated Ceph integration with OSP
(Choose two.)?
a. The number of OSDs in the cluster is limited only by hardware configuration.
b. Integrated installation and update strategies.
c. Hyperconverged infrastructure with compute resources.
d. Storage resources are available to external clients.
e. The Ceph cluster can support multiple OSP environments.
464 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Solution
2. Which two of the following options describe the implementation choices for Ceph
integration designs? (Choose two.)
a. Stand-alone
b. External
c. Dedicated
d. Containerized
3. Which of the following options is the most common node role that is used when
TripleO builds a Ceph server?
a. CephStorage
b. CephAll
c. ControllerStorage
d. StorageNode
4. Which of the following options are benefits of a dedicated Ceph integration with OSP
(Choose two.)?
a. The number of OSDs in the cluster is limited only by hardware configuration.
b. Integrated installation and update strategies.
c. Hyperconverged infrastructure with compute resources.
d. Storage resources are available to external clients.
e. The Ceph cluster can support multiple OSP environments.
CL260-RHCS5.0-en-1-20211117 465
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Objectives
After completing this section, you should be able to describe how OpenStack implements Ceph
storage for each storage-related OpenStack component.
The Network File System (NFS) is also a valid method for shared storage access across compute
and controller nodes. Although mature and capable of significant performance and resilience when
configured for redundancy, NFS has scaling limitations and was not designed for cloud application
requirements. OpenStack needs a scalable storage solution design for use in the cloud.
• Supports the same API that the Swift Object Store uses.
• Supports thin provisioning by using copy-on-write, making volume-based provisioning fast.
• Supports Keystone identity authentication, for transparent integration with or replacement of
the Swift Object Store.
• Consolidates object storage and block storage.
• Supports the CephFS distributed file system interface.
OpenStack services use unique service accounts, which are called after the service. The service
account runs service actions on behalf of the requesting user or of another service. Similar
accounts are created in Ceph for each OpenStack service that requires storage access. For
example, the Image service is configured for Ceph access by using this command:
466 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Image Storage
In OpenStack, the default back end for the Image service is a file store that is located with the
Glance API node on the controller node. The location is configurable, with a default of /var/lib/
glance. To improve scalability, the Image service implemented an image cache at the default /
var/lib/glance/image-cache/ location on the controller node. When the Compute service
loads images that are stored in the default QCOW2 format to convert to RAW for use on the
compute nodes, then the converted image is cached.
When Red Hat OpenStack Platform is installed with the Swift Object Store, TripleO places
the image service back end on Swift by default. The Swift service creates a container called
glance for storing Glance images.
When Ceph storage is integrated into RHOSP, TripleO places the image service back end on
Ceph RADOS Block Devices (RBD) by default. Glance images are stored in a Ceph pool called
images. RHOSP works with images as immutable blobs and handles them accordingly. The pool
name is configurable with the glance_pool_name property. The images pool is configured as a
replicated pool by default, which means that all images are replicated across storage devices for
transparent resilience.
An image pool can be configured as erasure-coded to conserve disk space with a slight increase in
CPU utilization.
When using Ceph as the storage back end, it is important to disable the image cache, as it is not
needed because Ceph expects Glance images to be stored in the RAW format. When using RAW
images, all image interactions occur within Ceph, including image clone and snapshot creation.
Disabling the image cache eliminates significant CPU and network activity on controller nodes.
When using a distributed architecture with Distributed Compute Nodes (DCN), TripleO can
configure the Image service with an image pool at each remote site. You can copy images between
the central (hub) site and the remote sites. The DCN Ceph cluster uses RBD technologies, such as
copy-on-write and snapshot layering, for fast instance launching. The Image, Block Storage,
and Compute services must all be configured to use Ceph RBD as their back-end storage.
Object Storage
Object storage is implemented in OpenStack by the Object Store service (Swift). The Object
Store service implements both the Swift API and the Amazon S3 API. The default storage back
end is file-based, and uses an XFS-formatted partition mount in subdirectories of /srv/node
on the designated storage node. You can also configure the Object Store service to use an
existing, external Swift cluster as a back end.
When Ceph storage is integrated into RHOSP, TripleO configures the Object Store service
to use the RADOS Gateway (RGW) as the back end. Similarly, the Image service is configured for
RGW because Swift would not be available as a back end.
The Ceph Object Gateway can be integrated with the Keystone identity service. This
integration configures RGW to use the Identity service as the user authority. If Keystone
authorizes a user to access the gateway, then the user is also initially created on the Ceph Object
Gateway. Identity tokens that Keystone validates are considered valid by the Ceph Object
Gateway. The Ceph Object Gateway is also configured as an object-storage endpoint in
Keystone.
CL260-RHCS5.0-en-1-20211117 467
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Block Storage
Block storage is implemented in OpenStack by the Block Storage service (Cinder). The
Block Storage service provides persistent volumes that remain in storage and are stable
when not attached to any instance. It is common to configure the Block Storage service with
multiple back ends. The default storage back end is the Logical Volume Manager (LVM), which
is configured to use a volume group called cinder-volumes. TripleO can create the volume
group during an installation, or use an existing cinder-volumes volume group.
When Ceph storage is integrated into RHOSP, TripleO configures the Block Storage service
to use RADOS Block Devices (RBD) as the back end. Block Storage volumes are stored in
a Ceph pool called volumes. Volume backups are stored in a Ceph pool called backups. Ceph
block device images attach to an OpenStack instance by using libvirt, which configures the
QEMU interface to the librbd Ceph module. Ceph stripes block volumes across multiple OSDs
within the cluster, providing increased performance for large volumes when compared to local
drives.
OpenStack volumes, snapshots, and clones are implemented as block devices. OpenStack uses
volumes to boot VMs, or to attach to running VMs as further application storage.
File Storage
File storage is implemented in OpenStack by the Shared File Systems service (Manila). The
Shared File Systems service supports multiple back ends and can provision shares from one
or more back ends. Share servers export file shares by using various file system protocols such as
NFS, CIFS, GlusterFS, or HDFS.
The Shared File Systems service is persistent storage and can be mounted to any number of
client machines. You can detach file shares from one instance and attach them to another instance
without data loss. The Shared File Systems service manages share attributes, access rules,
quotas, and rate limits. Because unprivileged users are not allowed to use the mount command,
the Shared File Systems service acts as a broker to mount and unmount shares that the
storage operator configured.
When Ceph storage is integrated into RHOSP, TripleO configures the Shared File Systems
service to use CephFS as the back end. CephFS uses the NFS protocol with the Shared File
Systems service. TripleO can use the ControllerStorageNFS server role to configure an
NFS Ganesha cluster as the scalable interface to the libcephfs back end.
Compute Storage
Ephemeral storage is implemented in OpenStack by the Compute service (Nova). The Compute
service uses the KVM hypervisor with libvirt to launch compute workloads as VMs. The
Compute service requires two types of storage for libvirt operations:
• Base image: A cached and formatted copy of the image from the Image service.
• Instance overlay: A layered volume to be overlaid on the base image to become the VM's
instance disk.
When Ceph storage is integrated into RHOSP, TripleO configures the Compute service to use
RADOS Block Devices (RBD) as the back end. With RBD, instance operating system disks can
be managed either as ephemeral, to be deleted when the instance is shut down, or as a persistent
volume. An ephemeral disk behaves like a normal disk, to be listed, formatted, mounted, and used
as a block device. However, the disk and its data cannot be preserved or accessed beyond the
instance that it is attached to.
468 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
In recent versions, you can boot every VM inside Ceph directly without using the Block Storage
service. This feature enables hypervisors to use the live-migration and evacuate operations
to restore VMs in another hypervisor during a maintenance operation or on a hardware failure.
References
Cinder Administration: Configure multiple-storage back ends
https://docs.openstack.org/cinder/latest/admin/blockstorage-multi-backend.html
For more information, refer to the Creating and Managing Instances Guide at
https://access.redhat.com/documentation/en-us/
red_hat_openstack_platform/16.1/html-single/creating_and_managing_instances/
index
For more information, refer to the Distributed compute node and storage
deployment at
https://access.redhat.com/documentation/en-
us/red_hat_openstack_platform/16.1/html-single/
distributed_compute_node_and_storage_deployment/index
For more information, refer to the CephFS Back End Guide for the Shared File
System Service at
https://access.redhat.com/documentation/en-
us/red_hat_openstack_platform/16.1/html-single/
cephfs_back_end_guide_for_the_shared_file_system_service/index
For more information, refer to the Deploying the Shared File Systems service with
CephFS through NFS at
https://access.redhat.com/documentation/en-
us/red_hat_openstack_platform/16.1/html-single/
deploying_the_shared_file_systems_service_with_cephfs_through_nfs/index
CL260-RHCS5.0-en-1-20211117 469
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Quiz
1. Which of the following is the only image format supported by Red Hat OpenStack
Platform for use with an integrated Ceph cluster?
a. QCOW2
b. VMDK
c. RAW
d. VHDX
2. Which four of the following are the default Ceph pool names used by OpenStack
services? (Choose four.)
a. vms
b. volumes
c. backups
d. glance
e. shares
f. images
g. compute
3. Which three of the following are the OpenStack services backed by Ceph RADOS
Block Devices? (Choose three.)
a. Deployment
b. Images
c. Shared File Systems
d. Block Storage
e. Object Storage
f. Compute
4. Which three of the following are required parameters to integrate an external Ceph
Storage cluster with OpenStack? (Choose three.)
a. FSID
b. Manager node list
c. Monitor node list
d. client.openstack key-ring
e. admin.openstack key-ring
f. Monitor map
470 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Solution
1. Which of the following is the only image format supported by Red Hat OpenStack
Platform for use with an integrated Ceph cluster?
a. QCOW2
b. VMDK
c. RAW
d. VHDX
2. Which four of the following are the default Ceph pool names used by OpenStack
services? (Choose four.)
a. vms
b. volumes
c. backups
d. glance
e. shares
f. images
g. compute
3. Which three of the following are the OpenStack services backed by Ceph RADOS
Block Devices? (Choose three.)
a. Deployment
b. Images
c. Shared File Systems
d. Block Storage
e. Object Storage
f. Compute
4. Which three of the following are required parameters to integrate an external Ceph
Storage cluster with OpenStack? (Choose three.)
a. FSID
b. Manager node list
c. Monitor node list
d. client.openstack key-ring
e. admin.openstack key-ring
f. Monitor map
CL260-RHCS5.0-en-1-20211117 471
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Objectives
After completing this section, you should be able to describe Red Hat OpenShift Container
Platform storage requirements, and compare the architecture choices for using Red Hat Ceph
Storage as an RHOCP storage back end.
Red Hat OpenShift Container Platform (RHOCP) is a collection of modular components and
services that are built on top of a Kubernetes container infrastructure. OpenShift Container
Platform provides remote management, multitenancy, monitoring, auditing, and application
lifecycle management. It features enhanced security capabilities and self-service interfaces. It also
integrates with major Red Hat products, that extend the capabilities of the platform.
OpenShift Container Platform is available in most clouds, whether as a managed cloud service
in public clouds or as self-managed software in your data center. These implementations offer
different levels of platform automation, update strategies, and operation customization. This
training material references RHOCP 4.8.
OpenShift Container Platform assigns the responsibilities of each node within the cluster by using
different roles. The machine config pools (MCP) are sets of hosts where a role is assigned. Each
MCP manages the hosts and their configuration. The control plane and the compute MCPs are
created by defualt.
Compute nodes are responsible for running the scheduled workloads of the control plane.
Compute nodes contain services such as CRI-O (Container Runtime Interface with Open
Container Initiative compatibility), to run, stop or restart containers, and kubelet, which acts as an
agent to accept a request to operate the containers.
Control plane nodes are responsible for running the main OpenShift services, such as the following
ones:
• OpenShift API server. It validates and configures the data for OpenShift resources, such as
projects, routes, and templates.
• OpenShift controller manager. It watches the etcd service for changes to resources and uses
the API to enforce the specified state.
• OpenShift OAuth API server. It validates and configures the data to authenticate to the
OpenShift Container Platform, such as users, groups, and OAuth tokens.
472 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
The operator container image defines the requirements for deployment, such as dependent
services and hardware resources. Because operators require resource access, they typically
use custom security settings. Operators provide an API for resource management and service
configuration, and deliver levels of automated management and upgrade strategies.
OpenShift Container Platform uses the Operator Lifecycle Manager (OLM) to manage operators.
OLM orchestrates the deployment, update, resource utilization, and deletion of other operators
from the operator catalog. Every operator has a Cluster Service Version (CSV) that describes the
required technical information to run the operator, such as the RBAC rules that it requires and the
resources that it manages or depends on. OLM is itself an operator.
Custom Resource Definition (CRD) objects define unique object types in the cluster. Custom
Resource (CR) objects are created from CRDs. Only cluster administrators can create CRDs.
Developers with CRD read permission can add defined CR object types into their project.
Operators use CRDs by packaging them with any required RBAC policy and other software-
specific logic. Cluster administrators can add CRDs manually to a cluster independently from an
Operator lifecycle, to be available to all users.
When the operator bundle is installed, the ocs-operator starts and creates an
OCSInitialization resource if it does not already exist. The OCSInitialization resource
performs basic setup and initializes services. It creates the openshift-storage namespace
in which other bundle operators will create resources. You can edit this resource to adjust the
tools that are included in the OpenShift Data Foundation operator. If the OCSInitialization
resource is in a failed state, further start requests are ignored until the resource is deleted.
The StorageCluster resource manages the creation and reconciliation of CRDs for the Rook-
Ceph and NooBaa operators. These CRDs are defined by known best practices and policies that
Red Hat supports. You can create a StorageCluster resource with an installation wizard in
OpenShift Container Platform.
CL260-RHCS5.0-en-1-20211117 473
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Rook-Ceph is responsible for the initial storage cluster bootstrap, administrative tasks, and the
creation of the pods and other dependent resources in the openshift-storage namespace.
Many advanced Ceph features, such as Placement Groups and CRUSH maps, are reserved for
Rook management. Rook-Ceph facilitates a seamless storage consumption experience and
minimizes the required cluster administration.
Monitoring is an important Rook-Ceph duty. Rook-Ceph watches the storage cluster state
to ensure that it is available and healthy. Rook-Ceph monitors Ceph Placement Groups and
automatically adjusts their configuration based on pool sizing, and monitors Ceph daemons. Rook-
Ceph communicates with OpenShift APIs to request the necessary resources when the cluster
scales.
Rook-Ceph provides two Container Storage Interface (CSI) drivers to create volumes, the RBD
driver and the CephFS driver. These drivers provide the channel for OpenShift Container Platform
to consume storage.
Note
The OpenShift Container Storage operator does not create Persistent Volume
resources, but tracks resources that Ceph-CSI drivers created.
Figure 13.3 visualizes the Rook-Ceph operator interaction with OpenShift Container Platform.
474 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
You can update the configuration of Rook-Ceph components via CRD updates. Rook-Ceph looks
for configuration changes that the service API requested and applies them to the cluster. The
cluster state reflects whether the cluster is in the desired state or is approaching it. Important
CRDs for the cluster configuration are CephCluster, CephObjectStore, CephFilesystem, and
CephBlockPool.
The NooBaa operator creates and reconciles changes for the NooBaa service and creates the
following resources:
• Backing store
• Namespace store
• Bucket class
• Object bucket claims (OBCs)
• Prometheus rules and service monitoring
• Horizontal pod autoscaler (HPA)
NooBaa requires a backing store resource to save objects. A default backing store is created
in an OpenShift Data Foundation deployment, but depends on the platform that OpenShift
Container Platform is running on. For example, when OpenShift Container Platform or OpenShift
Data Foundation is deployed on Amazon Web Services (AWS), it creates the backing store as an
AWS::S3 bucket. For Microsoft Azure, the default backing store is a blob container. NooBaa can
define multiple, concurrent backing stores.
OpenShift Data Foundation pods can be scheduled on the same nodes as application pods or
on separate nodes. When using the same nodes, compute and storage resources must be scaled
together. When using separate nodes, compute and storage resources scale independently.
One Internal installation mode benefit is that the OpenShift Container Platform dashboard
integrates cluster lifecycle management and monitoring.
The Internal installation mode uses the back-end infrastructure to provide storage resources by
default. An alternative configuration is to choose the Internal - Attached Devices option,
which uses the available local storage.
Use of the Internal - Attached Devices option has the following requirements:
CL260-RHCS5.0-en-1-20211117 475
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
• The target nodes are scanned for disks to match specific criteria.
• The default namespace for the Local Storage operator is openshift-local-storage.
Important
In an Internal storage installation, the integrated Ceph Storage cluster can be
used only by the OpenShift Container Platform cluster where it is installed.
The OpenShift Container Platform and the Ceph Storage cluster must meet certain conditions to
correctly integrate the storage cluster.
476 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
References
For more information, refer to the _Red Hat OpenShift Container Storage 4.8
Documentation.
Red Hat OpenShift Container Storage
https://access.redhat.com/documentation/en-us/
red_hat_openshift_container_storage/4.8
For more information, refer to the _Red Hat OpenShift Container Platform 4.8
Documentation.
Red Hat OpenShift Container Platform
https://access.redhat.com/documentation/en-us/
openshift_container_platform/4.8
CL260-RHCS5.0-en-1-20211117 477
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Quiz
1. Which component in OpenShift Data Foundation provides the interface for OpenShift
Container Platform to consume storage from Ceph? (Choose one.)
a. CSI drivers
b. NooBaa
c. OCSInitialization
d. CustomResourceDefinitions
2. What are the advantages of installing OpenShift Data Foundation in internal mode?
(Choose three.)
a. Support for several OpenShift Container Platform clusters.
b. Storage back end can use the same infrastructure as the OpenShift Container Platform
cluster.
c. Advanced feature and configuration customization.
d. Automated Ceph storage installation and configuration.
e. Seamless auto-update and lifecycle management.
3. What are the main capabilites of the Rook-Ceph operator? (Choose three.)
a. Provides a bundle with operators and resources to deploy the storage cluster.
b. Monitors Ceph daemons and ensures that the cluster is in a healthy state.
c. Deploys the storage cluster according to best practices and recommendations.
d. Interacts with multiple clouds to provide object services.
e. Looks for and applies configuration changes to the storage cluster.
478 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Solution
1. Which component in OpenShift Data Foundation provides the interface for OpenShift
Container Platform to consume storage from Ceph? (Choose one.)
a. CSI drivers
b. NooBaa
c. OCSInitialization
d. CustomResourceDefinitions
2. What are the advantages of installing OpenShift Data Foundation in internal mode?
(Choose three.)
a. Support for several OpenShift Container Platform clusters.
b. Storage back end can use the same infrastructure as the OpenShift Container Platform
cluster.
c. Advanced feature and configuration customization.
d. Automated Ceph storage installation and configuration.
e. Seamless auto-update and lifecycle management.
3. What are the main capabilites of the Rook-Ceph operator? (Choose three.)
a. Provides a bundle with operators and resources to deploy the storage cluster.
b. Monitors Ceph daemons and ensures that the cluster is in a healthy state.
c. Deploys the storage cluster according to best practices and recommendations.
d. Interacts with multiple clouds to provide object services.
e. Looks for and applies configuration changes to the storage cluster.
CL260-RHCS5.0-en-1-20211117 479
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Objectives
After completing this section, you should be able to describe how OpenShift implements Ceph
storage for each storage-related OpenShift feature.
Administrators can use a StorageClass resource to describe the storage types and characteristics
of the cluster. Administrators can use classes to define storage needs such as QoS levels or
provisioner types.
A PersistentVolumeClaim (PVC) or claim is a cluster user storage request from inside a project.
PersistentVolumeClaim resources contain the requested storage and the required access
mode.
Note
The StorageClass and PersistentVolume resources are cluster resources that
are independent of any projects.
The following operations are the most common interactions between a PersistentVolume and
PersistentVolumeClaim resources.
When installing OpenShift Data Foundation, the following storage classes are created:
– ocs-storagecluster-ceph-rbd
– ocs-storagecluster-cephfs
– ocs-storagecluster-ceph-rgw
– openshift-storage.noobaa.io
480 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Note
Red Hat recommends changing the default StorageClass to
ocs-storagecluster-ceph-rbd backed by OpenShift Data Foundation.
Important
Access modes are a description of the volume's access capabilities. The cluster
does not enforce the claim's requested access, but permits access according to the
volume's capabilities.
CL260-RHCS5.0-en-1-20211117 481
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
services:
mon: 3 daemons, quorum a,b,c (age 23m)
mgr: a(active, since 23m)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-
b=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 22m), 3 in (since 22m)
rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
You can list the pools that Rook-Ceph created during the cluster creation:
482 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
CL260-RHCS5.0-en-1-20211117 483
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
name: cl260-pvc-01
spec:
storageClassName: ocs-storagecluster-ceph-rbd
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
[cloud-user@ocp ~]$ oc create -f cl260-pvc-01.yml
persistentvolumeclaim/cl260-pvc-01 created
View the PersistentVolumeClaim resource details to check whether it is bound. When the
status is Bound, then view the volume resource.
To match the volume resource with the rbd device, inspect the VolumeHandle attribute in the
PersistentVolume description.
To find the device in the Ceph Storage cluster, log in to the Rook-Ceph Toolbox shell and list the
ocs-storagecluster-cephblockpool pool. Observe that the device name matches the
second part of the VolumeHandle property of the volume resource.
To resize the volume, edit the YAML file and edit the storage field to the new wanted capacity.
Then, apply the changes.
484 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
CL260-RHCS5.0-en-1-20211117 485
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
StorageClass: ocs-storagecluster-cephfs
Status: Bound
Volume: pvc-793c06bc-4514-4c11-9272-cf6ce51996e8
...output omitted...
Capacity: 10Gi
Access Modes: RWX
VolumeMode: Filesystem
...output omitted...
To test the volume, create a demo application and scale the deployment to three nodes.
Then, mount the volume and verify that all nodes can see the volume.
486 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
You can view status, access, and secret keys with the noobaa backingstore status
command:
# BackingStore spec:
s3Compatible:
endpoint: http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-
storage.svc:80
secret:
name: rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-
objectstore-user
namespace: openshift-storage
...output omitted...
type: s3-compatible
# Secret data:
AccessKey: D7VHJ1I32B0LVJ0EEL9W
Endpoint: http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-
storage.svc:80
SecretKey: wY8ww5DbdOqwre8gj1HTiA0fADY61zNcX1w8z
When NooBaa finishes the creation, it communicates with the OpenShift Container Platform
cluster and delegates the characteristics of the `ObjectBucketClai`mthat is created.
You can review the attributes of a ObjectBucketClaim resource with the -o yaml option to
query the resource definition. You can use this option to view the S3 access credentials of the
resource.
CL260-RHCS5.0-en-1-20211117 487
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
488 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
References
For more information, refer to the Red Hat OpenShift Container Storage 4.8
Documentation.
Red Hat OpenShift Container Storage
https://access.redhat.com/documentation/en-us/
red_hat_openshift_container_storage/4.8
For more information, refer to the Red Hat OpenShift Container Platform 4.8
Documentation.
Red Hat OpenShift Container Platform
https://access.redhat.com/documentation/en-us/
openshift_container_platform/4.8
CL260-RHCS5.0-en-1-20211117 489
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Quiz
3. In what scenario, where an application mounts a volume, is a volume with RWX access
mode required? (Choose one.)
a. Mounted to many pods that all have read and write permissions.
b. Mounted to many pods that have read permission only.
c. Mounted to one pod that has read and write permissions.
d. Mounted to one pod that has read permission only.
490 CL260-RHCS5.0-en-1-20211117
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Solution
3. In what scenario, where an application mounts a volume, is a volume with RWX access
mode required? (Choose one.)
a. Mounted to many pods that all have read and write permissions.
b. Mounted to many pods that have read permission only.
c. Mounted to one pod that has read and write permissions.
d. Mounted to one pod that has read permission only.
CL260-RHCS5.0-en-1-20211117 491
Chapter 13 | Managing Cloud Platforms with Red Hat Ceph Storage
Summary
In this chapter, you learned:
• Red Hat Ceph Storage can provide a unified storage back end for OpenStack services that
consume block, image, object, and file-based storage.
• OpenStack Glance can use Ceph RBD images to store the operating system images that it
manages.
• OpenStack Cinder can also use RADOS block devices to provide block-based storage for virtual
machines that run as cloud instances.
• The RADOS Gateway can replace the native OpenStack Swift storage by providing
object storage for applications that use the OpenStack Swift API, and integrates its user
authentication with OpenStack Keystone.
• Red Hat OpenShift Data Foundation is an operator bundle that provides cloud storage and data
services to Red Hat OpenShift Container Platform; it is composed of the ocs-storage, NooBaa,
and Rook-Ceph operators.
• Rook-Ceph is a cloud storage orchestrator that installs, monitors, and manages the underlying
Ceph cluster in the OpenShift Data Foundation bundle operator. Rook-Ceph provides the
required drivers to request storage to the cluster.
• PersistentVolumeClaims are an OpenShift resource type that represent a request for a storage
object. They contain the StorageClass which describes the PersistentVolume that should bind to
it.
492 CL260-RHCS5.0-en-1-20211117
Chapter 14
Comprehensive Review
Goal Review tasks from Cloud Storage with Red Hat
Ceph Storage
CL260-RHCS5.0-en-1-20211117 493
Chapter 14 | Comprehensive Review
Comprehensive Review
Objectives
After completing this section, you should be able to demonstrate knowledge and skills learned in
Cloud Storage with Red Hat Ceph Storage.
You can refer to earlier sections in the textbook for extra study.
• Describe the personas in the cloud storage ecosystem that characterize the use cases and tasks
taught in this course.
• Describe the Red Hat Ceph Storage architecture, introduce the Object Storage Cluster, and
describe the choices in data access methods.
• Describe and compare the use cases for the various management interfaces provided for
Red Hat Ceph Storage.
• Prepare for and perform a Red Hat Ceph Storage cluster deployment using cephadm
command-line tools.
• Identify and configure the primary settings for the overall Red Hat Ceph Storage cluster.
• Describe the purpose of cluster monitors and the quorum procedures, query the monitor map,
manage the configuration database, and describe Cephx.
• Describe the purpose for each of the cluster networks, and view and modify the network
configuration.
494 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
• Describe OSD configuration scenarios and create BlueStore OSDs using ceph-volume.
• Describe and compare replicated and erasure coded pools, and create and configure each pool
type.
• Describe Cephx and configure user authentication and authorization for Ceph clients.
• Administer and update the cluster CRUSH map used by the Ceph cluster.
• Provide block storage to Ceph clients using RADOS block devices (RBDs), and manage RBDs
from the command line.
• Export an RBD image from the cluster to an external file and import it into another cluster.
• Configure an RBD mirror to replicate an RBD block device between two Ceph clusters for
disaster recovery purposes.
• Configure the Ceph iSCSI Gateway to export RADOS Block Devices using the iSCSI protocol,
and configure clients to use the iSCSI Gateway.
• Deploy a RADOS Gateway to provide clients with access to Ceph object storage.
• Configure the RADOS Gateway with multisite support to allow objects to be stored in two or
more geographically diverse Ceph storage clusters.
• Configure the RADOS Gateway to provide access to object storage compatible with the
Amazon S3 API, and manage objects stored using that API.
CL260-RHCS5.0-en-1-20211117 495
Chapter 14 | Comprehensive Review
• Configure the RADOS Gateway to provide access to object storage compatible with the Swift
API, and manage objects stored using that API.
• Provide file storage on the Ceph cluster by deploying the Ceph File System (CephFS).
• Configure CephFS, including snapshots, replication, memory management, and client access.
• Administer and monitor a Red Hat Ceph Storage cluster, including starting and stopping specific
services or the full cluster, and querying cluster health and utilization.
• Perform common cluster maintenance tasks, such as adding or removing MONs and OSDs, and
recovering from various component failures.
• Choose Red Hat Ceph Storage architecture scenarios and operate Red Hat Ceph Storage-
specific performance analysis tools to optimize cluster deployments.
• Protect OSD and cluster hardware resources from over-utilization by controlling scrubbing,
deep scrubbing, backfill, and recovery processes to balance CPU, RAM, and I/O requirements.
• Identify key tuning parameters and troubleshoot performance for Ceph clients, including
RADOS Gateway, RADOS Block Devices, and CephFS.
• Describe Red Hat OpenStack Platform storage requirements, and compare the architecture
choices for using Red Hat Ceph Storage as an RHOSP storage back end.
• Describe how OpenStack implements Ceph storage for each storage-related OpenStack
component.
• Describe Red Hat OpenShift Container Platform storage requirements, and compare the
architecture choices for using Red Hat Ceph Storage as an RHOCP storage back end.
• Describe how OpenShift implements Ceph storage for each storage-related OpenShift feature.
496 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Lab
Outcomes
You should be able to deploy a Red Hat Ceph Storage cluster using a service specification
file.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes
a pre-built, fully operational Ceph cluster. This first comprehensive review
will remove that cluster, but still requires the rest of the clean classroom
environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command confirms that the local container registry for the classroom is running and
deletes the prebuilt Ceph cluster so it can be redeployed with the steps in this exercise.
Important
This lab start script immediately deletes the prebuilt Ceph cluster and takes a
few minutes to complete. Wait for the command to finish before continuing.
Specifications
• Deploy a four node Red Hat Ceph Storage cluster using a service specification file with these
parameters:
– Use the registry at registry.lab.example.com with the username registry and the
password redhat.
CL260-RHCS5.0-en-1-20211117 497
Chapter 14 | Comprehensive Review
– Deploy RGWs on the serverc and serverd nodes, with the service_id set to
realm.zone.
– Deploy OSDs on the serverc, serverd, and servere nodes, with the service_id set to
default_drive_group. On all OSD nodes, use the /dev/vdb, /dev/vdc, and /dev/vdd
drives as data devices.
Hostname IP Address
clienta.lab.example.com 172.25.250.10
serverc.lab.example.com 172.25.250.12
serverd.lab.example.com 172.25.250.13
servere.lab.example.com 172.25.250.14
• After the cluster is installed, manually add the /dev/vde and /dev/vdf drives as data devices
on the servere node.
– Use 172.25.250.0/24 for the OSD public network, and 172.25.249.0/24 for the OSD
cluster network.
Evaluation
As the student user on the workstation machine, use the lab command to grade your work.
Correct any reported failures and rerun the command until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
498 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Solution
Outcomes
You should be able to deploy a Red Hat Ceph Storage cluster using a service specification
file.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes
a pre-built, fully operational Ceph cluster. This first comprehensive review
will remove that cluster, but still requires the rest of the clean classroom
environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command confirms that the local container registry for the classroom is running and
deletes the prebuilt Ceph cluster so it can be redeployed with the steps in this exercise.
Important
This lab start script immediately deletes the prebuilt Ceph cluster and takes a
few minutes to complete. Wait for the command to finish before continuing.
1. Using the serverc host as the bootstrap host, install the cephadm-ansible package,
create the inventory file, and run the pre-flight playbook to prepare cluster hosts. `.
CL260-RHCS5.0-en-1-20211117 499
Chapter 14 | Comprehensive Review
Note
The ceph_origin variable is set to empty, which causes some playbooks tasks to
be skipped because, in this classroom, the Ceph packages are installed from a local
classroom repository. In a production environment, set ceph_origin to rhcs to
enable the Red Hat Storage Tools repository for your supported deployment.
Hostname IP Address
clienta.lab.example.com 172.25.250.10
serverc.lab.example.com 172.25.250.12
serverd.lab.example.com 172.25.250.13
servere.lab.example.com 172.25.250.14
500 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
3. As the root user on the serverc host, bootstrap the Ceph cluster using the created service
specification file.
CL260-RHCS5.0-en-1-20211117 501
Chapter 14 | Comprehensive Review
3.1. Set the Ceph dashboard password to redhat and use the --dashboard-password-
noupdate option. Use the --allow-fqdn-hostname to use fully qualified domain
names for the hosts. The registry URL is registry.lab.example.com, the
username is registry, and the password is redhat.
3.2. As the root user on the serverc host, run the cephadm bootstrap command
with the provided parameters to bootstrap the Ceph cluster. Use the created service
specification file.
URL: https://serverc.lab.example.com:8443/
User: admin
Password: redhat
ceph telemetry on
https://docs.ceph.com/docs/pacific/mgr/telemetry/
Bootstrap complete.
3.3. As the root user on the serverc host, run the cephadm shell.
502 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
3.4. Verify that the cluster status is HEALTH_OK. Wait until the cluster reaches the
HEALTH_OK status.
services:
mon: 1 daemons, quorum serverc.lab.example.com (age 2m)
mgr: serverc.lab.example.com.anabtp(active, since 91s), standbys:
clienta.trffqp
osd: 9 osds: 9 up (since 21s), 9 in (since 46s)
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 47 MiB used, 90 GiB / 90 GiB avail
pgs: 1 active+clean
4. Label the clienta host as the admin node. Manually copy the ceph.conf and
ceph.client.admin.keyring files to the admin node. On the admin node, test the
cephadm shell.
[ceph: root@serverc /]# ceph orch host label add clienta.lab.example.com _admin
Added label _admin to host clienta.lab.example.com
4.2. Copy the ceph.conf and ceph.client.admin.keyring files from the serverc
host to the clienta host. Locate these files in /etc/ceph on both hosts.
CL260-RHCS5.0-en-1-20211117 503
Chapter 14 | Comprehensive Review
5. Manually add OSDs to the servere node using devices /dev/vde and /dev/vdf. Set
172.25.250.0/24 for the OSD public network and 172.25.249.0/24 for the OSD
cluster network.
5.1. Display the servere node storage device inventory on the Ceph cluster. Verify that
the /dev/vde and /dev/vdf devices are available.
5.2. Create the OSDs using the /dev/vde and /dev/vdf devices on the servere node.
Evaluation
As the student user on the workstation machine, use the lab command to grade your work.
Correct any reported failures and rerun the command until successful.
504 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
CL260-RHCS5.0-en-1-20211117 505
Chapter 14 | Comprehensive Review
Lab
Outcomes
You should be able to configure cluster settings and components, such as pools, users,
OSDs, and the CRUSH map.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
Specifications
• Set the value of osd_pool_default_pg_num to 250 in the configuration database.
• Create a CRUSH rule called onhdd to target HDD-based OSDs for replicated pools.
• Create a replicated pool called rbd1 that uses the onhdd CRUSH map rule. Set the application
type to rbd and the number of replicas for the objects in this pool to five.
• Create the following CRUSH hierarchy. Do not associate any OSD with this new tree.
• Create a new erasure code profile called cl260. Pools using this profile must set two data
chunks and one coding chunk per object.
506 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
• Create an erasure coded pool called testec that uses your new cl260 profile. Set its
application type to rgw.
• Create a user called client.fortestec that can store and retrieve objects under
the docs namespace in the pool called testec. This user must not have access
to any other pool or namespace. Save the associated key-ring file as /etc/ceph/
ceph.client.fortestec.keyring on clienta.
• Update the OSD near-capacity limit information for the serverc, serverd, and servere
cluster. Set the full ratio to 90% and the near-full ratio to 86%.
• Locate the host on which the ceph-osd-7 service is running. List the available storage devices
on that host.
Evaluation
Grade your work by running the lab grade comprehensive-review2 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
CL260-RHCS5.0-en-1-20211117 507
Chapter 14 | Comprehensive Review
Solution
Outcomes
You should be able to configure cluster settings and components, such as pools, users,
OSDs, and the CRUSH map.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
1.1. Log in to clienta as the admin user and use sudo to run the cephadm shell.
508 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
2. Create a CRUSH rule called onhdd to target HDD-based OSDs for replicated pools.
2.1. Create a new rule called onhdd to target HDD-based OSDs for replicated pools.
[ceph: root@clienta /]# ceph osd crush rule create-replicated onhdd default \
host hdd
3. Create a replicated pool called rbd1 that uses the onhdd CRUSH map rule. Set the
application type to rbd and the number of replicas for the objects in this pool to five.
3.1. Create a new replicated pool called rbd1 that uses the onhdd CRUSH map rule.
[ceph: root@clienta /]# ceph osd pool application enable rbd1 rbd
enabled application 'rbd' on pool 'rbd1'
3.3. Increase the number of replicas for the pool to five and verify the new value.
4. Create the following CRUSH hierarchy. Do not associate any OSD with this new tree.
CL260-RHCS5.0-en-1-20211117 509
Chapter 14 | Comprehensive Review
4.1.
4.4. Display the CRUSH map tree to verify the new hierarchy.
510 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
5. Create a new erasure code profile called cl260. Pools that use this profile must set two data
chunks and one coding chunk per object.
[ceph: root@clienta /]# ceph osd erasure-code-profile set cl260 k=2 m=1
6. Create an erasure coded pool called testec that uses your new cl260 profile. Set its
application type to rgw.
6.1. Create an erasure coded pool called testec that uses the cl260 profile.
[ceph: root@clienta /]# ceph osd pool create testec erasure cl260
pool 'testec' created
[ceph: root@clienta /]# ceph osd pool application enable testec rgw
enabled application 'rgw' on pool 'testec'
7. Create a user called client.fortestec that can store and retrieve objects under
the docs namespace in the pool called testec. This user must not have access
to any other pool or namespace. Save the associated key-ring file as /etc/ceph/
ceph.client.fortestec.keyring on clienta.
CL260-RHCS5.0-en-1-20211117 511
Chapter 14 | Comprehensive Review
7.1. Exit from the current cephadm shell. Start a new cephadm shell with the /etc/ceph
directory as a bind mount.
7.2. Create a user called client.fortestec, with read and write capabilities in the
namespace docs within the pool testec. Save the associated key-ring file as /etc/
ceph/ceph.client.fortestec.keyring in the mounted directory.
[ceph: root@clienta /]# ceph auth get-or-create client.fortestec mon 'allow r' \
osd 'allow rw pool=test ec namespace=docs' \
-o /etc/ceph/ceph.client.fortestec.keyring
7.3. To verify your work, attempt to store and retrieve an object. The diff command
returns no output when the file contents are the same. When finished, remove the
object.
8.2. Obtain report object details to confirm that the upload was successful.
[ceph: root@clienta ~]# rados --id fortestec -p testec -N docs stat report
testec/report mtime 2021-10-29T11:44:21.000000+0000, size 19216
9. Update the OSD near-capacity limit information for the cluster. Set the full ratio to
90% and the near-full ratio to 86%.
9.1. Set the full_ratio parameter to 0.9 (90%) and the nearfull_ratio to 0.86
(86%) in the OSD map.
9.2. Dump the OSD map and verify the new value of the two parameters.
512 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
10. Locate the host with the OSD 7 service. List that host's available storage devices.
10.1. Locate the OSD 7`service. The location of the OSD `7 service might be
different in your lab environment.
10.2. Use the ceph orch device ls command to list the available storage devices on the
located host. Use the host you located in your environment.
CL260-RHCS5.0-en-1-20211117 513
Chapter 14 | Comprehensive Review
Evaluation
Grade your work by running the lab grade comprehensive-review2 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
514 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Lab
Deploying CephFS
In this review, you will deploy CephFS on an existing Red Hat Ceph Storage cluster using
specified requirements.
Outcomes
You should be able to deploy a Metadata Server, provide storage with CephFS, and
configure clients for its use.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
Specifications
• Create a CephFS file system cl260-fs. Create an MDS service called cl260-fs with two
MDS instances, one on the serverc node and another on the serverd node. Create a data
pool called cephfs.cl260-fs.data and a metadata pool called cephfs.cl260-fs.meta.
Use replicated as the type for both pools.
• Mount the CephFS file system to the /mnt/cephfs directory on the clienta host and owned
by the admin user. Save the client.admin key-ring to the /root/secretfile and use the
file to authenticate the mount operation.
• Create the ceph01 and ceph02 directories. Create an empty file called firstfile in the
ceph01 directory. Verify the directories and its contents are owned by the admin user.
CL260-RHCS5.0-en-1-20211117 515
Chapter 14 | Comprehensive Review
• Configure the CephFS file system to be mounted on each system startup. Verify that the /etc/
fstab file is updated accordingly.
Evaluation
Grade your work by running the lab grade comprehensive-review3 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
516 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Solution
Deploying CephFS
In this review, you will deploy CephFS on an existing Red Hat Ceph Storage cluster using
specified requirements.
Outcomes
You should be able to deploy a Metadata Server, provide storage with CephFS, and
configure clients for its use.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
1. Create a CephFS file system cl260-fs. Create an MDS service called cl260-fs
with an MDS instance on serverc and another on serverd. Create a data pool called
cephfs.cl260-fs.data and a metadata pool called cephfs.cl260-fs.meta. Use
replicated as the type for both pools. Verify that the MDS service is up and running.
1.1. Log in to clienta and use sudo to run the cephadm shell.
1.2. Create a data pool called cephfs.cl260-fs.data and a metadata pool called
cephfs.cl260-fs.meta for the CephFS service.
CL260-RHCS5.0-en-1-20211117 517
Chapter 14 | Comprehensive Review
1.4. Create an MDS service called cl260-fs with an MDS instance on serverc.
services:
...output omitted...
mds: 1/1 daemons up, 1 standby
...output omitted...
2. Install the ceph-common package. Mount the CephFS file system to the /mnt/cephfs
directory on the clienta host. Save the key-ring associated with the client.admin user
to the /root/secretfile file. Use this file to authenticate the mount operation. Verify
that the /mnt/cephfs directory is owned by the admin user.
2.1. Exit the cephadm shell and switch to the root user. Install the ceph-common package.
518 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
2.2. Extract the key-ring associated with the client.admin user, and save it in the /
root/secretfile file.
2.3. Create a new directory called /mnt/cephfs to use as a mount point for the CephFS
file system. Mount your new CephFS file system on that directory.
2.5. Change the ownership of the top-level directory of the mounted file system to user
and group admin.
3. As the admin user, create the ceph01 and ceph02 directories. Create an empty file called
firstfile on the ceph01 directory. Ensure the directories and its contents are owned by
the admin user.
3.1. Exit the root user session. Create two directories directly underneath the mount point,
and name them dir1 and dir2.
CL260-RHCS5.0-en-1-20211117 519
Chapter 14 | Comprehensive Review
4.4. Verify that the secondfile file has the correct layout attribute.
5. Switch to the root user. Install and use the ceph-fuse client to mount a new directory
called cephfuse.
5.2. Create a directory called /mnt/cephfuse to use as a mount point for the Fuse client.
Mount your new Ceph-Fuse file system on that directory.
520 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
| |-- ceph01
| | |-- firstfile
| | `-- secondfile
| `-- ceph02
`-- cephfuse
|-- ceph01
| |-- firstfile
| `-- secondfile
`-- ceph02
6 directories, 4 files
6. Configure the CephFS file system to be persistently mounted at startup. Use the contents
of the /root/secretfile file to configure the mount operation in the /etc/fstab file.
Verify that the configuration works as expected by using the mount -a command.
6.1. View the contents of the admin key-ring in the /root/secretfile file.
6.2. Configure the /etc/fstab file to mount the file system at startup. The /etc/fstab
file should look like the following output.
6.3. Unmount the CephFS file system, then test mount using the mount -a command.
Verify the mount.
Evaluation
Grade your work by running the lab grade comprehensive-review3 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
CL260-RHCS5.0-en-1-20211117 521
Chapter 14 | Comprehensive Review
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
522 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Lab
Outcomes
You should be able to:
• Deploy and configure Red Hat Ceph Storage for RBD mirroring.
• Configure a client to access RBD images.
• Manage RBD images, RBD mirroring, and RBD snapshots and clones.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command ensures that production and backups clusters are running and have the RBD
storage pools called rbd, rbdpoolmode, and rbdimagemode in both clusters, also creates
the data image in the rbd pool in the production cluster.
Specifications
• Deploy and configure a Red Hat Ceph Storage cluster for RBD mirrorring between two clusters:
– In the production cluster, create an RBD image called vm1 in the rbdpoolmode pool
configured as one-way pool-mode and with a size of 128 MiB. Create an RBD image called
vm2 in the rbdimagemode pool configured as one-way image-mode and with a size of
128 MiB. Both images should be enabled for mirroring.
– Production and backup clusters should be called production and bck, respectively.
– Map the image called rbd/data using the kernel RBD client on clienta and format the
device with an XFS file system. Store a copy of the /usr/share/dict/words at the root of
CL260-RHCS5.0-en-1-20211117 523
Chapter 14 | Comprehensive Review
the file system. Create a snapshot called beforeprod of the RBD image data, and create a
clone called prod1 from the snapshot called beforeprod.
– Map again the image called rbd/data using the kernel RBD client on clienta Copy the /
etc/services file to the root of the file system. Export changes to the rbd/data image to
the /home/admin/cr4/data-diff.img file.
– Configure the clienta node so that it will persistently mount the rbd/data RBD image as /
mnt/data.
Evaluation
Grade your work by running the lab grade comprehensive-review4 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
524 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Solution
Outcomes
You should be able to:
• Deploy and configure Red Hat Ceph Storage for RBD mirroring.
• Configure a client to access RBD images.
• Manage RBD images, RBD mirroring, and RBD snapshots and clones.
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command ensures that production and backups clusters are running and have the RBD
storage pools called rbd, rbdpoolmode, and rbdimagemode in both clusters, also creates
the data image in the rbd pool in the production cluster.
1. Using two terminals, log in to clienta for the production cluster and serverf for the
backup cluster as the admin user. Verify that each cluster is reachable and has a HEALTH_OK
status.
1.1. In the first terminal, log in to clienta as the admin user and use sudo to run the
cephadm shell. Verify the health of the production cluster.
CL260-RHCS5.0-en-1-20211117 525
Chapter 14 | Comprehensive Review
1.2. In the second terminal, log in to serverf as admin and use sudo to run the cephadm
shell. Verify the health of the backup cluster. Exit from the cephadm shell.
2. In the production cluster, create the rbdpoolmode/vm1 RBD image, enable one-way pool-
mode mirroring on the pool, and view the image information.
2.1. Create an RBD image called vm1 in the rbdpoolmode pool in the production cluster.
Specify a size of 128 megabytes, enable exclusive-lock, and journaling RBD
image features.
2.3. View the vm1 image information. Exit from the cephadm shell.
526 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
journal: ad7c2dd2d3be
mirroring state: enabled
mirroring mode: journal
mirroring global id: 6ea4b768-a53d-4195-a1f5-37733eb9af76
mirroring primary: true
[ceph: root@clienta /]# exit
exit
[admin@clienta ~]$
3. In the production cluster, run the cephadm shell with a bind mount of /home/admin/cr4/.
Bootstrap the storage cluster peer and create Ceph user accounts, and save the token in the
/home/admin/cr4/pool_token_prod file in the container. Name the production cluster
prod. Copy the bootstrap token file to the backup storage cluster.
3.1. In the production cluster, use sudo to run the cephadm shell with a bind mount of the /
home/admin/cr4/ directory.
3.2. Bootstrap the storage cluster peer, and create Ceph user accounts, save the output in
the /mnt/pool_token_prod file. Name the production cluster prod.
3.3. Exit the cephadm shell. Copy the bootstrap token file to the backup storage cluster in
the /home/admin/cr4/ directory.
4. In the backup cluster, run the cephadm shell with a bind mount of /home/admin/cr4/.
Deploy an rbd-mirror daemon in the serverf node. Import the bootstrap token located
in the /home/admin/cr4/ directory. Name the backup cluster bck. Verify that the RBD
image is present.
4.1. In the backup cluster, use sudo to run the cephadm shell with a bind mount of the /
home/admin/cr4/ directory.
4.2. Deploy a rbd-mirror daemon, by using the --placement option to select the
serverf.lab.example.com node. Verify the placement.
CL260-RHCS5.0-en-1-20211117 527
Chapter 14 | Comprehensive Review
4.3. Import the bootstrap token located in /mnt/pool_token_prod. Name the backup
cluster bck.
Important
Ignore the known error containing the following text: auth: unable to find a
keyring on …
4.4. Verify that the RBD image is present. Wait until the RBD image is displayed.
5. In the production cluster, create the rbdimagemode/vm2 RBD image, enable one-way
image-mode mirroring on the pool. Also, enable mirroring for the vm2 RBD image in the
rbdimagemode pool
5.1. In the production cluster, use sudo to run the cephadm shell with a bind mount of the /
home/admin/cr4/ directory.
5.2. Create an RBD image called vm2 in the rbdimagemode pool in the production cluster.
Specify a size of 128 megabytes, enable exclusive-lock, and journaling RBD
image features.
528 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
5.4. Enable mirroring for the vm2 RBD image in the rbdimagemode pool.
6. In the production cluster, bootstrap the storage cluster peer and create Ceph user accounts,
and save the token in the /home/admin/cr4/image_token_prod file in the container.
Copy the bootstrap token file to the backup storage cluster.
6.1. Bootstrap the storage cluster peer and create Ceph user accounts, and save the output
in the /mnt/image_token_prod file.
6.2. Exit from the cephadm shell. Copy the bootstrap token file to the backup storage
cluster in the /home/admin/cr4/ directory.
7. In the backup cluster, import the bootstrap token. Verify that the RBD image is present.
7.1. Import the bootstrap token located in /mnt/image_token_prod. Name the backup
cluster bck.
Important
Ignore the known error containing the following text: auth: unable to find a
keyring on …
7.2. Verify that the RBD image is present. Wait until the RBD image appears.
7.3. Return to workstation as the student user and Exit the second terminal.
8. In the production cluster, map the image called rbd/data using the kernel RBD client on
clienta. Format the device with an XFS file system. Temporarily mount the file system and
CL260-RHCS5.0-en-1-20211117 529
Chapter 14 | Comprehensive Review
store a copy of the /usr/share/dict/words file at the root of the file system. Unmount and
unmap the device when done.
8.1. Map the data image in the rbd pool using the kernel RBD client.
8.2. Format the /dev/rbd0 device with an XFS file system and mount the file system on the
/mnt/data directory.
8.3. Copy the /usr/share/dict/words file to the root of the file system, /mnt/data.
List the content to verify the copy.
9. In the production cluster, create a snapshot called beforeprod of the RBD image data.
Create a clone called prod1 from the snapshot called beforeprod.
9.1. In the production cluster, use sudo to run the cephadm shell. Create a snapshot called
beforeprod of the RBD image data in the rbd pool.
9.2. Verify the snapshot by listing the snapshots of the data RBD image in the rbd pool.
530 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
9.3. Protect the beforeprod snapshot and create the clone. Exit from the cephadm shell.
9.4. Verify that the clone also contains the words file by mapping and mounting the clone
image. Unmount the file system and unmap the device after verification.
10. In the production cluster, export the image called data to the /home/admin/cr4/
data.img file. Import it as an image called data to the rbdimagemode pool. Create a
snapshot called beforeprod of the new data image in the rbdimagemode pool.
10.1. In the production cluster, use sudo to run the cephadm shell with a bind mount of the /
home/admin/cr4/ directory. Export the image called data to the /mnt/data.img
file.
10.2. Import the /mnt/data.img file as an image called data to the pool called
rbdimagemode. Verify the import by listing the images in the rbdimagemode pool.
10.3. Create a snapshot called beforeprod of the image called data in the pool called
rbdimagemode. Exit from the cephadm shell.
CL260-RHCS5.0-en-1-20211117 531
Chapter 14 | Comprehensive Review
11. On the clienta host, use the kernel RBD client to remap and remount the RBD image
called data in the pool called rbd. Copy the /etc/services file to the root of the file
system. Unmount the file system and unmap the device when done.
11.1. Map the data image in the rbd pool using the kernel RBD client. Mount the file system
on /mnt/data.
11.2. Copy the /etc/services file to the root of the file system, /mnt/data. List the
contents of /mnt/data for verification.
11.3. Unmount the file system and unmap the data image in the rbd pool.
12. In the production cluster, export changes to the rbd/data image, after the creation
of the beforeprod snapshot, to a file called /home/admin/cr4/data-diff.img.
Import the changes from the /mnt/data-diff.img file to the image called data in the
rbdimagemode pool.
12.1. In the production cluster, use sudo to run the cephadm shell with a bind mount of the
/home/admin/cr4/ directory. Export changes to the data image in the rbd pool,
after the creation of the beforeprod snapshot, to a file called /mnt/token/data-
diff.img.
12.2. Import changes from the /mnt/data-diff.img file to the image called data in the
pool called rbdimagemode. Exit from the cephadm shell.
532 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
12.3. Verify that the image called data in the pool called rbdimagemode also contains the
services file by mapping and mounting the image. When done, unmount the file system
and unmap the image.
13. Configure the clienta host so that it will persistently mount the rbd/data RBD image as /
mnt/data. Authenticate as the admin Ceph user by using existing keys found in the /etc/
ceph/ceph.client.admin.keyring file.
13.1. Create an entry for rbd/data in the /etc/ceph/rbdmap RBD map file. The resulting
file should have the following contents:
13.2. Create an entry for /dev/rbd/rbd/data in the /etc/fstab file. The resulting file
should have the following contents:
13.3. Use the rbdmap command to verify your RBD map configuration.
13.4. After you have verified that the RBD mapped devices work, enable the rbdmap service.
Reboot the clienta host to verify that the RBD device mounts persistently.
CL260-RHCS5.0-en-1-20211117 533
Chapter 14 | Comprehensive Review
13.5. When clienta finishes rebooting, log in to clienta as the admin user, and verify
that it has mounted the RBD device.
Evaluation
Grade your work by running the lab grade comprehensive-review4 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
534 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Lab
Outcomes
You should be able to:
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command ensures that all cluster hosts are reachable. It also installs the AWS and Swift
clients on the serverc and serverf nodes.
The primary Ceph cluster contains the serverc, serverd, and servere nodes. The
secondary Ceph cluster contains the serverf node.
Specifications
• Create a realm, zonegroup, zone, and a system user called Replication User on the primary
Ceph cluster. Configure each resource as the default. Treat this zone as the primary zone. Use
the names provided in this table:
Resource Name
Realm cl260
CL260-RHCS5.0-en-1-20211117 535
Chapter 14 | Comprehensive Review
Resource Name
Zonegroup classroom
Zone main
Endpoint http://serverc:80
• Deploy a RADOS Gateway service in the primary cluster called cl260-1 with one RGW instance
on serverc. Configure the primary zone name and disable dynamic bucket index resharding.
• On the secondary Ceph cluster, configure a secondary zone called fallback for the
classroom zonegroup. Object resources created in the primary zone must replicate to the
secondary zone. Configure the endpoint of the secondary zone as http://serverf:80
• Deploy a RADOS Gateway service in the secondary cluster called cl260-2 with one RGW
instance. Configure the secondary zone name and disable dynamic bucket index resharding.
• Create an Amazon S3 API user called S3 User with a uid of apiuser, an access key of
review, and a secret key of securekey. Create a Swift API subuser with secret key of
secureospkey. Grant full access to both the user and subuser.
• Create a bucket called images by using the Amazon S3 API. Upload the /etc/favicon.png
file to the images container by using the Swift API. The object must be available as favicon-
image
Evaluation
Grade your work by running the lab grade comprehensive-review5 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
536 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
Solution
Outcomes
You should be able to:
Important
Reset your environment before performing this exercise. All comprehensive
review labs start with a clean, initial classroom environment that includes a
pre-built, fully operational Ceph cluster. All remaining comprehensive reviews
use the default Ceph cluster provided in the initial classroom environment.
As the student user on the workstation machine, use the lab command to prepare your
system for this exercise.
This command ensures that all cluster hosts are reachable. It also installs the AWS and Swift
clients on the serverc and serverf nodes.
The primary Ceph cluster contains the serverc, serverd, and servere nodes. The
secondary Ceph cluster contains the serverf node.
1. Log in to serverc as the admin user. Create a realm called cl260, a zonegroup called
classroom, a zone called main, and a system user called Replication User. Use the UID
of repl.user, access key of replication, and secret key of secret for the user. Set the
zone endpoint as http://serverc:80.
1.1. Log in to serverc as the admin user and use sudo to run the cephadm shell.
CL260-RHCS5.0-en-1-20211117 537
Chapter 14 | Comprehensive Review
1.3. Create a zonegroup called classroom. Configure the classroom zonegroup with an
endpoint on the serverc node. Set the classroom zonegroup as the default.
1.4. Create a master zone called main. Configure the zone with an endpoint pointing to
http://serverc:80. Use replication as the access key and secret as the
secret key. Set the main zone as the default.
1.5. Create a system user called repl.user to access the zone pools. The keys for the
repl.user user must match the keys configured for the zone.
538 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
2. Create a RADOS Gateway service called cl260-1 with a single RGW daemon on serverc.
Verify that the RGW daemon is up and running. Configure the zone name in the configuration
database and disable dynamic bucket index resharding.
2.1. Create a RADOS gateway service called cl260-1 with a single RGW daemon on the
serverc node.
CL260-RHCS5.0-en-1-20211117 539
Chapter 14 | Comprehensive Review
[ceph: root@serverc /]# ceph orch apply rgw cl260-1 --realm=cl260 --zone=main \
--placement="1 serverc.lab.example.com"
Scheduled rgw.cl260-1 update...
[ceph: root@serverc /]# ceph orch ps --daemon-type rgw
NAME HOST STATUS REFRESHED
AGE PORTS ...
rgw.cl260-1.serverc.iwsaop serverc.lab.example.com running (70s) 65s
ago 70s *:80 ...
3. Log in to serverf as the admin user. Pull the realm and period configuration from the
serverc node. Use the credentials for repl.user to authenticate. Verify that the pulled
realm and zonegroup are set as default for the secondary cluster. Create a secondary zone
called fallback for the classroom zonegroup.
3.1. In a second terminal, log in to serverf as the admin user and use sudo to run the
cephadm shell.
540 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
"id": "93a7f406-0bbd-43a5-a32a-c217386d534b",
"zonegroups": [
{
"id": "2b1495f8-5ac3-4ec5-897e-ae5e0923d0b9",
"name": "classroom",
"api_name": "classroom",
"is_master": "true",
"endpoints": [
"http://serverc:80"
...output omitted...
"zones": [
{
"id": "b50c6d11-6ab6-4a3e-9fb6-286798ba950d",
"name": "main",
"endpoints": [
"http://serverc:80"
...output omitted...
"master_zonegroup": "2b1495f8-5ac3-4ec5-897e-ae5e0923d0b9",
"master_zone": "b50c6d11-6ab6-4a3e-9fb6-286798ba950d",
...output omitted...
"realm_id": "8ea5596f-e2bb-4ac5-8fc8-9122de311e26",
"realm_name": "cl260",
"realm_epoch": 2
}
3.3. Set the cl260 realm and classroom zone group as default.
3.4. Create a zone called fallback. Configure the fallback zone with the endpoint
pointing to http://serverf:80.
CL260-RHCS5.0-en-1-20211117 541
Chapter 14 | Comprehensive Review
"epoch": 2,
"predecessor_uuid": "75c34edd-428f-4c7f-a150-6236bf6102db",
"sync_status": [],
"period_map": {
"id": "93a7f406-0bbd-43a5-a32a-c217386d534b",
"zonegroups": [
{
"id": "2b1495f8-5ac3-4ec5-897e-ae5e0923d0b9",
"name": "classroom",
"api_name": "classroom",
"is_master": "true",
"endpoints": [
"http://serverc:80"
],
...output omitted...
"zones": [
{
"id": "b50c6d11-6ab6-4a3e-9fb6-286798ba950d",
"name": "main",
"endpoints": [
"http://serverc:80"
],
...output omitted...
},
{
"id": "fe105db9-fd00-4674-9f73-0d8e4e93c98c",
"name": "fallback",
"endpoints": [
"http://serverf:80"
],
...output omitted...
"master_zonegroup": "2b1495f8-5ac3-4ec5-897e-ae5e0923d0b9",
"master_zone": "b50c6d11-6ab6-4a3e-9fb6-286798ba950d",
...output omitted...
"realm_id": "8ea5596f-e2bb-4ac5-8fc8-9122de311e26",
"realm_name": "cl260",
"realm_epoch": 2
}
4. Create a RADOS Gateway service called cl260-2 with a single RGW daemon on the
serverf node. Verify that the RGW daemon is up and running. Configure the zone name in
the configuration database and disable dynamic bucket index resharding.
4.1. Create a RADOS gateway service called cl260-2 with a single RGW daemon on
serverf.
542 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
5. On serverc, use the radosgw-admin command to create a user called apiuser for the
Amazon S3 API and a subuser called apiuser:swift for the Swift API. For the apiuser
user, utilize the access key of review, secret key of securekey, and grant full access. For
the apiuser:swift subuser, utilize the secret of secureospkey and grant the subuser full
access.
5.1. Create an Amazon S3 API user called S3 user with the UID of apiuser. Assign an
access key of review and a secret of securekey, and grant the user full access.
CL260-RHCS5.0-en-1-20211117 543
Chapter 14 | Comprehensive Review
],
"swift_keys": [],
...output omitted...
5.2. Create a Swift subuser called apiuser:swift, set secureospkey as the subuser
secret and grant full access.
6. On the serverc node, exit the cephadm shell. Create a bucket called review. Configure
the AWS CLI tool to use the apiuser user credentials. Use the swift upload command to
upload the /etc/favicon.png file to the image bucket.
6.1. Exit the cephadm shell. Configure the AWS CLI tool to use operator credentials. Enter
review as the access key and securekey as the secret key.
544 CL260-RHCS5.0-en-1-20211117
Chapter 14 | Comprehensive Review
6.3. Use the upload command of the swift API to upload the /etc/favicon.png file to
the image bucket. The object must be available as favicon-image.
6.4. Exit and close the second terminal. Return to workstation as the student user.
Evaluation
Grade your work by running the lab grade comprehensive-review5 command from your
workstation machine. Correct any reported failures and rerun the script until successful.
Finish
As the student user on the workstation machine, use the lab command to complete this
exercise. This is important to ensure that resources from previous exercises do not impact
upcoming exercises.
CL260-RHCS5.0-en-1-20211117 545
546 CL260-RHCS5.0-en-1-20211117