Storage Ceph 5 Documentation IBM
Storage Ceph 5 Documentation IBM
IBM
© Copyright IBM Corp. 2024.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Tables of Contents
IBM Storage Ceph 1
Summary of changes 1
Release notes 1
Enhancements 1
Bug fixes 3
Known issues 11
Sources 12
Asynchronous updates 12
Release Notes for 5.3z6 12
Enhancements 12
Bug fixes 13
Known issues 18
Concepts 18
Architecture 19
Ceph architecture 19
Core Ceph components 20
Prerequisites 21
Ceph pools 21
Ceph authentication 22
Ceph placement groups 23
Ceph CRUSH ruleset 24
Ceph input/output operations 24
Ceph replication 25
Ceph erasure coding 26
Ceph ObjectStore 27
Ceph BlueStore 28
Ceph self management operations 28
Ceph heartbeat 29
Ceph peering 29
Ceph rebalancing and recovery 29
Ceph data integrity 30
Ceph high availability 30
Clustering the Ceph Monitor 30
Ceph client components 31
Prerequisites 31
Ceph client native protocol 31
Ceph client object watch and notify 32
Ceph client Mandatory Exclusive Locks 32
Ceph client object map 33
Ceph client data striping 34
Ceph on-wire encryption 36
Data Security and Hardening 38
Introduction to data security 38
Preface 38
Introduction to IBM Storage Ceph 38
Supporting Software 39
Threat and Vulnerability Management 39
Threat Actors 40
Security Zones 40
Connecting Security Zones 41
Security-Optimized Architecture 41
Encryption and Key Management 42
SSH 43
SSL Termination 43
Messenger v2 protocol 43
Encryption in transit 44
Encryption at Rest 45
Identity and Access Management 45
Ceph Storage Cluster User Access 46
Ceph Object Gateway User Access 46
Ceph Object Gateway LDAP or AD authentication 47
Ceph Object Gateway OpenStack Keystone authentication 47
Infrastructure Security 47
Administration 48
Network Communication 48
Hardening the Network Service 49
Reporting 50
Auditing Administrator Actions 50
Data Retention 51
Ceph Storage Cluster 51
Ceph Block Device 51
Ceph Object Gateway 51
Federal Information Processing Standard (FIPS) 52
Summary 52
Planning 52
Compatibility 53
Compatibility Matrix for IBM Storage Ceph 5.3 53
Hardware 54
Executive summary 54
General principles for selecting hardware 55
Identify performance use case 55
Consider storage density 56
Identical hardware configuration 56
Network considerations for IBM Storage Ceph 56
Avoid using RAID solutions 57
Summary of common mistakes when selecting hardware 57
Reference 58
Optimize workload performance domains 58
Server and rack solutions 59
Minimum hardware recommendations for containerized Ceph 62
Recommended minimum hardware requirements for the IBM Storage Ceph Dashboard 63
Storage Strategies 63
Overview 63
What are storage strategies? 64
Configuring storage strategies 65
Crush admin overview 65
CRUSH introduction 66
Dynamic data placement 67
CRUSH failure domain 68
CRUSH performance domain 68
Using different device classes 69
CRUSH hierarchy 69
CRUSH location 70
Adding a bucket 71
Moving a bucket 71
Removing a bucket 72
CRUSH Bucket algorithms 72
Ceph OSDs in CRUSH 72
Viewing OSDs in CRUSH 74
Adding an OSD to CRUSH 76
Moving an OSD within a CRUSH Hierarchy 76
Removing an OSD from a CRUSH Hierarchy 76
Device class 77
Setting a device class 77
Removing a device class 78
Renaming a device class 78
Listing a device class 78
Listing OSDs of a device class 78
Listing CRUSH Rules by Class 79
CRUSH weights 79
Setting CRUSH weights of OSDs 79
Setting a Bucket’s OSD Weights 80
Set an OSD’s in Weight 80
Setting the OSDs weight by utilization 80
Setting an OSD’s Weight by PG distribution 81
Recalculating a CRUSH Tree’s weights 82
Primary affinity 82
CRUSH rules 82
Listing CRUSH rules 85
Dumping CRUSH rules 85
Adding CRUSH rules 85
Creating CRUSH rules for replicated pools 85
Creating CRUSH rules for erasure coded pools 86
Removing CRUSH rules 86
CRUSH tunables overview 86
CRUSH tuning 87
CRUSH tuning, the hard way 87
CRUSH legacy values 88
Edit a CRUSH map 88
Getting the CRUSH map 88
Decompiling the CRUSH map 89
Compiling the CRUSH map 89
Setting a CRUSH map 89
CRUSH storage strategies examples 89
Placement Groups 91
About placement groups 91
Placement group states 92
Placement group tradeoffs 94
Data durability 94
Data distribution 95
Resource usage 96
Placement group count 96
Placement group calculator 96
Configuring default placement group count 97
Placement group count for small clusters 97
Calculating placement group count 97
Maximum placement group count 97
Auto-scaling placement groups 98
Placement group auto-scaling 98
Viewing placement group scaling recommendations 98
Placement group splitting and merging 100
Setting placement group auto-scaling modes 101
Viewing placement group scaling recommendations 98
Setting placement group auto-scaling 103
Setting minimum and maximum number of placement groups for pools 104
Updating noautoscale flag 105
Specifying target pool size 105
Specifying target size using the absolute size of the pool 106
Specifying target size using the total cluster capacity 106
Placement group command line interface 106
Setting number of placement groups in a pool 107
Getting number of placement groups in a pool 107
Getting statistics for placement groups 107
Getting statistics for stuck placement groups 108
Getting placement group maps 108
Scrubbing placement groups 108
Getting a placement group statistics 108
Marking unfound objects 109
Pools overview 109
Pools and storage strategies overview 110
Listing pool 111
Creating a pool 111
Setting pool quota 113
Deleting a pool 113
Renaming a pool 113
Viewing pool statistics 114
Setting pool values 114
Getting pool values 114
Enabling a client application 114
Disabling a client application 115
Setting application metadata 115
Removing application metadata 116
Setting the number of object replicas 116
Getting the number of object replicas 116
Pool values 117
Erasure code pools overview 120
Creating a sample erasure-coded pool 121
Erasure code profiles 121
Setting OSD erasure-code-profile 123
Removing OSD erasure-code-profile 124
Getting OSD erasure-code-profile 124
Listing OSD erasure-code-profile 125
Erasure Coding with Overwrites 125
Erasure Code Plugins 125
Creating a new erasure code profile using jerasure erasure code plugin 125
Controlling CRUSH Placement 127
Installing 128
IBM Storage Ceph 128
IBM Storage Ceph considerations and recommendations 129
Basic IBM Storage Ceph considerations 129
IBM Storage Ceph workload considerations 131
Network considerations for IBM Storage Ceph 133
Considerations for using a RAID controller with OSD hosts 134
Tuning considerations for the Linux kernel when running Ceph 135
How colocation works and its advantages 135
Operating system requirements for IBM Storage Ceph 138
Minimum hardware considerations for IBM Storage Ceph 139
IBM Storage Ceph installation 140
cephadm utility 141
How cephadm works 142
cephadm-ansible playbooks 143
Registering the IBM Storage Ceph nodes 144
Configuring Ansible inventory location 145
Enabling SSH login as root user on Red Hat Enterprise Linux 9 146
Creating an Ansible user with sudo access 147
Enabling password-less SSH for Ansible 148
Configuring SSH 149
Configuring a different SSH user 150
Running the preflight playbook 151
Bootstrapping a new storage cluster 153
Recommended cephadm bootstrap command options 155
Obtaining entitlement key 155
Using a JSON file to protect login information 156
Bootstrapping a storage cluster using a service configuration file 156
Bootstrapping the storage cluster as a non-root user 158
Bootstrap command options 159
Configuring a private registry for a disconnected installation 160
Running the preflight playbook for a disconnected installation 165
Performing a disconnected installation 166
Changing configurations of custom container images for disconnected installations 168
Distributing SSH keys 169
Launching the cephadm shell 170
Verifying the cluster installation 171
Adding hosts 172
Using the addr option to identify hosts 174
Adding multiple hosts 174
Adding hosts in disconnected deployments 176
Removing hosts 176
Labeling hosts 177
Adding a label to a host 178
Removing a label from a host 178
Using host labels to deploy daemons on specific hosts 179
Adding Monitor service 181
Setting up the admin node 182
Deploying Ceph monitor nodes using host labels 183
Adding Ceph Monitor nodes by IP address or network name 184
Removing the admin label from a host 185
Adding Manager service 186
Adding OSDs 186
Purging the Ceph storage cluster 187
Deploying client nodes 189
Managing an IBM Storage Ceph cluster using cephadm-ansible modules 191
cephadm-ansible modules 191
cephadm-ansible modules options 192
Bootstrapping a storage cluster using the cephadm_bootstrap and cephadm_registry_login modules 193
Adding or removing hosts using the ceph_orch_host module 195
Setting configuration options using the ceph_config module 199
Applying a service specification using the ceph_orch_apply module 201
Managing Ceph daemon states using the ceph_orch_daemon module 202
Comparison between Ceph Ansible and Cephadm 203
cephadm commands 204
What to do next? Day 2 209
Upgrading 209
Upgrading to an IBM Storage Ceph cluster using cephadm 209
Upgrading to IBM Storage Ceph cluster 210
Upgrading the IBM Storage Ceph cluster in a disconnected environment 213
Staggered upgrade 215
Staggered upgrade options 215
Performing a staggered upgrade 216
Monitoring and managing upgrade of the storage cluster 218
Troubleshooting upgrade error messages 218
Configuring 219
The basics of Ceph configuration 219
Ceph configuration 219
The Ceph configuration database 220
Using the Ceph metavariables 222
Viewing the Ceph configuration at runtime 222
Viewing a specific configuration at runtime 223
Setting a specific configuration at runtime 223
OSD Memory Target 224
Setting the OSD memory target 224
MDS Memory Cache Limit 225
Ceph network configuration 225
Network configuration for Ceph 226
Ceph network messenger 228
Configuring a public network 228
Configuring multiple public networks to the cluster 229
Configuring a private network 231
Verifying firewall rules are configured for default Ceph ports 232
Firewall settings for Ceph Monitor node 232
Firewall settings for Ceph OSDs 233
Ceph Monitor configuration 234
Ceph Monitor configuration 235
Viewing the Ceph Monitor configuration database 235
Ceph cluster maps 236
Ceph Monitor quorum 236
Ceph Monitor consistency 237
Bootstrap the Ceph Monitor 237
Minimum configuration for a Ceph Monitor 238
Unique identifier for Ceph 238
Ceph Monitor data store 238
Ceph storage capacity 239
Ceph heartbeat 240
Ceph Monitor synchronization role 240
Ceph time synchronization 241
Ceph authentication configuration 241
Cephx authentication 242
Enabling Cephx 242
Disabling Cephx 243
Cephx user keyrings 244
Cephx daemon keyrings 244
Cephx message signatures 244
Pools, placement groups, and CRUSH configuration 244
Pools placement groups and CRUSH 245
Ceph Object Storage Daemon (OSD) configuration 245
Ceph OSD configuration 245
Scrubbing the OSD 246
Backfilling an OSD 246
OSD recovery 246
Ceph Monitor and OSD interaction configuration 247
Ceph Monitor and OSD interaction 247
OSD heartbeat 247
Reporting an OSD as down 248
Reporting a peering failure 249
OSD reporting status 249
Ceph debugging and logging configuration 250
General configuration options 251
Ceph network configuration options 252
Ceph Monitor configuration options 256
Cephx configuration options 269
Pools, placement groups, and CRUSH configuration options 271
Object Storage Daemon (OSD) configuration options 275
Ceph Monitor and OSD configuration options 287
Ceph debugging and logging configuration options 290
Ceph scrubbing options 294
BlueStore configuration options 298
Administering 298
Administration 299
Ceph administration 299
Understanding process management for Ceph 299
Ceph process management 300
Starting, stopping, and restarting all Ceph daemons 300
Starting, stopping, and restarting all Ceph services 301
Viewing log files of Ceph daemons that run in containers 302
Powering down and rebooting IBM Storage Ceph cluster 303
Powering down and rebooting the cluster using the systemctl commands 303
Powering down and rebooting the cluster using the Ceph Orchestrator 305
Monitoring a Ceph storage cluster 308
High-level monitoring of a Ceph storage cluster 309
Using the Ceph command interface interactively 309
Checking the storage cluster health 309
Watching storage cluster events 310
How Ceph calculates data usage 311
Understanding the storage clusters usage stats 312
Understanding the OSD usage stats 313
Checking the storage cluster status 314
Checking the Ceph Monitor status 315
Using the Ceph administration socket 317
Understanding the Ceph OSD status 321
Low-level monitoring of a Ceph storage cluster 323
Monitoring Placement Group Sets 323
Ceph OSD peering 324
Placement Group States 324
Placement Group creating state 327
Placement group peering state 327
Placement group active state 327
Placement Group clean state 328
Placement Group degraded state 328
Placement Group recovering state 328
Back fill state 328
Placement Group remapped state 329
Placement Group stale state 329
Placement Group misplaced state 329
Placement Group incomplete state 330
Identifying stuck Placement Groups 330
Finding an object’s location 331
Stretch clusters for Ceph storage 331
Stretch mode for a storage cluster 332
Setting the crush location for the daemons 333
Entering the stretch mode 336
Adding OSD hosts in stretch mode 338
Override Ceph behavior 339
Setting and unsetting Ceph override options 340
Ceph override use cases 340
Ceph user management 341
Ceph user management background 341
Managing Ceph users 343
Listing Ceph users 343
Display Ceph user information 345
Add a new Ceph user 346
Modifying a Ceph User 346
Deleting a Ceph user 347
Print a Ceph user key 347
The ceph-volume utility 348
Ceph volume lvm plugin 348
Why does ceph-volume replace ceph-disk? 349
Preparing Ceph OSDs using ceph-volume 350
Listing devices using ceph-volume 351
Activating Ceph OSDs using ceph-volume 352
Deactivating Ceph OSDs using ceph-volume 353
Creating Ceph OSDs using ceph-volume 354
Migrating BlueFS data 355
Using batch mode with ceph-volume 357
Zapping data using ceph-volume 357
Ceph performance benchmark 358
Performance baseline 359
Benchmarking Ceph performance 359
Benchmarking Ceph block performance 361
Ceph performance counters 362
Access to Ceph performance counters 363
Display the Ceph performance counters 363
Dump the Ceph performance counters 364
Average count and sum 365
Ceph Monitor metrics 365
Ceph OSD metrics 367
Ceph Object Gateway metrics 372
BlueStore 374
Ceph BlueStore 374
Ceph BlueStore devices 375
Ceph BlueStore caching 376
Sizing considerations for Ceph BlueStore 376
Tuning Ceph BlueStore using bluestore_min_alloc_size parameter 376
Resharding the RocksDB database using the BlueStore admin tool 378
The BlueStore fragmentation tool 379
What is the BlueStore fragmentation tool? 380
Checking for fragmentation 380
Ceph BlueStore BlueFS 381
Viewing the bluefs_buffered_io setting 382
Viewing Ceph BlueFS statistics for Ceph OSDs 383
Cephadm troubleshooting 384
Pause or disable cephadm 385
Per service and per daemon event 385
Check cephadm logs 386
Gather log files 386
Collect systemd status 387
List all downloaded container images 387
Manually run containers 387
CIDR network error 388
Access the admin socket 388
Manually deploying a mgr daemon 388
Cephadm operations 390
Monitor cephadm log messages 390
Ceph daemon logs 391
Data location 392
Cephadm health checks 392
Cephadm operations health checks 392
Cephadm configuration health checks 393
Managing an IBM Storage Ceph cluster using cephadm-ansible modules 394
The cephadm-ansible modules 395
The cephadm-ansible modules options 395
Bootstrapping a storage cluster using the cephadm_bootstrap and cephadm_registry_login modules 397
Adding or removing hosts using the ceph_orch_host module 399
Setting configuration options using the ceph_config module 402
Applying a service specification using the ceph_orch_apply module 404
Managing Ceph daemon states using the ceph_orch_daemon module 406
Operations 407
Introduction to the Ceph Orchestrator 407
Use of the Ceph Orchestrator 407
Management of services 409
Checking service status 409
Checking daemon status 410
Placement specification of the Ceph Orchestrator 411
Deploying the Ceph daemons using the command line interface 411
Deploying the Ceph daemons on a subset of hosts using the command line interface 413
Service specification of the Ceph Orchestrator 414
Deploying the Ceph daemons using the service specification 414
Management of hosts 416
Adding hosts 416
Adding multiple hosts 418
Listing hosts 419
Adding labels to hosts 420
Removing labels from hosts 421
Removing hosts 422
Placing hosts in the maintenance mode 423
Management of monitors 424
Ceph Monitors 424
Configuring monitor election strategy 425
Deploying the Ceph monitor daemons using the command line interface 425
Deploying the Ceph monitor daemons using the service specification 427
Deploying the monitor daemons on specific network 428
Removing the monitor daemons 429
Removing a Ceph Monitor from an unhealthy storage cluster 430
Management of managers 432
Deploying the manager daemons 432
Removing the manager daemons 433
Using Ceph Manager modules 434
Using the Ceph Manager balancer module 436
Using the Ceph Manager alerts module 439
Using the Ceph manager crash module 441
Management of OSDs 443
Ceph OSDs 444
Ceph OSD node configuration 444
Automatically tuning OSD memory 444
Listing devices for Ceph OSD deployment 445
Zapping devices for Ceph OSD deployment 447
Deploying Ceph OSDs on all available devices 448
Deploying Ceph OSDs on specific devices and hosts 449
Advanced service specifications and filters for deploying OSDs 450
Deploying Ceph OSDs using advanced service specifications 452
Removing the OSD daemons 455
Replacing the OSDs 457
Replacing the OSDs with pre-created LVM 458
Replacing the OSDs in a non-colocated scenario 460
Stopping the removal of the OSDs 464
Activating the OSDs 465
Observing the data migration 466
Recalculating the placement groups 467
Management of monitoring stack 467
Deploying the monitoring stack 468
Removing the monitoring stack 470
Basic IBM Storage Ceph client setup 471
Configuring file setup on client machines 471
Setting-up keyring on client machines 472
Management of MDS service 472
Deploying the MDS service using the command line interface 473
Deploying the MDS service using the service specification 475
Removing the MDS service 476
Management of Ceph object gateway 478
Deploying the Ceph Object Gateway using the command line interface 478
Deploying the Ceph Object Gateway using the service specification 480
Deploying a multi-site Ceph Object Gateway 482
Removing the Ceph Object Gateway 486
Configuration of SNMP traps 487
Simple network management protocol 487
Configuring snmptrapd 488
Deploying the SNMP gateway 491
Handling a node failure 494
Considerations before adding or removing a node 494
Performance considerations 495
Recommendations for adding or removing nodes 496
Adding a Ceph OSD node 496
Removing a Ceph OSD node 498
Simulating a node failure 499
Handling a data center failure 500
Avoiding a data center failure 500
Handling a data center failure 501
Dashboard 502
Ceph dashboard overview 503
Ceph Dashboard components 503
Ceph Dashboard features 504
IBM Storage Ceph Dashboard architecture 505
Ceph Dashboard installation and access 506
Network port requirements for Ceph Dashboard 507
Accessing the Ceph dashboard 508
Setting message of the day (MOTD) 509
Expanding the cluster 511
Toggling Ceph dashboard features 512
Understanding the landing page of the Ceph dashboard 515
Changing the dashboard password 517
Changing the Ceph dashboard password using the command line interface 518
Setting admin user password for Grafana 518
Enabling IBM Storage Ceph Dashboard manually 520
Creating an admin account for syncing users to the Ceph dashboard 521
Syncing users to the Ceph dashboard using Red Hat Single Sign-On 522
Enabling Single Sign-On for the Ceph Dashboard 525
Disabling Single Sign-On for the Ceph Dashboard 526
Management of roles 527
User roles and permissions 527
Creating roles 529
Editing roles 531
Cloning roles 531
Deleting roles 532
Management of users 532
Creating users 533
Editing users 534
Deleting users 535
Management of Ceph daemons 535
Daemon actions 535
Monitor the cluster 536
Monitoring hosts of the Ceph cluster 537
Viewing and editing the configuration of the Ceph cluster 538
Viewing and editing the manager modules of the Ceph cluster 538
Monitoring monitors of the Ceph cluster 539
Monitoring services of the Ceph cluster 540
Monitoring Ceph OSDs 540
Monitoring HAProxy 541
Viewing the CRUSH map of the Ceph cluster 542
Filtering logs of the Ceph cluster 543
Monitoring pools of the Ceph cluster 544
Monitoring Ceph file systems 544
Monitoring Ceph object gateway daemons 545
Monitoring Block device images 545
Management of Alerts 546
Enabling monitoring stack 547
Configuring Grafana certificate 549
Adding Alertmanager webhooks 550
Viewing alerts 552
Creating a silence 552
Re-creating a silence 553
Editing a silence 554
Expiring a silence 554
Management of pools 555
Creating pools 555
Editing pools 556
Deleting pools 556
Management of hosts 558
Entering maintenance mode 558
Exiting maintenance mode 559
Removing hosts 560
Management of Ceph OSDs 561
Managing the OSDs 562
Replacing the failed OSDs 564
Management of Ceph Object Gateway 566
Manually adding Ceph object gateway login credentials to the dashboard 566
Creating the Ceph Object Gateway services with SSL using the dashboard 568
Management of Ceph Object Gateway users 569
Creating Ceph object gateway users 570
Creating Ceph object gateway subusers 571
Editing Ceph object gateway users on the dashboard 572
Deleting Ceph object gateway users 573
Management of Ceph Object Gateway buckets 574
Creating Ceph object gateway buckets 574
Editing Ceph object gateway buckets 575
Deleting Ceph object gateway buckets 576
Monitoring multisite object gateway configuration 577
Management of buckets of a multisite object configuration 578
Editing buckets of a multisite object gateway configuration 578
Deleting buckets of a multisite object gateway configuration 580
Management of block devices 581
Management of block device images 582
Creating images 582
Creating namespaces 583
Editing images 584
Copying images 585
Moving images to trash 586
Purging trash 586
Restoring images from trash 587
Deleting images. 587
Deleting namespaces. 588
Creating snapshots of images 589
Renaming snapshots of images 589
Protecting snapshots of images 590
Cloning snapshots of images 591
Copying snapshots of images 592
Unprotecting snapshots of images 593
Rolling back snapshots of images 594
Deleting snapshots of images 594
Management of mirroring functions 595
Mirroring view 595
Editing mode of pools 596
Adding peer in mirroring 596
Editing peer in mirroring 598
Deleting peer in mirroring 599
Activating and deactivating telemetry 600
Ceph Object Gateway 601
The Ceph Object Gateway 601
Considerations and recommendations 602
Network considerations for IBM Storage Ceph 603
Basic IBM Storage Ceph considerations 604
Colocating Ceph daemons and its advantages 605
IBM Storage Ceph workload considerations 607
Ceph Object Gateway considerations 610
Administrative data storage 610
Index pool 611
Data pool 612
Data extra pool 612
Developing CRUSH hierarchies 612
Creating CRUSH roots 613
Creating CRUSH rules 613
Ceph Object Gateway multi-site considerations 615
Considering storage sizing 616
Considering storage density 617
Considering disks for the Ceph Monitor nodes 617
Adjusting backfill and recovery settings 617
Adjusting the cluster map size 617
Adjusting scrubbing 618
Increase rgw_thread_pool_size 618
Increase objecter_inflight_ops 618
Tuning considerations for the Linux kernel when running Ceph 618
Deployment 619
Deploying the Ceph Object Gateway using the command line interface 620
Deploying the Ceph Object Gateway using the service specification 622
Deploying a multi-site Ceph Object Gateway using the Ceph Orchestrator 624
Removing the Ceph Object Gateway using the Ceph Orchestrator 627
Basic configuration 628
Add a wildcard to the DNS 629
The Beast front-end web server 631
Configuring SSL for Beast 631
Adjusting logging and debugging output 632
Static web hosting 633
Static web hosting assumptions 634
Static web hosting requirements 634
Static web hosting gateway setup 634
Static web hosting DNS configuration 635
Creating a static web hosting site 636
High availability for the Ceph Object Gateway 636
High availability service 636
Configuring high availability for the Ceph Object Gateway 637
HAProxy/keepalived Prerequisites 639
HAProxy/keepalived Prerequisites 640
Preparing HAProxy Nodes 640
Installing and Configuring HAProxy 641
Installing and Configuring keepalived 642
Advanced configuration 644
Multi-site configuration and administration 644
Requirements and Assumptions 645
Pools 647
Migrating a single site system to multi-site 648
Establishing a secondary zone 649
Configuring the archive zone (Technology Preview) 652
Deleting objects in archive zone 652
Failover and disaster recovery 654
Configuring multiple zones without replication 656
Configuring multiple realms in the same storage cluster 658
Multi-site Ceph Object Gateway command line usage 666
Realms 666
Creating a realm 666
Making a Realm the Default 667
Deleting a Realm 667
Getting a realm 667
Listing realms 667
Setting a realm 668
Listing Realm Periods 668
Pulling a Realm 668
Renaming a Realm 668
Zone Groups 668
Creating a Zone Group 669
Making a Zone Group the Default 669
Renaming a Zone Group 669
Deleting a zone group 670
Listing Zone Groups 670
Getting a Zone Group 670
Setting a Zone Group Map 671
Setting a Zone Group 672
Zones 673
Creating a Zone 673
Deleting a zone 674
Modifying a Zone 674
Listing Zones 675
Getting a Zone 675
Setting a Zone 675
Renaming a zone 676
Adding a Zone to a Zone Group 676
Removing a Zone from a Zone Group 676
Configure LDAP and Ceph Object Gateway 677
Install Red Hat Directory Server 677
Configure the Directory Server firewall 677
Label ports for SELinux 678
Configure LDAPS 678
Check if the gateway user exists 678
Add a gateway user 679
Configure the gateway to use LDAP 679
Using a custom search filter 680
Add an S3 user to the LDAP server 680
Export an LDAP token 681
Test the configuration with an S3 client 681
Configure Active Directory and Ceph Object Gateway 682
Using Microsoft Active Directory 683
Configuring Active Directory for LDAPS 683
Check if the gateway user exists 683
Add a gateway user 683
Configuring the gateway to use Active Directory 684
Add an S3 user to the LDAP server 684
Export an LDAP token 685
Test the configuration with an S3 client 685
The Ceph Object Gateway and OpenStack Keystone 686
Roles for Keystone authentication 687
Keystone authentication and the Ceph Object Gateway 687
Creating the Swift service 687
Setting the Ceph Object Gateway endpoints 688
Verifying Openstack is using the Ceph Object Gateway endpoints 689
Configuring the Ceph Object Gateway to use Keystone SSL 690
Configuring the Ceph Object Gateway to use Keystone authentication 690
Restarting the Ceph Object Gateway daemon 691
Security 692
S3 server-side encryption 692
Server-side encryption requests 693
Configuring server-side encryption 693
The HashiCorp Vault 695
Secret engines for Vault 696
Authentication for Vault 697
Namespaces for Vault 697
Transit engine compatibility support 697
Creating token policies for Vault 698
Configuring the Ceph Object Gateway to use SSE-S3 with Vault 699
Configuring the Ceph Object Gateway to use SSE-KMS with Vault 702
Creating a key using the kv engine 704
Creating a key using the transit engine 705
Uploading an object using AWS and the Vault 706
The Ceph Object Gateway and multi-factor authentication 707
Multi-factor authentication 707
Creating a seed for multi-factor authentication 707
Creating a new multi-factor authentication TOTP token 708
Test a multi-factor authentication TOTP token 709
Resynchronizing a multi-factor authentication TOTP token 710
Listing multi-factor authentication TOTP tokens 711
Display a multi-factor authentication TOTP token 711
Deleting a multi-factor authentication TOTP token 712
Administration 713
Creating storage policies 713
Creating indexless buckets 715
Configure bucket index resharding 716
Bucket index resharding 716
Recovering bucket index 717
Limitations of bucket index resharding 718
Configuring bucket index resharding in simple deployments 718
Configuring bucket index resharding in multi-site deployments 719
Resharding bucket index dynamically 721
Resharding bucket index dynamically in multi-site configuration 723
Resharding bucket index manually 725
Cleaning stale instances of bucket entries after resharding 726
Fixing lifecycle policies after resharding 727
Enabling compression 727
User management 728
Multi-tenant namespace 729
Create a user 730
Create a subuser 730
Get user information 731
Modify user information 731
Enable and suspend users 731
Remove a user 732
Remove a subuser 732
Rename a user 732
Create a key 735
Add and remove access keys 735
Add and remove admin capabilities 736
Role management 736
Creating a role 737
Getting a role 738
Listing a role 738
Updating assume role policy document of a role 739
Getting permission policy attached to a role 740
Listing permission policy attached to a role 741
Deleting policy attached to a role 741
Deleting a role 742
Updating the session duration of a role 743
Quota management 744
Set user quotas 744
Enable and disable user quotas 744
Set bucket quotas 745
Enable and disable bucket quotas 745
Get quota settings 745
Update quota stats 745
Get user quota usage stats 746
Quota cache 746
Reading and writing global quotas 746
Bucket management 746
Renaming buckets 747
Moving buckets 748
Moving buckets between non-tenanted users 748
Moving buckets between tenanted users 749
Moving buckets from non-tenanted users to tenanted users 750
Finding orphan and leaky objects 751
Managing bucket index entries 753
Bucket notifications 754
Creating bucket notifications 755
Bucket lifecycle 757
Creating a lifecycle management policy 758
Deleting a lifecycle management policy 760
Updating a lifecycle management policy 761
Monitoring bucket lifecycles 764
Configuring lifecycle expiration window 765
S3 bucket lifecycle transition within a storage cluster 766
Transitioning an object from one storage class to another 766
Enabling object lock for S3 772
Usage 774
Show usage 774
Trim usage 775
Ceph Object Gateway data layout 775
Object lookup path 776
Multiple data pools 776
Bucket and object listing 776
Object Gateway data layout parameters 777
Optimize the Ceph Object Gateway's garbage collection 778
Viewing the garbage collection queue 778
Adjusting Garbage Collection Settings 778
Adjusting garbage collection for delete-heavy workloads 779
Optimize the Ceph Object Gateway's data object storage 780
Parallel thread processing for bucket life cycles 780
Optimizing the bucket lifecycle 781
Testing 781
Create an S3 user 782
Create a Swift user 783
Test S3 access 785
Test Swift access 786
Configuration reference 786
General settings 787
About pools 789
Lifecycle settings 790
Swift settings 790
Logging settings 791
Keystone settings 791
LDAP settings 792
Block devices 792
Introduction to Ceph block devices 792
Ceph block devices 793
Displaying the command help 793
Creating a block device pool 794
Creating a block device image 794
Listing the block device images 795
Retrieving the block device image information 796
Resizing a block device image 796
Removing a block device image 797
Moving a block device image to the trash 798
Defining an automatic trash purge schedule 799
Enabling and disabling image features 799
Working with image metadata 800
Moving images between pools 802
The rbdmap service 803
Configuring the rbdmap service 804
Persistent Write Log Cache (Technology Preview) 804
Persistent write log cache limitations 805
Enabling persistent write log cache 805
Checking persistent write log cache status 807
Flushing persistent write log cache 808
Discarding persistent write log cache 808
Monitoring performance of Ceph Block Devices using the command-line interface 809
Live migration of images 810
The live migration process 810
Formats 811
Streams 812
Preparing the live migration process 813
Preparing import-only migration 814
Executing the live migration process 815
Committing the live migration process 815
Aborting the live migration process 816
Image encryption 817
Encryption format 817
Encryption load 817
Supported formats 818
Adding encryption format to images and clones 819
Snapshot management 820
Ceph block device snapshots 821
The Ceph user and keyring 821
Creating a block device snapshot 821
Listing the block device snapshots 822
Rolling back a block device snapshot 823
Deleting a block device snapshot 823
Purging the block device snapshots 824
Renaming a block device snapshot 824
Ceph block device layering 825
Protecting a block device snapshot 826
Cloning a block device snapshot 826
Unprotecting a block device snapshot 827
Listing the children of a snapshot 827
Flattening cloned images 828
Mirroring Ceph block devices 828
Ceph block device mirroring 829
An overview of journal-based and snapshot-based mirroring 831
Configuring one-way mirroring using the command-line interface 831
Configuring two-way mirroring using the command-line interface 834
Administration for mirroring Ceph block devices 837
Viewing information about peers 838
Enabling mirroring on a pool 838
Disabling mirroring on a pool 839
Enabling image mirroring 839
Disabling image mirroring 840
Image promotion and demotion 841
Image resynchronization 841
Adding a storage cluster peer 842
Removing a storage cluster peer 843
Getting mirroring status for a pool 843
Getting mirroring status for a single image 844
Delaying block device replication 844
Asynchronous updates and Ceph block device mirroring 845
Converting journal-based mirroring to snapshot-based mirrorring 845
Creating an image mirror-snapshot 846
Scheduling mirror-snapshots 846
Creating a mirror-snapshot schedule 847
Listing all snapshot schedules at a specific level 847
Removing a mirror-snapshot schedule 848
Viewing the status for the next snapshots to be created 849
Recover from a disaster 849
Disaster recovery 850
Recover from a disaster with one-way mirroring 850
Recover from a disaster with two-way mirroring 850
Failover after an orderly shutdown 850
Failover after a non-orderly shutdown 851
Prepare for fail back 852
Fail back to the primary storage cluster 854
Remove two-way mirroring 856
Management of ceph-immutable-object-cache daemons 857
Explanation of ceph-immutable-object-cache daemons 857
Configuring the ceph-immutable-object-cache daemon 858
Generic settings of ceph-immutable-object-cache daemons 860
QOS settings of ceph-immutable-object-cache daemons 860
The rbd kernel module 862
Creating a Ceph Block Device and using it from a Linux kernel module client 862
Creating a Ceph block device for a Linux kernel module client using dashboard 862
Map and mount a Ceph Block Device on Linux using the command line 863
Mapping a block device 865
Displaying mapped block devices 866
Unmapping a block device 867
Using the Ceph block device Python module 867
Ceph block device configuration reference 868
Block device default options 869
Block device general options 870
Block device caching options 872
Block device parent and child read options 874
Block device read ahead options 874
Block device blocklist options 875
Block device journal options 875
Block device configuration override options 877
Block device input and output options 879
Developer 880
Ceph RESTful API 880
Prerequisites 881
Versioning for the Ceph API 881
Authentication and authorization for the Ceph API 881
Enabling and Securing the Ceph API module 882
Questions and Answers 883
Getting information 883
How Can I View All Cluster Configuration Options? 884
How Can I View a Particular Cluster Configuration Option? 885
How Can I View All Configuration Options for OSDs? 886
How Can I View CRUSH Rules? 887
How Can I View Information about Monitors? 888
How Can I View Information About a Particular Monitor? 889
How Can I View Information about OSDs? 890
How Can I View Information about a Particular OSD? 891
How Can I Determine What Processes Can Be Scheduled on an OSD? 892
How Can I View Information About Pools? 894
How Can I View Information About a Particular Pool? 895
How Can I View Information About Hosts? 896
How Can I View Information About a Particular Host? 897
Changing Configuration 898
How Can I Change OSD Configuration Options? 898
How Can I Change the OSD State? 899
How Can I Reweight an OSD? 900
How Can I Change Information for a Pool? 901
Administering the Cluster 902
How Can I Run a Scheduled Process on an OSD? 902
How Can I Create a New Pool? 903
How Can I Remove Pool? 904
Ceph Object Gateway administrative API 905
Prerequisites 907
Administration operations 907
Administration authentication requests 907
Creating an administrative user 913
Get user information 915
Create a user 917
Modify a user 921
Remove a user 926
Create a subuser 927
Modify a subuser 929
Remove a subuser 931
Add capabilities to a user 932
Remove capabilities from a user 934
Create a key 935
Remove a key 938
Bucket notifications 939
Prerequisites 939
Overview of bucket notifications 939
Persistent notifications 940
Creating a topic 940
Getting topic information 942
Listing topics 943
Deleting topics 944
Using the command-line interface for topic management 944
Event record 945
Supported event types 947
Get bucket information 947
Check a bucket index 950
Remove a bucket 951
Link a bucket 952
Unlink a bucket 954
Get a bucket or object policy 955
Remove an object 956
Quotas 957
Get a user quota 957
Set a user quota 957
Get a bucket quota 958
Set a bucket quota 960
Get usage information 960
Remove usage information 963
Standard error responses 964
Ceph Object Gateway and the S3 API 965
Prerequisites 907
S3 limitations 965
Accessing the Ceph Object Gateway with the S3 API 966
Prerequisites 966
S3 authentication 966
S3 server-side encryption 968
S3 access control lists 968
Preparing access to the Ceph Object Gateway using S3 969
Accessing the Ceph Object Gateway using Ruby AWS S3 970
Accessing the Ceph Object Gateway using Ruby AWS SDK 973
Accessing the Ceph Object Gateway using PHP 977
Secure Token Service 980
The Secure Token Service application programming interfaces 981
Configuring the Secure Token Service 984
Creating a user for an OpenID Connect provider 985
Obtaining a thumbprint of an OpenID Connect provider 986
Configuring and using STS Lite with Keystone (Technology Preview) 987
Working around the limitations of using STS Lite with Keystone (Technology Preview) 989
S3 bucket operations 990
Prerequisites 992
S3 create bucket notifications 992
S3 get bucket notifications 995
S3 delete bucket notifications 997
Accessing bucket host names 997
S3 list buckets 998
S3 return a list of bucket objects 999
S3 create a new bucket 1001
S3 put bucket website 1002
S3 get bucket website 1003
S3 delete bucket website 1003
S3 delete a bucket 1003
S3 bucket lifecycle 1004
S3 GET bucket lifecycle 1005
S3 create or replace a bucket lifecycle 1006
S3 delete a bucket lifecycle 1006
S3 get bucket location 1007
S3 get bucket versioning 1007
S3 put bucket versioning 1007
S3 get bucket access control lists 1008
S3 put bucket Access Control Lists 1009
S3 get bucket cors 1010
S3 put bucket cors 1010
S3 delete a bucket cors 1011
S3 list bucket object versions 1011
S3 head bucket 1013
S3 list multipart uploads 1013
S3 bucket policies 1017
S3 get the request payment configuration on a bucket 1019
S3 set the request payment configuration on a bucket 1019
Multi-tenant bucket operations 1020
S3 Block Public Access 1020
S3 GET PublicAccessBlock 1022
S3 PUT PublicAccessBlock 1022
S3 delete PublicAccessBlock 1023
S3 object operations 1023
Prerequisites 1024
S3 get an object from a bucket 1025
S3 get information on an object 1026
S3 put object lock 1027
S3 get object lock 1028
S3 put object legal hold 1030
S3 get object legal hold 1031
S3 put object retention 1031
S3 get object retention 1032
S3 put object tagging 1033
S3 get object tagging 1034
S3 delete object tagging 1034
S3 add an object to a bucket 1035
S3 delete an object 1035
S3 delete multiple objects 1036
S3 get an object’s Access Control List (ACL) 1036
S3 set an object’s Access Control List (ACL) 1037
S3 copy an object 1038
S3 add an object to a bucket using HTML forms 1040
S3 determine options for a request 1040
S3 initiate a multipart upload 1040
S3 add a part to a multipart upload 1041
S3 list the parts of a multipart upload 1042
S3 assemble the uploaded parts 1044
S3 copy a multipart upload 1045
S3 abort a multipart upload 1046
S3 Hadoop interoperability 1046
S3 select operations (Technology Preview) 1047
Prerequisites 1047
S3 select content from an object 1047
S3 supported select functions 1051
S3 alias programming construct 1053
S3 CSV parsing explained 1053
Ceph Object Gateway and the Swift API 1054
Prerequisites 1055
Swift API limitations 1055
Create a Swift user 1055
Swift authenticating a user 1057
Swift container operations 1058
Prerequisites 1058
Swift container operations 1058
Swift update a container’s Access Control List (ACL) 1059
Swift list containers 1059
Swift list a container’s objects 1061
Swift create a container 1063
Swift delete a container 1064
Swift add or update the container metadata 1064
Swift object operations 1065
Prerequisites 1065
Swift object operations 1065
Swift get an object 1065
Swift create or update an object 1066
Swift delete an object 1067
Swift copy an object 1068
Swift get object metadata 1069
Swift add or update object metadata 1069
Swift temporary URL operations 1070
Swift get temporary URL objects 1070
Swift POST temporary URL keys 1070
Swift multi-tenancy container operations 1071
The Ceph RESTful API specifications 1071
Prerequisites 1072
Ceph summary 1072
Authentication 1073
Ceph File System 1074
Storage cluster configuration 1079
CRUSH rules 1081
Erasure code profiles 1083
Feature toggles 1084
Grafana 1085
Storage cluster health 1086
Host 1087
Logs 1091
Ceph Manager modules 1091
Ceph Monitor 1094
Ceph OSD 1094
Ceph Object Gateway 1102
REST APIs for manipulating a role 1111
Ceph Orchestrator 1113
Pools 1114
Prometheus 1116
RADOS block device 1118
Performance counters 1132
Roles 1135
Services 1137
Settings 1139
Ceph task 1141
Telemetry 1142
Ceph users 1143
S3 common request headers 1146
S3 common response status codes 1146
S3 unsupported header fields 1147
Swift request headers 1147
Swift response headers 1147
Examples using the Secure Token Service APIs 1147
Troubleshooting 1149
Initial Troubleshooting 1150
Identifying problems 1150
Diagnosing the health of a storage cluster 1150
Understanding Ceph health 1151
Muting health alerts of a Ceph cluster 1152
Understanding Ceph logs 1153
Generating an sos report 1154
Configuring logging 1154
Ceph subsystems 1155
Configuring logging at runtime 1156
Configuring logging in configuration file 1157
Accelerating log rotation 1158
Creating and collecting operation logs for Ceph Object Gateway 1158
Troubleshooting networking issues 1159
Basic networking troubleshooting 1160
Basic chrony NTP troubleshooting 1163
Troubleshooting Ceph Monitors 1164
Most common Ceph Monitor errors 1164
Ceph Monitor error messages 1164
Common Ceph Monitor error messages in the Ceph logs 1164
Ceph Monitor is out of quorum 1165
Clock skew 1166
The Ceph Monitor store is getting too big 1167
Understanding Ceph Monitor status 1168
Injecting a monmap 1169
Replacing a failed Monitor 1170
Compacting the monitor store 1171
Opening port for Ceph manager 1172
Recovering the Ceph Monitor store 1173
Recovering the Ceph Monitor store when using BlueStore 1173
Troubleshooting Ceph OSDs 1176
Most common Ceph OSD errors 1176
Ceph OSD error messages 1177
Common Ceph OSD error messages in the Ceph logs 1177
Full OSDs 1177
Backfillfull OSDs 1178
Nearfull OSDs 1178
Down OSDs 1179
Flapping OSDs 1181
Slow requests or requests are blocked 1182
Stopping and starting rebalancing 1183
Mounting the OSD data partition 1184
Replacing an OSD drive 1185
Increasing the PID count 1187
Deleting data from a full storage cluster 1187
Troubleshooting a multi-site Ceph Object Gateway 1188
Error code definitions for the Ceph Object Gateway 1188
Syncing a multisite Ceph Object Gateway 1189
Performance counters for multi-site Ceph Object Gateway data sync 1190
Synchronizing data in a multi-site Ceph Object Gateway configuration 1191
Troubleshooting Ceph placement groups 1192
Most common Ceph placement groups errors 1192
Placement group error messages 1192
Stale placement groups 1193
Inconsistent placement groups 1193
Unclean placement groups 1195
Inactive placement groups 1195
Placement groups are down 1195
Unfound objects 1196
Listing placement groups stuck in stale, inactive, or unclean state 1198
Listing placement group inconsistencies 1199
Repairing inconsistent placement groups 1202
Increasing the placement group 1202
Troubleshooting Ceph objects 1204
Troubleshooting high-level object operations 1204
Listing objects 1204
Fixing lost objects 1205
Troubleshooting low-level object operations 1206
Manipulating the object’s content 1207
Removing an object 1208
Listing the object map 1209
Manipulating the object map header 1209
Manipulating the object map key 1210
Listing the object’s attributes 1211
Manipulating the object attribute key 1212
Troubleshooting clusters in stretch mode 1213
Replacing the tiebreaker with a monitor in quorum 1213
Replacing the tiebreaker with a new monitor 1215
Forcing stretch cluster into recovery or healthy mode 1217
Contacting IBM support for service 1217
Providing information to IBM Support engineers 1217
Generating readable core dump files 1218
Generating readable core dump files in containerized deployments 1219
Ceph subsystems default logging level values 1221
Health messages of a Ceph cluster 1222
Related information 1226
Acknowledgments 1226
IBM Storage Ceph
Edit online
IBM Storage Ceph is a software-defined storage platform engineered for private cloud architectures.
Summary of changes
Edit online
This topic lists the dates and nature of updates to the published information for IBM Storage Ceph.
Concepts
Architecture
Data Security and Hardening
Planning
Compatibility
Hardware
Storage Strategies
Administering
Operations
Ceph Object Gateway
Block devices
Developer
Troubleshooting
10 March 2023 The version information was added in IBM Documentation as part of the initial IBM Storage Ceph 5.3 release.
Release notes
Edit online
IBM Storage Ceph is a hardened, qualified, secure, and supported enterprise software curated from the Ceph open-source project
and delivered by IBM.
Enhancements
This section lists all major updates, enhancements, and new features introduced in this release of IBM Storage Ceph.
Bug fixes
This section describes bugs with significant user impact, which were fixed in this release of IBM Storage Ceph. In addition, the
section includes descriptions of fixed known issues found in previous versions.
Known issues
This section documents known issues found in this release of IBM Storage Ceph.
Sources
Enhancements
With this enhancement, if initial_admin_password is set in an applied Grafana specification, cephadm automatically updates
the dashboard Grafana password, which is equivalent to running ceph dashboard set-grafana-api-password
command, to streamline the process of fully setting up Grafana. Users no longer have to manually set the dashboard Grafana
password after applying a specification that includes the password.
OSDs automatically update their Ceph configuration files with the new mon locations
With this enhancement, whenever a monmap change is detected, cephadm automatically updates the Ceph configuration
files for each OSD with the new mon locations.
Note: This enhancement may take some time to update on all OSDs if you have a lot of OSDs.
Ceph Dashboard
The Block Device images table is paginated
With this enhancement, the Block Device images table is paginated to use with 10000+ image storage clusters as retrieving
information for a block device image is expensive.
With this enhancement, CORS is allowed by adding the cross_origin_url option that can be set to a particular URL - ceph
config set mgr mgr/dashboard/cross_origin_url
localhost and the REST API allows communication with only that URL.
With this enhancement, the result code 2002 is explicitly translated to 2 and users can see the original behavior.
With this release, rgw-restore-bucket-index command can restore the indices for the buckets that are in the non-default
realms, non-default zonegroup, or non-default zone when the user specifies that information on the command-line.
With this enhancement, dynamic bucket resharding is supported in multi-site configurations. Once the storage clusters are
upgraded, enable the resharding feature, zone level, and zone group. You can either manually reshard the buckets with
radosgw-admin bucket reshard command or automatically reshard them with dynamic resharding, independently of other
zones in the storage cluster.
Users can now reshard bucket index dynamically with multi-site archive zones
With this enhancement, multi-site archive zone bucket index can be resharded dynamically when dynamic resharding is
enabled for that zone.
RADOS
Low-level log messages are introduced to warn user about hitting throttle limits
Previously, there was a lack of low-level logging indication that throttle limits were hit, causing these occurrences to
incorrectly have the appearance of a networking issue.
With this enhancement, the introduction of low-level log messages makes it much clearer that the throttle limits are hit.
Bug fixes
Edit online
This section describes bugs with significant user impact, which were fixed in this release of IBM Storage Ceph. In addition, the
section includes descriptions of fixed known issues found in previous versions.
With this fix, care has been taken to identify the images to which docker.io is added by default. Users using a local repo image
can upgrade to that image without encountering issues.
(BZ#2100553)
With this fix, tcmu-runner is added to the postrotate actions in the logrotate file that Cephadm deploys for rotation of Ceph
daemons logs. tcmu-runner no longer stops logging after its log file is rotated.
(BZ#2204505)
With this fix, users can configure the retention.size parameter in Prometheus’s specification file. Cephadm passes this
value to the Prometheus daemon allowing it to control the disk space usage of Prometheus by limiting the size of the data
directory.
(BZ#2207748)
With this fix, both the headers are added to the emails sent by Ceph Manager Alerts module and the messages are not flagged
as spam.
(BZ#2064481)
With this release, the email header is modified to include these two fields and the emails generated by the module are no
longer flagged as spam.
(BZ#2210906)
With this fix, the GIL is released during all libcephfs or librbd calls and other Python tasks may acquire the GIL normally.
(BZ#2219093)
With this fix, if no ceph-osd container is found, the volume list will remain empty and the cephvolumescan actor does not
fail.
(BZ#2141393)
Ceph OSD deployment no longer fails when ceph-volume treats multiple devices.
Previously, ceph-volume computed wrong sizes when there were multiple devices to treat, resulting in failure to deploy
OSDs.
With this fix, ceph-volume computes the correct size when multiple devices are to be treated and deployment of OSDs work
as expected.
(BZ#2119774)
Re-running ceph-volume lvm batch command against created devices is now possible
Previously, in ceph-volume, lvm membership was not set for mpath devices like it was for other types of supported devices.
Due to this, re-running the ceph-volume lvm batch command against already created devices was not possible.
With this fix, the lvm membership is set for mpath devices and re-running ceph-volume lvm batch command against
already created devices is now possible.
(BZ#2215042)
With this fix, devices already used by Ceph are filtered out as expected and adding new OSDs with pre-created LVs no longer
fails.
(BZ#2209319)
With this fix, a new configuration parameter, rgw_allow_notification_secrets_in_cleartext, is added. Users can now set up
Kafka connectivity with SASL in a non-TLS environment.
(BZ#2014330)
With this fix, the internal token handling is fixed and it works as expected.
(BZ#2055137)
With this fix, the object version access is corrected, thereby preventing object lock violation.
(BZ#2108394)
With this fix, a check on the pointer has been implemented into the call path and Ceph Object Gateway returns a permission
error, rather than crashing, if it is uninitialized.
(BZ#2109256)
With this fix, the code in the Ceph Object Gateway that parses dates in x-amz-date format is changed to also accept the new
date format.
(BZ#2109675)
New logic in processing of lifecycle shards prevents stalling due to deleted buckets
Previously, changes were made to cause lifecycle processing to continuously cycle across days, that is, to not restart from the
beginning of the list of eligible buckets each day. However, the changes contained a bug which could stall processing of
lifecycle shards that contained deleted buckets, causing the processing of lifecycle shards to stall.
With this fix, a logic is introduced to skip over the deleted buckets, due to which the processing no longer stalls.
(BZ#2118295)
With this fix, header processing is corrected and new diagnostics are added. The logic now works as expected.
(BZ#2123335)
With this fix, the corrected logic no longer logs the warning in inappropriate circumstances.
(BZ#2126787)
With this fix, care is taken to prevent various operations from creating bucket index shards and recover when the race
condition is encountered. PUT object operations now always write to the correct bucket index shards.
(BZ#2145022)
Blocksize is changed to 4K
Previously, Ceph Object Gateway GC processing would consume excessive time due to the use of a 1K blocksize that would
consume the GC queue. This caused slower processing of large GC queues.
With this fix, blocksize is changed to 4K, which has accelerated the processing of large GC queues.
(BZ#2215062)
With this release, the <body> and </html> closing tags are sent to the client under all required conditions. The value of the
Content-Length header field correctly represents the length of the data sent to the client, and the client no longer resets the
connection for an incorrect Content-Length reason.
(BZ#2227048)
With this fix, archive zone versioning is always enabled irrespective of bucket versioning changes on other zones. Bucket
versioning in the archive zone no longer gets suspended.
(BZ#1957088)
With this update, the radosgw-admin sync status command does not get stuck and works as expected.
(BZ#1749627)
Processes trimming retired bucket index entries no longer cause radosgw instance to crash
Previously, under some circumstances, processes trimming retired bucket index entries could access an uninitialized pointer
variable resulting in the radosgw instance to crash.
With this fix, code is initialized immediately before use and the radosgw instance no longer crashes.
(BZ#2139258)
With this fix, bucket sync run is given control logic which enables it to run the sync from oldest outstanding to current and all
objects are now synced as expected.
(BZ#2066453)
With this fix, the logic error responsible for confusing the source and destination bucket information is corrected and the
policies execute correctly.
(BZ#2108886)
With this fix, variable access is fixed and the potential fault can no longer occur.
(BZ#2123423)
With this fix, requests that access the bucket but do not specify a valid bucket are denied, resulting in an error instead of a
crash.
(BZ#2139422)
RADOS
Performing a DR test with two sites stretch cluster no longer causes Ceph to become unresponsive
Previously, when performing a DR test with two sites stretch-cluster, removing and adding new monitors to the cluster would
cause an incorrect rank in ConnectionTracker class. Due to this, the monitor would fail to identify itself in the
peer_tracker copy and would never update its correct field, causing a deadlock in the election process which would lead to
Ceph becoming unresponsive.
Added an assert in the function notify_rank_removed(), to compare the expected rank provided by the Monmap against
the rank that is manually adjusted as a sanity check.
Clear the variable removed_ranks from every Monmap update.
Added an action to manually reset peer_tracker.rank when executing the command - ceph connection scores
reset for each monitor. The peer_tracker.rank matches the current rank of the monitor.
Added functions in the Elector and ConnectionTracker classes to check for clean peer_tracker when
upgrading the monitors, including booting up. If found unclean, peer_tracker is cleared.
The user can choose to manually remove a monitor rank before shutting down the monitor, causing inconsistency in
Monmap. Therefore, in Monitor::notify_new_monmap() we prevent the function from removing our rank or ranks
that don’t exist in Monmap.
The cluster now works as expected and there is no unwarranted downtime. The cluster no longer becomes unresponsive
when performing a DR test with two sites stretch-cluster.
(BZ#2142674)
Rank is removed from the live_pinging and dead_pinging set to mitigate the inconsistent connectivity score issue
Previously, when removing two monitors consecutively, if the rank size is equal to Paxos’s size, the monitor would face a
condition and would not remove rank from the dead_pinging set. Due to this, the rank remained in the dead_pinging set which
would cause problems, such as inconsistent connectivity score when the stretch-cluster mode was enabled.
With this fix, a case is added where the highest ranked monitor is removed, that is, when the rank is equal to Paxos’s size,
remove the rank from the live_pinging and dead_pinging set. The monitor stays healthy with a clean live_pinging and
dead_pinging set.
(BZ#2142174)
The Prometheus metrics now reflect the correct Ceph version for all Ceph Monitors whenever requested
Previously, the Prometheus metrics reported mismatched Ceph versions for Ceph Monitors when the monitor was upgraded.
As a result, the active Ceph Manager daemon needed to be restarted to resolve this inconsistency.
With this fix, the Ceph Monitors explicitly send metadata update requests with mon metadata to mgr when MON election is
over.
(BZ#2008524)
(BZ#2107407)
The correct set of replicas are used for remapped placement groups
Previously, for remapped placement groups, the wrong set of replicas would be queried for the scrub information causing a
failure of the scrub process, after identifying mismatches that would not exist.
With this fix, the correct set of replicas are now queried.
The ceph daemon heap status command shows the heap status
Previously, due to a failure to get heap information through the ceph daemon command, the ceph daemon heap stats
command would return empty output instead of returning current heap usage for a Ceph daemon. This was because
ceph::osd_cmds::heap() was confusing the stderr and stdout concept which caused the difference in output.
With this fix, the ceph daemon heap stats command returns heap usage information for a Ceph daemon similar to what we
get using the ceph tell command.
(BZ#2119100)
Ceph Monitors no longer crash when using ceph orch apply mon <num> command
Previously, when the command ceph orch apply mon <num> was used to decrease monitors in a cluster, the monitors were
removed before shutting down in ceph-adm causing the monitors to crash.
With this fix, a sanity check is added to all code paths that check whether the peer rank is more than or equal to the size of the
ranks from the monitor map. If the condition is satisfied, then skip certain operations that lead to the monitor crashing. The
peer rank eventually resolves itself in the next version of the monitor map. The monitors no longer crash when removed from
the monitor map before shutting down.
(BZ#2142141)
End-user can now see the scrub or deep-scrub starts message from the Ceph cluster log
Previously, due to the scrub or deep-scrub starts message missing in the Ceph cluster log, the end-user would fail to know if
the PG scrubbing had started for a PG from the Ceph cluster log.
With this fix, the scrub or deep-scrub starts message is reintroduced. The Ceph cluster log now shows the message for a PG,
whenever it goes for a scrubbing or deep-scrubbing process.
(BZ#2091773)
With this fix, the check in the manager that deals with the initial service map is relaxed and there is no assertion during the
Ceph Manager failover.
(BZ#2095062)
With this fix, SnapMapper’s legacy conversation is updated to match the new key format. The cloned objects in earlier
versions of Ceph can now be easily removed after an upgrade.
(BZ#2107405)
With this fix, deferred replay no longer overwrites BlueFS data and some RocksDB errors do not occur, such as:
osd_superblock corruption.
CURRENT does not end with newline.
.sst files checksum error.
Note: Do not write deferred data as the write location might either contain a proper object or be empty. It is not possible to
corrupt object data this way. BlueFS is the only entity that can allocate this space.
(BZ#2109886)
Corrupted dups entries of a PG Log can be removed by off-line and on-line trimming
Previously, trimming of PG log dups entries could be prevented during the low-level PG split operation, which is used by the
PG autoscaler with far higher frequency than by a human operator. Stalling the trimming of dups resulted in significant
With this fix, both off-line, using the ceph-objectstore-tool command, and on-line, within OSD, trimming can remove
corrupted dups entries of a PG log that jammed the on-line trimming machinery and were responsible for the memory growth.
A debug improvement is implemented that prints the number of dups entries to the OSD’s log to help future investigations.
(BZ#2119853)
Manager continues to send beacons in the event of an error during authentication check
Previously, if an error was encountered when performing an authentication check with a monitor, the manager would get into a
state where it would no longer have an active connection. Due to this, the manager could no longer send beacons and the
monitor would mark it as lost.
With this fix, a session (active con) is reopened in the event of an error and the manager is able to continue to send beacons
and is no longer marked as lost.
(BZ#2192479)
With this fix, the implementation defect is fixed and rbd info command no longer fails even if executed when the image is
being flattened.
(BZ#1989527)
Removing a pool with pending Block Device tasks no longer causes all the tasks to hang
Previously, due to an implementation defect, removing a pool with pending Block Device tasks caused all Block Device tasks,
including other pools, to hang. To resume hung Block Device tasks, the administrator had to restart the ceph-mgr daemon.
With this fix, the implementation defect is fixed and removing a pool with pending RBD tasks no longer causes any hangs.
Block Device tasks for the removed pool are cleaned up. Block Device tasks for other pools continue executing uninterrupted.
(BZ#2150968)
Object map for the snapshot accurately reflects the contents of the snapshot
Previously, due to an implementation defect, a stale snapshot context would be used when handling a write-like operation.
Due to this, the object map for the snapshot was not guaranteed to accurately reflect the contents of the snapshot in case the
snapshot was taken without quiescing the workload. In differential backup and snapshot-based mirroring, use cases with
object-map and/or fast-diff features enabled, the destination image could get corrupted.
With this fix, the implementation defect is fixed and everything works as expected.
(BZ#2216188)
RBD Mirroring
The image replayer shuts down as expected
Previously, due to an implementation defect, a request to shut down a particular image replayer would cause the rbd-
mirror daemon to hang indefinitely, especially in cases where the daemon was blocklisted on the remote storage cluster.
With this fix, the implementation defect is fixed and a request to shut down a particular image replayer no longer causes the
rbd-mirror daemon to hang and the image replayer shuts down as expected.
(BZ#2086471)
The rbd mirror pool peer bootstrap create command guarantees correct monitor addresses in the bootstrap token
Previously, a bootstrap token generated with the rbd mirror pool peer bootstrap create command contained monitor
addresses as specified by the mon_host option in the ceph.conf file. This was fragile and caused issues to users, such as
causing confusion between V1 and V2 endpoints, specifying only one of them, grouping them incorrectly, and the like.
(BZ#2122130)
With this fix, a dependency to ansible-collection-ansible is created when deploying ceph-ansible and cephadm-
adopt.yml playbook completes successfully.
(BZ#2207872)
With this fix, the SElinux separation for the container is disabled and all Ceph containers start successfully.
(BZ#2222003)
Known issues
Edit online
This section documents known issues found in this release of IBM Storage Ceph.
As a workaround, if the user needs to find the version, the daemons' container names include the version.
(BZ#2125382)
Encryption of multipart uploads requires special handling around the part boundaries because each part is uploaded and
encrypted separately. In multi-site, objects are encrypted, and multipart uploads are replicated as a single part. As a result,
the replicated copy loses its knowledge about the original part boundaries required to decrypt the data correctly, which
causes this corruption.
As a workaround, multi-site users should not use server-side encryption for multipart uploads. For more detailed information,
see the KCS Server-side encryption with RGW multisite configuration might lead to data corruption of multipart objects.
(BZ#2214252)
The updated IBM Storage Ceph source code packages are available at the following location:
Asynchronous updates
Edit online
This section describes the bug fixes, known issues, and enhancements of the z-stream releases for IBM Storage Ceph 5.3.
Enhancements
This section describes the enhancements in the IBM Storage Ceph 5.3z6 release.
Bug fixes
This section describes the bug fixes in the IBM Storage Ceph 5.3z6 release.
Known issues
This section documents known issues found in IBM Storage Ceph 5.3z6 release.
Enhancements
Edit online
This section describes the enhancements in the IBM Storage Ceph 5.3z6 release.
BZ#2255436
BZ#2240089
With this enhancement, such buckets perform an ordered bucket listing more quickly.
BZ#2239433
RADOS
Improved protection against running BlueStore twice
Previously, advisory locking was used to protect against running BlueStore twice. This works well on baremetal deployments.
However, when used on containers it would create unrelated inodes that targeted same mknod b block device. As a result,
two containers might assume that they can have exclusive access which led to severe errors.
With this release, you can improve protection against running OSDs twice at the same time on one block device. You can
reinforce advisory locking with O_EXCL open flag dedicated for block devices. It is no longer possible to open one BlueStore
instance twice and the overwrite and corruption does not occur.
BZ#2239455
With this enhancement, you can view the detailed descriptions of delayed sub-events for operations.
BZ#2240839
With this update, one finisher thread for each Ceph Manager module is added. Each module has a separate thread for
commands run. Even if one of the module’s command hangs, the other modules are able to run.
BZ#2234610
Bug fixes
Edit online
This section describes the bug fixes in the IBM Storage Ceph 5.3z6 release.
Cephadm
Ceph File System
Ceph Object Gateway
RADOS
Ceph-Ansible
Cephadm
IBM Storage Ceph 13
The cephadm-adopt playbook completes with IPV6 address
Previously, due to the single quotes around the IPV6 address, the cephadm-adopt playbook would fail as the IPV6 was not
recognized as the correct IP address.
With this fix, the single quotes around the IPV6 address are removed and the cephadm-adopt playbook completes
successfully with an IPV6 setup.
BZ#2153448
With this fix, custom config files are added to the tcmu-runner container as well as the rbd-target-api container. You
can now use custom config files to bindmount a tcmu.conf file into the tcmu-runner containers deployed by cephadm.
BZ#2193419
Important: Be careful when configuring mds_max_snaps_per_dir and snapshot scheduling limits to avoid unintentional
deactivation of snapshot schedules due to the file system returning a "Too many links" error if the mds_max_snaps_per_dir
limit is breached.
BZ#2227806
With this fix, the module captures all exceptions. The resulting traceback is also dumped to the console or the log file to report
unexpected events. As a result, the module continues to stay operational, providing a better user experience.
BZ#2227810
With this fix, the client always sends a caps revocation acknowledgment to the MDS Daemon, even when there is no inode
existing and the MDS Daemon no longer stays stuck.
BZ#2227997
With this fix, split_realms are not sent to kclient and works as expected.
BZ#2228001
Laggy clients are now evicted only if there are no laggy OSDs
Previously, monitoring performance dumps from the MDS would sometimes show that the OSDs were laggy,
objecter.op_laggy and objecter.osd_laggy, causing laggy clients (cannot flush dirty data for cap revokes).
BZ#2228039
With this release, the Python librados supports iterating object omap key/values with unicode or binary keys and the iteration
continues as expected.
BZ#2232164
With this fix, the old commits are reverted and there is no longer a deadlock between unlink and reintegration requests.
BZ#2233886
With this fix, MDS ensures that the locks are obtained in the correct order and the requests are processed correctly.
BZ#2236190
With this fix, the client that is causing the buildup is blocklisted and evicted, allowing the MDS to work as expected.
BZ#2238665
With this fix, the ENOTEMPTY output is detected for the subvolume group rm command when there is subvolume present
inside the subvolumegroup and the message is displayed correctly.
BZ#2240727
The next client replay request is queued automatically while in the up:client-replay state
Previously, MDS would not queue the next client request for replay in the up:client-replay state causing the MDS to hang
in that state.
With this fix, the next client replay request is queued automatically as part of the request clean up and the MDS proceeds with
failover recovery normally.
BZ#2244868
The MDS no longer crashes when the journal logs are flushed
Previously, when the journal logs were successfully flushed, you could set the lockers’ state to LOCK_SYNC or
LOCK_PREXLOCK when the xclock count was non-zero. However, the MDS would not allow that and would crash.
With this fix, MDS allows the lockers’ state to LOCK_SYNC or LOCK_PREXLOCK when the xclock count is non-zero and the
MDS does not crash.
BZ#2248825
With this fix, only one reintegration is triggered for each case and no redundant reintegration is triggered.
With this fix, the loner member is set to true and as a result the corresponding request is not blocked.
BZ#2251768
With this fix, you can add MDS metadata with FSMap changes in batches to ensure consistency. The ceph mds metadata
command functions as expected across upgrades.
BZ#2255035
With this fix, the standby-replay MDS daemons trim their caches and keep the cache usage below the configured limit and no
“MDSs report oversized cache” warnings are emitted.
BZ#2257421
With this fix, the counters are updated during replay and the perf dump command works as expected.
BZ#2259297
With this fix, a test for reshardable bucket layout is added to prevent such crashes. In case of immediate and scheduled
resharding a descriptive error message is displayed, and for dynamic bucket resharding the bucket is simply skipped.
BZ#2245335
The user modify –placement-id command can now be used with an empty --storage-class argument
Previously, if the --storage-class argument was not used when running the user modify –placement-id command, the
command would fail.
With this fix, the --storage-class argument can be left empty and the command works as expected.
BZ#2245699
The rados cppool command ceases the operation without the --yes-i-really-mean-it parameter
The rados cppool command does not preserve self-managed snapshots when copying an RBD pool. It needs to be used by
enforcing the --yes-i-really-mean-it parameter. Previously, the obligatory switch for this parameter was not enforced for RBD
pools.
With this fix, if the user misses this switch, rados cppool command ceases the operation and exists with a warning.
BZ#2252781
With this fix, the logic is fixed and evicting a single client does not evict other clients.
BZ#2237391
RADOS
Log scrub starts message for a PG through the scrubbing process
Previously, the scrub reservation would get canceled for the PG and caused frequent scrubbing of a PG. This would result in
multiple scrub messages being printed in the cluster log for the same PG.
With this fix, log scrub starts message for a PG only when the replicas confirm the scrub reservation and is going through the
scrubbing process.
BZ#2211758
With this fix, the code is reintroduced, and the spillover appears properly.
BZ#2237880
With this fix, the libcephsqlite zeros short reads at the correct region of the buffer with no corruption.
BZ#2240144
With this exception, an exception is added to handle the ceph health mute command and the monitor handles this exception.
The command errors out and notifies that the command has the wrong syntax.
BZ#2247232
With this fix, the correct CRUSH location of the OSDs parent (host) is determined based on the host mask. This allows the
change to propagate to the OSDs on the host. All the OSDs hosted by the machine are notified whenever the auto-tuner
applies a new osd_memory_target and the change is reflected.
BZ#2249014
Normalized: mgr/dashboard/ssl_server_port
Localized: mgr/dashboard/x/ssl_server_port
However, the pretty-printed (for example, JSON) version of the command only showed the normalized option name as shown
in the example above. The ceph config dump command result was inconsistent between with and without the pretty-print
option.
BZ#2249017
BZ#2253672
Ceph-Ansible
The “manage nodes with cephadm - ipv4/6" task option work as expected
Previously, when there was more than one IP address in the cephadm_mgmt_network, the tasks with the “manage nodes
with cephadm – ipv4/6” parameter would be skipped as the condition was not met. The condition tests if the
cephadm_mgmt_network is an IP address and a list of IPs cannot be an address.
With this fix, the first IP is extracted from the list to test if it is an IP address and the “manage nodes with cephadm – ipv4.6”
works as expected.
BZ#2231469
The Ceph packages now install without stopping any of the running Ceph services
Previously, during the upgrade, all Ceph services stopped running as the Ceph 4 packages would be uninstalled instead of
updating.
With this fix, the new Ceph 5 packages are installed during upgrades and do not impact the running Ceph processes.
BZ#2233444
Known issues
Edit online
This section documents known issues found in IBM Storage Ceph 5.3z6 release.
Ceph Dashboard
Ceph Dashboard
Some metrics are displayed as null leading to blank spaces in graphs
Some metrics on the Ceph dashboard are shown as null, which leads to blank space in the graphs since you do not initialize a
metric until it has some value.
As a workaround, edit the Grafana panel in which the issue is present. From the Edit menu, click Migrate and select Connect
Nulls. Choose Always and the issue is resolved.
Concepts
Edit online
Learn about the architecture, data security, and hardening concepts for IBM Storage Ceph.
Architecture
Data Security and Hardening
Ceph architecture
Core Ceph components
Ceph client components
Ceph on-wire encryption
Ceph architecture
Edit online
IBM Ceph Storage cluster is a distributed data object store designed to provide excellent performance, reliability and scalability.
Distributed object stores are the future of storage, because they accommodate unstructured data, and because clients can use
modern object interfaces and legacy interfaces simultaneously.
For example:
Filesystem interface
The power of IBM Ceph Storage cluster can transform your organization’s IT infrastructure and your ability to manage vast amounts
of data, especially for cloud computing platforms like Red Hat Enterprise Linux OSP. IBM Ceph Storage cluster delivers extraordinary
scalability–thousands of clients accessing petabytes to exabytes of data and beyond.
At the heart of every Ceph deployment is the IBM Ceph Storage cluster. It consists of three types of daemons:
Ceph Monitor
A Ceph Monitor maintains a master copy of the IBM Ceph Storage cluster map with the current state of the IBM Ceph Storage
cluster. Monitors require high consistency, and use Paxos to ensure agreement about the state of the IBM Ceph Storage
cluster.
Ceph Manager
The Ceph Manager maintains detailed information about placement groups, process metadata and host metadata in lieu of
the Ceph Monitor—significantly improving performance at scale. The Ceph Manager handles execution of many of the read-
only Ceph CLI queries, such as placement group statistics. The Ceph Manager also provides the RESTful monitoring APIs.
Figure 1. Daemons
Ceph client interfaces read data from and write data to the IBM Ceph Storage cluster. Clients need the following data to
communicate with the IBM Ceph Storage cluster:
The Ceph configuration file, or the cluster name (usually ceph) and the monitor address.
Ceph clients maintain object IDs and the pool names where they store the objects. However, they do not need to maintain an object-
to-OSD index or communicate with a centralized object index to look up object locations. Then, Ceph clients provide an object name
and pool name to librados, which computes an object’s placement group and the primary OSD for storing and retrieving data using
the CRUSH (Controlled Replication Under Scalable Hashing) algorithm. The Ceph client connects to the primary OSD where it may
perform read and write operations. There is no intermediary server, broker or bus between the client and the OSD.
When an OSD stores data, it receives data from a Ceph client—whether the client is a Ceph Block Device, a Ceph Object Gateway, a
Ceph Filesystem or another interface and it stores the data as an object.
NOTE: An object ID is unique across the entire cluster, not just an OSD’s storage media.
Ceph OSDs store all data as objects in a flat namespace. There are no hierarchies of directories. An object has a cluster-wide unique
identifier, binary data, and metadata consisting of a set of name/value pairs.
Figure 2. Object
Ceph clients define the semantics for the client’s data format. For example, the Ceph block device maps a block device image to a
series of objects stored across the cluster.
NOTE: Objects consisting of a unique ID, data, and name/value paired metadata can represent both structured and unstructured
data, as well as legacy and leading edge data storage interfaces.
Compress data
To the Ceph client interface that reads and writes data, a IBM Storage Ceph cluster looks like a simple pool where it stores data.
However, librados and the storage cluster perform many complex operations in a manner that is completely transparent to the
client interface. Ceph clients and Ceph OSDs both use the CRUSH (Controlled Replication Under Scalable Hashing) algorithm. The
following sections provide details on how CRUSH enables Ceph to perform these operations seamlessly.
Prerequisites
Ceph pools
Ceph authentication
Ceph placement groups
Prerequisites
Edit online
Ceph pools
Edit online
The Ceph storage cluster stores data objects in logical partitions called Pools. Ceph administrators can create pools for particular
types of data, such as for block devices, object gateways, or simply just to separate one group of users from another.
From the perspective of a Ceph client, the storage cluster is very simple. When a Ceph client reads or writes data using an I/O
context, it always connects to a storage pool in the Ceph storage cluster. The client specifies the pool name, a user and a secret key,
so the pool appears to act as a logical partition with access controls to its data objects.
In actual fact, a Ceph pool is not only a logical partition for storing object data. A pool plays a critical role in how the Ceph storage
cluster distributes and stores data. However, these complex operations are completely transparent to the Ceph client.
Pool Type: In early versions of Ceph, a pool simply maintained multiple deep copies of an object. Today, Ceph can maintain
multiple copies of an object, or it can use erasure coding to ensure durability. The data durability method is pool-wide, and
does not change after creating the pool. The pool type defines the data durability method when creating the pool. Pool types
are completely transparent to the client.
Placement Groups: In an exabyte scale storage cluster, a Ceph pool might store millions of data objects or more. Ceph must
handle many types of operations, including data durability via replicas or erasure code chunks, data integrity by scrubbing or
CRC checks, replication, rebalancing and recovery. Consequently, managing data on a per-object basis presents a scalability
and performance bottleneck. Ceph addresses this bottleneck by sharding a pool into placement groups. The CRUSH algorithm
computes the placement group for storing an object and computes the Acting Set of OSDs for the placement group. CRUSH
puts each object into a placement group. Then, CRUSH stores each placement group in a set of OSDs. System administrators
set the placement group count when creating or modifying a pool.
CRUSH Ruleset: CRUSH plays another important role: CRUSH can detect failure domains and performance domains. CRUSH
can identify OSDs by storage media type and organize OSDs hierarchically into nodes, racks, and rows. CRUSH enables Ceph
OSDs to store object copies across failure domains. For example, copies of an object may get stored in different server rooms,
aisles, racks and nodes. If a large part of a cluster fails, such as a rack, the cluster can still operate in a degraded state until
the cluster recovers.
Additionally, CRUSH enables clients to write data to particular types of hardware, such as SSDs, hard drives with SSD journals, or
hard drives with journals on the same drive as the data. The CRUSH ruleset determines failure domains and performance domains
for the pool. Administrators set the CRUSH ruleset when creating a pool.
NOTE: An administrator CANNOT change a pool’s ruleset after creating the pool.
Replica pools store multiple deep copies of an object using the CRUSH failure domain to physically separate one data
object copy from another. That is, copies get distributed to separate physical hardware. This increases durability during
hardware failures.
Erasure coded pools store each object as K+M chunks, where K represents data chunks and M represents coding
chunks. The sum represents the number of OSDs used to store the object and the M value represents the number of
OSDs that can fail and still restore data should the M number of OSDs fail.
From the client perspective, Ceph is elegant and simple. The client simply reads from and writes to pools. However, pools play an
important role in data durability, performance and high availability.
Ceph authentication
Edit online
To identify users and protect against man-in-the-middle attacks, Ceph provides its cephx authentication system, which
authenticates users and daemons.
NOTE: The cephx protocol does not address data encryption for data transported over the network or data stored in OSDs.
Cephx uses shared secret keys for authentication, meaning both the client and the monitor cluster have a copy of the client’s secret
key. The authentication protocol enables both parties to prove to each other that they have a copy of the key without actually
revealing it. This provides mutual authentication, which means the cluster is sure the user possesses the secret key, and the user is
sure that the cluster has a copy of the secret key.
Cephx
A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each monitor can authenticate users and distribute keys, so
there is no single point of failure or bottleneck when using cephx. The monitor returns an authentication data structure similar to a
Kerberos ticket that contains a session key for use in obtaining Ceph services. This session key is itself encrypted with the user’s
permanent secret key, so that only the user can request services from the Ceph monitors. The client then uses the session key to
request its desired services from the monitor, and the monitor provides the client with a ticket that will authenticate the client to the
OSDs that actually handle data. Ceph monitors and OSDs share a secret, so the client can use the ticket provided by the monitor with
any OSD or metadata server in the cluster. Like Kerberos, cephx tickets expire, so an attacker cannot use an expired ticket or session
key obtained surreptitiously. This form of authentication will prevent attackers with access to the communications medium from
either creating bogus messages under another user’s identity or altering another user’s legitimate messages, as long as the user’s
secret key is not divulged before it expires.
To use cephx, an administrator must set up users first. In the following diagram, the client.admin user invokes ceph auth
get-or-create-key from the command line to generate a username and secret key. Ceph’s auth subsystem generates the
username and key, stores a copy with the monitor(s) and transmits the user’s secret back to the client.admin user. This means
that the client and the monitor share a secret key.
NOTE: The client.admin user must provide the user ID and secret key to the user in a secure manner.
Figure 1. CephX
A PG is a subset of a pool that serves to contain a collection of objects. Ceph shards a pool into a series of PGs. Then, the CRUSH
algorithm takes the cluster map and the status of the cluster into account and distributes the PGs evenly and pseudo-randomly to
OSDs in the cluster.
When a system administrator creates a pool, CRUSH creates a user-defined number of PGs for the pool. Generally, the number of
PGs should be a reasonably fine-grained subset of the data. For example, 100 PGs per OSD per pool would mean that each PG
contains approximately 1% of the pool’s data.
The number of PGs has a performance impact when Ceph needs to move a PG from one OSD to another OSD. If the pool has too few
PGs, Ceph will move a large percentage of the data simultaneously and the network load will adversely impact the cluster’s
performance. If the pool has too many PGs, Ceph will use too much CPU and RAM when moving tiny percentages of the data and
thereby adversely impact the cluster’s performance. For details on calculating the number of PGs to achieve optimal performance,
see PG Count
Ceph ensures against data loss by storing replicas of an object or by storing erasure code chunks of an object. Since Ceph stores
objects or erasure code chunks of an object within PGs, Ceph replicates each PG in a set of OSDs called the Acting Set for each copy
of an object or each erasure code chunk of an object. A system administrator can determine the number of PGs in a pool and the
number of replicas or erasure code chunks. However, the CRUSH algorithm calculates which OSDs are in the acting set for a
particular PG.
The CRUSH algorithm and PGs make Ceph dynamic. Changes in the cluster map or the cluster state may result in Ceph moving PGs
from one OSD to another automatically.
Expanding the Cluster: When adding a new host and its OSDs to the cluster, the cluster map changes. Since CRUSH evenly
and pseudo-randomly distributes PGs to OSDs throughout the cluster, adding a new host and its OSDs means that CRUSH will
reassign some of the pool’s placement groups to those new OSDs. That means that system administrators do not have to
rebalance the cluster manually. Also, it means that the new OSDs contain approximately the same amount of data as the other
OSDs. This also means that new OSDs do not contain newly written OSDs, preventing hot spots in the cluster.
An OSD Fails: When an OSD fails, the state of the cluster changes. Ceph temporarily loses one of the replicas or erasure code
chunks, and needs to make another copy. If the primary OSD in the acting set fails, the next OSD in the acting set becomes the
primary and CRUSH calculates a new OSD to store the additional copy or erasure code chunk.
By managing millions of objects within the context of hundreds to thousands of PGs, the Ceph storage cluster can grow, shrink and
recover from failure efficiently.
For Ceph clients, the CRUSH algorithm via librados makes the process of reading and writing objects very simple. A Ceph client
simply writes an object to a pool or reads an object from a pool. The primary OSD in the acting set can write replicas of the object or
erasure code chunks of the object to the secondary OSDs in the acting set on behalf of the Ceph client.
The Ceph client via librados connects directly to the primary OSD within an acting set when writing and reading objects. Since I/O
operations do not use a centralized broker, network oversubscription is typically NOT an issue with Ceph.
The following diagram depicts how CRUSH assigns objects to PGs, and PGs to OSDs. The CRUSH algorithm assigns the PGs to OSDs
such that each OSD in the acting set is in a separate failure domain, which typically means the OSDs will always be on separate
server hosts and sometimes in separate racks.
To map placement groups to OSDs, a CRUSH map defines a hierarchical list of bucket types. The list of bucket types are located
under types in the generated CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf nodes by their failure
domains and/or performance domains, such as drive type, hosts, chassis, racks, power distribution units, pods, rows, rooms, and
data centers.
With the exception of the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary. Administrators may define it according
to their own needs if the default types don’t suit their requirements. CRUSH supports a directed acyclic graph that models the Ceph
OSD nodes, typically in a hierarchy. So Ceph administrators can support multiple hierarchies with multiple root nodes in a single
CRUSH map. For example, an administrator can create a hierarchy representing higher cost SSDs for high performance, and a
separate hierarchy of lower cost hard drives with SSD journals for moderate performance.
The only inputs required by the client are the object ID and the pool name. It is simple: Ceph stores data in named pools. When a
client wants to store a named object in a pool it takes the object name, a hash code, the number of PGs in the pool and the pool
name as inputs; then, CRUSH (Controlled Replication Under Scalable Hashing) calculates the ID of the placement group and the
primary OSD for the placement group.
1. The client inputs the pool ID and the object ID. For example, pool = liverpool and object-id = john.
3. CRUSH calculates the hash modulo of the number of PGs to get a PG ID. For example, 58.
5. The client gets the pool ID given the pool name. For example, the pool liverpool is pool number 4.
6. The client prepends the pool ID to the PG ID. For example, 4.58.
7. The client performs an object operation such as write, read, or delete by communicating directly with the Primary OSD in the
Acting Set.
The topology and state of the Ceph storage cluster are relatively stable during a session. Empowering a Ceph client via librados to
compute object locations is much faster than requiring the client to make a query to the storage cluster over a chatty session for
each read/write operation. The CRUSH algorithm allows a client to compute where objects should be stored, and enables the client
to contact the primary OSD in the acting set directly to store or retrieve data in the objects. Since a cluster at the exabyte scale has
thousands of OSDs, network oversubscription between a client and a Ceph OSD is not a significant problem. If the cluster state
changes, the client can simply request an update to the cluster map from the Ceph monitor.
Ceph replication
Edit online
Like Ceph clients, Ceph OSDs can contact Ceph monitors to retrieve the latest copy of the cluster map. Ceph OSDs also use the
CRUSH algorithm, but they use it to compute where to store replicas of objects. In a typical write scenario, a Ceph client uses the
CRUSH algorithm to compute the placement group ID and the primary OSD in the Acting Set for an object. When the client writes the
object to the primary OSD, the primary OSD finds the number of replicas that it should store. The value is found in the
osd_pool_default_size setting. Then, the primary OSD takes the object ID, pool name and the cluster map and uses the CRUSH
algorithm to calculate the IDs of secondary OSDs for the acting set. The primary OSD writes the object to the secondary OSDs. When
the primary OSD receives an acknowledgment from the secondary OSDs and the primary OSD itself completes its write operation, it
acknowledges a successful write operation to the Ceph client.
Figure 1. Replicated IO
With the ability to perform data replication on behalf of Ceph clients, Ceph OSD Daemons relieve Ceph clients from that duty, while
ensuring high data availability and data safety.
NOTE: The primary OSD and the secondary OSDs are typically configured to be in separate failure domains. CRUSH computes the
IDs of the secondary OSDs with consideration for the failure domains.
Data copies In a replicated storage pool, Ceph needs multiple copies of an object to operate in a degraded state. Ideally, a Ceph
storage cluster enables a client to read and write data even if one of the OSDs in an acting set fails. For this reason, Ceph defaults to
In an erasure-coded pool, Ceph needs to store chunks of an object across multiple OSDs so that it can operate in a degraded state.
Similar to replicated pools, ideally an erasure-coded pool enables a Ceph client to read and write in a degraded state.
IMPORTANT: IBM supports the following jerasure coding values for k, and m:
k=8 m=3
k=8 m=4
k=4 m=2
More specifically, N = K+M where the variable K is the original amount of data chunks. The variable M stands for the extra or
redundant chunks that the erasure code algorithm adds to provide protection from failures. The variable N is the total number of
chunks created after the erasure coding process. The value of M is simply N-K which means that the algorithm computes N-K
redundant chunks from K original data chunks. This approach guarantees that Ceph can access all the original data. The system is
resilient to arbitrary N-K failures. For instance, in a 10 K of 16 N configuration, or erasure coding 10/16, the erasure code algorithm
adds six extra chunks to the 10 base chunks K. For example, in a M = K-N or 16-10 = 6 configuration, Ceph will spread the 16
chunks N across 16 OSDs. The original file could be reconstructed from the 10 verified N chunks even if 6 OSDs fail—ensuring that the
IBM Ceph Storage cluster will not lose data, and thereby ensures a very high level of fault tolerance.
Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write operations. In replicated pools, Ceph
makes a deep copy of each object in the placement group on the secondary OSDs in the set. For erasure coding, the process is a bit
different. An erasure coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks. The pool is
configured to have a size of K+M so that Ceph stores each chunk in an OSD in the acting set. Ceph stores the rank of the chunk as an
attribute of the object. The primary OSD is responsible for encoding the payload into K+M chunks and sends them to the other OSDs.
The primary OSD is also responsible for maintaining an authoritative version of the placement group logs.
For example, in a typical configuration a system administrator creates an erasure coded pool to use six OSDs and sustain the loss of
two of them. That is, (K+M = 6) such that (M = 2).
When Ceph writes the object NYAN containing ABCDEFGHIJKL to the pool, the erasure encoding algorithm splits the content into
four data chunks by simply dividing the content into four parts: ABC, DEF, GHI, and JKL. The algorithm will pad the content if the
content length is not a multiple of K. The function also creates two coding chunks: the fifth with YXY and the sixth with QGC. Ceph
stores each chunk on an OSD in the acting set, where it stores the chunks in objects that have the same name, NYAN, but reside on
different OSDs. The algorithm must preserve the order in which it created the chunks as an attribute of the object shard_t, in
addition to its name. For example, Chunk 1 contains ABC and Ceph stores it on OSD5 while chunk 5 contains YXY and Ceph stores it
on OSD4.
Splitting data into chunks is independent from object placement. The CRUSH ruleset along with the erasure-coded pool profile
determines the placement of chunks on the OSDs. For instance, using the Locally Repairable Code (lrc) plugin in the erasure code
profile creates additional chunks and requires fewer OSDs to recover from. For example, in an lrc profile configuration K=4 M=2
L=3, the algorithm creates six chunks (K+M), just as the jerasure plugin would, but the locality value (L=3) requires that the
algorithm create 2 more chunks locally. The algorithm creates the additional chunks as such, (K+M)/L. If the OSD containing chunk
0 fails, this chunk can be recovered by using chunks 1, 2 and the first local chunk. In this case, the algorithm only requires 3 chunks
for recovery instead of 5.
Reference
Edit online
For more information about CRUSH, the erasure-coding profiles, and plugins, see Storage strategies.
For more details on Object Map, see Ceph client object map.
Ceph ObjectStore
Edit online
ObjectStore provides a low-level interface to an OSD’s raw block device. When a client reads or writes data, it interacts with the
ObjectStore interface. Ceph write operations are essentially ACID transactions: that is, they provide Atomicity, Consistency,
Isolation and Durability. ObjectStore ensures that a Transaction is all-or-nothing to provide Atomicity. The ObjectStore
also handles object semantics. An object stored in the storage cluster has a unique identifier, object data and metadata. So
ObjectStore provides Consistency by ensuring that Ceph object semantics are correct. ObjectStore also provides the
Isolation portion of an ACID transaction by invoking a Sequencer on write operations to ensure that Ceph write operations occur
sequentially. In contrast, an OSDs replication or erasure coding functionality provides the Durability component of the ACID
transaction. Since ObjectStore is a low-level interface to storage media, it also provides performance statistics.
Memstore
A developer implementation for testing read/write operations directly in RAM.
K/V Store
An internal implementation for Ceph’s use of key/value databases.
Since administrators will generally only address BlueStore, the following sections will only describe those implementations in
greater detail.
Ceph BlueStore
Edit online
BlueStore is the next generation storage implementation for Ceph. As the market for storage devices now includes solid state
drives or SSDs and non-volatile memory over PCI Express or NVMe, their use in Ceph reveals some of the limitations of the
FileStore storage implementation. While FileStore has many improvements to facilitate SSD and NVMe storage, other
limitations remain. Among them, increasing placement groups remains computationally expensive, and the double write penalty
remains. Whereas, FileStore interacts with a file system on a block device, BlueStore eliminates that layer of indirection and
directly consumes a raw block device for object storage. BlueStore uses the very light weight BlueFS file system on a small
partition for its k/v databases. BlueStore eliminates the paradigm of a directory representing a placement group, a file representing
an object and file XATTRs representing metadata. BlueStore also eliminates the double write penalty of FileStore, so write
operations are nearly twice as fast with BlueStore under most workloads.
Object Data
In BlueStore, Ceph stores objects as blocks directly on a raw block device. The portion of the raw block device that stores
object data does NOT contain a filesystem. The omission of the filesystem eliminates a layer of indirection and thereby
improves performance. However, much of the BlueStore performance improvement comes from the block database and
write-ahead log.
Block Database
In BlueStore, the block database handles the object semantics to guarantee Consistency. An object’s unique identifier is a
key in the block database. The values in the block database consist of a series of block addresses that refer to the stored
object data, the object’s placement group, and object metadata. The block database may reside on a BlueFS partition on the
same raw block device that stores the object data, or it may reside on a separate block device, usually when the primary block
device is a hard disk drive and an SSD or NVMe will improve performance. The block database provides a number of
improvements over FileStore; namely, the key/value semantics of BlueStore do not suffer from the limitations of
filesystem XATTRs. BlueStore may assign objects to other placement groups quickly within the block database without the
overhead of moving files from one directory to another, as is the case in FileStore. BlueStore also introduces new
features. The block database can store the checksum of the stored object data and its metadata, allowing full data checksum
operations for each read, which is more efficient than periodic scrubbing to detect bit rot. BlueStore can compress an object
and the block database can store the algorithm used to compress an object—ensuring that read operations select the
appropriate algorithm for decompression.
Write-ahead Log
In BlueStore, the write-ahead log ensures Atomicity, similar to the journaling functionality of FileStore. Like FileStore,
BlueStore logs all aspects of each transaction. However, the BlueStore write-ahead log or WAL can perform this function
simultaneously, which eliminates the double write penalty of FileStore. Consequently, BlueStore is nearly twice as fast as
FileStore on write operations for most workloads. BlueStore can deploy the WAL on the same device for storing object
data, or it may deploy the WAL on another device, usually when the primary block device is a hard disk drive and an SSD or
NVMe will improve performance.
NOTE: It is only helpful to store a block database or a write-ahead log on a separate block device if the separate device is faster than
the primary storage device. For example, SSD and NVMe devices are generally faster than HDDs. Placing the block database and the
WAL on separate devices may also have performance benefits due to differences in their workloads.
Ceph heartbeat
Edit online
Ceph OSDs join a cluster and report to Ceph Monitors on their status. At the lowest level, the Ceph OSD status is up or down
reflecting whether or not it is running and able to service Ceph client requests. If a Ceph OSD is down and in the Ceph storage
cluster, this status may indicate the failure of the Ceph OSD. If a Ceph OSD is not running for example, it crashes the Ceph OSD
cannot notify the Ceph Monitor that it is down. The Ceph Monitor can ping a Ceph OSD daemon periodically to ensure that it is
running. However, heartbeating also empowers Ceph OSDs to determine if a neighboring OSD is down, to update the cluster map and
to report it to the Ceph Monitors. This means that Ceph Monitors can remain light weight processes.
Ceph peering
Edit online
Ceph stores copies of placement groups on multiple OSDs. Each copy of a placement group has a status. These OSDs peer check
each other to ensure that they agree on the status of each copy of the PG. Peering issues usually resolve themselves.
NOTE: When Ceph monitors agree on the state of the OSDs storing a placement group, that does not mean that the placement group
has the latest contents.
When Ceph stores a placement group in an acting set of OSDs, refer to them as Primary, Secondary, and so forth. By convention, the
Primary is the first OSD in the Acting Set. The Primary that stores the first copy of a placement group is responsible for coordinating
the peering process for that placement group. The Primary is the ONLY OSD that will accept client-initiated writes to objects for a
given placement group where it acts as the Primary.
An Acting Set is a series of OSDs that are responsible for storing a placement group. An Acting Set may refer to the Ceph OSD
Daemons that are currently responsible for the placement group, or the Ceph OSD Daemons that were responsible for a particular
placement group as of some epoch.
The Ceph OSD daemons that are part of an Acting Set may not always be up. When an OSD in the Acting Set is up, it is part of the Up
Set. The Up Set is an important distinction, because Ceph can remap PGs to other Ceph OSDs when an OSD fails.
NOTE: In an Acting Set for a PG containing osd.25, osd.32 and osd.61, the first OSD, osd.25, is the Primary. If that OSD fails, the
Secondary, osd.32, becomes the Primary, and Ceph will remove osd.25 from the Up Set.
The following diagram depicts the rebalancing process where some, but not all of the PGs migrate from existing OSDs, OSD 1 and 2
in the diagram, to the new OSD, OSD 3, in the diagram. Even when rebalancing, CRUSH is stable. Many of the placement groups
remain in their original configuration, and each OSD gets some added capacity, so there are no load spikes on the new OSD after the
cluster rebalances.
Scrubbing
Ceph OSD Daemons can scrub objects within placement groups. That is, Ceph OSD Daemons can compare object metadata in
one placement group with its replicas in placement groups stored on other OSDs. Scrubbing usually performed daily catches
bugs or storage errors. Ceph OSD Daemons also perform deeper scrubbing by comparing data in objects bit-for-bit. Deep
scrubbing usually performed weekly finds bad sectors on a drive that weren’t apparent in a light scrub.
CRC Checks
In IBM Ceph Storage 5 when using BlueStore, Ceph can ensure data integrity by conducting a cyclical redundancy check
(CRC) on write operations; then, store the CRC value in the block database. On read operations, Ceph can retrieve the CRC
value from the block database and compare it with the generated CRC of the retrieved data to ensure data integrity instantly.
For added reliability and fault tolerance, Ceph supports a cluster of monitors. In a cluster of Ceph Monitors, latency and other faults
can cause one or more monitors to fall behind the current state of the cluster. For this reason, Ceph must have agreement among
various monitor instances regarding the state of the storage cluster. Ceph always uses a majority of monitors and the Paxos
algorithm to establish a consensus among the monitors about the current state of the storage cluster. Ceph Monitors nodes require
NTP to prevent clock drift.
Ceph clients tend to follow some similar patterns, such as object-watch-notify and striping. The following sections describe a little
bit more about RADOS, librados and common patterns used in Ceph clients.
Prerequisites
Ceph client native protocol
Ceph client object watch and notify
Ceph client Mandatory Exclusive Locks
Ceph client object map
Ceph client data striping
Prerequisites
Edit online
Prerequisites
Edit online
Pool Operations
Snapshots
Read/Write Objects
Create or Remove
Create/Set/Get/Remove XATTRs
Edit online
Mandatory Exclusive Locks is a feature that locks an RBD to a single client, if multiple mounts are in place. This helps address the
write conflict situation when multiple mounted clients try to write to the same object. This feature is built on object-watch-
notify explained in the previous section. So, when writing, if one client first establishes an exclusive lock on an object, another
mounted client will first check to see if a peer has placed a lock on the object before writing.
With this feature enabled, only one client can modify an RBD device at a time, especially when changing internal RBD structures
during operations like snapshot create/delete. It also provides some protection for failed clients. For instance, if a virtual
machine seems to be unresponsive and you start a copy of it with the same disk elsewhere, the first one will be blacklisted in Ceph
and unable to corrupt the new one.
Example
[root@mon ~]# [root@mon ~]# rbd create --size 102400 mypool/myimage --image-feature 13
Here, the numeral 13 is a summation of 1, 4 and 8 where 1 enables layering support, 4 enables exclusive locking support and 8
enables object map support. So, the above command creates a 100 GB RBD image, enable layering, exclusive lock and object map.
Mandatory Exclusive Locks is also a prerequisite for object map. Without enabling exclusive locking support, object map support
cannot be enabled.
Mandatory Exclusive Locks also does some ground work for mirroring.
Resize
Export
Copy
Flatten
Delete
Read
A shrink resize operation is like a partial delete where the trailing objects are deleted.
A copy operation knows which objects exist and need to be copied. It does not have to iterate over potentially hundreds and
thousands of possible objects.
A flatten operation performs a copy-up for all parent objects to the clone so that the clone can be detached from the parent i.e, the
reference from the child clone to the parent snapshot can be removed. So, instead of all potential objects, copy-up is done only for
the objects that exist.
A delete operation deletes only the objects that exist in the image.
A read operation skips the read for objects it knows doesn’t exist.
So, for operations like resize, shrinking only, exporting, copying, flattening, and deleting, these operations would need to issue an
operation for all potentially affected RADOS objects, whether they exist or not. With object map enabled, if the object doesn’t exist,
the operation need not be issued.
For example, if we have a 1 TB sparse RBD image, it can have hundreds and thousands of backing RADOS objects. A delete operation
without object map enabled would need to issue a remove object operation for each potential object in the image. But if object
map is enabled, it only needs to issue remove object operations for the objects that exist.
Object map is valuable against clones that don’t have actual objects but get objects from parents. When there is a cloned image, the
clone initially has no objects and all reads are redirected to the parent. So, object map can improve reads as without the object map,
first it needs to issue a read operation to the OSD for the clone, when that fails, it issues another read to the parent — with object
map enabled. It skips the read for objects it knows doesn’t exist.
Example
Here, the numeral 13 is a summation of 1, 4 and 8 where 1 enables layering support, 4 enables exclusive locking support and 8
enables object map support. So, the above command creates a 100 GB RBD image, enable layering, exclusive lock and object map.
Ceph provides three types of clients: Ceph Block Device, Ceph Filesystem, and Ceph Object Storage. A Ceph Client converts its data
from the representation format it provides to its users, such as a block device image, RESTful objects, CephFS filesystem directories,
into objects for storage in the Ceph Storage Cluster.
TIP: The objects Ceph stores in the Ceph Storage Cluster are not striped. Ceph Object Storage, Ceph Block Device, and the Ceph
Filesystem stripe their data over multiple Ceph Storage Cluster objects. Ceph Clients that write directly to the Ceph storage cluster
using librados must perform the striping, and parallel I/O for themselves to obtain these benefits.
The simplest Ceph striping format involves a stripe count of 1 object. Ceph Clients write stripe units to a Ceph Storage Cluster object
until the object is at its maximum capacity, and then create another object for additional stripes of data. The simplest form of striping
may be sufficient for small block device images, S3 or Swift objects. However, this simple form doesn’t take maximum advantage of
Ceph’s ability to distribute data across placement groups, and consequently doesn’t improve performance very much. The following
diagram depicts the simplest form of striping:
If you anticipate large images sizes, large S3 or Swift objects for example, video, you may see considerable read/write performance
improvements by striping client data over multiple objects within an object set. Significant write performance occurs when the client
writes the stripe units to their corresponding objects in parallel. Since objects get mapped to different placement groups and further
NOTE: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically.
In the following diagram, client data gets striped across an object set (object set 1 in the following diagram) consisting of 4
objects, where the first stripe unit is stripe unit 0 in object 0, and the fourth stripe unit is stripe unit 3 in object 3.
After writing the fourth stripe, the client determines if the object set is full. If the object set is not full, the client begins writing a
stripe to the first object again, see object 0 in the following diagram. If the object set is full, the client creates a new object set, see
object set 2 in the following diagram, and begins writing to the first stripe, with a stripe unit of 16, in the first object in the new
object set, see object 4 in the diagram below.
Object Size
Stripe Width
Stripes have a configurable unit size, for example 64 KB. The Ceph Client divides the data it will write to objects into equally
sized stripe units, except for the last stripe unit. A stripe width should be a fraction of the Object Size so that an object may
contain many stripe units.
Stripe Count
The Ceph Client writes a sequence of stripe units over a series of objects determined by the stripe count. The series of objects
is called an object set. After the Ceph Client writes to the last object in the object set, it returns to the first object in the object
set.
IMPORTANT: Test the performance of your striping configuration before putting your cluster into production. You CANNOT change
these striping parameters after you stripe the data and write it to objects.
Once the Ceph Client has striped data to stripe units and mapped the stripe units to objects, Ceph’s CRUSH algorithm maps the
objects to placement groups, and the placement groups to Ceph OSD Daemons before the objects are stored as files on a storage
disk.
NOTE: Since a client writes to a single pool, all data striped into objects get mapped to placement groups in the same pool. So they
use the same CRUSH map and the same access controls.
The second version of Ceph’s on-wire protocol, msgr2, includes several new features:
The Ceph daemons bind to multiple ports allowing both the legacy, v1-compatible, and the new, v2-compatible, Ceph clients to
connect to the same storage cluster. Ceph clients or other Ceph daemons connecting to the Ceph Monitor daemon will try to use the
v2 protocol first, if possible, but if not, then the legacy v1 protocol will be used. By default, both messenger protocols, v1 and v2, are
enabled. The new v2 port is 3300, and the legacy v1 port is 6789, by default.
The messenger v2 protocol has two configuration options that control whether the v1 or the v2 protocol is used:
ms_bind_msgr1
This option controls whether a daemon binds to a port speaking the v1 protocol; it is true by default.
ms_bind_msgr2
This option controls whether a daemon binds to a port speaking the v2 protocol; it is true by default.
Similarly, two options control based on IPv4 and IPv6 addresses used:
ms_bind_ipv4
This option controls whether a daemon binds to an IPv4 address; it is true by default.
ms_bind_ipv6
This option controls whether a daemon binds to an IPv6 address; it is true by default.
crc
secure
Ensure that you consider cluster CPU requirements when you plan the IBM Storage Ceph cluster deployment to include encryption
overhead.
IMPORTANT: Using secure mode is supported by Ceph clients using librbd, such as OpenStack Nova, Glance, and Cinder.
Address Changes
For both versions of the messenger protocol to coexist in the same storage cluster, the address formatting has changed:
IP_ADDR:PORT/CLIENT_ID
PROTOCOL_VERSION:IP_ADDR:PORT/CLIENT_ID
Because the Ceph daemons now bind to multiple ports, the daemons display multiple addresses instead of a single address. Here is
an example from a dump of the monitor map:
epoch 1
fsid 50fcf227-be32-4bcb-8b41-34ca8370bd17
last_changed 2021-12-12 11:10:46.700821
created 2021-12-12 11:10:46.700821
min_mon_release 14 (nautilus)
0: [v2:10.0.0.10:3300/0,v1:10.0.0.10:6789/0] mon.a
1: [v2:10.0.0.11:3300/0,v1:10.0.0.11:6789/0] mon.b
2: [v2:10.0.0.12:3300/0,v1:10.0.0.12:6789/0] mon.c
Also, the mon_host configuration option and specifying addresses on the command line, using -m, supports the new address format.
Connection Phases
Banner
On connection, both the client and the server send a banner. Currently, the Ceph banner is ceph 0 0n.
Authentication Exchange
All data, sent or received, is contained in a frame for the duration of the connection. The server decides if authentication has
completed, and what the connection mode will be. The frame format is fixed, and can be in three different forms depending on
the authentication flags being used.
Message Exchange
The client and server start exchanging messages, until the connection is closed.
Preface
Introduction to IBM Storage Ceph
Supporting Software
Preface
Edit online
This document provides advice and good practice information for hardening the security of IBM Storage Ceph, with a focus on the
Ceph Orchestrator using cephadm for IBM Storage Ceph deployments.
While following the instructions in this guide will help harden the security of your environment, we do not guarantee security or
compliance from following these recommendations.
All IBM Storage Ceph deployments consist of a storage cluster commonly referred to as the Ceph Storage Cluster or RADOS (Reliable
Autonomous Distributed Object Store), which consists of three types of daemons:
Ceph Monitors (ceph-mon): Ceph monitors provide a few critical functions such as establishing an agreement about the state
of the cluster, maintaining a history of the state of the cluster such as whether an OSD is up and running and in the cluster,
providing a list of pools through which clients write and read data, and providing authentication for clients and the Ceph
Storage Cluster daemons.
Ceph OSDs (ceph-osd): Ceph Object Storage Daemons (OSDs) store and serve client data, replicate client data to secondary
Ceph OSD daemons, track and report to Ceph Monitors on their health and on the health of neighboring OSDs, dynamically
recover from failures, and backfill data when the cluster size changes, among other functions.
All IBM Storage Ceph deployments store end-user data in the Ceph Storage Cluster or RADOS (Reliable Autonomous Distributed
Object Store). Generally, users DO NOT interact with the Ceph Storage Cluster directly; rather, they interact with a Ceph client.
Ceph Object Gateway (radosgw): The Ceph Object Gateway, also known as RADOS Gateway, radosgw or rgw provides an
object storage service with RESTful APIs. Ceph Object Gateway stores data on behalf of its clients in the Ceph Storage Cluster
or RADOS.
Ceph Block Device (rbd): The Ceph Block Device provides copy-on-write, thin-provisioned, and cloneable virtual block
devices to a Linux kernel via Kernel RBD (krbd) or to cloud computing solutions like OpenStack via librbd.
Ceph File System (cephfs): The Ceph File System consists of one or more Metadata Servers (mds), which store the inode
portion of a file system as objects on the Ceph Storage Cluster. Ceph file systems can be mounted via a kernel client, a FUSE
client, or via the libcephfs library for cloud computing solutions like OpenStack.
Additional clients include librados, which enables developers to create custom applications to interact with the Ceph Storage
cluster and command line interface clients for administrative purposes.
Supporting Software
Edit online
An important aspect of IBM Storage Ceph security is to deliver solutions that have security built-in upfront, that IBM supports over
time. Specific steps which IBM takes with IBM Storage Ceph include:
Maintaining upstream relationships and community involvement to help focus on security from the start.
Selecting and configuring packages based on their security and performance track records.
Building binaries from associated source code (instead of simply accepting upstream builds).
Applying a suite of inspection and quality assurance tools to prevent an extensive array of potential security issues and
regressions.
Digitally signing all released packages and distributing them through cryptographically authenticated distribution channels.
In addition, IBM maintains a dedicated security team that analyzes threats and vulnerabilities against our products, and provides
relevant advice and updates through the Customer Portal. This team determines which issues are important, as opposed to those
that are mostly theoretical problems. The IBM Product Security team maintains expertise in, and makes extensive contributions to
the upstream communities associated with our subscription products. A key part of the process, IBM Security Advisories, deliver
proactive notification of security flaws affecting IBM solutions, along with patches that are frequently distributed on the same day
the vulnerability is first published.
Threat Actors
Edit online
A threat actor is an abstract way to refer to a class of adversary that you might attempt to defend against. The more capable the
actor, the more rigorous the security controls that are required for successful attack mitigation and prevention. Security is a matter of
balancing convenience, defense, and cost, based on requirements.
In some cases, it’s impossible to secure an IBM Storage Ceph deployment against all threat actors described here. When deploying
IBM Storage Ceph, you must decide where the balance lies for your deployment and usage.
As part of your risk assessment, you must also consider the type of data you store and any accessible resources, as this will also
influence certain actors. However, even if your data is not appealing to threat actors, they could simply be attracted to your
computing resources.
Nation-State Actors: This is the most capable adversary. Nation-state actors can bring tremendous resources against a
target. They have capabilities beyond that of any other actor. It’s difficult to defend against these actors without stringent
controls in place, both human and technical.
Serious Organized Crime: This class describes highly capable and financially driven groups of attackers. They are able to fund
in-house exploit development and target research. In recent years, the rise of organizations such as the Russian Business
Network, a massive cyber-criminal enterprise, has demonstrated how cyber attacks have become a commodity. Industrial
espionage falls within the serious organized crime group.
Highly Capable Groups: This refers to ‘Hacktivist’ type organizations who are not typically commercially funded, but can pose
a serious threat to service providers and cloud operators.
Motivated Individuals Acting Alone: These attackers come in many guises, such as rogue or malicious employees,
disaffected customers, or small-scale industrial espionage.
Script Kiddies: These attackers don’t target a specific organization, but run automated vulnerability scanning and
exploitation. They are often a nuisance; however, compromise by one of these actors is a major risk to an organization’s
reputation.
The following practices can help mitigate some of the risks identified above:
Security Updates: You must consider the end-to-end security posture of your underlying physical infrastructure, including
networking, storage, and server hardware. These systems will require their own security hardening practices. For your IBM
Storage Ceph deployment, you should have a plan to regularly test and deploy security updates.
Product Updates: IBM recommends running product updates as they become available. Updates are typically released every
six weeks (and occasionally more frequently). IBM endeavors to make point releases and z-stream releases fully compatible
within a major release in order to not require additional integration testing.
Access Management: Access management includes authentication, authorization, and accounting. Authentication is the
process of verifying the user’s identity. Authorization is the process of granting permissions to an authenticated user.
Accounting is the process of tracking which user performed an action. When granting system access to users, apply the
principle of least privilege, and only grant users the granular system privileges they actually need. This approach can also help
mitigate the risks of both malicious actors and typographical errors from system administrators.
Manage Insiders: You can help mitigate the threat of malicious insiders by applying careful assignment of role-based access
control (minimum required access), using encryption on internal interfaces, and using authentication/authorization security
(such as centralized identity management). You can also consider additional non-technical options, such as separation of
duties and irregular job role rotation.
Security Zones
Public Security Zone: The public security zone is an entirely untrusted area of the cloud infrastructure. It can refer to the
Internet as a whole or simply to networks that are external to your OpenStack deployment over which you have no authority.
Any data with confidentiality or integrity requirements that traverse this zone should be protected using compensating
controls such as encryption. The public security zone SHOULD NOT be confused with the Ceph Storage Cluster’s front- or
client-side network, which is referred to as the public_network in IBM Storage Ceph and is usually NOT part of the public
security zone or the Ceph client security zone.
Ceph Client Security Zone: With IBM Storage Ceph, the Ceph client security zone refers to networks accessing Ceph clients
such as Ceph Object Gateway, Ceph Block Device, Ceph Filesystem, or librados. The Ceph client security zone is typically
behind a firewall separating itself from the public security zone. However, Ceph clients are not always protected from the
public security zone. It is possible to expose the Ceph Object Gateway’s S3 and Swift APIs in the public security zone.
Storage Access Security Zone: The storage access security zone refers to internal networks providing Ceph clients with
access to the Ceph Storage Cluster. We use the phrase storage access security zone so that this document is consistent with
the terminology used in the OpenStack Platform Security and Hardening Guide. The storage access security zone includes the
Ceph Storage Cluster’s front- or client-side network, which is referred to as the public_network in IBM Storage Ceph.
Ceph Cluster Security Zone: The Ceph cluster security zone refers to the internal networks providing the Ceph Storage
Cluster’s OSD daemons with network communications for replication, heartbeating, backfilling, and recovery. The Ceph
cluster security zone includes the Ceph Storage Cluster’s backside network, which is referred to as the cluster_network in
IBM Storage Ceph. These security zones can be mapped separately, or combined to represent the majority of the possible
areas of trust within a given IBM Storage Ceph deployment. Security zones should be mapped out against your specific
deployment topology. The zones and their trust requirements will vary depending upon whether the storage cluster is
operating in a standalone capacity or is serving a public, private, or hybrid cloud.
Reference
In some cases, IBM Storage Ceph administrators might want to consider securing integration points at a higher standard than any of
the zones in which the integration point resides. For example, the Ceph Cluster Security Zone can be isolated from other security
zones easily, because there is no reason for it to connect to other security zones. By contrast, the Storage Access Security Zone must
provide access to port 6789 on Ceph monitor nodes, and ports 6800-7300 on Ceph OSD nodes. However, port 3000 should be
exclusive to the Storage Access Security Zone, because it provides access to Ceph Grafana monitoring information that should be
exposed to Ceph administrators only. A Ceph Object Gateway in the Ceph Client Security Zone will need to access the Ceph Cluster
Security Zone’s monitors (port 6789) and OSDs (ports 6800-7300), and may expose its S3 and Swift APIs to the Public Security
Zone such as over HTTP port 80 or HTTPS port 443; yet, it may still need to restrict access to the admin API.
As core services usually span at least two zones, special consideration must be given when applying security controls to them.
Security-Optimized Architecture
Edit online
By contrast, IBM Storage Ceph clients such as Ceph Block Device (rbd), Ceph Filesystem (cephfs), and Ceph Object Gateway (rgw)
access the IBM storage cluster, but expose their services to other cloud computing platforms.
IMPORTANT: Security zone separation might be insufficient for protection if an attacker gains access to Ceph clients on the public
network.
There are situations where there is a security requirement to assure the confidentiality or integrity of network traffic, and where IBM
Storage Ceph uses encryption and key management, including:
SSH
Encryption in Transit
Encryption at Rest
SSH
SSL Termination
Messenger v2 protocol
Encryption in transit
Encryption at Rest
SSH
Edit online
All nodes in the IBM Storage Ceph cluster use SSH as part of deploying the cluster. This means that on each node:
IMPORTANT: Any person with access to the cephadm user by extension has permission to run commands as root on any node in
the IBM Storage Ceph cluster.
Reference
SSL Termination
Edit online
The Ceph Object Gateway may be deployed in conjunction with HAProxy and keepalived for load balancing and failover. Earlier
versions of Civetweb do not support SSL and later versions support SSL with some performance limitations.
You can configure the Beast front-end web server to use the OpenSSL library to provide Transport Layer Security (TLS).
When using HAProxy and keepalived to terminate SSL connections, the HAProxy and keepalived components use encryption
keys.
When using HAProxy and keepalived to terminate SSL, the connection between the load balancer and the Ceph Object Gateway is
NOT encrypted.
Reference
For more information, see Configuring SSL for Beast and High availability service.
Messenger v2 protocol
Edit online
The second version of Ceph’s on-wire protocol, msgr2, has the following features:
Encapsulation improvement of authentication payloads, enabling future integration of new authentication modes.
The messenger v2 protocol has two configuration options that control whether the v1 or the v2 protocol is used:
ms_bind_msgr1 - This option controls whether a daemon binds to a port speaking the v1 protocol; it is true by default.
ms_bind_msgr2 - This option controls whether a daemon binds to a port speaking the v2 protocol; it is true by default.
Similarly, two options control based on IPv4 and IPv6 addresses used:
ms_bind_ipv4 - This option controls whether a daemon binds to an IPv4 address; it is true by default.
ms_bind_ipv6 - This option controls whether a daemon binds to an IPv6 address; it is true by default.
NOTE: The ability to bind to multiple ports has paved the way for dual-stack IPv4 and IPv6 support.
crc
secure
Also, the Ceph Object Gateway supports encryption with customer-provided keys using its S3 API.
IMPORTANT: To comply with regulatory compliance standards requiring strict encryption in transit, administrators MUST deploy the
Ceph Object Gateway with client-side encryption.
System administrators integrating Ceph as a backend for OpenStack Platform 13 MUST encrypt Ceph block device volumes using
dm_crypt for RBD Cinder to ensure on-wire encryption within the Ceph storage cluster.
IMPORTANT: To comply with regulatory compliance standards requiring strict encryption in transit, system administrators MUST use
dmcrypt for RBD Cinder to ensure on-wire encryption within the Ceph storage cluster.
Reference
Encryption in transit
Edit online
The secure mode setting for messenger v2 encrypts communication between Ceph daemons and Ceph clients, providing end-to-
end encryption.
Reference
Encryption at Rest
Edit online
IBM Storage Ceph supports encryption at rest in a few scenarios:
1. Ceph Storage Cluster: The Ceph Storage Cluster supports Linux Unified Key Setup or LUKS encryption of Ceph OSDs and their
corresponding journals, write-ahead logs, and metadata databases. In this scenario, Ceph will encrypt all data at rest
irrespective of whether the client is a Ceph Block Device, Ceph Filesystem, or a custom application built on librados.
2. Ceph Object Gateway: The Ceph storage cluster supports encryption of client objects. Additionally, the data transmitted is
between the Ceph Object Gateway and the Ceph Storage Cluster is in encrypted form.
The Ceph storage cluster supports encrypting data stored in Ceph OSDs. IBM Storage Ceph can encrypt logical volumes with lvm by
specifying dmcrypt; that is, lvm, invoked by ceph-volume, encrypts an OSD’s logical volume, not its physical volume. It can
encrypt non-LVM devices like partitions using the same OSD key. Encrypting logical volumes allows for more configuration flexibility.
Ceph uses LUKS v1 rather than LUKS v2, because LUKS v1 has the broadest support among Linux distributions.
When creating an OSD, lvm will generate a secret key and pass the key to the Ceph Monitors securely in a JSON payload via stdin.
The attribute name for the encryption key is dmcrypt_key.
By default, Ceph does not encrypt data stored in Ceph OSDs. System administrators must enable dmcrypt to encrypt data stored in
Ceph OSDs. When using a Ceph Orchestrator service specification file for adding Ceph OSDs to the storage cluster, set the following
option in the file to encrypt Ceph OSDs:
Example
...
encrypted: true
...
NOTE: LUKS and dmcrypt only address encryption for data at rest, not encryption for data in transit.
The Ceph Object Gateway supports encryption with customer-provided keys using its S3 API. When using customer-provided keys,
the S3 client passes an encryption key along with each request to read or write encrypted data. It is the customer’s responsibility to
manage those keys. Customers must remember which key the Ceph Object Gateway used to encrypt each object.
Reference
IMPORTANT: The cephx protocol DOES NOT address data encryption in transport or encryption at rest.
Cephx uses shared secret keys for authentication, meaning both the client and the monitor cluster have a copy of the client’s secret
key. The authentication protocol is such that both parties are able to prove to each other they have a copy of the key without actually
revealing it. This provides mutual authentication, which means the cluster is sure the user possesses the secret key, and the user is
sure that the cluster has a copy of the secret key.
In the figure below, users are either individuals or system actors such as applications, which use Ceph clients to interact with the
IBM Storage Ceph cluster daemons.
Ceph runs with authentication and authorization enabled by default. Ceph clients may specify a user name and a keyring containing
the secret key of the specified user, usually by using the command line. If the user and keyring are not provided as arguments, Ceph
will use the client.admin administrative user as the default. If a keyring is not specified, Ceph will look for a keyring by using the
keyring setting in the Ceph configuration.
IMPORTANT: To harden a Ceph cluster, keyrings SHOULD ONLY have read and write permissions for the current user and root. The
keyring containing the client.admin administrative user key must be restricted to the root user.
For details on configuring the IBM Storage Ceph cluster to use authentication, see the IBM Storage Ceph Configuration Guide 5. More
specifically, see section Ceph authentication configuration.
Swift User: An access key and secret for a user of the Swift API. The Swift user is a subuser of an S3 user. Deleting the S3
parent user will delete the Swift user.
Administrative User: An access key and secret for a user of the administrative API. Administrative users should be created
sparingly, as the administrative user will be able to access the Ceph Admin API and execute its functions, such as creating
users, and giving them permissions to access buckets or containers and their objects among other things.
The Ceph Object Gateway stores all user authentication information in Ceph Storage cluster pools. Additional information may be
stored about users including names, email addresses, quotas, and usage.
Reference
Ceph Object Gateway controls whether to use LDAP. However, once configured, it is the LDAP server that is responsible for
authenticating users.
To secure communications between the Ceph Object Gateway and the LDAP server, IBM recommends deploying configurations with
LDAP Secure or LDAPS.
IMPORTANT: When using LDAP, ensure that access to the rgw_ldap_secret = _PATH_TO_SECRET_FILE_ secret file is secure.
Reference
For more information, see Configure LDAP and Ceph Object Gateway and Configure Active Directory and Ceph Object Gateway.
Ceph Object Gateway controls whether to use OpenStack Keystone for authentication. However, once configured, it is the OpenStack
Keystone service that is responsible for authenticating users.
Configuring the Ceph Object Gateway to work with Keystone requires converting the OpenSSL certificates that Keystone uses for
creating the requests to the nss db format.
Reference
For more information, see The Ceph Object Gateway and OpenStack Keystone.
Infrastructure Security
Edit online
The scope of this guide is IBM Storage Ceph. However, a proper IBM Storage Ceph security plan requires consideration of the
following prerequisites.
Prerequisites
Administration
Network Communication
Hardening the Network Service
Reporting
Auditing Administrator Actions
Network Communication
Edit online
IBM Storage Ceph provides two networks:
A public network.
A cluster network.
All Ceph daemons and Ceph clients require access to the public network, which is part of the storage access security zone. By
contrast, ONLY the OSD daemons require access to the cluster network, which is part of the Ceph cluster security zone.
The Ceph configuration contains public_network and cluster_network settings. For hardening purposes, specify the IP
address and the netmask using CIDR notation. Specify multiple comma-delimited IP address and netmask entries if the cluster will
have multiple subnets.
cluster_network = <cluster-network/netmask>[,<cluster-network/netmask>]
1. Start the firewalld service, enable it to run on boot, and ensure that it is running:
# firewall-cmd --list-all
On a new installation, the sources: section should be blank indicating that no ports have been opened specifically. The
services section should indicate ssh indicating that the SSH service (and port 22) and dhcpv6-client are enabled.
sources:
services: ssh dhcpv6-client
# getenforce
Enforcing
# setenforce 1
If SELinux is not running, enable it. For more information, see Red Hat Enterprise Linux 8 Using SELinux.
Each Ceph daemon uses one or more ports to communicate with other daemons in the IBM Storage Ceph cluster. In some cases, you
may change the default port settings. Administrators typically only change the default port with the Ceph Object Gateway or ceph-
radosgw daemon.
The Ceph clients include ceph-radosgw, ceph-mds, ceph-fuse, libcephfs, rbd, librbd, and librados. These daemons and
their hosts comprise the storage access security zone, which should use its own subnet for hardening purposes.
On the Ceph Storage Cluster zone’s hosts, consider enabling only hosts running Ceph clients to connect to the Ceph Storage Cluster
daemons. For example:
Reporting
Edit online
IBM Storage Ceph provides basic system monitoring and reporting with the ceph-mgr daemon plug-ins, namely, the RESTful API,
the dashboard, and other plug-ins such as Prometheus and Zabbix. Ceph collects this information using collectd and sockets to
retrieve settings, configuration details, and statistical information.
In addition to default system behavior, system administrators may configure collectd to report on security matters, such as
configuring the IP-Tables or ConnTrack plug-ins to track open ports and connections respectively.
System administrators may also retrieve configuration settings at runtime. For more information, see Viewing the Ceph configuration
at runtime.
Example
In distributed systems such as Ceph, actions may begin on one instance and get propagated to other nodes in the cluster. When the
action begins, the log indicates dispatch. When the action ends, the log indicates finished.
Cephx controls access to the pools storing object data. However, Ceph Storage Cluster users are typically Ceph clients, and not
users. Consequently, users generally DO NOT have the ability to write, read or delete objects directly in a Ceph Storage Cluster pool.
Depending upon the application consuming the Ceph Block Device interface, usually OpenStack Platform, users may create, modify,
and delete volumes and images. Ceph handles the create, retrieve, update, and delete operations of each individual object.
Deleting volumes and images destroys the corresponding objects in an unrecoverable manner. However, residual data artifacts may
continue to reside on storage media until overwritten. Data may also remain in backup archives.
User Authentication Information: User authentication information generally consists of user IDs, user access keys, and user
secrets. It may also comprise a user’s name and email address if provided. Ceph Object Gateway will retain user
authentication data unless the user is explicitly deleted from the system.
User Data: User data generally comprises user- or administrator-created buckets or containers, and the user-created S3 or
Swift objects contained within them. The Ceph Object Gateway interface creates one or more Ceph Storage cluster objects for
each S3 or Swift object and stores the corresponding Ceph Storage cluster objects within a data pool. Ceph assigns the Ceph
Storage cluster objects to placement groups and distributes or places them pseudo-randomly in OSDs throughout the cluster.
The Ceph Object Gateway may also store an index of the objects contained within a bucket or index to enable services such as
listing the contents of an S3 bucket or Swift container. Additionally, when implementing multi-part uploads, the Ceph Object
Gateway may temporarily store partial uploads of S3 or Swift objects.
Deleting S3 or Swift objects destroys the corresponding Ceph Storage cluster objects in an unrecoverable manner. However,
residual data artifacts may continue to reside on storage media until overwritten. Data may also remain in backup archives.
Logging: Ceph Object Gateway also stores logs of user operations that the user intends to accomplish and operations that
have been executed. This data provides traceability about who created, modified or deleted a bucket or container, or an S3 or
Swift object residing in an S3 bucket or Swift container. When users delete their data, the logging information is not affected
and will remain in storage until deleted by a system administrator or removed automatically by expiration policy.
Bucket Lifecycle
Ceph Object Gateway also supports bucket lifecycle features, including object expiration. Data retention regulations like the General
Data Protection Regulation may require administrators to set object expiration policies and disclose them to users among other
compliance factors.
multi-site
Ceph Object Gateway is often deployed in a multi-site context whereby a user stores an object at one site and the Ceph Object
Gateway creates a replica of the object in another cluster possibly at another geographic location. For example, if a primary cluster
fails, a secondary cluster may resume operations. In another example, a secondary cluster may be in a different geographic location,
such as an edge network or content-delivery network such that a client may access the closest cluster to improve response time,
throughput, and other performance characteristics. In multi-site scenarios, administrators must ensure that each site has
implemented security measures. Additionally, if geographic distribution of data would occur in a multi-site scenario, administrators
must be aware of any regulatory implications when the data crosses political boundaries.
Enable FIPS mode on Red Hat Enterprise Linux either during system installation or after it.
For container deployments, follow the instructions in the Red Hat Enterprise Linux 8 Security Hardening Guide.
Reference
For the latest information on FIPS validation, refer to the US Government Standards.
Summary
Edit online
This document has provided only a general introduction to security for IBM Storage Ceph. Contact the IBM Storage Ceph consulting
team for additional help.
Planning
Edit online
Planning involves considering the supported compatibility, physical configuration, and various storage strategy prerequisites before
working with IBM Storage Ceph.
Compatibility
Hardware
Storage Strategies
Host Operating
Version Notes
System
Red Hat 9.2, 9.1, 9.0, 8.8, 8.7, 8.6, 8.5, Standard lifecycle 9.1 is included in the product (recommended). Red Hat
Enterprise Linux and EUS 8.4 Enterprise Linux EUS is optional.
IMPORTANT: All nodes in the cluster and their clients must use the supported OS version(s) to ensure that the version of the ceph
package is the same on all nodes. Using different versions of the ceph package is not supported.
IMPORTANT: IBM no longer supports using Ubuntu as a host operating system to deploy IBM Storage Ceph.
Executive summary
General principles for selecting hardware
Optimize workload performance domains
Server and rack solutions
Minimum hardware recommendations for containerized Ceph
Recommended minimum hardware requirements for the IBM Storage Ceph Dashboard
Executive summary
Edit online
Many hardware vendors now offer both Ceph-optimized servers and rack-level solutions designed for distinct workload profiles. To
simplify the hardware selection process and reduce risk for organizations, IBM has worked with multiple storage server vendors to
test and evaluate specific cluster options for different cluster sizes and workload profiles. IBM’s exacting methodology combines
performance testing with proven guidance for a broad range of cluster capabilities and sizes.
With appropriate storage servers and rack-level solutions, IBM Storage Ceph can provide storage pools serving a variety of
workloads—from throughput-sensitive and cost and capacity-focused workloads to emerging IOPS-intensive workloads.
IBM Storage Ceph significantly lowers the cost of storing enterprise data and helps organizations manage exponential data growth.
The software is a robust and modern petabyte-scale storage platform for public or private cloud deployments. IBM Storage Ceph
offers mature interfaces for enterprise block and object storage, making it an optimal solution for active archive, rich media, and
cloud infrastructure workloads characterized by tenant-agnostic OpenStack® environments1. Delivered as a unified, software-
defined, scale-out storage platform, IBM Storage Ceph lets businesses focus on improving application innovation and availability by
offering capabilities such as:
IBM Storage Ceph can run on myriad industry-standard hardware configurations to satisfy diverse needs. To simplify and accelerate
the cluster design process, IBM conducts extensive performance and suitability testing with participating hardware vendors. This
testing allows evaluation of selected hardware under load and generates essential performance and sizing data for diverse
workloads—ultimately simplifying Ceph storage cluster hardware selection. As discussed in this guide, multiple hardware vendors
now provide server and rack-level solutions optimized for IBM Storage Ceph deployments with IOPS-, throughput-, and cost and
capacity-optimized solutions as available options.
Software-defined storage presents many advantages to organizations seeking scale-out solutions to meet demanding applications
and escalating storage needs. With a proven methodology and extensive testing performed with multiple vendors, IBM simplifies the
process of selecting hardware to meet the demands of any environment. Importantly, the guidelines and example systems listed in
this document are not a substitute for quantifying the impact of production workloads on sample systems.
Ceph is and has been the leading storage for OpenStack according to several semi-annual OpenStack user surveys.
See Yahoo Cloud Object Store - Object Storage at Exabyte Scale for details.
IBM Storage Ceph can run on myriad industry-standard hardware configurations to satisfy diverse needs. To simplify and accelerate
the cluster design process, IBM conducts extensive performance and suitability testing with participating hardware vendors. This
testing allows evaluation of selected hardware under load and generates essential performance and sizing data for diverse
workloads—ultimately simplifying Ceph storage cluster hardware selection. As discussed in this guide, multiple hardware vendors
now provide server and rack-level solutions optimized for IBM Storage Ceph deployments with IOPS-, throughput-, and cost and
capacity-optimized solutions as available options.
Prerequisites
Edit online
IOPS optimized: IOPS optimized deployments are suitable for cloud computing operations, such as running MYSQL or
MariaDB instances as virtual machines on OpenStack. IOPS optimized deployments require higher performance storage such
as 15k RPM SAS drives and separate SSD journals to handle frequent write operations. Some high IOPS scenarios use all flash
storage to improve IOPS and total throughput.
Throughput optimized: Throughput-optimized deployments are suitable for serving up significant amounts of data, such as
graphic, audio and video content. Throughput-optimized deployments require networking hardware, controllers and hard disk
drives with acceptable total throughput characteristics. In cases where write performance is a requirement, SSD journals will
substantially improve write performance.
Capacity optimized: Capacity-optimized deployments are suitable for storing significant amounts of data as inexpensively as
possible. Capacity-optimized deployments typically trade performance for a more attractive price point. For example,
capacity-optimized deployments often use slower and less expensive SATA drives and co-locate journals rather than using
SSDs for journaling.
This document provides examples of IBM tested hardware suitable for these use cases.
Same controller.
Same RPMs.
Same I/O.
Using the same hardware within a pool provides a consistent performance profile, simplifies provisioning and streamlines
troubleshooting.
Storage administrators prefer that a storage cluster recovers as quickly as possible. Carefully consider bandwidth requirements for
the storage cluster network, be mindful of network link oversubscription, and segregate the intra-cluster traffic from the client-to-
cluster traffic. Also consider that network performance is increasingly important when considering the use of Solid State Disks (SSD),
flash, NVMe, and other high performing storage devices.
Ceph supports a public network and a storage cluster network. The public network handles client traffic and communication with
Ceph Monitors. The storage cluster network handles Ceph OSD heartbeats, replication, backfilling, and recovery traffic. At a
minimum, a single 10 GB Ethernet link should be used for storage hardware, and you can add additional 10 GB Ethernet links for
connectivity and throughput.
IMPORTANT: IBM recommends allocating bandwidth to the storage cluster network, such that it is a multiple of the public network
using the osd_pool_default_size as the basis for the multiple on replicated pools. IBM also recommends running the public and
storage cluster networks on separate network cards.
IMPORTANT: IBM recommends using 10 GB Ethernet for IBM Storage Ceph deployments in production. A 1 GB Ethernet network is
not suitable for production storage clusters.
In the case of a drive failure, replicating 1 TB of data across a 1 GB Ethernet network takes 3 hours, and 3 TB takes 9 hours. Using 3
TB is the typical drive configuration. By contrast, with a 10 GB Ethernet network, the replication times would be 20 minutes and 1
hour. Remember that when a Ceph OSD fails, the storage cluster will recover by replicating the data it contained to other Ceph OSDs
within the pool.
Before installing and testing a IBM Storage Ceph cluster, verify the network throughput. Most performance-related problems in Ceph
usually begin with a networking issue. Simple network issues like a kinked or bent Cat-6 cable could result in degraded bandwidth.
Use a minimum of 10 GB ethernet for the front side network. For large clusters, consider using 40 GB ethernet for the backend or
cluster network.
IMPORTANT: For network optimization, IBM recommends using jumbo frames for a better CPU per bandwidth ratio, and a non-
blocking network switch back-plane. IBM Storage Ceph requires the same MTU value throughout all networking devices in the
communication path, end-to-end for both public and cluster networks. Verify that the MTU value is the same on all hosts and
networking equipment in the environment before using a IBM Storage Ceph cluster in production.
IMPORTANT: IBM recommends that each hard drive be exported separately from the RAID controller as a single volume with write-
back caching enabled.
This requires a battery-backed, or a non-volatile flash memory device on the storage controller. It is important to make sure the
battery is working, as most controllers will disable write-back caching if the memory on the controller can be lost as a result of a
power failure. Periodically, check the batteries and replace them if necessary, as they do degrade over time. See the storage
controller vendor’s documentation for details. Typically, the storage controller vendor provides storage management utilities to
monitor and adjust the storage controller configuration without any downtime.
Using Just a Bunch of Drives (JBOD) in independent drive mode with Ceph is supported when using all Solid State Drives (SSDs), or
for configurations with high numbers of drives per controller. For example, 60 drives attached to one controller. In this scenario, the
write-back caching can become a source of I/O contention. Since JBOD disables write-back caching, it is ideal in this scenario. One
advantage of using JBOD mode is the ease of adding or replacing drives and then exposing the drive to the operating system
immediately after it is physically plugged in.
Journaling on OSD data drives when the use case calls for an SSD journal.
Use the examples in this document of IBM tested configurations for different workloads to avoid some of the foregoing hardware
selection mistakes.
The following lists provide the criteria IBM uses to identify optimal IBM Storage Ceph cluster configurations on storage servers.
These categories are provided as general guidelines for hardware purchases and configuration decisions, and can be adjusted to
satisfy unique workload blends. Actual hardware configurations chosen will vary depending on specific workload mix and vendor
capabilities.
IOPS optimized
3x replication for hard disk drives (HDDs) or 2x replication for solid state drives (SSDs).
Throughput optimized
3x replication.
Streaming media.
A cost- and capacity-optimized storage cluster typically has the following properties:
Object archive.
To the Ceph client interface that reads and writes data, a Ceph storage cluster appears as a simple pool where the client stores data.
However, the storage cluster performs many complex operations in a manner that is completely transparent to the client interface.
Ceph clients and Ceph object storage daemons (Ceph OSDs, or simply OSDs) both use the controlled replication under scalable
hashing (CRUSH) algorithm for storage and retrieval of objects. OSDs run on OSD hosts—the storage servers within the cluster.
A CRUSH map describes a topography of cluster resources, and the map exists both on client nodes as well as Ceph Monitor (MON)
nodes within the cluster. Ceph clients and Ceph OSDs both use the CRUSH map and the CRUSH algorithm. Ceph clients communicate
directly with OSDs, eliminating a centralized object lookup and a potential performance bottleneck. With awareness of the CRUSH
map and communication with their peers, OSDs can handle replication, backfilling, and recovery—allowing for dynamic failure
recovery.
Ceph uses the CRUSH map to implement failure domains. Ceph also uses the CRUSH map to implement performance domains,
which simply take the performance profile of the underlying hardware into consideration. The CRUSH map describes how Ceph
stores data, and it is implemented as a simple hierarchy (acyclic graph) and a ruleset. The CRUSH map can support multiple
hierarchies to separate one type of hardware performance profile from another.
Hard disk drives (HDDs) are typically appropriate for cost- and capacity-focused workloads.
Throughput-sensitive workloads typically use HDDs with Ceph write journals on solid state drives (SSDs).
Network switching: Redundant network switching interconnects the cluster and provides access to clients.
Ceph MON nodes: The Ceph monitor is a datastore for the health of the entire cluster, and contains the cluster log. A
minimum of three monitor nodes are strongly recommended for a cluster quorum in production.
Ceph OSD hosts: Ceph OSD hosts house the storage capacity for the cluster, with one or more OSDs running per individual
storage device. OSD hosts are selected and configured differently depending on both workload optimization and the data
devices installed: HDDs, SSDs, or NVMe SSDs.
IBM Storage Ceph: Many vendors provide a capacity-based subscription for IBM Storage Ceph bundled with both server and
rack-level solution SKUs.
With the growing use of flash storage, organizations increasingly host IOPS-intensive workloads on Ceph storage clusters to let them
emulate high-performance public cloud solutions with private cloud storage. These workloads commonly involve structured data
from MySQL-, MariaDB-, or PostgreSQL-based applications.
Bluestore WAL/DB: High-performance, high-endurance enterprise NVMe SSD, co-located with OSDs.
NOTE: For Non-NVMe SSDs, for CPU, use two cores per SSD OSD.
Throughput-optimized Solutions
Throughput-optimized Ceph solutions are usually centered around semi-structured or unstructured data. Large-block sequential I/O
is typical.
Networking: 10 GbE per 12 OSDs each for client- and cluster-facing networks.
Bluestore WAL/DB: High-performance, high-endurance enterprise NVMe SSD, co-located with OSDs.
Several vendors provide pre-configured server and rack-level solutions for throughput-optimized Ceph workloads. IBM has
conducted extensive testing and evaluation of servers from Supermicro and Quanta Cloud Technologies (QCT).
Table 2. Rack-level SKUs for Ceph OSDs, MONs, and top-of-rack (TOR)
switches.
Vendor Small (250TB) Medium (1PB) Large (2PB+)
SuperMicro SRS-42E112-Ceph-03 SRS-42E136-Ceph-03 SRS-42E136-Ceph-03
Rack-level SKUs for Ceph OSDs, MONs, and top-of-rack (TOR) switches.
See Dell PowerEdge R730xd Performance and Sizing Guide for IBM Storage Ceph - A Dell IBM Technical White Paper for
details.
See Dell EMC DSS 7000 Performance & Sizing Guide for IBM Storage Ceph for details.
Cost- and capacity-optimized solutions typically focus on higher capacity, or longer archival scenarios. Data can be either semi-
structured or unstructured. Workloads include media archives, big data analytics archives, and machine image backups. Large-block
sequential I/O is typical.
Networking. 10 GbE per 12 OSDs (each for client- and cluster-facing networks).
HBA. JBOD.
Supermicro and QCT provide pre-configured server and rack-level solution SKUs for cost- and capacity-focused Ceph workloads.
Intel® Data Center Blocks for Cloud – IBM OpenStack Platform with Red Hat Ceph Storage
Red Hat Ceph Storage on Servers with Intel Processors and SSDs
This number is highly dependent on the configurable MDS cache size. The RAM requirement is
typically twice as much as the amount set in the mds_cache_memory_limit configuration
setting. Note also that this is the memory for your daemon, not the overall system memory.
Disk Space 2 GB per mds-container, plus taking into consideration any additional space required for
possible debug logging, 20GB is a good start.
Network 2x 1GB Ethernet NICs, 10 GB Recommended
Note that this is the same network as the OSD containers. If you have a 10 GB network on your
OSDs you should use the same on your MDS so that the MDS is not disadvantaged when it
comes to latency.
Minimum requirements
8 GB RAM
Reference
For more information, see High-level monitoring of a Ceph storage cluster in the Administration Guide.
Storage Strategies
Edit online
Creating storage strategies for IBM Storage Ceph clusters
This section of the document provides instructions for creating storage strategies, including creating CRUSH hierarchies, estimating
the number of placement groups, determining which type of storage pool to create, and managing pools.
Overview
Crush admin overview
Placement Groups
Pools overview
Erasure code pools overview
Overview
Edit online
From the perspective of a Ceph client, interacting with the Ceph storage cluster is remarkably simple:
This remarkably simple interface is how a Ceph client selects one of the storage strategies you define. Storage strategies are invisible
to the Ceph client in all but storage capacity and performance.
The diagram below shows the logical data flow starting from the client into the IBM Storage Ceph cluster.
Storage strategies include the storage media (hard drives, SSDs, and the rest), the CRUSH maps that set up performance and failure
domains for the storage media, the number of placement groups, and the pool interface. Ceph supports multiple storage strategies.
Use cases, cost/benefit performance tradeoffs and data durability are the primary considerations that drive storage strategies.
1. Use Cases: Ceph provides massive storage capacity, and it supports numerous use cases. For example, the Ceph Block Device
client is a leading storage backend for cloud platforms like OpenStack—providing limitless storage for volumes and images
with high performance features like copy-on-write cloning. Likewise, Ceph can provide container-based storage for OpenShift
environments. By contrast, the Ceph Object Gateway client is a leading storage backend for cloud platforms that provides
RESTful S3-compliant and Swift-compliant object storage for objects like audio, bitmap, video and other data.
2. Cost/Benefit of Performance: Faster is better. Bigger is better. High durability is better. However, there is a price for each
superlative quality, and a corresponding cost/benefit trade off. Consider the following use cases from a performance
perspective: SSDs can provide very fast storage for relatively small amounts of data and journaling. Storing a database or
3. Durability: In large scale clusters, hardware failure is an expectation, not an exception. However, data loss and service
interruption remain unacceptable. For this reason, data durability is very important. Ceph addresses data durability with
multiple deep copies of an object or with erasure coding and multiple coding chunks. Multiple copies or multiple coding
chunks present an additional cost/benefit tradeoff: it’s cheaper to store fewer copies or coding chunks, but it might lead to the
inability to service write requests in a degraded state. Generally, one object with two additional copies (that is, size = 3) or
two coding chunks might allow a cluster to service writes in a degraded state while the cluster recovers. The CRUSH algorithm
aids this process by ensuring that Ceph stores additional copies or coding chunks in different locations within the cluster. This
ensures that the failure of a single storage device or node doesn’t lead to a loss of all of the copies or coding chunks necessary
to preclude data loss.
You can capture use cases, cost/benefit performance tradeoffs and data durability in a storage strategy and present it to a Ceph
client as a storage pool.
IMPORTANT: Ceph’s object copies or coding chunks make RAID obsolete. Do not use RAID, because Ceph already handles data
durability, a degraded RAID has a negative impact on performance, and recovering data using RAID is substantially slower than using
deep copies or erasure coding chunks.
1. Define a Storage Strategy: Storage strategies require you to analyze your use case, cost/benefit performance tradeoffs and
data durability. Then, you create OSDs suitable for that use case. For example, you can create SSD-backed OSDs for a high
performance pool; SAS drive/SSD journal-backed OSDs for high-performance block device volumes and images; or, SATA-
backed OSDs for low cost storage. Ideally, each OSD for a use case should have the same hardware configuration so that you
have a consistent performance profile.
2. Define a CRUSH Hierarchy: Ceph rules select a node, usually the root, in a CRUSH hierarchy, and identify the appropriate
OSDs for storing placement groups and the objects they contain. You must create a CRUSH hierarchy and a CRUSH rule for
your storage strategy. CRUSH hierarchies get assigned directly to a pool by the CRUSH rule setting.
3. Calculate Placement Groups: Ceph shards a pool into placement groups. You do not have to manually set the number of
placement groups for your pool. PG autoscaler sets an appropriate number of placement groups for your pool that remains
within a healthy maximum number of placement groups in the event that you assign multiple pools to the same CRUSH rule.
4. Create a Pool: Finally, you must create a pool and determine whether it uses replicated or erasure-coded storage. You must
set the number of placement groups for the pool, the rule for the pool and the durability, such as size or K+M coding chunks.
Remember, the pool is the Ceph client’s interface to the storage cluster, but the storage strategy is completely transparent to the
Ceph client, except for capacity and performance.
— Arthur C. Clarke
CRUSH introduction
CRUSH hierarchy
CRUSH introduction
Edit online
The CRUSH map for your storage cluster describes your device locations within CRUSH hierarchies and a rule for each hierarchy that
determines how Ceph stores data.
The CRUSH map contains at least one hierarchy of nodes and leaves. The nodes of a hierarchy, called "buckets" in Ceph, are any
aggregation of storage locations as defined by their type. For example, rows, racks, chassis, hosts, and devices. Each leaf of the
hierarchy consists essentially of one of the storage devices in the list of storage devices. A leaf is always contained in one node or
"bucket." A CRUSH map also has a list of rules that determine how CRUSH stores and retrieves data.
NOTE: Storage devices are added to the CRUSH map when adding an OSD to the cluster.
The CRUSH algorithm distributes data objects among storage devices according to a per-device weight value, approximating a
uniform probability distribution. CRUSH distributes objects and their replicas or erasure-coding chunks according to the hierarchical
cluster map an administrator defines. The CRUSH map represents the available storage devices and the logical buckets that contain
them for the rule, and by extension each pool that uses the rule.
To map placement groups to OSDs across failure domains or performance domains, a CRUSH map defines a hierarchical list of
bucket types; that is, under types in the generated CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf
nodes by their failure domains or performance domains or both. Failure domains include hosts, chassis, racks, power distribution
units, pods, rows, rooms, and data centers. Performance domains include failure domains and OSDs of a particular configuration. For
example, SSDs, SAS drives with SSD journals, SATA drives, and so on. Devices have the notion of a class, such as hdd, ssd and
nvme to more rapidly build CRUSH hierarchies with a class of devices.
With the exception of the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and you can define it according to your
own needs if the default types do not suit your requirements. We recommend adapting your CRUSH map bucket types to your
organization’s hardware naming conventions and using instance names that reflect the physical hardware names. Your naming
practice can make it easier to administer the cluster and troubleshoot problems when an OSD or other hardware malfunctions and
the administrator needs remote or physical access to the host or other hardware.
In the following example, the bucket hierarchy has four leaf buckets (osd 1-4), two node buckets (host 1-2) and one rack node
(rack 1).
When declaring a bucket instance, specify its type, give it a unique name as a string, assign it an optional unique ID expressed as a
negative integer, specify a weight relative to the total capacity or capability of its items, specify the bucket algorithm such as
straw2, and the hash that is usually 0 reflecting hash algorithm rjenkins1. A bucket can have one or more items. The items can
consist of node buckets or leaves. Items can have a weight that reflects the relative weight of the item.
Ceph Clients: By distributing CRUSH maps to Ceph clients, CRUSH empowers Ceph clients to communicate with OSDs
directly. This means that Ceph clients avoid a centralized object look-up table that could act as a single point of failure, a
performance bottleneck, a connection limitation at a centralized look-up server and a physical limit to the storage cluster’s
scalability.
Ceph OSDs: By distributing CRUSH maps to Ceph OSDs, Ceph empowers OSDs to handle replication, backfilling and recovery.
This means that the Ceph OSDs handle storage of object replicas (or coding chunks) on behalf of the Ceph client. It also
means that Ceph OSDs know enough about the cluster to re-balance the cluster (backfilling) and recover from failures
dynamically.
For example, to address the possibility of concurrent failures, it might be desirable to ensure that data replicas or erasure coding
chunks are on devices using different shelves, racks, power supplies, controllers or physical locations. This helps to prevent data loss
and allows the cluster to operate in a degraded state.
Object Storage: Ceph hosts that serve as an object storage back end for S3 and Swift interfaces might take advantage of less
expensive storage media such as SATA drives that might not be suitable for VMs—reducing the cost per gigabyte for object
storage, while separating more economical storage hosts from more performing ones intended for storing volumes and
images on cloud platforms. HTTP tends to be the bottleneck in object storage systems.
Cold Storage: Systems designed for cold storage—infrequently accessed data, or data retrieval with relaxed performance
requirements—might take advantage of less expensive storage media and erasure coding. However, erasure coding might
require a bit of additional RAM and CPU, and thus differ in RAM and CPU requirements from a host used for object storage or
VMs.
SSD-backed Pools: SSDs are expensive, but they provide significant advantages over hard disk drives. SSDs have no seek
time and they provide high total throughput. In addition to using SSDs for journaling, a cluster can support SSD-backed pools.
Common use cases include high performance SSD pools. For example, it is possible to map the .rgw.buckets.index pool
for the Ceph Object Gateway to SSDs instead of SATA drives.
A CRUSH map supports the notion of a device class. Ceph can discover aspects of a storage device and automatically assign a class
such as hdd, ssd or nvme. However, CRUSH is not limited to these defaults. For example, CRUSH hierarchies might also be used to
separate different types of workloads. For example, an SSD might be used for a journal or write-ahead log, a bucket index or for raw
object storage. CRUSH can support different device classes, such as ssd-bucket-index or ssd-object-storage so Ceph does
not use the same storage media for different workloads—making performance more predictable and consistent.
Behind the scenes, Ceph generates a crush root for each device-class. These roots should only be modified by setting or changing
device classes on OSDs. You can view the generated roots using the following command:
Example
Syntax
Syntax
Syntax
CRUSH hierarchy
Edit online
When declaring a bucket instance with the Ceph CLI, you must specify its type and give it a unique string name. Ceph automatically
assigns a bucket ID, sets the algorithm to straw2, sets the hash to 0 reflecting rjenkins1 and sets a weight. When modifying a
decompiled CRUSH map, assign the bucket a unique ID expressed as a negative integer (optional), specify a weight relative to the
total capacity/capability of its item(s), specify the bucket algorithm (usually straw2), and the hash (usually 0, reflecting hash
algorithm rjenkins1).
A bucket can have one or more items. The items can consist of node buckets (for example, racks, rows, hosts) or leaves (for example,
an OSD disk). Items can have a weight that reflects the relative weight of the item.
When modifying a decompiled CRUSH map, you can declare a node bucket with the following syntax:
[bucket-type] [bucket-name] {
id [a unique negative numeric ID]
weight [the relative capacity/capability of the item(s)]
alg [the bucket type: uniform | list | tree | straw2 ]
hash [the hash type: 0 by default]
item [item-name] weight [weight]
}
For example, using the diagram above, we would define two host buckets and one rack bucket. The OSDs are declared as items
within the host buckets:
host node1 {
id -1
alg straw2
hash 0
item osd.0 weight 1.00
item osd.1 weight 1.00
}
host node2 {
id -2
alg straw2
hash 0
item osd.2 weight 1.00
item osd.3 weight 1.00
}
rack rack1 {
id -3
alg straw2
hash 0
item node1 weight 2.00
item node2 weight 2.00
}
NOTE: In the foregoing example, note that the rack bucket does not contain any OSDs. Rather it contains lower level host buckets,
and includes the sum total of their weight in the item entry.
CRUSH location
Adding a bucket
Moving a bucket
Removing a bucket
CRUSH Bucket algorithms
CRUSH location
Edit online
A CRUSH location is the position of an OSD in terms of the CRUSH map’s hierarchy. When you express a CRUSH location on the
command line interface, a CRUSH location specifier takes the form of a list of name/value pairs describing the OSD’s position. For
example, if an OSD is in a particular row, rack, chassis and host, and is part of the default CRUSH tree, its crush location could be
described as:
2. The key name (left of = ) must be a valid CRUSH type. By default these include root, datacenter, room, row, pod, pdu,
rack, chassis and host. You might edit the CRUSH map to change the types to suit your needs.
3. You do not need to specify all the buckets/keys. For example, by default, Ceph automatically sets a ceph-osd daemon’s
location to be root=default host={HOSTNAME} (based on the output from hostname -s).
Adding a bucket
Edit online
To add a bucket instance to your CRUSH hierarchy, specify the bucket name and its type. Bucket names must be unique in the
CRUSH map.
If you plan to use multiple hierarchies, for example, for different hardware performance profiles, consider naming buckets based on
their type of hardware or use case.
For example, you could create a hierarchy for solid state drives (ssd), a hierarchy for SAS disks with SSD journals (hdd-journal),
and another hierarchy for SATA drives (hdd):
Add an instance of each bucket type you need for your hierarchy. The following example demonstrates adding buckets for a row with
a rack of SSD hosts and a rack of hosts for object storage.
Notice that the hierarchy remains flat. You must move your buckets into a hierarchical position after you add them to the CRUSH
map.
Moving a bucket
Edit online
When you create your initial cluster, Ceph has a default CRUSH map with a root bucket named default and your initial OSD hosts
appear under the default bucket. When you add a bucket instance to your CRUSH map, it appears in the CRUSH hierarchy, but it
does not necessarily appear under a particular bucket.
To move a bucket instance to a particular location in your CRUSH hierarchy, specify the bucket name and its type.
Once you have completed these steps, you can view your tree.
NOTE: You can also use ceph osd crush create-or-move to create a location while moving an OSD.
Removing a bucket
Edit online
To remove a bucket instance from your CRUSH hierarchy, specify the bucket name. For example:
Or:
If you are removing higher level buckets (for example, a root like default), check to see if a pool uses a CRUSH rule that selects
that bucket. If so, you need to modify your CRUSH rules; otherwise, peering fails.
1. Uniform: Uniform buckets aggregate devices with exactly the same weight. For example, when firms commission or
decommission hardware, they typically do so with many machines that have exactly the same physical configuration (for
example, bulk purchases). When storage devices have exactly the same weight, you can use the uniform bucket type, which
allows CRUSH to map replicas into uniform buckets in constant time. With non-uniform weights, you should use another
bucket algorithm.
2. List: List buckets aggregate their content as linked lists. Based on the RUSH (Replication Under Scalable Hashing) P algorithm,
a list is a natural and intuitive choice for an expanding cluster: either an object is relocated to the newest device with some
appropriate probability, or it remains on the older devices as before. The result is optimal data migration when items are
added to the bucket. Items removed from the middle or tail of the list, however, can result in a significant amount of
unnecessary movement, making list buckets most suitable for circumstances in which they never, or very rarely shrink.
3. Tree: Tree buckets use a binary search tree. They are more efficient than listing buckets when a bucket contains a larger set of
items. Based on the RUSH (Replication Under Scalable Hashing) R algorithm, tree buckets reduce the placement time to zero
(log n), making them suitable for managing much larger sets of devices or nested buckets.
4. Straw2 (default): List and Tree buckets use a divide and conquer strategy in a way that either gives certain items precedence,
for example, those at the beginning of a list or obviates the need to consider entire subtrees of items at all. That improves the
performance of the replica placement process, but can also introduce suboptimal reorganization behavior when the contents
of a bucket change due an addition, removal, or re-weighting of an item. The straw2 bucket type allows all items to fairly
“compete” against each other for replica placement through a process analogous to a draw of straws.
id
Description
The numeric ID of the OSD.
Type
Integer
Required
Yes
Example
0
name
Description
The full name of the OSD.
Type
String
Required
Yes
Example
osd.0
weight
Description
The CRUSH weight for the OSD.
Type
Double
Required
Yes
Example
2.0
root
Description
The name of the root bucket of the hierarchy or tree in which the OSD resides.
Type
Key-value pair.
Required
Yes
Example
root=default, root=replicated_rule, and so on
bucket-type
Description
One or more name-value pairs, where the name is the bucket type and the value is the bucket’s name. You can specify a CRUSH
location for an OSD in the CRUSH hierarchy.
Type
Key-value pairs.
Required
No
Example
datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
[
{
"id": -2,
"name": "ssd",
"type": "root",
"type_id": 10,
"items": [
{
"id": -6,
"name": "dell-per630-11-ssd",
"type": "host",
"type_id": 1,
"items": [
{
"id": 6,
"name": "osd.6",
"type": "osd",
"type_id": 0,
"crush_weight": 0.099991,
"depth": 2
}
]
},
{
"id": -7,
"name": "dell-per630-12-ssd",
"type": "host",
"type_id": 1,
"items": [
{
"id": 7,
"name": "osd.7",
"type": "osd",
"type_id": 0,
"crush_weight": 0.099991,
"depth": 2
}
]
},
{
"id": -8,
"name": "dell-per630-13-ssd",
"type": "host",
"type_id": 1,
"items": [
{
"id": 8,
"name": "osd.8",
"type": "osd",
"type_id": 0,
"crush_weight": 0.099991,
"depth": 2
}
You must prepare a Ceph OSD before you add it to the CRUSH hierarchy. Deployment utilities, such as the Ceph Orchestrator, can
perform this step for you. For example creating a Ceph OSD on a single node:
Syntax
The CRUSH hierarchy is notional, so the ceph osd crush add command allows you to add OSDs to the CRUSH hierarchy wherever
you wish. The location you specify should reflect its actual location. If you specify at least one bucket, the command places the OSD
into the most specific bucket you specify, and it moves that bucket underneath any other buckets you specify.
Syntax
IMPORTANT: If you specify only the root bucket, the command attaches the OSD directly to the root. However, CRUSH rules expect
OSDs to be inside of hosts or chassis, and host or chassis should be inside of other buckets reflecting your cluster topology.
ceph osd crush add osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
NOTE: You can also use ceph osd crush set or ceph osd crush create-or-move to add an OSD to the CRUSH hierarchy.
IMPORTANT: Moving an OSD in the CRUSH hierarchy means that Ceph will recompute which placement groups get assigned to the
OSD, potentially resulting in significant redistribution of data.
Syntax
NOTE: You can also use ceph osd crush create-or-move to move an OSD within the CRUSH hierarchy.
To remove an OSD from the CRUSH map of a running cluster, execute the following:
Syntax
Device class
Edit online
Ceph’s CRUSH map provides extraordinary flexibility in controlling data placement. This is one of Ceph’s greatest strengths. Early
Ceph deployments used hard disk drives almost exclusively. Today, Ceph clusters are frequently built with multiple types of storage
devices: HDD, SSD, NVMe, or even various classes of the foregoing. For example, it is common in Ceph Object Gateway deployments
to have storage policies where clients can store data on slower HDDs and other storage policies for storing data on fast SSDs. Ceph
Object Gateway deployments might even have a pool backed by fast SSDs for bucket indices. Additionally, OSD nodes also frequently
have SSDs used exclusively for journals or write-ahead logs that do NOT appear in the CRUSH map. These complex hardware
scenarios historically required manually editing the CRUSH map, which can be time-consuming and tedious. It is not required to have
different CRUSH hierarchies for different classes of storage devices.
CRUSH rules work in terms of the CRUSH hierarchy. However, if different classes of storage devices reside in the same hosts, the
process becomes more complicated—requiring users to create multiple CRUSH hierarchies for each class of device, and then disable
the osd crush update on start option that automates much of the CRUSH hierarchy management. Device classes eliminate
this tediousness by telling the CRUSH rule what class of device to use, dramatically simplifying CRUSH management tasks.
NOTE: The ceph osd tree command has a column reflecting a device class.
Reference
Edit online
Syntax
Example
[ceph: root@host01 /]# ceph osd crush set-device-class hdd osd.0 osd.1
[ceph: root@host01 /]# ceph osd crush set-device-class ssd osd.2 osd.3
[ceph: root@host01 /]# ceph osd crush set-device-class bucket-index osd.4
NOTE: Ceph might assign a class to a device automatically. However, class names are simply arbitrary strings. There is no
requirement to adhere to hdd, ssd or nvme. In the foregoing example, a device class named bucket-index might indicate an SSD
Syntax
Example
[ceph: root@host01 /]# ceph osd crush rm-device-class hdd osd.0 osd.1
[ceph: root@host01 /]# ceph osd crush rm-device-class ssd osd.2 osd.3
[ceph: root@host01 /]# ceph osd crush rm-device-class bucket-index osd.4
Syntax
Example
[ceph: root@host01 /]# ceph osd crush class rename hdd sas15k
Syntax
Example
[
"hdd",
"ssd",
"bucket-index"
]
Syntax
0
1
2
3
4
5
6
Syntax
Example
CRUSH weights
Edit online
The CRUSH algorithm assigns a weight value in terabytes (by convention) per OSD device with the objective of approximating a
uniform probability distribution for write requests that assign new data objects to PGs and PGs to OSDs. For this reason, as a best
practice, we recommend creating CRUSH hierarchies with devices of the same type and size, and assigning the same weight. We also
recommend using devices with the same I/O and throughput characteristics so that you will also have uniform performance
characteristics in your CRUSH hierarchy, even though performance characteristics do not affect data distribution.
Since using uniform hardware is not always practical, you might incorporate OSD devices of different sizes and use a relative weight
so that Ceph will distribute more data to larger devices and less data to smaller devices.
Where:
name
Description
The full name of the OSD.
Required
Yes
Example
osd.0
weight
Description
The CRUSH weight for the OSD. This should be the size of the OSD in Terabytes, where 1.0 is 1 Terabyte.
Type
Double
Required
Yes
Example
2.0
This setting is used when creating an OSD or adjusting the CRUSH weight immediately after adding the OSD. It usually does not
change over the life of the OSD.
Syntax
Where,
Edit online
For the purposes of ceph osd in and ceph osd out, an OSD is either in the cluster or out of the cluster. That is how a monitor
records an OSD’s status. However, even though an OSD is in the cluster, it might be experiencing a malfunction such that you do not
want to rely on it as much until you fix it (for example, replace a storage drive, change out a controller, and so on).
You can increase or decrease the in weight of a particular OSD (that is, without changing its weight in Terabytes) by executing:
Syntax
Where:
weight is a range from 0.0-1.0, where 0 is not in the cluster (that is, it does not have any PGs assigned to it) and 1.0 is in the
cluster (that is, the OSD receives the same number of PGs as other OSDs).
Multiple Pools: You can assign multiple pools to a CRUSH hierarchy, but the pools might have different numbers of placement
groups, size (number of replicas to store), and object size characteristics.
Custom Clients: Ceph clients such as block device, object gateway and filesystem share data from their clients and stripe the
data as objects across the cluster as uniform-sized smaller RADOS objects. So except for the foregoing scenario, CRUSH
usually achieves its goal. However, there is another case where a cluster can become imbalanced: namely, using librados to
store data without normalizing the size of objects. This scenario can lead to imbalanced clusters (for example, storing 100 1
MB objects and 10 4 MB objects will make a few OSDs have more data than the others).
Probability: A uniform distribution will result in some OSDs with more PGs and some with less. For clusters with a large
number of OSDs, the statistical outliers will be further out.
Syntax
Example
Where:
threshold is a percentage of utilization such that OSDs facing higher data storage loads will receive a lower weight and thus
fewer PGs assigned to them. The default value is 120, reflecting 120%. Any value from 100+ is a valid threshold. Optional.
weight_change_amount is the amount to change the weight. Valid values are greater than 0.0 - 1.0. The default value is
0.05. Optional.
number_of_OSDs is the maximum number of OSDs to reweight. For large clusters, limiting the number of OSDs to reweight
prevents significant rebalancing. Optional.
no-increasing is off by default. Increasing the osd weight is allowed when using the reweight-by-utilization or
test-reweight-by-utilization commands. If this option is used with these commands, it prevents the OSD weight
from increasing, even if the OSD is underutilized. Optional.
IMPORTANT: Executing reweight-by-utilization is recommended and somewhat inevitable for large clusters. Utilization rates
might change over time, and as your cluster size or hardware changes, the weightings might need to be updated to reflect changing
utilization. If you elect to reweight by utilization, you might need to re-run this command as utilization, hardware or cluster size
change.
Executing this or other weight commands that assign a weight will override the weight assigned by this command (for example, osd
reweight-by-utilization, osd crush weight, osd weight, in or out).
Syntax
Where:
poolname is the name of the pool. Ceph will examine how the pool assigns PGs to OSDs and reweight the OSDs according to
this pool’s PG distribution. Note that multiple pools could be assigned to the same CRUSH hierarchy. Reweighting OSDs
according to one pool’s distribution could have unintended effects for other pools assigned to the same CRUSH hierarchy if
they do not have the same size (number of replicas) and PGs.
Syntax
Primary affinity
Edit online
When a Ceph Client reads or writes data, it always contacts the primary OSD in the acting set. For set [2, 3, 4], osd.2 is the
primary. Sometimes an OSD is not well suited to act as a primary compared to other OSDs (for example, it has a slow disk or a slow
controller). To prevent performance bottlenecks (especially on read operations) while maximizing utilization of your hardware, you
can set a Ceph OSD’s primary affinity so that CRUSH is less likely to use the OSD as a primary in an acting set. :
Syntax
Primary affinity is 1 by default (that is, an OSD might act as a primary). You might set the OSD primary range from 0-1, where 0
means that the OSD might NOT be used as a primary and 1 means that an OSD might be used as a primary. When the weight is < 1,
it is less likely that CRUSH will select the Ceph OSD Daemon to act as a primary.
CRUSH rules
Edit online
CRUSH rules define how a Ceph client selects buckets and the primary OSD within them to store objects, and how the primary OSD
selects buckets and the secondary OSDs to store replicas or coding chunks. For example, you might create a rule that selects a pair
of target OSDs backed by SSDs for two object replicas, and another rule that selects three target OSDs backed by SAS drives in
different data centers for three replicas.
rule <rulename> {
id <unique number>
type [replicated | erasure]
min_size <min-size>
max_size <max-size>
step take <bucket-type> [class <class-name>]
step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
step emit
}
id
Description
A unique whole number for identifying the rule.
Purpose
A component of the rule mask.
Type
Integer
Required
Yes
type
Description
Describes a rule for either a storage drive replicated or erasure coded.
Purpose
A component of the rule mask.
Type
String
Required
Yes
Default
replicated
Valid Values
Currently only replicated
min_size
Description
If a pool makes fewer replicas than this number, CRUSH will not select this rule.
Type
Integer
Purpose
A component of the rule mask.
Required
Yes
Default
1
max_size
Description
If a pool makes more replicas than this number, CRUSH will not select this rule.
Type
Integer
Purpose
A component of the rule mask.
Required
Yes
Default
10
Purpose
A component of the rule.
Required
Yes
Example
step take data step take data class ssd
Purpose
A component of the rule.
Prerequisite
Follow step take or step choose.
Example
step choose firstn 1 type row
Purpose
A component of the rule. Usage removes the need to select a device using two steps.
Prerequisite
Follows step take or step choose.
Example
step chooseleaf firstn 0 type row
step emit
Description
Outputs the current value and empties the stack. Typically used at the end of a rule, but might also be used to pick from different
trees in the same rule.
Purpose
A component of the rule.
Prerequisite
Follows step choose.
Example
step emit
Example
You have a PG stored on OSDs 1, 2, 3, 4, 5 in which 3 goes down.. In the first scenario, with the firstn mode, CRUSH adjusts its
calculation to select 1 and 2, then selects 3 but discovers it is down, so it retries and selects 4 and 5, and then goes on to select a
new OSD 6. The final CRUSH mapping change is from 1, 2, 3, 4, 5 to 1, 2, 4, 5, 6. In the second scenario, with indep mode on an
erasure-coded pool, CRUSH attempts to select the failed OSD 3, tries again and picks out 6, for a final transformation from 1, 2, 3, 4,
5 to 1, 2, 6, 4, 5.
IMPORTANT: A given CRUSH rule can be assigned to multiple pools, but it is not possible for a single pool to have multiple CRUSH
rules.
Syntax
Syntax
Syntax
Ceph creates a rule with chooseleaf and one bucket of the type you specify.
Example
[ceph: root@host01 /]# ceph osd crush rule create-simple deleteme default host firstn
{ "id": 1,
"rule_name": "deleteme",
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_firstn",
"num": 0,
"type": "host"},
{ "op": "emit"}]}
Where:
Example
[ceph: root@host01 /]# ceph osd crush rule create-replicated fast default host ssd
Syntax
Syntax
Adjusting CRUSH values might result in the shift of some PGs between storage nodes. If the Ceph cluster is already storing a
lot of data, be prepared for some fraction of the data to move.
The ceph-osd and ceph-mon daemons will start requiring the feature bits of new connections as soon as they receive an
updated map. However, already-connected clients are effectively grandfathered in, and will misbehave if they do not support
the new feature. Make sure when you upgrade your Ceph Storage Cluster daemons that you also update your Ceph clients.
If the CRUSH tunables are set to non-legacy values and then later changed back to the legacy values, ceph-osd daemons will
not be required to support the feature. However, the OSD peering process requires examining and understanding old maps.
Therefore, you should not run old versions of the ceph-osd daemon if the cluster has previously used non-legacy CRUSH
values, even if the latest version of the map has been switched back to using the legacy defaults.
CRUSH tuning
CRUSH tuning, the hard way
CRUSH legacy values
The simplest way to adjust the CRUSH tunables is by changing to a known profile. Those are:
Syntax
Generally, you should set the CRUSH tunables after you upgrade, or if you receive a warning. Starting with version v0.74, Ceph issues
a health warning if the CRUSH tunables are not set to their optimal values, the optimal values are the default as of v0.73.
1. Adjust the tunables on the existing cluster. Note that this will result in some data movement (possibly as much as 10%). This
is the preferred route, but should be taken with care on a production cluster where the data movement might affect
performance. You can enable optimal tunables with:
If things go poorly (for example, too much load) and not very much progress has been made, or there is a client compatibility
problem (old kernel cephfs or rbd clients, or pre-bobtail librados clients), you can switch back to an earlier profile:
2. You can make the warning go away without making any changes to CRUSH by adding the following option to the mon section of
the ceph.conf file:
For the change to take effect, restart the monitors, or apply the option to running monitors with:
Adjust tunables. These values appear to offer the best behavior for both large and small clusters we tested with. You will need
to additionally specify the --enable-unsafe-tunables argument to crushtool for this to work. Please use this option
with extreme care.:
Again, the special --enable-unsafe-tunables option is required. Further, as noted above, be careful running old versions of the
ceph-osd daemon after reverting to legacy values as the feature bit is not perfectly enforced.
To activate a CRUSH Map rule for a specific pool, identify the common rule number and specify that rule number for the pool when
creating the pool.
Syntax
Ceph will output (-o) a compiled CRUSH map to the file name you specified. Since the CRUSH map is in a compiled form, you must
decompile it first before you can edit it.
Syntax
Ceph decompiles (-d) the compiled CRUSH map and send the output (-o) to the file name you specified.
Syntax
Ceph will store a compiled CRUSH map to the file name you specified.
Syntax
Ceph inputs the compiled CRUSH map of the file name you specified as the CRUSH map for the cluster.
Use device classes. The process is simple to add a class to each device.
Syntax
Example
[ceph:root@host01 /]# ceph osd crush set-device-class hdd osd.0 osd.1 osd.4 osd.5
[ceph:root@host01 /]# ceph osd crush set-device-class ssd osd.2 osd.3 osd.6 osd.7
Syntax
Example
[ceph:root@host01 /]# ceph osd crush rule create-replicated cold default host hdd
[ceph:root@host01 /]# ceph osd crush rule create-replicated hot default host ssd
Syntax
Example
There is no need to manually edit the CRUSH map, because one hierarchy can serve multiple classes of devices.
host ceph-osd-server-1 {
id -1
alg straw2
hash 0
item osd.0 weight 1.00
item osd.1 weight 1.00
item osd.2 weight 1.00
item osd.3 weight 1.00
}
host ceph-osd-server-2 {
id -2
alg straw2
hash 0
item osd.4 weight 1.00
item osd.5 weight 1.00
item osd.6 weight 1.00
item osd.7 weight 1.00
}
root default {
id -3
alg straw2
hash 0
item ceph-osd-server-1 weight 4.00
item ceph-osd-server-2 weight 4.00
}
rule cold {
ruleset 0
type replicated
min_size 2
max_size 11
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule hot {
ruleset 1
Placement Groups
Edit online
Placement Groups (PGs) are invisible to Ceph clients, but they play an important role in Ceph Storage Clusters.
A Ceph Storage Cluster might require many thousands of OSDs to reach an exabyte level of storage capacity. Ceph clients store
objects in pools, which are a logical subset of the overall cluster. The number of objects stored in a pool might easily run into the
millions and beyond. A system with millions of objects or more cannot realistically track placement on a per-object basis and still
perform well. Ceph assigns objects to placement groups, and placement groups to OSDs to make re-balancing dynamic and efficient.
All problems in computer science can be solved by another level of indirection, except of course for the problem of too
many indirections.
— David Wheeler
When the primary OSD fails and gets marked out of the cluster, CRUSH assigns the placement group to another OSD, which receives
copies of objects in the placement group. Another OSD in the Up Set will assume the role of the primary OSD.
When you increase the number of object replicas or coding chunks, CRUSH will assign each placement group to additional OSDs as
required.
NOTE: PGs do not own OSDs. CRUSH assigns many placement groups to each OSD pseudo-randomly to ensure that data gets
distributed evenly across the cluster.
activating
The PG is peered, but not yet active.
backfill_toofull
A backfill operation is waiting because the destination OSD is over the backfillfull ratio.
backfill_unfound
Backfill stopped due to unfound objects.
backfill_wait
The PG is waiting in line to start backfill.
backfilling
Ceph is scanning and synchronizing the entire contents of a PG instead of inferring what contents need to be synchronized from the
logs of recent operations. Backfill is a special case of recovery.
clean
Ceph replicated all objects in the PG accurately.
creating
Ceph is still creating the PG.
deep
Ceph is checking the PG data against stored checksums.
degraded
Ceph has not replicated some objects in the PG accurately yet.
down
A replica with necessary data is down, so the PG is offline. A PG with less than min_size replicas is marked as down. Use ceph
health detail to understand the backing OSD state.
forced_backfill
High backfill priority of that PG is enforced by user.
forced_recovery
High recovery priority of that PG is enforced by user.
incomplete
Ceph detects that a PG is missing information about writes that might have occurred, or does not have any healthy copies. If you see
this state, try to start any failed OSDs that might contain the needed information. In the case of an erasure coded pool, temporarily
reducing min_size might allow recovery.
inconsistent
Ceph detects inconsistencies in one or more replicas of an object in the PG, such as objects are the wrong size, objects are missing
from one replica after recovery finished.
peering
The PG is undergoing the peering process. A peering process should clear off without much delay, but if it stays and the number of
PGs in a peering state does not reduce in number, the peering might be stuck.
peered
The PG has peered, but cannot serve client IO due to not having enough copies to reach the pool’s configured min_size parameter.
Recovery might occur in this state, so the PG might heal up to min_size eventually.
recovering
Ceph is migrating or synchronizing objects and their replicas.
recovery_toofull
A recovery operation is waiting because the destination OSD is over its full ratio.
recovery_unfound
Recovery stopped due to unfound objects.
recovery_wait
The PG is waiting in line to start recovery.
repair
Ceph is checking the PG and repairing any inconsistencies it finds, if possible.
replay
The PG is waiting for clients to replay operations after an OSD crashed.
snaptrim
Trimming snaps.
snaptrim_error
Error stopped trimming snaps.
snaptrim_wait
Queued to trim snaps.
scrubbing
Ceph is checking the PG metadata for inconsistencies.
splitting
Ceph is splitting the PG into multiple PGs.
stale
The PG is in an unknown state; the monitors have not received an update for it since the PG mapping changed.
undersized
The PG has fewer copies than the configured pool replication level.
unknown
The ceph-mgr has not yet received any information about the PG’s state from an OSD since Ceph Manager started up.
References
See the knowledge base What are the possible Placement Group states in an Ceph cluster for more information.
Data durability
Data distribution
Resource usage
Data durability
Edit online
Ceph strives to prevent the permanent loss of data. However, after an OSD fails, the risk of permanent data loss increases until the
data it had is fully recovered. Permanent data loss, though rare, is still possible. The following scenario describes how Ceph could
permanently lose data in a single placement group with three copies of the data:
An OSD fails and all copies of the object it contains are lost. For all objects within a placement group stored on the OSD, the
number of replicas suddenly drops from three to two.
Ceph starts recovery for each placement group stored on the failed OSD by choosing a new OSD to re-create the third copy of
all objects for each placement group.
The second OSD containing a copy of the same placement group fails before the new OSD is fully populated with the third
copy. Some objects will then only have one surviving copy.
The third OSD containing a copy of the same placement group fails before recovery is complete. If this OSD contained the only
remaining copy of an object, the object is lost permanently.
Hardware failure isn’t an exception, but an expectation. To prevent the foregoing scenario, ideally the recovery process should be as
fast as reasonably possible. The size of your cluster, your hardware configuration and the number of placement groups play an
important role in total recovery time.
In a cluster containing 10 OSDs with 512 placement groups in a three replica pool, CRUSH will give each placement group three
OSDs. Each OSD will end up hosting (512 * 3) / 10 = ~150 placement groups. When the first OSD fails, the cluster will start
recovery for all 150 placement groups simultaneously.
It is likely that Ceph stored the remaining 150 placement groups randomly across the 9 remaining OSDs. Therefore, each remaining
OSD is likely to send copies of objects to all other OSDs and also receive some new objects, because the remaining OSDs become
responsible for some of the 150 placement groups now assigned to them.
The total recovery time depends upon the hardware supporting the pool. For example, in a 10 OSD cluster, if a host contains one OSD
with a 1 TB SSD, and a 10 GB/s switch connects each of the 10 hosts, the recovery time will take M minutes. By contrast, if a host
contains two SATA OSDs and a 1 GB/s switch connects the five hosts, recovery will take substantially longer. Interestingly, in a
cluster of this size, the number of placement groups has almost no influence on data durability. The placement group count could be
128 or 8192 and the recovery would not be slower or faster.
However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDs is likely to speed up recovery and therefore improve data
durability significantly. Why? Each OSD now participates in only 75 placement groups instead of 150. The 20 OSD cluster will still
require all 19 remaining OSDs to perform the same amount of copy operations in order to recover. In the 10 OSD cluster, each OSDs
had to copy approximately 100 GB. In the 20 OSD cluster each OSD only has to copy 50 GB each. If the network was the bottleneck,
recovery will happen twice as fast. In other words, recovery time decreases as the number of OSDs increases.
If the exemplary cluster grows to 40 OSDs, each OSD will only host 35 placement groups. If an OSD dies, recovery time will decrease
unless another bottleneck precludes improvement. However, if this cluster grows to 200 OSDs, each OSD will only host
approximately 7 placement groups. If an OSD dies, recovery will happen between at most of 21 (7 * 3) OSDs in these placement
groups: recovery will take longer than when there were 40 OSDs, meaning the number of placement groups should be
increased!
IMPORTANT: No matter how short the recovery time, there is a chance for another OSD storing the placement group to fail while
recovery is in progress.
In the 10 OSD cluster described above, if any OSD fails, then approximately 8 placement groups (that is 75 pgs / 9 osds being
recovered) will only have one surviving copy. And if any of the 8 remaining OSDs fail, the last objects of one placement group are
likely to be lost (that is 8 pgs / 8 osds with only one remaining copy being recovered). This is why starting with a somewhat
larger cluster is preferred (for example, 50 OSDs).
When the size of the cluster grows to 20 OSDs, the number of placement groups damaged by the loss of three OSDs drops. The
second OSD lost will degrade approximately 2 (that is 35 pgs / 19 osds being recovered) instead of 8 and the third OSD lost will
only lose data if it is one of the two OSDs containing the surviving copy. In other words, if the probability of losing one OSD is
0.0001% during the recovery time frame, it goes from 8 * 0.0001% in the cluster with 10 OSDs to 2 * 0.0001% in the cluster
with 20 OSDs. Having 512 or 4096 placement groups is roughly equivalent in a cluster with less than 50 OSDs as far as data
durability is concerned.
TIP In a nutshell, more OSDs means faster recovery and a lower risk of cascading failures leading to the permanent loss of a
placement group and its objects.
When you add an OSD to the cluster, it might take a long time to populate the new OSD with placement groups and objects. However
there is no degradation of any object and adding the OSD has no impact on data durability.
Data distribution
Edit online
Since CRUSH computes the placement group for each object, but does not actually know how much data is stored in each OSD within
this placement group, the ratio between the number of placement groups and the number of OSDs might influence the
distribution of the data significantly.
For instance, if there was only one placement group with ten OSDs in a three replica pool, Ceph would only use three OSDs to store
data because CRUSH would have no other choice. When more placement groups are available, CRUSH is more likely to evenly spread
objects across OSDs. CRUSH also evenly assigns placement groups to OSDs.
As long as there are one or two orders of magnitude more placement groups than OSDs, the distribution should be even. For
instance, 256 placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDs, and so forth.
The ratio between OSDs and placement groups usually solves the problem of uneven data distribution for Ceph clients that
implement advanced features like object striping. For example, a 4 TB block device might get sharded up into 4 MB objects.
The ratio between OSDs and placement groups does not address uneven data distribution in other cases, because CRUSH does
not take object size into account. Using the librados interface to store some relatively small objects and some very large objects
can lead to uneven data distribution. For example, one million 4K objects totaling 4 GB are evenly spread among 1000 placement
groups on 10 OSDs. They will use 4 GB / 10 = 400 MB on each OSD. If one 400 MB object is added to the pool, the three OSDs
supporting the placement group in which the object has been placed will be filled with 400 MB + 400 MB = 800 MB while the
seven others will remain occupied with only 400 MB.
Resource usage
Edit online
For each placement group, OSDs and Ceph monitors need memory, network and CPU at all times, and even more during recovery.
Sharing this overhead by clustering objects within a placement group is one of the main reasons placement groups exist.
You need to set both the number of placement groups (total), and the number of placement groups used for objects (used in PG
splitting). They should be equal.
(OSDs * 100)
Total PGs = ------------
pool size
Where pool size is either the number of replicas for replicated pools or the K+M sum for erasure coded pools (as returned by ceph
osd erasure-code-profile get).
You should then check if the result makes sense with the way you designed your Ceph cluster to maximize data durability, data
distribution and minimize resource usage.
The result should be rounded up to the nearest power of two. Rounding up is optional, but recommended for CRUSH to evenly
balance the number of objects among placement groups.
For a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate your number of PGs as follows:
(200 * 100)
----------- = 6667. Nearest power of 2: 8192
3
With 8192 placement groups distributed across 200 OSDs, that evaluates to approximately 41 placement groups per OSD. You also
need to consider the number of pools you are likely to use in your cluster, since each pool will create placement groups too. Ensure
that you have a reasonable maximum PG count.
In an exemplary Ceph Storage Cluster consisting of 10 pools, each pool with 512 placement groups on ten OSDs, there are a total of
5,120 placement groups spread over ten OSDs, or 512 placement groups per OSD. That might not use too many resources
depending on your hardware configuration. By contrast, if you create 1,000 pools with 512 placement groups each, the OSDs will
handle ~50,000 placement groups each and it would require significantly more resources. Operating with too many placement
groups per OSD can significantly reduce performance, especially during rebalancing or recovery.
The Ceph Storage Cluster has a default maximum value of 300 placement groups per OSD. You can set a different maximum value in
your Ceph configuration file.
TIP Ceph Object Gateways deploy with 10-15 pools, so you might consider using less than 100 PGs per OSD to arrive at a
reasonable maximum number.
Auto-scaling the number of PGs can make managing the cluster easier. The pg-autoscaling command provides recommendations
for scaling PGs, or automatically scales PGs based on how the cluster is being used.
To learn more about how auto-scaling works, see Placement group auto-scaling.
To view placement group scaling recommendations, see Viewing placement group scaling recommendations.
The auto-scaler analyzes pools and adjusts on a per-subtree basis. Because each pool can map to a different CRUSH rule, and each
rule can distribute data across different devices, Ceph considers utilization of each subtree of the hierarchy independently. For
example, a pool that maps to OSDs of class ssd, and a pool that maps to OSDs of class hdd, will each have optimal PG counts that
depend on the number of those respective device types.
Prerequisites
Edit online
Procedure
Edit online
You can view each pool, its relative utilization, and any suggested changes to the PG count using:
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO
EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
device_health_metrics 0 3.0 374.9G 0.0000
1.0 1 on False
cephfs.cephfs.meta 24632 3.0 374.9G 0.0000
4.0 32 on False
cephfs.cephfs.data 0 3.0 374.9G 0.0000
1.0 32 on False
.rgw.root 1323 3.0 374.9G 0.0000
1.0 32 on False
default.rgw.log 3702 3.0 374.9G 0.0000
1.0 32 on False
default.rgw.control 0 3.0 374.9G 0.0000
1.0 32 on False
default.rgw.meta 382 3.0 374.9G 0.0000
4.0 8 on False
TARGET SIZE, if present, is the amount of data the administrator has specified they expect to eventually be stored in this pool. The
system uses the larger of the two values for its calculation.
RATE is the multiplier for the pool that determines how much raw storage capacity the pool uses. For example, a 3 replica pool has a
ratio of 3.0, while a k=4,m=2 erasure coded pool has a ratio of 1.5.
RAW CAPACITY is the total amount of raw storage capacity on the OSDs that are responsible for storing the pool’s data.
RATIO is the ratio of the total capacity that the pool is consuming, that is, ratio = size * rate / raw capacity.
TARGET RATIO, if present, is the ratio of storage the administrator has specified that they expect the pool to consume relative to
other pools with target ratios set. If both target size bytes and ratio are specified, the ratio takes precedence. The default value of
TARGET RATIO is 0 unless it was specified while creating the pool. The more the --target_ratio you give in a pool, the larger the
PGs you are expecting the pool to have.
EFFECTIVE RATIO, is the target ratio after adjusting in two ways: 1. subtracting any capacity expected to be used by pools with
target size set. 2. normalizing the target ratios among pools with target ratio set so they collectively target the rest of the space. For
example, 4 pools with target ratio 1.0 would have an effective ratio of 0.25. The system uses the larger of the actual ratio
and the effective ratio for its calculation.
BIAS, is used as a multiplier to manually adjust a pool’s PG based on prior information about how much PGs a specific pool is
expected to have. By default, the value if 1.0 unless it was specified when creating a pool. The more --bias you give in a pool, the
larger the PGs you are expecting the pool to have.
PG_NUM is the current number of PGs for the pool, or the current number of PGs that the pool is working towards, if a pg_num change
is in progress. NEW PG_NUM, if present, is the suggested number of PGs (pg_num). It is always a power of 2, and is only present if the
suggested value varies from the current value by more than a factor of 3.
The BULK values are true, false, 1, or 0, where 1 is equivalent to true and 0 is equivalent to false. The default value is false.
For more information about use the bulk flag, see Creating a pool and Setting placement group auto-scaling modes.
IBM Storage Ceph can split existing placement groups (PGs) into smaller PGs, which increases the total number of PGs for a given
pool. Splitting existing placement groups (PGs) allows a small IBM Storage Ceph cluster to scale over time as storage requirements
increase. The PG auto-scaling feature can increase the pg_num value, which causes the existing PGs to split as the storage cluster
expands. If the PG auto-scaling feature is disabled, then you can manually increase the pg_num value, which triggers the PG split
process to begin. For example, increasing the pg_num value from 4 to 16, will split into four pieces. Increasing the pg_num value will
also increase the pgp_num value, but the pgp_num value increases at a gradual rate. This gradual increase is done to minimize the
impact to a storage cluster’s performance and to a client’s workload, because migrating object data adds a significant load to the
system. By default, Ceph queues and moves no more than 5% of the object data that is in a "misplaced" state. This default
percentage can be adjusted with the target_max_misplaced_ratio option.
Figure 1. Splitting
Merging
IBM Storage Ceph can also merge two existing PGs into a larger PG, which decreases the total number of PGs. Merging two PGs
together can be useful, especially when the relative amount of objects in a pool decreases over time, or when the initial number of
PGs chosen was too large. While merging PGs can be useful, it is also a complex and delicate process. When doing a merge, pausing
I/O to the PG occurs, and only one PG is merged at a time to minimize the impact to a storage cluster’s performance. Ceph works
slowly on merging the object data until the new pg_num value is reached.
Figure 2. Merging
off: Disables auto-scaling for the pool. It is up to the administrator to choose an appropriate PG number for each pool. Refer
to the PG count section for more information.
on: Enables automated adjustments of the PG count for the given pool.
NOTE: In IBM Storage Ceph 5.3, pg_autoscale_mode is on by default. Upgraded storage clusters retain the existing
pg_autoscale_mode setting. The pg_auto_scale mode is on for the newly created pools. PG count is automatically adjusted,
and ceph status might display a recovering state during PG count adjustment.
The autoscaler uses the bulk flag to determine which pool should start with a full complement of PGs and only scales down when
the usage ratio across the pool is not even. However, if the pool does not have the bulk flag, the pool starts with minimal PGs and
only when there is more usage in the pool.
NOTE: The autoscaler identifies any overlapping roots and prevents the pools with such roots from scaling because overlapping
roots can cause problems with the scaling process.
Procedure
Edit online
Syntax
Example
Example
Syntax
Example
IMPORTANT: The values must be written as true, false, 1, or 0. 1 is equivalent to true and 0 is equivalent to false. If
written with different capitalization, or with other content, an error is emitted.
The following is an example of the command written with the wrong syntax:
[ceph: root@host01 /]# ceph osd pool set ec_pool_overwrite bulk True
Error EINVAL: expecting value 'true', 'false', '0', or '1'
Syntax
Example
[ceph: root@host01 /]# ceph osd pool set testpool bulk true
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
You can view each pool, its relative utilization, and any suggested changes to the PG count using:
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO
EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
device_health_metrics 0 3.0 374.9G 0.0000
1.0 1 on False
cephfs.cephfs.meta 24632 3.0 374.9G 0.0000
4.0 32 on False
cephfs.cephfs.data 0 3.0 374.9G 0.0000
1.0 32 on False
.rgw.root 1323 3.0 374.9G 0.0000
1.0 32 on False
default.rgw.log 3702 3.0 374.9G 0.0000
1.0 32 on False
default.rgw.control 0 3.0 374.9G 0.0000
1.0 32 on False
default.rgw.meta 382 3.0 374.9G 0.0000
4.0 8 on False
TARGET SIZE, if present, is the amount of data the administrator has specified they expect to eventually be stored in this pool. The
system uses the larger of the two values for its calculation.
RATE is the multiplier for the pool that determines how much raw storage capacity the pool uses. For example, a 3 replica pool has a
ratio of 3.0, while a k=4,m=2 erasure coded pool has a ratio of 1.5.
RAW CAPACITY is the total amount of raw storage capacity on the OSDs that are responsible for storing the pool’s data.
RATIO is the ratio of the total capacity that the pool is consuming, that is, ratio = size * rate / raw capacity.
TARGET RATIO, if present, is the ratio of storage the administrator has specified that they expect the pool to consume relative to
other pools with target ratios set. If both target size bytes and ratio are specified, the ratio takes precedence. The default value of
TARGET RATIO is 0 unless it was specified while creating the pool. The more the --target_ratio you give in a pool, the larger the
PGs you are expecting the pool to have.
EFFECTIVE RATIO, is the target ratio after adjusting in two ways: 1. subtracting any capacity expected to be used by pools with
target size set. 2. normalizing the target ratios among pools with target ratio set so they collectively target the rest of the space. For
example, 4 pools with target ratio 1.0 would have an effective ratio of 0.25. The system uses the larger of the actual ratio
and the effective ratio for its calculation.
BIAS, is used as a multiplier to manually adjust a pool’s PG based on prior information about how much PGs a specific pool is
expected to have. By default, the value if 1.0 unless it was specified when creating a pool. The more --bias you give in a pool, the
larger the PGs you are expecting the pool to have.
PG_NUM is the current number of PGs for the pool, or the current number of PGs that the pool is working towards, if a pg_num change
is in progress. NEW PG_NUM, if present, is the suggested number of PGs (pg_num). It is always a power of 2, and is only present if the
suggested value varies from the current value by more than a factor of 3.
BULK, is used to determine which pool should start out with a full complement of PGs. BULK only scales down when the usage ratio
across the pool is not even. If the pool does not have this flag the pool starts out with a minimal amount of PGs and only used when
there is more usage in the pool.
The BULK values are true, false, 1, or 0, where 1 is equivalent to true and 0 is equivalent to false. The default value is false.
For more information about use the bulk flag, see Creating a pool and Setting placement group auto-scaling modes.
The target number of PGs per OSD is based on the mon_target_pg_per_osd configurable. The default value is set to 100.
To adjust mon_target_pg_per_osd:
Syntax
Example
If a minimum value is set, Ceph does not automatically reduce, or recommend to reduce, the number of PGs to a value below the set
minimum value.
If a minimum value is set, Ceph does not automatically increase, or recommend to increase, the number of PGs to a value above the
set maximum value.
In addition to the this procedure, the ceph osd pool create command has two command-line options that can be used to
specify the minimum or maximum PG count at the time of pool creation.
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# ceph osd pool set testpool pg_num_max 150
Resources
Edit online
For more information, see:
Edit online
If you want to enable or disable the autoscaler for all the pools at same time, you can use the noautoscale global flag. This global
flag is useful during upgradation of the storage cluster when some OSDs are bounced or when the cluster is under maintenance. You
can set the flag before any activity and unset it once the activity is complete.
By default, the noautoscale flag is set to off. When this flag is set, then all the pools have pg_autoscale_mode as off and all
the pools have the autoscaler disabled.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Example
1. Set the target size using the absolute size of the pool in bytes:
For example, to instruct the system that mypool is expected to consume 100T of space:
You can also set the target size of a pool at creation time by adding the optional --target-size-bytes <bytes> argument to the
ceph osd pool create command.
1. Set the target size using the ratio of the total cluster capacity:
Syntax
Example
[ceph: root@host01 /]# ceph osd pool set mypool target_size_ratio 1.0
tells the system that the pool mypool is expected to consume 1.0 relative to the other pools with target_size_ratio set.
If mypool is the only pool in the cluster, this means an expected use of 100% of the total capacity. If there is a second pool
with target_size_ratio as 1.0, both pools would expect to use 50% of the cluster capacity.
You can also set the target size of a pool at creation time by adding the optional --target-size-ratio <ratio> argument to the
ceph osd pool create command.
NOTE If you specify impossible target size values, for example, a capacity larger than the total cluster, or ratios that sum to more
than 1.0, the cluster raises a POOL_TARGET_SIZE_RATIO_OVERCOMMITTED or POOL_TARGET_SIZE_BYTES_OVERCOMMITTED
health warning. If you specify both target_size_ratio and target_size_bytes for a pool, the cluster considers only the ratio, and raises
a POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO health warning.
Syntax
Example
Once you increase or decrease the number of placement groups, you must also adjust the number of placement groups for
placement (pgp_num) before your cluster rebalances. The pgp_num should be equal to the pg_num. To increase the number of
placement groups for placement, execute the following:
Syntax
Example
Syntax
Example
Syntax
Syntax
Inactive Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to
come up and in.
Unclean Placement groups contain objects that are not replicated the desired number of times. They should be recovering.
Stale Placement groups are in an unknown state - the OSDs that host them have not reported to the monitor cluster in a while
(configured by mon_osd_report_timeout).
Valid formats are plain (default) and json. The threshold defines the minimum number of seconds the placement group is stuck
before including it in the returned statistics (default 300 seconds).
Syntax
Example
Ceph returns the placement group map, the placement group, and the OSD status:
Syntax
Ceph checks the primary and any replica nodes, generates a catalog of all objects in the placement group and compares them to
ensure that no objects are missing or mismatched, and their contents are consistent. Assuming the replicas all match, a final
semantic sweep ensures that all of the snapshot-related object metadata is consistent. Errors are reported via logs.
Syntax
Edit online
If the cluster has lost one or more objects, and you have decided to abandon the search for the lost data, you must mark the unfound
objects as lost.
If all possible locations have been queried and objects are still lost, you might have to give up on the lost objects. This is possible
given unusual combinations of failures that allow the cluster to learn about writes that were performed before the writes themselves
are recovered.
Currently the only supported option is "revert", which will either roll back to a previous version of the object or (if it was a new object)
forget about it entirely. To mark the "unfound" objects as "lost", execute the following:
Syntax
IMPORTANT: Use this feature with caution, because it might confuse applications that expect the object(s) to exist.
Pools overview
Edit online
Ceph clients store data in pools. When you create pools, you are creating an I/O interface for clients to store data. From the
perspective of a Ceph client (that is, block device, gateway, and the rest), interacting with the Ceph storage cluster is remarkably
simple: create a cluster handle and connect to the cluster; then, create an I/O context for reading and writing objects and their
extended attributes.
To connect to the Ceph storage cluster, the Ceph client needs the cluster name (usually ceph by default) and an initial monitor
address. Ceph clients usually retrieve these parameters using the default path for the Ceph configuration file and then read it from
the file, but a user might also specify the parameters on the command line too. The Ceph client also provides a user name and secret
key (authentication is on by default). Then, the client contacts the Ceph monitor cluster and retrieves a recent copy of the cluster
map, including its monitors, OSDs and pools.
To read and write data, the Ceph client creates an i/o context to a specific pool in the Ceph storage cluster. If the specified user has
permissions for the pool, the Ceph client can read from and write to the specified pool.
Ceph’s architecture enables the storage cluster to provide this remarkably simple interface to Ceph clients so that clients might
select one of the sophisticated storage strategies you define simply by specifying a pool name and creating an I/O context. Storage
strategies are invisible to the Ceph client in all but capacity and performance. Similarly, the complexities of Ceph clients (mapping
objects into a block device representation, providing an S3/Swift RESTful service) are invisible to the Ceph storage cluster.
Resilience: You can set how many OSD are allowed to fail without losing data. For replicated pools, it is the desired number of
copies/replicas of an object. A typical configuration stores an object and one additional copy (that is, size = 2), but you can
determine the number of copies/replicas. For erasure coded pools, it is the number of coding chunks (that is m=2 in the
erasure code profile)
Placement Groups: You can set the number of placement groups for the pool. A typical configuration uses approximately 50-
100 placement groups per OSD to provide optimal balancing without using up too many computing resources. When setting
up multiple pools, be careful to ensure you set a reasonable number of placement groups for both the pool and the cluster as
a whole.
CRUSH Rules: When you store data in a pool, a CRUSH rule mapped to the pool enables CRUSH to identify the rule for the
placement of each object and its replicas or chunks for erasure coded pools in your cluster. You can create a custom CRUSH
rule for your pool.
Snapshots: When you create snapshots with ceph osd pool mksnap, you effectively take a snapshot of a particular pool.
Quotas: When you set quotas on a pool with ceph osd pool set-quota you might limit the maximum number of objects
or the maximum number of bytes stored in the specified pool.
Listing pool
Edit online
To list your cluster’s pools, execute:
Creating a pool
Edit online
Before creating pools, see the Pool, PG and CRUSH Configuration Reference.
NOTE The system administrators must expressly enable a pool to receive I/O operations from Ceph clients. See Enabling a client
application for details. Failure to enable a pool will result in a HEALTH_WARN status.
It is better to adjust the default value for the number of placement groups in the Ceph configuration file, as the default value does
not have to suit your needs.
Example
Syntax
Syntax
Syntax
Where:
POOL_NAME
Description
The name of the pool. It must be unique.
Type
String
Required
Yes. If not specified, it is set to the value listed in the Ceph configuration file or to the default value.
Default
ceph
PG_NUM
Description
The total number of placement groups for the pool. See the See the Placement Groups and the Ceph Placement Groups (PGs) per
Pool Calculator for details on calculating a suitable number. The default value 8 is not suitable for most systems.
Required
Yes
Default
8
PGP_NUM
Description
The total number of placement groups for placement purposes. This value must be equal to the total number of placement groups,
except for placement group splitting scenarios.
Type
Integer
Required
Yes. If not specified it is set to the value listed in the Ceph configuration file or to the default value.
Default
8
replicated or erasure
Description
The pool type can be either replicated to recover from lost OSDs by keeping multiple copies of the objects or erasure to get a
kind of generalized RAID5 capability. The replicated pools require more raw storage but implement all Ceph operations. The erasure-
coded pools require less raw storage but only implement a subset of the available operations.
Type
String
Required
No
Default
replicated
crush-rule-name
Description
The name of the crush rule for the pool. The rule MUST exist. For replicated pools, the name is the rule specified by the
osd_pool_default_crush_rule configuration setting. For erasure-coded pools the name is erasure-code if you specify the
default erasure code profile or POOL_NAME otherwise. Ceph creates this rule with the specified name implicitly if the rule doesn’t
already exist.
Type
String
Required
No
Default
Uses erasure-code for an erasure-coded pool. For replicated pools, it uses the value of the osd_pool_default_crush_rule
variable from the Ceph configuration.
expected-num-objects
Description
The expected number of objects for the pool. By setting this value together with a negative filestore_merge_threshold
variable, Ceph splits the placement groups at pool creation time to avoid the latency impact to perform runtime directory splitting.
Type
Integer
Required
No
Default
0, no splitting at the pool creation time
Type
String
Required
No
When you create a pool, set the number of placement groups to a reasonable value (for example to 100). Consider the total number
of placement groups per OSD too. Placement groups are computationally expensive, so performance will degrade when you have
many pools with many placement groups, for example, 50 pools with 100 placement groups each. The point of diminishing returns
depends upon the power of the OSD host.
See the Placement Groups section and Ceph Placement Groups (PGs) per Pool Calculator for details on calculating an appropriate
number of placement groups for your pool.
Syntax
Example
[ceph: root@host01 /]# ceph osd pool set-quota data max_objects 10000
NOTE: In-flight write operations might overrun pool quotas for a short time until Ceph propagates the pool usage across the cluster.
This is normal behavior. Enforcing pool quotas on in-flight write operations would impose significant performance penalties.
Deleting a pool
Edit online
To delete a pool, execute:
Syntax
IMPORTANT: To protect data, storage administrators cannot delete pools by default. Set the mon_allow_pool_delete
configuration option before deleting pools.
If a pool has its own rule, consider removing it after deleting the pool. If a pool has users strictly for its own use, consider deleting
those users after deleting the pool.
Renaming a pool
Edit online
To rename a pool, execute:
Syntax
Syntax
rados df
Syntax
The Pool Values section lists all key-values pairs that you can set.
Syntax
The Pool Values section lists all key-values pairs that you can get.
To enable a client application to conduct I/O operations on a pool, execute the following:
Syntax
In that scenario, the output for ceph health detail -f json-pretty gives the following output:
{
"checks": {
"POOL_APP_NOT_ENABLED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "application not enabled on 1 pool(s)"
},
"detail": [
{
"message": "application not enabled on pool 'POOL_NAME'"
},
{
"message": "use 'ceph osd pool application enable POOL_NAME APP', where APP is
'cephfs', 'rbd', 'rgw', or freeform for custom applications."
}
]
}
},
"status": "HEALTH_WARN",
"overall_status": "HEALTH_WARN",
"detail": [
"'ceph health' JSON format has changed in luminous. If you see this your monitoring system
is scraping the wrong fields. Disable this with 'mon health preluminous compat warning = false'"
]
}
NOTE: Initialize pools for the Ceph Block Device with rbd pool init POOL_NAME.
Syntax
Syntax
Syntax
Syntax
IMPORTANT: The NUMBER_OF_REPLICAS parameter includes the object itself. If you want to include the object and two copies of
the object for a total of three instances of the object, specify 3.
Example
NOTE: An object might accept I/O operations in degraded mode with fewer replicas than specified by the pool size setting. To set
a minimum number of required replicas for I/O, use the min_size setting.
Example
This ensures that no object in the data pool will receive I/O with fewer replicas than specified by the min_size setting.
Ceph will list the pools, with the replicated size attribute highlighted. By default, Ceph creates two replicas of an object, that is
a total of three copies, or a size of 3.
size
Description
Specifies the number of replicas for objects in the pool. See the Setting the number of object replicas section for further details.
Applicable for the replicated pools only.
Type
Integer
min_size
Description
Specifies the minimum number of replicas required for I/O. See the Setting the number of object replicas section for further
details.For erasure-coded pools, this should be set to a value greater than k. If I/O is allowed at the value k, then there is no
redundancy and data is lost in the event of a permanent OSD failure. For more information, see Erasure code pools overview.
Type
Integer
crash_replay_interval
Description
Specifies the number of seconds to allow clients to replay acknowledged, but uncommitted requests.
Type
Integer
pg-num
Description The total number of placement groups for the pool. See the Pools, placement groups, and CRUSH configuration section
for details on calculating a suitable number. The default value 8 is not suitable for most systems.
Type
Integer
Required
Yes.
Default
8
pgp-num
Description
The total number of placement groups for placement purposes. This should be equal to the total number of placement groups,
except for placement group splitting scenarios.
Type
Integer
Required
Yes. Picks up default or Ceph configuration value if not specified.
Default
8
Valid Range
Equal to or less than what specified by the pg_num variable.
crush_rule
Type
String
hashpspool
Description
Enable or disable the HASHPSPOOL flag on a given pool. With this option enabled, pool hashing and placement group mapping are
changed to improve the way pools and placement groups overlap.
Type
Integer
Valid Range
1 enables the flag, 0 disables the flag.
IMPORTANT Do not enable this option on production pools of a cluster with a large amount of OSDs and data. All placement groups
in the pool would have to be remapped causing too much data movement.
fast_read
Description
On a pool that uses erasure coding, if this flag is enabled, the read request issues subsequent reads to all shards, and waits until it
receives enough shards to decode to serve the client. In the case of the jerasure and isa erasure plug-ins, once the first K
replies return, the client’s request is served immediately using the data decoded from these replies. This helps to allocate some
resources for better performance. Currently this flag is only supported for erasure coding pools.
Type
Boolean
Defaults
0
allow_ec_overwrites
Description
Whether writes to an erasure coded pool can update part of an object, so the Ceph Filesystem and Ceph Block Device can use it.
Type
Boolean
compression_algorithm
Description
Sets inline compression algorithm to use with the BlueStore storage backend. This setting overrides the
bluestore_compression_algorithm configuration setting.
Type
String
Valid Settings
lz4, snappy, zlib, zstd
compression_mode
Description
Sets the policy for the inline compression algorithm for the BlueStore storage backend. This setting overrides the
bluestore_compression_mode configuration setting.
Type
String
Valid Settings
none, passive, aggressive, force
compression_min_blob_size
Type
Unsigned Integer
compression_max_blob_size
Description
BlueStore will break chunks larger than this size into smaller blobs of compression_max_blob_size before compressing the data.
Type
Unsigned Integer
nodelete
Description
Set or unset the NODELETE flag on a given pool.
Type
Integer
Valid Range
1 sets flag. 0 unsets flag.
nopgchange
Description
Set or unset the NOPGCHANGE flag on a given pool.
Type
Integer
Valid Range
1 sets the flag. 0 unsets the flag.
nosizechange
Description
Set or unset the NOSIZECHANGE flag on a given pool.
Type
Integer
Valid Range
1 sets the flag. 0 unsets the flag.
write_fadvise_dontneed
Description
Set or unset the WRITE_FADVISE_DONTNEED flag on a given pool.
Type
Integer
Valid Range
1 sets the flag. 0 unsets the flag.
noscrub
Description
Set or unset the NOSCRUB flag on a given pool.
Type
Integer
Valid Range
1 sets the flag. 0 unsets the flag.
nodeep-scrub
Type
Integer
Valid Range
1 sets the flag. 0 unsets the flag.
scrub_min_interval
Description
The minimum interval in seconds for pool scrubbing when load is low. If it is 0, Ceph uses the osd_scrub_min_interval
configuration setting.
Type
Double
Default
0
scrub_max_interval
Description
The maximum interval in seconds for pool scrubbing irrespective of cluster load. If it is 0, Ceph uses the
osd_scrub_max_interval configuration setting.
Type
Double
Default
0
deep_scrub_interval
Description
The interval in seconds for pool deep scrubbing. If it is 0, Ceph uses the osd_deep_scrub_interval configuration setting.
Type
Double
Default
0
Ceph stores data in pools and there are two types of the pools:
replicated
erasure-coded
Ceph uses the replicated pools by default, meaning the Ceph copies every object from a primary OSD node to one or more secondary
OSDs.
The erasure-coded pools reduce the amount of disk space required to ensure data durability but it is computationally a bit more
expensive than replication.
Erasure coding is a method of storing an object in the Ceph storage cluster durably where the erasure code algorithm breaks the
object into data chunks (k) and coding chunks (m), and stores those chunks in different OSDs.
In the event of the failure of an OSD, Ceph retrieves the remaining data (k) and coding (m) chunks from the other OSDs and the
erasure code algorithm restores the object from those chunks.
Erasure coding uses storage capacity more efficiently than replication. The n-replication approach maintains n copies of an object
(3x by default in Ceph), whereas erasure coding maintains only k + m chunks. For example, 3 data and 2 coding chunks use 1.5x the
storage space of the original object.
While erasure coding uses less storage overhead than replication, the erasure code algorithm uses more RAM and CPU than
replication when it accesses or recovers objects. Erasure coding is advantageous when data storage must be durable and fault
tolerant, but do not require fast read performance (for example, cold storage, historical records, and so on).
For the mathematical and detailed explanation on how erasure code works in Ceph, see the Ceph Erasure Coding.
Ceph creates a default erasure code profile when initializing a cluster with k=2 and m=2, This mean that Ceph will spread the object
data over four OSDs (k+m == 4) and Ceph can lose one of those OSDs without losing data. To know more about erasure code
profiling see Erasure Code Profiles section.
IMPORTANT: Configure only the .rgw.buckets pool as erasure-coded and all other Ceph Object Gateway pools as replicated,
otherwise an attempt to create a new bucket fails with the following error:
The reason for this is that erasure-coded pools do not support the omap operations and certain Ceph Object Gateway metadata
pools require the omap support.
Example
NOTE: The 32 in pool create stands for the number of placement groups.
Ceph creates a default erasure code profile when initializing a cluster and it provides the same level of redundancy as two copies in a
replicated pool. However, it uses 25% less storage capacity. The default profiles define k=2 and m=2, meaning Ceph will spread the
object data over three OSDs (k+m=4) and Ceph can lose one of those OSDs without losing data.
The default erasure code profile can sustain the loss of a single OSD. It is equivalent to a replicated pool with a size two, but requires
1.5 TB instead of 2 TB to store 1 TB of data. To display the default profile use the following command:
The most important parameters of the profile are k, m and crush-failure-domain, because they define the storage overhead and the
data durability.
IMPORTANT: Choosing the correct profile is important because you cannot change the profile after you create the pool. To modify a
profile, you must create a new pool with a different profile and migrate the objects from the old pool to the new pool.
For instance, if the desired architecture must sustain the loss of two racks with a storage overhead of 40% overhead, the following
profile can be defined:
The primary OSD will divide the NYAN object into four (k=4) data chunks and create two additional chunks (m=2). The value of m
defines how many OSDs can be lost simultaneously without losing any data. The crush-failure-domain=rack will create a CRUSH rule
that ensures no two chunks are stored in the same rack.
IMPORTANT: IBM supports the following jerasure coding values for k, and m:
k=8 m=3
k=8 m=4
k=4 m=2
IMPORTANT: If the number of OSDs lost equals the number of coding chunks (m), some placement groups in the erasure coding pool
will go into incomplete state. If the number of OSDs lost is less than m, no placement groups will go into incomplete state. In either
situation, no data loss will occur. If placement groups are in incomplete state, temporarily reducing min_size of an erasure coded
pool will allow recovery.
Edit online
To create a new erasure code profile:
Syntax
Where:
directory
Description
Set the directory name from which the erasure code plug-in is loaded.
Type
String
Required
No.
Default
/usr/lib/ceph/erasure-code
plugin
Description
Use the erasure code plug-in to compute coding chunks and recover missing details. See the Erasure Code Plug-ins section for
details.
Type
String
Required
No.
Default
jerasure
stripe_unit
Description
The amount of data in a data chunk, per stripe. For example, a profile with 2 data chunks and stripe_unit=4K would put the range
0-4K in chunk 0, 4K-8K in chunk 1, then 8K-12K in chunk 0 again. This should be a multiple of 4K for best performance. The default
value is taken from the monitor config option osd_pool_erasure_code_stripe_unit when a pool is created. The stripe_width
of a pool using this profile will be the number of data chunks multiplied by this stripe_unit.
Type
String
Required
No.
Default
4K
crush-device-class
Description
Type
String
Required
No
Default
none, meaning CRUSH uses all devices regardless of class.
crush-failure-domain
Description
The failure domain, such as host or rack.
Type
String
Required
No
Default
host
key
Description
The semantic of the remaining key-value pairs is defined by the erasure code plug-in.
Type
String
Required
No.
--force
Description
Override an existing profile by the same name.
Type
String
Required
No.
Edit online
To remove an erasure code profile:
Syntax
Edit online
To display an erasure code profile:
Syntax
Edit online
To list the names of all erasure code profiles:
Syntax
Using erasure coded pools with overwrites allows Ceph Block Devices and CephFS store their data in an erasure coded pool:
Syntax
Example
[ceph: root@host01 /]# ceph osd pool set ec_pool allow_ec_overwrites true
Enabling erasure coded pools with overwrites can only reside in a pool using BlueStore OSDs. Since BlueStore’s checksumming is
used to detect bit rot or other corruption during deep scrubs. Using FileStore with erasure coded overwrites is unsafe, and yields
lower performance when compared to BlueStore.
Erasure coded pools do not support omap. To use erasure coded pools with Ceph Block Devices and CephFS, store the data in an
erasure coded pool, and the metadata in a replicated pool.
For Ceph Block Devices, use the --data-pool option during image creation:
Syntax
Example
If using erasure coded pools for CephFS, then setting the overwrites must be done in a file layout.
Creating a new erasure code profile using jerasure erasure code plugin
Controlling CRUSH Placement
The jerasure plug-in encapsulates the JerasureH library. For detailed information about the parameters, see the jerasure
documentation.
To create a new erasure code profile using the jerasure plug-in, run the following command:
Syntax
Where:
k
Description
Each object is split in data-chunks parts, each stored on a different OSD.
Type
Integer
Required
Yes.
Example
4
m
Description
Compute coding chunks for each object and store them on different OSDs. The number of coding chunks is also the number of OSDs
that can be down without losing data.
Type
Integer
Required
Yes.
Example
2
technique
Description
The more flexible technique is reed_sol_van; it is enough to set k and m. The cauchy_good technique can be faster but you need to
choose the packetsize carefully. All of reed_sol_r6_op, liberation, blaum_roth, liber8tion are RAID6 equivalents in the sense that they
can only be configured with m=2.
Type
String
Required
No.
Valid Settings
reed_sol_van reed_sol_r6_op cauchy_orig cauchy_good liberation blaum_roth liber8tion
Default
reed_sol_van
packetsize
Description
The encoding will be done on packets of bytes size at a time. Choosing the correct packet size is difficult. The jerasure documentation
contains extensive information on this topic.
Required
No.
Default
2048
crush-root
Description
The name of the CRUSH bucket used for the first step of the rule. For instance step take default.
Type
String
Required
No.
Default
default
crush-failure-domain
Description
Ensure that no two chunks are in a bucket with the same failure domain. For instance, if the failure domain is host no two chunks will
be stored on the same host. It is used to create a rule step such as step chooseleaf host.
Type
String
Required
No.
Default
host
directory
Description
Set the directory name from which the erasure code plug-in is loaded.
Type
String
Required
No.
Default
/usr/lib/ceph/erasure-code
--force
Description
Override an existing profile by the same name.
Type
String
Required
No.
chunk nr 01234567
step 1 _cDD_cDD
needs exactly 8 OSDs, one for each chunk. If the hosts are in two adjacent racks, the first four chunks can be placed in the first rack
and the last four in the second rack. Recovering from the loss of a single OSD does not require using bandwidth between the two
racks.
For instance:
creates a rule that selects two crush buckets of type rack and for each of them choose four OSDs, each of them located in a different
bucket of type host.
Installing
Edit online
This information provides instructions on installing IBM Storage Ceph on Red Hat Enterprise Linux running on AMD64 and Intel 64
architectures.
IBM Storage Ceph is designed for cloud infrastructure and web-scale object storage. IBM Storage Ceph clusters consist of the
following types of nodes:
Ceph Monitor
Each Ceph Monitor node runs the ceph-mon daemon, which maintains a master copy of the storage cluster map. The storage cluster
map includes the storage cluster topology. A client connecting to the Ceph storage cluster retrieves the current copy of the storage
cluster map from the Ceph Monitor, which enables the client to read from and write data to the storage cluster.
IMPORTANT: The storage cluster can run with only one Ceph Monitor; however, to ensure high availability in a production storage
cluster, IBM supports deployments with at least three Ceph Monitor nodes. Deploy a total of 5 Ceph Monitors for storage clusters
exceeding 750 Ceph OSDs.
Ceph Manager
The Ceph Manager daemon, ceph-mgr, co-exists with the Ceph Monitor daemons running on Ceph Monitor nodes to provide
additional services. The Ceph Manager provides an interface for other monitoring and management systems using Ceph Manager
modules. Running the Ceph Manager daemons is a requirement for normal storage cluster operations.
Ceph OSD
Each Ceph Object Storage Device (OSD) node runs the ceph-osd daemon, which interacts with logical disks attached to the node.
The storage cluster stores data on these Ceph OSD nodes.
Ceph can run with very few OSD nodes, of which the default is three, but production storage clusters realize better performance
beginning at modest scales. For example, 50 Ceph OSDs in a storage cluster. Ideally, a Ceph storage cluster has multiple OSD nodes,
Ceph MDS
Each Ceph Metadata Server (MDS) node runs the ceph-mds daemon, which manages metadata related to files stored on the Ceph
File System (CephFS). The Ceph MDS daemon also coordinates access to the shared storage cluster.
Ceph Object Gateway node runs the ceph-radosgw daemon, and is an object storage interface built on top of librados to provide
applications with a RESTful access point to the Ceph storage cluster. The Ceph Object Gateway supports two interfaces:
S3
Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.
Swift
Provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.
Reference
Edit online
Understanding such things as, the hardware and network requirements, understanding what type of workloads work well with an
IBM Storage Ceph cluster, along with IBM's recommendations. IBM Storage Ceph can be used for different workloads based on a
particular business need or set of requirements. Doing the necessary planning before installing an IBM Storage Ceph is critical to the
success of running a Ceph storage cluster efficiently and achieving the business requirements.
NOTE: Want help with planning an IBM Storage Ceph cluster for a specific use case? Contact your IBM representative for assistance.
IBM Storage Ceph Use cases can support multiple storage strategies. Use cases, cost versus benefit performance tradeoffs, and data
durability are the primary considerations that help develop a sound storage strategy.
Use Cases
Ceph provides massive storage capacity, and it supports numerous use cases, such as:
The Ceph Block Device client is a leading storage backend for cloud platforms that provides limitless storage for volumes and
images with high performance features like copy-on-write cloning.
The Ceph Object Gateway client is a leading storage backend for cloud platforms that provides a RESTful S3-compliant and
Swift-compliant object storage for objects like audio, bitmap, video, and other data.
Faster is better. Bigger is better. High durability is better. However, there is a price for each superlative quality, and a corresponding
cost versus benefit tradeoff. Consider the following use cases from a performance perspective: SSDs can provide very fast storage
for relatively small amounts of data and journaling. Storing a database or object index can benefit from a pool of very fast SSDs, but
proves too expensive for other data. SAS drives with SSD journaling provide fast performance at an economical price for volumes and
images. SATA drives without SSD journaling provide cheap storage with lower overall performance. When you create a CRUSH
hierarchy of OSDs, you need to consider the use case and an acceptable cost versus performance tradeoff.
Data Durability
In large scale storage clusters, hardware failure is an expectation, not an exception. However, data loss and service interruption
remain unacceptable. For this reason, data durability is very important. Ceph addresses data durability with multiple replica copies
of an object or with erasure coding and multiple coding chunks. Multiple copies or multiple coding chunks present an additional cost
versus benefit tradeoff: it is cheaper to store fewer copies or coding chunks, but it can lead to the inability to service write requests in
a degraded state. Generally, one object with two additional copies, or two coding chunks can allow a storage cluster to service writes
in a degraded state while the storage cluster recovers.
Replication stores one or more redundant copies of the data across failure domains in case of a hardware failure. However,
redundant copies of data can become expensive at scale. For example, to store 1 petabyte of data with triple replication would
require a cluster with at least 3 petabytes of storage capacity.
Erasure coding stores data as data chunks and coding chunks. In the event of a lost data chunk, erasure coding can recover the lost
data chunk with the remaining data chunks and coding chunks. Erasure coding is substantially more economical than replication. For
example, using erasure coding with 8 data chunks and 3 coding chunks provides the same redundancy as 3 copies of the data.
However, such an encoding scheme uses approximately 1.5x the initial data stored compared to 3x with replication.
The CRUSH algorithm aids this process by ensuring that Ceph stores additional copies or coding chunks in different locations within
the storage cluster. This ensures that the failure of a single storage device or host does not lead to a loss of all of the copies or coding
chunks necessary to preclude data loss. You can plan a storage strategy with cost versus benefit tradeoffs, and data durability in
mind, then present it to a Ceph client as a storage pool.
IMPORTANT: ONLY the data storage pool can use erasure coding. Pools storing service data and bucket indexes use replication.
IMPORTANT: Ceph’s object copies or coding chunks make RAID solutions obsolete. Do not use RAID, because Ceph already handles
data durability, a degraded RAID has a negative impact on performance, and recovering data using RAID is substantially slower than
using deep copies or erasure coding chunks.
Reference
Edit online
To the Ceph client interface that reads and writes data, a Ceph storage cluster appears as a simple pool where the client stores data.
However, the storage cluster performs many complex operations in a manner that is completely transparent to the client interface.
Ceph clients and Ceph object storage daemons, referred to as Ceph OSDs, or simply OSDs, both use the Controlled Replication Under
Scalable Hashing (CRUSH) algorithm for the storage and retrieval of objects. Ceph OSDs can run in containers within the storage
cluster.
A CRUSH map describes a topography of cluster resources, and the map exists both on client hosts as well as Ceph Monitor hosts
within the cluster. Ceph clients and Ceph OSDs both use the CRUSH map and the CRUSH algorithm. Ceph clients communicate
directly with OSDs, eliminating a centralized object lookup and a potential performance bottleneck. With awareness of the CRUSH
map and communication with their peers, OSDs can handle replication, backfilling, and recovery—allowing for dynamic failure
recovery.
Ceph uses the CRUSH map to implement failure domains. Ceph also uses the CRUSH map to implement performance domains,
which simply take the performance profile of the underlying hardware into consideration. The CRUSH map describes how Ceph
stores data, and it is implemented as a simple hierarchy, specifically an acyclic graph, and a ruleset. The CRUSH map can support
multiple hierarchies to separate one type of hardware performance profile from another. Ceph implements performance domains
with device "classes".
For example, you can have these performance domains coexisting in the same IBM Storage Ceph cluster:
Hard disk drives (HDDs) are typically appropriate for cost and capacity-focused workloads.
Throughput-sensitive workloads typically use HDDs with Ceph write journals on solid state drives (SSDs).
IMPORTANT: Carefully consider the workload being run by IBM Storage Ceph clusters BEFORE considering what hardware to
purchase, because it can significantly impact the price and performance of the storage cluster. For example, if the workload is
capacity-optimized and the hardware is better suited to a throughput-optimized workload, then hardware will be more expensive
than necessary. Conversely, if the workload is throughput-optimized and the hardware is better suited to a capacity-optimized
workload, then the storage cluster can suffer from poor performance.
IOPS optimized: Input, output per second (IOPS) optimization deployments are suitable for cloud computing operations,
such as running MYSQL or MariaDB instances as virtual machines on OpenStack. IOPS optimized deployments require higher
performance storage such as 15k RPM SAS drives and separate SSD journals to handle frequent write operations. Some high
IOPS scenarios use all flash storage to improve IOPS and total throughput.
3x replication for hard disk drives (HDDs) or 2x replication for solid state drives (SSDs).
3x replication.
Capacity optimized: Capacity-optimized deployments are suitable for storing significant amounts of data as inexpensively as
possible. Capacity-optimized deployments typically trade performance for a more attractive price point. For example,
capacity-optimized deployments often use slower and less expensive SATA drives and co-locate journals rather than using
SSDs for journaling.
Object archive.
Storage administrators prefer that a storage cluster recovers as quickly as possible. Carefully consider bandwidth requirements for
the storage cluster network, be mindful of network link oversubscription, and segregate the intra-cluster traffic from the client-to-
cluster traffic. Also consider that network performance is increasingly important when considering the use of Solid State Disks (SSD),
flash, NVMe, and other high performing storage devices.
Ceph supports a public network and a storage cluster network. The public network handles client traffic and communication with
Ceph Monitors. The storage cluster network handles Ceph OSD heartbeats, replication, backfilling, and recovery traffic. At a
IMPORTANT: Allocate bandwidth to the storage cluster network, such that it is a multiple of the public network using the
osd_pool_default_size as the basis for the multiple on replicated pools. Run the public and storage cluster networks on
separate network cards.
IMPORTANT: Use 10 Gb/s Ethernet for IBM Storage Ceph deployments in production. A 1 Gb/s Ethernet network is not suitable for
production storage clusters.
In the case of a drive failure, replicating 1 TB of data across a 1 Gb/s network takes 3 hours and replicating 10 TB across a 1 Gb/s
network takes 30 hours. Using 10 TB is the typical drive configuration. By contrast, with a 10 Gb/s Ethernet network, the replication
times would be 20 minutes for 1 TB and 1 hour for 10 TB. Remember that when a Ceph OSD fails, the storage cluster will recover by
replicating the data it contained to other Ceph OSDs within the pool.
The failure of a larger domain such as a rack means that the storage cluster utilizes considerably more bandwidth. When building a
storage cluster consisting of multiple racks, which is common for large storage implementations, consider utilizing as much network
bandwidth between switches in a "fat tree" design for optimal performance. A typical 10 Gb/s Ethernet switch has 48 10 Gb/s ports
and four 40 Gb/s ports. Use the 40 Gb/s ports on the spine for maximum throughput. Alternatively, consider aggregating unused 10
Gb/s ports with QSFP+ and SFP+ cables into more 40 Gb/s ports to connect to other rack and spine routers. Also, consider using
LACP mode 4 to bond network interfaces. Additionally, use jumbo frames, with a maximum transmission unit (MTU) of 9000,
especially on the backend or cluster network.
Before installing and testing an IBM Storage Ceph cluster, verify the network throughput. Most performance-related problems in
Ceph usually begin with a networking issue. Simple network issues like a kinked or bent Cat-6 cable could result in degraded
bandwidth. Use a minimum of 10 Gb/s ethernet for the front side network. For large clusters, consider using 40 Gb/s ethernet for the
backend or cluster network.
IMPORTANT: For network optimization, use jumbo frames for a better CPU per bandwidth ratio, and a non-blocking network switch
back-plane. IBM Storage Ceph requires the same MTU value throughout all networking devices in the communication path, end-to-
end for both public and cluster networks. Verify that the MTU value is the same on all hosts and networking equipment in the
environment before using an IBM Storage Ceph cluster in production.
Reference
Edit online
For more information, see:
If an OSD host has a RAID controller with 1-2 Gb of cache installed, enabling the write-back cache might result in increased
small I/O write throughput. However, the cache must be non-volatile.
Most modern RAID controllers have super capacitors that provide enough power to drain volatile memory to non-volatile
NAND memory during a power-loss event. It is important to understand how a particular controller and its firmware behave
after power is restored.
Some RAID controllers require manual intervention. Hard drives typically advertise to the operating system whether their disk
caches should be enabled or disabled by default. However, certain RAID controllers and some firmware do not provide such
information. Verify that disk level caches are disabled to avoid file system corruption.
Create a single RAID 0 volume with write-back for each Ceph OSD data drive with write-back cache enabled.
The Ceph Object Gateway can hang if it runs out of file descriptors. You can modify the /etc/security/limits.conf file on Ceph
Object Gateway hosts to increase the file descriptors for the Ceph Object Gateway.
When running Ceph administrative commands on large storage clusters, for example, with 1024 Ceph OSDs or more, create an
/etc/security/limits.d/50-ceph.conf file on each host that runs administrative commands with the following contents:
Syntax
Replace USER_NAME with the name of the non-root user account that runs the Ceph administrative commands.
NOTE: The root user’s ulimit value is already set to unlimited by default on Red Hat Enterprise Linux.
Easier upgrade
Additionally, for Ceph Object Gateway (radosgw) (RGW) and Ceph File System (ceph-mds), you can colocate either with an OSD
daemon plus a daemon from the above list, excluding RBD mirror.
NOTE: Colocating two of the same kind of daemons on a given node is not supported.
NOTE: IBM recommends colocating the Ceph Object Gateway with Ceph OSD containers to increase performance.
With the colocation rules shared above, we have the following minimum clusters sizes that comply with these rules:
Example 1
2. Use case: Block (RBD) and File (CephFS), or Object (Ceph Object Gateway)
3. Number of nodes: 3
4. Replication scheme: 2
Example 2
2. Use case: Block (RBD), File (CephFS), and Object (Ceph Object Gateway)
3. Number of nodes: 4
4. Replication scheme: 3
Example 3
2. Use case: Block (RBD), Object (Ceph Object Gateway), and NFS for Ceph Object Gateway
3. Number of nodes: 4
4. Replication scheme: 3
The diagrams below shows the differences between storage clusters with colocated and non-colocated daemons.
The release of IBM Storage Ceph 5.3 is supported on Red Hat Enterprise Linux 8.4 EUS or later.
Use the same operating system version, architecture, and deployment type across all nodes.
For example, do not use a mixture of nodes with both AMD64 and Intel 64 architectures, a mixture of nodes with Red Hat Enterprise
Linux 8 operating systems, or a mixture of nodes with container-based deployments.
IMPORTANT: IBM does not support clusters with heterogeneous architectures, operating system versions, or deployment types.
SELinux
By default, SELinux is set to Enforcing mode and the ceph-selinux packages are installed.For additional information on SELinux,
see Red Hat Enterprise Linux 8 Using SELinux Guide.
NOTE: Disk space requirements are based on the Ceph daemons' default path under /var/lib/ceph/ directory.
Table 1. Containers
This number is highly dependent on the configurable MDS cache size. The RAM requirement is
typically twice as much as the amount set in the mds_cache_memory_limit configuration
setting. Note also that this is the memory for your daemon, not the overall system memory.
Disk Space 2 GB per mds-container, plus taking into consideration any additional space required for
possible debug logging, 20GB is a good start.
Network 2x 1 GB Ethernet NICs, 10 GB Recommended
Note that this is the same network as the OSD containers. If you have a 10 GB network on your
OSDs you should use the same on your MDS so that the MDS is not disadvantaged when it
comes to latency.
Reference
Edit online
To take a deeper look into Ceph’s various internal components and the strategies around those components, see Storage
Strategies.
The cephadm utility manages the entire life cycle of a Ceph cluster. Installation and management tasks comprise two types of
operations:
Day One operations involve installing and bootstrapping a bare-minimum, containerized Ceph storage cluster, running on a
single node. Day One also includes deploying the Monitor and Manager daemons and adding Ceph OSDs.
Day Two operations use the Ceph orchestration interface, cephadm orch, or the IBM Storage Ceph Dashboard to expand the
storage cluster by adding other Ceph services to the storage cluster.
cephadm utility
How cephadm works
cephadm-ansible playbooks
Registering the IBM Storage Ceph nodes
Configuring Ansible inventory location
Enabling SSH login as root user on Red Hat Enterprise Linux 9
Creating an Ansible user with sudo access
Enabling password-less SSH for Ansible
Configuring SSH
Configuring a different SSH user
Running the preflight playbook
Bootstrapping a new storage cluster
Distributing SSH keys
Launching the cephadm shell
Verifying the cluster installation
Adding hosts
Removing hosts
Labeling hosts
Adding Monitor service
Setting up the admin node
Adding Manager service
Adding OSDs
Purging the Ceph storage cluster
Deploying client nodes
Prerequisites
Edit online
At least one running virtual machine (VM) or bare-metal server with an active internet connection.
Remove troubling configurations in iptables so that refresh of iptables services does not cause issues to the cluster. For an
example, see Verifying firewall rules are configured for default Ceph ports.
Procedure
Edit online
1. Enable the Red Hat Enterprise Linux baseos and appstream repositories:
Example
2. Enable the ceph-tools repository for both Red Hat Enterprise Linux 8 and Red Hat Enterprise Linux 9:
Repeat the above steps on all the nodes of the storage cluster.
Example
Example
cephadm utility
Edit online
The cephadm utility deploys and manages a Ceph storage cluster. It is tightly integrated with both the command-line interface (CLI)
and the IBM Storage Ceph Dashboard web interface, so that you can manage storage clusters from either environment. cephadm
uses SSH to connect to hosts from the manager daemon to add, remove, or update Ceph daemon containers. It does not rely on
external configuration or orchestration tools such as Ansible or Rook.
NOTE: The cephadm utility is available after running the preflight playbook on a host.
The cephadm shell launches a bash shell within a container. This enables you to perform “Day One” cluster setup tasks, such as
installation and bootstrapping, and to invoke ceph commands.
Example
At the system prompt, type cephadm shell and the command you want to execute:
Example
NOTE: If the node contains configuration and keyring files in /etc/ceph/, the container environment uses the values in those files
as defaults for the cephadm shell. However, if you execute the cephadm shell on a Ceph Monitor node, the cephadm shell inherits its
default configuration from the Ceph Monitor container, instead of using the default configuration.
The cephadm orchestrator enables you to perform “Day Two” Ceph functions, such as expanding the storage cluster and
provisioning Ceph daemons and services. You can use the cephadm orchestrator through either the command-line interface (CLI) or
the web-based IBM Storage Ceph Dashboard. Orchestrator commands take the form ceph orch.
The cephadm script interacts with the Ceph orchestration module used by the Ceph Manager.
Edit online
The cephadm command manages the full lifecycle of an IBM Storage Ceph cluster. The cephadm command can perform the
following operations:
Launch a containerized shell that works with the IBM Storage Ceph command-line interface (CLI).
The cephadm command uses ssh to communicate with the nodes in the storage cluster. This allows you to add, remove, or update
IBM Storage Ceph containers without using external tools. Generate the ssh key pair during the bootstrapping process, or use your
own ssh key.
The cephadm bootstrapping process creates a small storage cluster on a single node, consisting of one Ceph Monitor and one Ceph
Manager, as well as any required dependencies. You then use the orchestrator CLI or the IBM Storage Ceph Dashboard to expand the
storage cluster to include nodes, and to provision all of the IBM Storage Ceph daemons and services. You can perform management
functions through the CLI or from the IBM Storage Ceph Dashboard web interface.
NOTE: The cephadm utility is a new feature in IBM Storage Ceph 5. It does not support older versions of IBM Storage Ceph.
Edit online
The cephadm-ansible package is a collection of Ansible playbooks to simplify workflows that are not covered by cephadm. After
installation, the playbooks are located in /usr/share/cephadm-ansible/.
IMPORTANT: Red Hat Enterprise Linux 9 and later does not support the cephadm-ansible playbook.
cephadm-preflight.yml
cephadm-clients.yml
cephadm-purge-cluster.yml
Use the cephadm-preflight playbook to initially setup hosts before bootstrapping the storage cluster and before adding new
nodes or clients to your storage cluster. This playbook configures the Ceph repository and installs some prerequisites such as
podman, lvm2, chronyd, and cephadm.
Use the cephadm-clients playbook to set up client hosts. This playbook handles the distribution of configuration and keyring files
to a group of Ceph clients.
Prerequisites
Edit online
At least one running virtual machine (VM) or bare-metal server with an active internet connection.
Procedure
Edit online
1. Enable the Red Hat Enterprise Linux baseos and appstream repositories:
Example
Example
2. Enable the ceph-tools repository for both Red Hat Enterprise Linux 8 and Red Hat Enterprise Linux 9:
Repeat the above steps on all the nodes of the storage cluster.
Syntax
NOTE: If deploying clients, client nodes must be defined in a dedicated [clients] group.
IMPORTANT: Skip these steps for Red Hat Enterprise Linux 9 as cephadm-ansible is not supported.
Prerequisites
Edit online
Procedure
Edit online
3. Optional: Edit the ansible.cfg file and add the following line to assign a default inventory location:
[defaults]
inventory = ./inventory/staging
5. Open and edit each hosts file and add the nodes and [admin] group:
Syntax
NODE_NAME_1
NODE_NAME_2
[admin]
ADMIN_NODE_NAME_1
Replace NODE_NAME_1 and NODE_NAME_2 with the Ceph nodes such as monitors, OSDs, MDSs, and gateway nodes.
Replace ADMIN_NODE_NAME_1 with the name of the node where the admin keyring is stored.
Example
host02
host03
host04
[admin]
host01
Syntax
Syntax
Example
You can run one of the following methods to enable login as a root user:
Use "Allow root SSH login with password" flag while setting the root password during installation of Red Hat Enterprise Linux
9.
Manually set the PermitRootLogin parameter after Red Hat Enterprise Linux 9 installation.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Syntax
ssh root@HOST_NAME
Example
Reference
Edit online
For more information, see the Not able to login as root user via ssh in RHEL 9 server.
Edit online
You can create an Ansible user with password-less root access on all nodes in the storage cluster to run the cephadm-ansible
playbooks. The Ansible user must be able to log into all the IBM Storage Ceph nodes as a user that has root privileges to install
software and create configuration files without prompting for a password.
IMPORTANT: If you are a non-root user on a Red Hat Enterprise Linux 9, you can follow the steps for this creating the user, else you
can skip these steps for Red Hat Enterprise Linux 9.
Prerequisites
Edit online
For Red Hat Enterprise 9, to log in as a root user, see Enabling SSH log in as root user on Red Hat Enterprise 9
Procedure
Edit online
Syntax
ssh root@HOST_NAME
Example
Syntax
adduser USER_NAME
Replace USER_NAME with the new user name for the Ansible user.
Example
IMPORTANT: Do not use ceph as the user name. The ceph user name is reserved for the Ceph daemons. A uniform user
name across the cluster can improve ease of use, but avoid using obvious user names, because intruders typically use them
for brute-force attacks.
Syntax
passwd USER_NAME
Example
Syntax
Replace USER_NAME with the new user name for the Ansible user.
Example
Syntax
Replace USER_NAME with the new user name for the Ansible user.
Example
Reference
Edit online
For more information about creating user accounts, see Configuring basic system settings > Getting started with managing
user accounts within the Red Hat Enterprise Linux guide.
IMPORTANT: If you are a non-root user on a Red Hat Enterprise Linux 9, you can follow the steps for this creating the user, else you
can skip these steps for Red Hat Enterprise Linux 9.
Prerequisites
Edit online
Ansible user with sudo access to all nodes in the storage cluster.
For Red Hat Enterprise 9, to log in as a root user, see Enabling SSH log in as root user on Red Hat Enterprise 9
Procedure
148 IBM Storage Ceph
Edit online
1. Generate the SSH key pair, accept the default file name and leave the passphrase empty:
ssh-copy-id USER_NAME@HOST_NAME
Replace USER_NAME with the new user name for the Ansible user. Replace HOST_NAME with the host name of the Ceph node.
Example
4. Open for editing the config file. Set values for the Hostname and User options for each node in the storage cluster:
Syntax
Host host01
Hostname HOST_NAME
User USER_NAME
Host host02
Hostname HOST_NAME
User USER_NAME
...
Replace HOST_NAME with the host name of the Ceph node. Replace USER_NAME with the new user name for the Ansible user.
Example
Host host01
Hostname host01
User ceph-admin
Host host02
Hostname host02
User ceph-admin
Host host03
Hostname host03
User ceph-admin
IMPORTANT: By configuring the ~/.ssh/config file you do not have to specify the -u _USER_NAME_ option each time you
execute the ansible-playbook command.
Reference
Edit online
Configuring SSH
Edit online
As a storage administrator, with Cephadm, you can use an SSH key to securely authenticate with remote hosts. The SSH key is stored
in the monitor to connect to remote hosts.
Procedure
Edit online
Example
Example
Example
Example
IMPORTANT: Prior to configuring a non-root SSH user, the cluster SSH key needs to be added to the user's authorized_keys file
and non-root users must have passwordless sudo access.
Prerequisites
Edit online
Procedure
Edit online
2. Provide Cephadm the name of the user who is going to perform all the Cephadm operations:
Syntax
Example
Syntax
Example
Syntax
Example
The preflight playbook uses the cephadm-ansible inventory file to identify all the admin and nodes in the storage cluster.
IMPORTANT: Skip these steps for Red Hat Enterprise Linux 9 as cephadm-ansible is not supported.
The default location for the inventory file is /usr/share/cephadm-ansible/hosts. The following example shows the structure
of a typical inventory file:
Example
host02
host03
host04
[admin]
host01
The [admin] group in the inventory file contains the name of the node where the admin keyring is stored. On a new storage cluster,
the node in the [admin] group will be the bootstrap node. To add additional admin hosts after bootstrapping the cluster see Setting
up the admin node.
NOTE: Run the preflight playbook before you bootstrap the initial host.
IMPORTANT: If you are performing a disconnected installation, see Running the preflight playbook for a disconnected installation.
Ansible user with sudo and passwordless ssh access to all nodes in the storage cluster.
Procedure
2. Open and edit the hosts file and add your nodes:
Example
host02
host03
host04
[admin]
host01
3. Add license to install IBM Storage Ceph and click "Accept" on all nodes:
Example
Example
Syntax
Example
Use the --limit option to run the preflight playbook on a selected set of hosts in the storage cluster:
Syntax
Replace GROUP_NAME with a group name from your inventory file. Replace NODE_NAME with a specific node name
from your inventory file.
NOTE: Optionally, you can group your nodes in your inventory file by group name such as [mons], [osds], and
[mgrs]. However, admin nodes must be added to the [admin] group and clients must be added to the [clients]
group.
Example
When you run the preflight playbook, cephadm-ansible automatically installs chronyd and ceph-common on the
client
The preflight playbook installs chronyd but configures it for a single NTP source.
Installs and starts a Ceph Monitor daemon and a Ceph Manager daemon for a new IBM Storage Ceph cluster on the local node
as containers.
Writes a copy of the public key to /etc/ceph/ceph.pub for the IBM Storage Ceph cluster and adds the SSH key to the root
user’s /root/.ssh/authorized_keys file.
Writes a minimal configuration file needed to communicate with the new cluster to /etc/ceph/ceph.conf.
Deploys a basic monitoring stack with prometheus, grafana, and other tools such as node-exporter and alert-manager.
IMPORTANT: If you are performing a disconnected installation, see Performing a disconnected installation.
NOTE: If you have existing prometheus services that you want to run with the new storage cluster, or if you are running Ceph with
Rook, use the --skip-monitoring-stack option with the cephadm bootstrap command. This option bypasses the basic
monitoring stack so that you can manually configure it later.
IMPORTANT: If you are deploying a monitoring stack, see Deploying the monitoring stack using the Ceph Orchestrator
IMPORTANT: Bootstrapping provides the default user name and password for the initial login to the Dashboard. Bootstrap requires
you to change the password after you log in.
IMPORTANT: Before you begin the bootstrapping process, make sure that the container image that you want to use has the same
version of IBM Storage Ceph as cephadm. If the two versions do not match, bootstrapping fails at the Creating initial admin
user stage.
NOTE: Before you begin the bootstrapping process, you must create a username and password for the cp.icr.io/cp container
registry.
Prerequisites
Edit online
An IP address for the first Ceph Monitor container, which is also the IP address for the first node in the storage cluster.
NOTE: If the storage cluster includes multiple networks and interfaces, be sure to choose a network that is accessible by any node
that uses the storage cluster.
NOTE: If the local node uses fully-qualified domain names (FQDN), then add the --allow-fqdn-hostname option to cephadm
bootstrap on the command line.
IMPORTANT: Run cephadm bootstrap on the node that you want to be the initial Monitor node in the cluster. The IP_ADDRESS
option should be the IP address of the node you are using to run cephadm bootstrap.
NOTE: If you want to deploy a storage cluster using IPV6 addresses, then use the IPV6 address format for the --mon-ip
IP_ADDRESS option. For example: cephadm bootstrap --mon-ip 2620:52:0:880:225:90ff:fefc:2536 --registry-
json /etc/mylogin.json
IMPORTANT: Configuring Ceph Object Gateway multi-site on IBM Storage Ceph 5.3 is not supported due to several open issues. For
more information, see the Red Hat knowledge base article Red Hat Ceph Storage 5.3 does not support multi-site configuration. Use
the --yes-i-know flag while bootstrapping a new IBM Storage Ceph cluster to get past the warning about multi-site regressions.
NOTE: Follow the knowledge base article How to upgrade from Red Hat Ceph Storage 4.2z4 to Red Hat Ceph Storage 5.0z4 with the
bootstrapping procedure if you are planning for a new installation of IBM Storage Ceph 5.3z4.
Procedure
Edit online
Syntax
Example
NOTE: If you want internal cluster traffic routed over the public network, you can omit the --cluster-network
NETWORK_CIDR option.
The script takes a few minutes to complete. Once the script completes, it provides the credentials to the IBM Storage Ceph
Dashboard URL, a command to access the Ceph command-line interface (CLI), and a request to enable telemetry.
URL: https://host01:8443/
User: admin
Password: i8nhu7zham
ceph telemetry on
https://docs.ceph.com/docs/master/mgr/telemetry/
Bootstrap complete.
IBM recommends that you use a basic set of command options for cephadm bootstrap. You can configure additional options after
your initial cluster is up and running.
Syntax
Example
For non-root users, see Creating an Ansible user with sudo access and Enabling password-less SSH for Ansible for more details.
Reference
Edit online
For more information about the --registry-json option, see Using a JSON file to protect login information
For more information about all available cephadm bootstrap options, see Bootstrap command options
For more information about bootstrapping the storage cluster as a non-root user, see Bootstrapping the storage cluster as a
non-root user
Procedure
Edit online
1. Log in to the IBM container software library with the IBM ID and password that is associated with the entitled IBM Storage
Ceph software.
3. On the Access your container software page, click Copy key to copy the generated entitlement key.
5. The user is cp while the key is the token which is the password.
Login Succeeded!
NOTE: You can also use a JSON file with the cephadm --registry-login command.
Prerequisites
Edit online
An IP address for the first Ceph Monitor container, which is also the IP address for the first node in the storage cluster.
Procedure
Edit online
1. Create the JSON file. In this example, the file is named mylogin.json.
Syntax
{
"url":"REGISTRY_URL",
"username":"USER_NAME",
"password":"PASSWORD"
}
Example
{
"url":"cp.icr.io/cp",
"username":"myuser1",
"password":"mypassword1"
}
Syntax
Example
NOTE: If you want to use a non-default realm or zone for applications such as multi-site, configure your Ceph Object Gateway
daemons after you bootstrap the storage cluster, instead of adding them to the configuration file and using the --apply-spec
option. This gives you the opportunity to create the realm or zone you need for the Ceph Object Gateway daemons before deploying
them.
NOTE: To deploy a Metadata Server (MDS) service, configure it after bootstrapping the storage cluster.
To deploy the MDS service, you must create a CephFS volume first.
NOTE: If you run the bootstrap command with --apply-spec option, ensure to include the IP address of the bootstrap host in the
specification file. This prevents resolving the IP address to loopback address while re-adding the bootstrap host where active Ceph
Manager is already running. If you do not use the --apply spec option during bootstrap and instead use ceph orch apply
command with another specification file which includes re-adding the host and contains an active Ceph Manager running, then
ensure to explicitly provide the addr field. This is applicable for applying any specification file after bootstrapping.
Prerequisites
Edit online
cephadm is installed on the node that you want to be the initial Monitor node in the storage cluster.
Procedure
Edit online
2. Create the service configuration .yaml file for your storage cluster. The example file directs cephadm bootstrap to
configure the initial host and two additional hosts, and it specifies that OSDs be created on all available disks.
Example
service_type: host
addr: host01
hostname: host01
---
service_type: host
addr: host02
hostname: host02
---
service_type: host
addr: host03
hostname: host03
---
service_type: host
addr: host04
hostname: host04
---
service_type: mon
placement:
host_pattern: "host[0-2]"
---
Syntax
Example
The script takes a few minutes to complete. Once the script completes, it provides the credentials to the IBM Storage Ceph
Dashboard URL, a command to access the Ceph command-line interface (CLI), and a request to enable telemetry.
Once your storage cluster is up and running, see Operations for more information about configuring additional daemons and
services.
Reference
Edit online
Non-root users must have passwordless sudo access. See the Creating an Ansible user with sudo access section and Enabling
password-less SSH for Ansible sections for more details.
Prerequisites
Edit online
An IP address for the first Ceph Monitor container, which is also the IP address for the initial Monitor node in the storage
cluster.
Procedure
Edit online
Syntax
su - SSH_USER_NAME
Example
NOTE: Using private and public keys is optional. If SSH keys have not previously been created, these can be created during
this step.
Syntax
Example
Reference
Edit online
For more information about utilizing Ansible to automate bootstrapping a rootless cluster, see the knowledge base article Red
Hat Ceph Storage 5.3 rootless deployment utilizing ansible ad-hoc commands.
The following table lists the available options for cephadm bootstrap.
Reference
Edit online
For more information about the --skip-monitoring-stack option, see Adding Hosts.
For more information about logging into the registry with the registry-json option, see help for the registry-login
command.
For more information about cephadm options, see help for cephadm.
Follow this procedure to set up a secure private registry using authentication and a self-signed certificate. Perform these steps on a
node that has both Internet access and access to the local cluster.
Prerequisites
Edit online
At least one running virtual machine (VM) or server with an active internet connection.
Procedure
Edit online
1. Enable the Red Hat Enterprise Linux baseos and appstream repositories:
Example
Example
2. Enable the ceph-tools repository for both Red Hat Enterprise Linux 8 and Red Hat Enterprise Linux 9:
Repeat the above steps on all the nodes of the storage cluster.
Example
Example
The registry will be stored in /opt/registry and the directories are mounted in the container running the registry.
The auth directory stores the htpasswd file the registry uses for authentication.
The certs directory stores the certificates the registry uses for authentication.
Replace PRIVATE_REGISTRY_USERNAME with the username to create for the private registry.
Replace PRIVATE_REGISTRY_PASSWORD with the password to create for the private registry username.
Example
Syntax
Replace LOCAL_NODE_FQDN with the fully qualified host name of the private registry node.
NOTE: You will be prompted for the respective options for your certificate. The CN= value is the host name of your node
and should be resolvable by DNS or the /etc/hosts file.
Example
# openssl req -newkey rsa:4096 -nodes -sha256 -keyout /opt/registry/certs/domain.key -x509 -days 365 -out
/opt/registry/certs/domain.crt -addext "subjectAltName = DNS:admin.lab.ibm.com"
NOTE: When creating a self-signed certificate, be sure to create a certificate with a proper Subject Alternative Name
(SAN). Podman commands that require TLS verification for certificates that do not include a proper SAN, return the
following error: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common
Name matching with GODEBUG=x509ignoreCN=0
7. Create a symbolic link to domain.cert to allow skopeo to locate the certificate with the file extension .cert:
Example
8. Add the certificate to the trusted list on the private registry node:
Syntax
cp /opt/registry/certs/domain.crt /etc/pki/ca-trust/source/anchors/
update-ca-trust
trust list | grep -i "LOCAL_NODE_FQDN"
Example
label: admin.lab.ibm.com
9. Copy the certificate to any nodes that will access the private registry for installation and update the trusted list:
Example
label: admin.lab.ibm.com
Syntax
Example
This starts the private registry on port 5000 and mounts the volumes of the registry directories in the container running the
registry.
11. On the local registry node, verify that cp.icr.io/cp is in the container registry search path.
a. Open for editing the /etc/containers/registries.conf file, and add cp.icr.io/cp to the unqualified-
search-registries list, if it does not exist:
Example
unqualified-search-registries = ["cp.icr.io/cp"]
Syntax
13. Copy the following IBM Storage Ceph 5 image, Prometheus images, and Dashboard image from the IBM Customer Portal to
the private registry:
Syntax
Replace SRC_IMAGE and SRC_TAG with the name and tag of the image to copy from cp.icr.io/cp.
Replace DST_IMAGE and DST_TAG with the name and tag of the image to copy to the private registry.
Example
14. Using the curl command, verify the images reside in the local registry:
Syntax
curl -u _PRIVATE_REGISTRY_USERNAME_:_PRIVATE_REGISTRY_PASSWORD_
https://_LOCAL_NODE_FQDN_:5000/v2/catalog
Example
{"repositories":["ibm-ceph/prometheus","ibm-ceph/prometheus-alertmanager","ibm-
ceph/prometheus-node-exporter","ibm-ceph/ceph-5-dashboard-rhel8","ibm-ceph/ceph-5-rhel8"]}
See the Red Hat knowledge centered solution What are the Red Hat Ceph Storage releases and corresponding Ceph package
versions? for different image Ceph package versions.
IMPORTANT: Skip these steps for Red Hat Enterprise Linux 9 as cephadm-preflight playbook is not supported.
The preflight playbook uses the cephadm-ansible inventory hosts file to identify all the nodes in the storage cluster. The default
location for cephadm-ansible, cephadm-preflight.yml, and the inventory hosts file is /usr/share/cephadm-ansible/.
Example
host02
host03
host04
[admin]
host01
The [admin] group in the inventory file contains the name of the node where the admin keyring is stored.
NOTE: Run the preflight playbook before you bootstrap the initial host.
Prerequisites
Nodes configured to access a local YUM repository server with the following repositories enabled on respective Red Hat
Enterprise Linux versions.
rhel-8-for-x86_64-baseos-rpms
rhel-8-for-x86_64-appstream-rpms
rhel-9-for-x86_64-baseos-rpms
rhel-9-for-x86_64-appstream-rpms
NOTE: For more information about setting up a local YUM repository, see the Red Hat knowledge base article Creating a Local
Repository and Sharing with Disconnected/Offline/Air-gapped Systems
Procedure
2. Open and edit the hosts file and add your nodes.
Example
Example
4. Run the preflight playbook with the ceph_origin parameter set to custom to use a local YUM repository:
Syntax
Example
5. Alternatively, you can use the --limit option to run the preflight playbook on a selected set of hosts in the storage cluster:
Syntax
Replace GROUP_NAME with a group name from your inventory file. Replace NODE_NAME with a specific node name from your
inventory file.
Example
NOTE: When you run the preflight playbook, cephadm-ansible automatically installs chronyd and ceph-common on the
client nodes.
NOTE: If your local registry uses a self-signed certificate with a local registry, ensure you have added the trusted root certificate to
the bootstrap host. For more information, see Configuring a private registry for a disconnected installation.
IMPORTANT: Before you begin the bootstrapping process, make sure that the container image that you want to use has the same
version of IBM Storage Ceph as cephadm. If the two versions do not match, bootstrapping fails at the Creating initial admin
user stage.
Prerequisites
Edit online
The preflight playbook has been run on the bootstrap host in the storage cluster. For more information, see Running the
preflight playbook for a disconnected installation.
A private registry has been configured and the bootstrap node has access to it. For more information, see Configuring a private
registry for a disconnected installation
Procedure
Edit online
Syntax
Replace PRIVATE_REGISTRY_NODE_FQDN with the fully qualified domain name of your private registry.
Replace CUSTOM_IMAGE_NAME and IMAGE_TAG with the name and tag of the IBM Storage Ceph container image that
resides in the private registry.
Replace IP_ADDRESS with the IP address of the node you are using to run cephadm bootstrap.
Replace PRIVATE_REGISTRY_USERNAME with the username to create for the private registry.
Replace PRIVATE_REGISTRY_PASSWORD with the password to create for the private registry username.
Example
The script takes a few minutes to complete. Once the script completes, it provides the credentials to the IBM Storage
Ceph Dashboard URL, a command to access the Ceph command-line interface (CLI), and a request to enable telemetry.
URL: https://host01:8443/
User: admin
Password: i8nhu7zham
ceph telemetry on
https://docs.ceph.com/docs/master/mgr/telemetry/
Bootstrap complete.
After the bootstrap process is complete, configure the container images, as detailed in Changing configurations of custom container
images for disconnected installations.
Once your storage cluster is up and running, configure additional daemons and services. For more information, see Operations.
NOTE: Make sure that the bootstrap process on the initial host is complete before making any configuration changes.
By default, the monitoring stack components are deployed based on the primary Ceph image. For disconnected environment of the
storage cluster, you can use the latest available monitoring stack component images.
NOTE: When using a custom registry, be sure to log in to the custom registry on newly added nodes before adding any Ceph
daemons.
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
1. Set the custom container images with the ceph config command:
Syntax
container_image_prometheus
container_image_grafana
container_image_alertmanager
container_image_node_exporter
Example
Syntax
NOTE: If any of the services do not deploy, you can redeploy them with the ceph orch redeploy command.
NOTE: By setting a custom image, the default values for the configuration image name and tag will be overridden, but not
overwritten. The default values change when updates become available. By setting a custom image, you will not be able to configure
the component for which you have set the custom image for automatic updates. You will need to manually update the configuration
image name and tag to be able to install updates.
If you choose to revert to using the default configuration, you can reset the custom container image. Use ceph config rm to
reset the configuration option:
Syntax
Example
Reference
Edit online
You can also generate an SSH key pair on the Ansible administration node and distribute the public key to each node in the storage
cluster so that Ansible can access the nodes without being prompted for a password.
Prerequisites
Edit online
Ansible user with sudo access to all nodes in the storage cluster.
Bootstrapping is completed. See Bootstrapping a new storage cluster for more details.
Procedure
Edit online
Example
2. From the Ansible administration node, distribute the SSH keys. The optional cephadm_pubkey_path parameter is the full
path name of the SSH public key file on the ansible controller host.
Syntax
Example
Edit online
The cephadm shell command launches a bash shell in a container with all of the Ceph packages installed. This enables you to
perform “Day One” cluster setup tasks, such as installation and bootstrapping, and to invoke ceph commands.
Prerequisites
Edit online
Procedure
Edit online
There are two ways to launch the cephadm shell:
Enter cephadm shell at the system prompt. This example invokes the ceph -s command from within the shell.
Example
At the system prompt, type cephadm shell and the command you want to execute:
Example
services:
mon: 3 daemons, quorum host01,host02,host03 (age 94m)
mgr: host01.lbnhug(active, since 59m), standbys: host02.rofgay, host03.ohipra
mds: 1/1 daemons up, 1 standby
osd: 18 osds: 18 up (since 10m), 18 in (since 10m)
rgw: 4 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 8 pools, 225 pgs
objects: 230 objects, 9.9 KiB
usage: 271 MiB used, 269 GiB / 270 GiB avail
pgs: 225 active+clean
NOTE: If the node contains configuration and keyring files in /etc/ceph/, the container environment uses the values in those files
as defaults for the cephadm shell. If you execute the cephadm shell on a MON node, the cephadm shell inherits its default
configuration from the MON container, instead of using the default configuration.
There are two ways of verifying the storage cluster installation as a root user:
Prerequisites
Edit online
Procedure
Edit online
Example
NOTE: In the NAMES column, the unit files now include the FSID.
Example
cluster:
id: f64f341c-655d-11eb-8778-fa163e914bcc
health: HEALTH_OK
services:
mon: 3 daemons, quorum host01,host02,host03 (age 94m)
mgr: host01.lbnhug(active, since 59m), standbys: host02.rofgay, host03.ohipra
mds: 1/1 daemons up, 1 standby
osd: 18 osds: 18 up (since 10m), 18 in (since 10m)
rgw: 4 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 8 pools, 225 pgs
objects: 230 objects, 9.9 KiB
usage: 271 MiB used, 269 GiB / 270 GiB avail
pgs: 225 active+clean
io:
client: 85 B/s rd, 0 op/s rd, 0 op/s wr
NOTE: The health of the storage cluster is in HEALTH_WARN status as the hosts and the daemons are not added.
NOTE: For Red Hat Enterprise Linux 8, running the preflight playbook installs podman, lvm2, chronyd, and cephadm on all hosts
listed in the Ansible inventory file.
NOTE: For Red Hat Enterprise Linux 9, you need to manually install podman, lvm2, chronyd, and cephadm on all hosts and skip
steps for running ansible playbooks as the preflight playbook is not supported.
NOTE: When using a custom registry, be sure to log in to the custom registry on newly added nodes before adding any Ceph
daemons.
Syntax
Example
Prerequisites
Edit online
Root-level or user with sudo access to all nodes in the storage cluster.
Ansible user with sudo and passwordless ssh access to all nodes in the storage cluster.
Procedure
Edit online
1. From the node that contains the admin keyring, install the storage cluster’s public SSH key in the root user’s
authorized_keys file on the new host:
NOTE: In the following procedure, use either root, as indicated, or the username with which the user is bootstrapped.
Syntax
Example
Example
Example
host02
host03
host04
[admin]
host01
NOTE: If you have previously added the new host to the Ansible inventory file and run the preflight playbook on the host, skip
to step 4.
Syntax
Example
The preflight playbook installs podman, lvm2, chronyd, and cephadm on the new host. After installation is complete,
cephadm resides in the /usr/sbin/ directory.
For Red Hat Enterprise Linux 9, install podman, lvm2, chronyd, and cephadm manually:
Example
5. From the bootstrap node, use the cephadm orchestrator to add the new host to the storage cluster:
Syntax
Example
6. Optional: You can also add nodes by IP address, before and after you run the preflight playbook. If you do not have DNS
configured in your storage cluster environment, you can add the hosts by IP address, along with the host names.
Syntax
Example
View the status of the storage cluster and verify that the new host has been added. The STATUS of the hosts is blank, in
the output of the ceph orch host ls command.
Example
Reference
Registering IBM Storage Ceph nodes to the CDN and attaching subscriptions
Edit online
The addr option offers an additional way to contact a host. Add the IP address of the host to the addr option. If ssh cannot connect
to the host by its hostname, then it uses the value stored in addr to reach the host by its IP address.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
NOTE: If adding a host by hostname results in that host being added with an IPv6 address instead of an IPv4 address, use
ceph orch host to specify the IP address of that host:
Syntax
To convert the IP address from IPv6 format to IPv4 format for a host you have added, use the following command:
NOTE: Be sure to create the hosts.yaml file within a host container, or create the file on the local host and then use the cephadm
shell to mount the file within the container. The cephadm shell automatically places mounted files in /mnt. If you create the file
directly on the local host and then apply the hosts.yaml file instead of mounting it, you might see a File does not exist error.
Prerequisites
174 IBM Storage Ceph
Edit online
Procedure
Edit online
1. Copy over the public ssh key to each of the hosts that you want to add.
3. Add the host descriptions to the hosts.yaml file, as shown in the following example. Include the labels to identify
placements for the daemons that you want to deploy on each host. Separate each host description with three dashes (---).
Example
service_type: host
addr:
hostname: host02
labels:
- mon
- osd
- mgr
---
service_type: host
addr:
hostname: host03
labels:
- mon
- osd
- mgr
---
service_type: host
addr:
hostname: host04
labels:
- mon
- osd
4. If you created the hosts.yaml file within the host container, invoke the ceph orch apply command:
Example
5. If you created the hosts.yaml file directly on the local host, use the cephadm shell to mount the file:
Example
[root@host01 ~]# cephadm shell --mount hosts.yaml -- ceph orch apply -i /mnt/hosts.yaml
Example
NOTE: If a host is online and operating normally, its status is blank. An offline host shows a status of OFFLINE, and a host in
maintenance mode shows a status of MAINTENANCE.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Syntax
Example
Removing hosts
Edit online
You can remove hosts of a Ceph cluster with the Ceph Orchestrators. All the daemons are removed with the drain option which
adds the _no_schedule label to ensure that you cannot deploy any daemons or a cluster till the operation is complete.
IMPORTANT: If you are removing the bootstrap host, be sure to copy the admin keyring and the configuration file to another host in
the storage cluster before you remove the host.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Syntax
Example
The _no_schedule label is automatically applied to the host which blocks deployment.
Example
When no placement groups (PG) are left on the OSD, the OSD is decommissioned and removed from the storage cluster.
5. Check if all the daemons are removed from the storage cluster:
Syntax
Example
Syntax
Example
Reference
Edit online
Labeling hosts
Edit online
The Ceph orchestrator supports assigning labels to hosts. Labels are free-form and have no specific meanings. This means that you
can use mon, monitor, mycluster_monitor, or any other text string. Each host can have multiple labels.
For example, apply the mon label to all hosts on which you want to deploy Ceph Monitor daemons, mgr for all hosts on which you
want to deploy Ceph Manager daemons, rgw for Ceph Object Gateway daemons, and so on.
Prerequisites
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
[ceph: root@host01 /]# ceph orch host label add host02 mon
Verification
Edit online
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Verification
Edit online
Example
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Method 1: Use the --placement option to deploy a daemon from the command line:
Syntax
Example
Method 2: To assign the daemon to a specific host label in a YAML file, specify the service type and label in the YAML
file:
Example
Example
service_type: prometheus
placement:
label: "mylabel"
Syntax
Example
Verification
Edit online
Syntax
Example
NOTE: In the case of a firewall, see Firewall settings for Ceph Monitor node
NOTE: The bootstrap node is the initial monitor of the storage cluster. Be sure to include the bootstrap node in the list of hosts to
which you want to deploy.
NOTE: If you want to apply Monitor service to more than one specific host, be sure to specify all of the host names within the same
ceph orch apply command. If you specify ceph orch apply mon --placement host1 and then specify ceph orch
apply mon --placement host2, the second command removes the Monitor service on host1 and applies a Monitor service to
host2.
If your Monitor nodes or your entire cluster are located on a single subnet, then cephadm automatically adds up to five Monitor
daemons as you add new hosts to the cluster. cephadm automatically configures the Monitor daemons on the new hosts. The new
hosts reside on the same subnet as the first (bootstrap) host in the storage cluster. cephadm can also deploy and scale monitors to
correspond to changes in the size of the storage cluster.
Prerequisites
Edit online
Procedure
Edit online
1. Apply the five Monitor daemons to five random hosts in the storage cluster:
Use host labels to identify the hosts that contain Monitor nodes.
Syntax
Example
[ceph: root@host01 /]# ceph orch host label add host01 mon
Syntax
Example
Syntax
Syntax
Example
NOTE: Be sure to include the bootstrap node in the list of hosts to which you want to deploy.
An admin node contains both the cluster configuration file and the admin keyring. Both of these files are stored in the directory
/etc/ceph and use the name of the storage cluster as a prefix.
For example, the default ceph cluster name is ceph. In a cluster using the default name, the admin keyring is named
/etc/ceph/ceph.client.admin.keyring. The corresponding cluster configuration file is named /etc/ceph/ceph.conf.
To set up additional hosts in the storage cluster as admin nodes, apply the _admin label to the host you want to designate as an
administrator node.
NOTE: By default, after applying the _admin label to a node, cephadm copies the ceph.conf and client.admin keyring files to
that node. The _admin label is automatically applied to the bootstrap node unless the --skip-admin-label option was specified
with the cephadm bootstrap command.
Prerequisites
Edit online
Procedure
Edit online
1. Use ceph orch host ls to view the hosts in your storage cluster:
2. Use the _admin label to designate the admin host in your storage cluster. For best results, this host should have both Monitor
and Manager daemons running.
Syntax
Example
Example
If your Ceph Monitor nodes or your entire cluster are located on a single subnet, then cephadm automatically adds up to five Ceph
Monitor daemons as you add new nodes to the cluster. cephadm automatically configures the Ceph Monitor daemons on the new
nodes. The new nodes reside on the same subnet as the first (bootstrap) node in the storage cluster. cephadm can also deploy and
scale monitors to correspond to changes in the size of the storage cluster.
NOTE: Use host labels to identify the hosts that contain Ceph Monitor nodes.
Prerequisites
Edit online
Procedure
Edit online
Syntax
[ceph: root@host01 /]# ceph orch host label add host02 mon
[ceph: root@host01 /]# ceph orch host label add host03 mon
Syntax
Example
Syntax
Syntax
Example
NOTE: Be sure to include the bootstrap node in the list of hosts to which you want to deploy.
If your Monitor nodes or your entire cluster are located on a single subnet, then cephadm automatically adds up to five Monitor
daemons as you add new nodes to the cluster. You do not need to configure the Monitor daemons on the new nodes. The new nodes
reside on the same subnet as the first node in the storage cluster. The first node in the storage cluster is the bootstrap node.
cephadm can also deploy and scale monitors to correspond to changes in the size of the storage cluster.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
1. Use ceph orch host ls to view the hosts and identify the admin host in your storage cluster:
Example
Example
3. Use the ceph orchestrator to remove the admin label from a host:
Syntax
Example
Example
The Ceph orchestrator deploys two Manager daemons by default. To deploy a different number of Manager daemons, specify a
different number. If you do not specify the hosts where the Manager daemons should be deployed, the Ceph orchestrator randomly
selects the hosts and deploys the Manager daemons to them.
NOTE: If you want to apply Manager daemons to more than one specific host, be sure to specify all of the host names within the
same ceph orch apply command. If you specify ceph orch apply mgr --placement host1 and then specify ceph orch
apply mgr --placement host2, the second command removes the Manager daemon on host1 and applies a Manager daemon
to host2.
Prerequisites
Edit online
Procedure
Edit online
To specify that you want to apply a certain number of Manager daemons to randomly selected hosts:
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mgr --placement "host02 host03 host04"
Adding OSDs
Edit online
Cephadm will not provision an OSD on a device that is not available. A storage device is considered available if it meets all of the
following conditions:
IMPORTANT: By default, the osd_memory_target_autotune parameter is set to true in IBM Storage Ceph 5.3.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
2. You can either deploy the OSDs on specific hosts or on all the available devices:
Syntax
Example
To deploy OSDs on any available and unused devices, use the --all-available-devices option.
Example
NOTE: This command creates colocated WAL and DB daemons. If you want to create non-colocated daemons, do not
use this command.
Reference
Edit online
For more information about drive specifications for OSDs, see Advanced service specifications and filters for deploying OSDs
For more information about zapping devices to clear data on devices, see Zapping devices for Ceph OSD deployment
The Ansible inventory file lists all the hosts in your cluster and what roles each host plays in your Ceph storage cluster. The default
location for an inventory file is /usr/share/cephadm-ansible/hosts, but this file can be placed anywhere.
Example
host02
host03
host04
[admin]
host01
[clients]
client01
client02
client03
Prerequisites
Edit online
Ansible user with sudo and passwordless ssh access to all nodes in the storage cluster.
The [admin] group is defined in the inventory file with a node where the admin keyring is present at
/etc/ceph/ceph.client.admin.keyring.
Procedure
Edit online
Syntax
Example
NOTE: An additional extra-var (-e ceph_origin=ibm) is required to zap the disk devices during the purge.
When the script has completed, the entire storage cluster, including all OSD disks, will have been removed from all hosts in the
cluster.
Prerequisites
Edit online
1. Disable cephadm to stop all the orchestration operations to avoid deploying new daemons:
Example
Example
Syntax
Example
IMPORTANT: Skip these steps for Red Hat Enterprise Linux 9 as cephadm-preflight playbook is not supported.
The cephadm-clients.yml playbook handles the distribution of configuration and keyring files to a group of Ceph clients.
NOTE: If you are not using the cephadm-ansible playbooks, after upgrading your Ceph cluster, you must upgrade the ceph-
common package and client libraries on your client nodes. For more information, see Upgrading the IBM Storage Ceph cluster.
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
The [admin] group is defined in the inventory file with a node where the admin keyring is present at
/etc/ceph/ceph.client.admin.keyring.
Procedure
Edit online
1. As an Ansible user, navigate to the /usr/share/cephadm-ansible directory on the Ansible administration node.
Example
2. Open and edit the hosts inventory file and add the [clients] group and clients to your inventory:
host02
host03
host04
[clients]
client01
client02
client03
[admin]
host01
Syntax
Example
4. Run the cephadm-clients.yml playbook to distribute the keyring and Ceph configuration files to a set of clients.
Syntax
Replace KEYRING_PATH with the full path name to the keyring on the admin host that you want to copy to the
client.
Optional: Replace CLIENT_GROUP_NAME with the Ansible group name for the clients to set up.
Optional: Replace CEPH_CONFIGURATION_PATH with the full path to the Ceph configuration file on the admin
node.
Optional: Replace KEYRING_DESTINATION_PATH with the full path name of the destination where the keyring
will be copied.
NOTE: If you do not specify a configuration file with the conf option when you run the playbook, the playbook
generates and distributes a minimal configuration file. By default, the generated file is located at
/etc/ceph/ceph.conf.
Example
b. To copy a keyring with the default destination keyring name of ceph.keyring and using the default group of
clients:
Syntax
Example
Verification
Edit online
Log into the client nodes and verify that the keyring and configuration files exist.
Example
Reference
Edit online
For more information about admin keys, see Ceph User Management.
For more information about the cephadm-preflight playbook, see Running the preflight playbook.
NOTE: At this time, cephadm-ansible modules only support the most important tasks. Any operation not covered by cephadm-
ansible modules must be completed using either the command or shell Ansible modules in your playbooks.
cephadm-ansible modules
cephadm-ansible modules options
Bootstrapping a storage cluster using the cephadm_bootstrap and cephadm_registry_login modules
Adding or removing hosts using the ceph_orch_host module
Setting configuration options using the ceph_config module
Applying a service specification using the ceph_orch_apply module
Managing Ceph daemon states using the ceph_orch_daemon module
cephadm-ansible modules
Edit online
The cephadm-ansible modules are a collection of modules that simplify writing Ansible playbooks by providing a wrapper around
cephadm and ceph orch commands. You can use the modules to write your own unique Ansible playbooks to administer your
cluster using one or more of the modules.
cephadm_bootstrap
ceph_orch_host
ceph_config
ceph_orch_daemon
cephadm_registry_login
Edit online
The following tables list the available options for the cephadm-ansible modules. Options listed as required need to be set when
using the modules in your Ansible playbooks. Options listed with a default value of true indicate that the option is automatically set
when using the modules and you do not need to specify it in your playbook. For example, for the cephadm_bootstrap module, the
Ceph Dashboard is installed unless you set dashboard: false.
Edit online
As a storage administrator, you can bootstrap a storage cluster using Ansible by using the cephadm_bootstrap and
cephadm_registry_login modules in your Ansible playbook.
Prerequisites
Edit online
An IP address for the first Ceph Monitor container, which is also the IP address for the first node in the storage cluster.
Procedure
Edit online
Example
3. Create the hosts file and add hosts, labels, and monitor IP address of the first host in the storage cluster:
Syntax
sudo vi INVENTORY_FILE
[admin]
ADMIN_HOST monitor_address=MONITOR_IP_ADDRESS labels="[ADMIN_LABEL, LABEL1, LABEL2]"
Example
[admin]
host01 monitor_address=10.10.128.68 labels="['_admin', 'mon', 'mgr']"
Syntax
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: NAME_OF_PLAY
hosts: BOOTSTRAP_HOST
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
-name: NAME_OF_TASK
cephadm_registry_login:
state: STATE
registry_url: REGISTRY_URL
registry_username: REGISTRY_USER_NAME
registry_password: REGISTRY_PASSWORD
Example
---
- name: bootstrap the cluster
hosts: host01
become: true
gather_facts: false
tasks:
- name: login to registry
cephadm_registry_login:
state: login
registry_url: cp.icr.io/cp
registry_username: user1
registry_password: mypassword1
Syntax
Example
Edit online
Add and remove hosts in your storage cluster by using the ceph_orch_host module in your Ansible playbook.
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
New hosts have the storage cluster’s public SSH key. For more information about copying the storage cluster’s public SSH
keys to new hosts, see Adding hosts.
Procedure
Example
c. Add the new hosts and labels to the Ansible inventory file.
Syntax
sudo vi INVENTORY_FILE
[admin]
ADMIN_HOST monitor_address=MONITOR_IP_ADDRESS labels="[ADMIN_LABEL, LABEL1, LABEL2]"
Example
[admin]
host01 monitor_address= 10.10.128.68 labels="['_admin', 'mon', 'mgr']"
NOTE: If you have previously added the new hosts to the Ansible inventory file and ran the preflight playbook on the
hosts, skip to step 3.
Syntax
Example
The preflight playbook installs podman, lvm2, chronyd, and cephadm on the new host. After installation is complete,
cephadm resides in the /usr/sbin/ directory.
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: HOSTS_OR_HOST_GROUPS
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_orch_host:
name: "{{ ansible_facts[hostname] }}"
address: "{{ ansible_facts[default_ipv4][address] }}"
labels: "{{ labels }}"
delegate_to: HOST_TO_DELEGATE_TASK_TO
- name: NAME_OF_TASK
when: inventory_hostname in groups[admin]
debug:
msg: "{{ REGISTER_NAME.stdout }}"
NOTE: By default, Ansible executes all tasks on the host that matches the hosts line of your playbook. The ceph
orch commands must run on the host that contains the admin keyring and the Ceph configuration file. Use the
delegate_to keyword to specify the admin host in your cluster.
Example
---
- name: add additional hosts to the cluster
hosts: all
become: true
gather_facts: true
tasks:
- name: add hosts to the cluster
ceph_orch_host:
name: "{{ ansible_facts['hostname'] }}"
address: "{{ ansible_facts['default_ipv4']['address'] }}"
labels: "{{ labels }}"
delegate_to: host01
In this example, the playbook adds the new hosts to the cluster and displays a current list of hosts.
Syntax
Example
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: NAME_OF_PLAY
hosts: ADMIN_HOST
- name: NAME_OF_TASK
ceph_orch_host:
name: HOST_TO_REMOVE
state: STATE
retries: NUMBER_OF_RETRIES
delay: DELAY
until: CONTINUE_UNTIL
register: REGISTER_NAME
- name: NAME_OF_TASK
ansible.builtin.shell:
cmd: ceph orch host ls
register: REGISTER_NAME
- name: NAME_OF_TASK
debug:
msg: "{{ REGISTER_NAME.stdout }}"
Example
---
- name: remove host
hosts: host01
become: true
gather_facts: true
tasks:
- name: drain host07
ceph_orch_host:
name: host07
state: drain
In this example, the playbook tasks drain all daemons on host07, removes the host from the cluster, and displays a
current list of hosts.
Syntax
Example
Verification
Review the Ansible task output displaying the current list of hosts in the cluster:
Example
Edit online
As a storage administrator, you can set or get IBM Storage Ceph configuration options using the ceph_config module.
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
The Ansible inventory file contains the cluster and admin hosts. For more information about adding hosts to your storage
cluster, see Adding or removing hosts using the ceph orch host module.
Procedure
Edit online
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: ADMIN_HOST
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_config:
action: GET_OR_SET
who: DAEMON_TO_SET_CONFIGURATION_TO
option: CEPH_CONFIGURATION_OPTION
- name: NAME_OF_TASK
ceph_config:
action: GET_OR_SET
who: DAEMON_TO_SET_CONFIGURATION_TO
option: CEPH_CONFIGURATION_OPTION
register: REGISTER_NAME
- name: NAME_OF_TASK
debug:
msg: "MESSAGE_TO_DISPLAY {{ REGISTER_NAME.stdout }}"
Example
---
- name: set pool delete
hosts: host01
become: true
gather_facts: false
tasks:
- name: set the allow pool delete option
ceph_config:
action: set
who: mon
option: mon_allow_pool_delete
value: true
In this example, the playbook first sets the mon_allow_pool_delete option to false. The playbook then gets the current
mon_allow_pool_delete setting and displays the value in the Ansible output.
Syntax
Example
Verification
Edit online
Example
Reference
Edit online
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
The Ansible inventory file contains the cluster and admin hosts. For more information about adding hosts to your storage
cluster, see Adding or removing hosts using the ceph orchhost module.
Procedure
Edit online
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: HOSTS_OR_HOST_GROUPS
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_orch_apply:
spec: |
service_type: SERVICE_TYPE
service_id: UNIQUE_NAME_OF_SERVICE
placement:
host_pattern: HOST_PATTERN_TO_SELECT_HOSTS
label: LABEL
spec:
SPECIFICATION_OPTIONS:
Example
---
- name: deploy osd service
In this example, the playbook deploys the Ceph OSD service on all hosts with the label osd.
Syntax
Example
Verification
Edit online
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
The Ansible inventory file contains the cluster and admin hosts. For more information about adding hosts to your storage
cluster, see Adding or removing hosts using the ceph_orch_host module.
Procedure
Edit online
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: ADMIN_HOST
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_orch_daemon:
state: STATE_OF_SERVICE
daemon_id: DAEMON_ID
daemon_type: TYPE_OF_SERVICE
Example
---
- name: start and stop services
hosts: host01
become: true
gather_facts: false
tasks:
- name: start osd.0
ceph_orch_daemon:
state: started
daemon_id: 0
daemon_type: osd
In this example, the playbook starts the OSD with an ID of 0 and stops a Ceph Monitor with an id of host02.
Syntax
Example
Verification
Edit online
The tables compare Cephadm with Ceph-Ansible playbooks for managing the containerized deployment of a Ceph cluster for day
one and day two operations.
cephadm commands
Edit online
The cephadm is a command line tool to manage the local host for the Cephadm Orchestrator. It provides commands to investigate
and modify the state of the current host.
NOTE: cephadm is not required on all hosts, however, it is useful when investigating a particular daemon. The cephadm-ansible-
preflight playbook installs cephadm on all hosts and the cephadm-ansible purge playbook requires cephadm be installed on
all hosts to work properly.
Description
Convert an upgraded storage cluster daemon to run cephadm.
Syntax
cephadm adopt [-h] --name DAEMON_NAME --style STYLE [--cluster CLUSTER] --legacy-dir [LEGACY_DIR] -
-config-json CONFIG_JSON] [--skip-firewalld] [--skip-pull]
Example
ceph-volume
Description
This command is used to list all the devices on the particular host. Run the ceph-volume command inside a container Deploys
OSDs with different device technologies like lvm or physical disks using pluggable tools and follows a predictable, and robust way of
preparing, activating, and starting OSDs.
Syntax
Example
check-host
Description
Check the host configuration that is suitable for a Ceph cluster.
Syntax
Example
deploy
Description
Deploys a daemon on the local host.
Syntax
cephadm shell deploy DAEMON_TYPE [-h] [--name DAEMON_NAME] [--fsid FSID] [--config CONFIG, -c
CONFIG] [--config-json CONFIG_JSON] [--keyring KEYRING] [--key KEY] [--osd-fsid OSD_FSID] [--skip-
firewalld] [--tcp-ports TCP_PORTS] [--reconfig] [--allow-ptrace] [--memory-request MEMORY_REQUEST]
[--memory-limit MEMORY_LIMIT] [--meta-json META_JSON]
Example
enter
Description
Run an interactive shell inside a running daemon container.
Syntax
cephadm enter [-h] [--fsid FSID] --name NAME [command [command …]]
Example
help
Syntax
cephadm help
Example
install
Description
Install the packages.
Syntax
Example
inspect-image
Description
Inspect the local Ceph container image.
Syntax
Example
list-networks
Description
List the IP networks.
Syntax
cephadm list-networks
Example
ls
Description
List daemon instances known to cephadm on the hosts. You can use --no-detail for the command to run faster, which gives
details of the daemon name, fsid, style, and systemd unit per daemon. You can use --legacy-dir option to specify a legacy base
directory to search for daemons.
Syntax
Example
logs
Description
Print journald logs for a daemon container. This is similar to the journalctl command.
Syntax
Example
prepare-host
Description
Prepare a host for cephadm.
Syntax
Example
pull
Description
Pull the Ceph image.
Syntax
Example
registry-login
Description
Give cephadm login information for an authenticated registry. Cephadm attempts to log the calling host into that registry.
Syntax
Example
You can also use a JSON registry file containing the login info formatted as:
Syntax
cat REGISTRY_FILE
{
"url":"REGISTRY_URL",
"username":"REGISTRY_USERNAME",
"password":"REGISTRY_PASSWORD"
}
Example
{
"url":"cp.icr.io/cp",
"username":"myuser",
"password":"mypass"
}
rm-daemon
Description
Remove a specific daemon instance. If you run the cephadm rm-daemon command on the host directly, although the command
removes the daemon, the cephadm mgr module notices that the daemon is missing and redeploys it. This command is problematic
and should be used only for experimental purposes and debugging.
Syntax
Example
rm-cluster
Description
Remove all the daemons from a storage cluster on that specific host where it is run. Similar to rm-daemon, if you remove a few
daemons this way and the Ceph Orchestrator is not paused and some of those daemons belong to services that are not unmanaged,
the cephadm orchestrator just redeploys them there.
Syntax
Example
rm-repo
Description
Remove a package repository configuration. This is mainly used for the disconnected installation of IBM Storage Ceph.
Syntax
Example
run
Description
Run a Ceph daemon, in a container, in the foreground.
Syntax
Example
shell
Description
Run an interactive shell with access to Ceph commands over the inferred or specified Ceph cluster. You can enter the shell using the
cephadm shell command and run all the orchestrator commands within the shell.
Syntax
cephadm shell [--fsid FSID] [--name DAEMON_NAME, -n DAEMON_NAME] [--config CONFIG, -c CONFIG] [--
mount MOUNT, -m MOUNT] [--keyring KEYRING, -k KEYRING] [--env ENV, -e ENV]
Example
Description
Start, stop, restart, enable, and disable the daemons with this operation. This operates on the daemon’s systemd unit.
Syntax
Example
version
Description
Provides the version of the storage cluster.
Syntax
cephadm version
Example
For more information about how to use the cephadm orchestrator to perform "Day Two" operations, see Operations.
To deploy, configure, and administer the Ceph Object Gateway on "Day Two" operations, see Object Gateway.
Upgrading
Edit online
Upgrade to an IBM Storage Ceph cluster running Red Hat Enterprise Linux on AMD64 and Intel 64 architectures.
Edit online
As a storage administrator, you can use the cephadm Orchestrator to upgrade from Red Hat Ceph Storage 5.3 to IBM Storage Ceph
5.3.
You can also use the Orchestrator to upgrade an IBM Storage Ceph cluster to later releases.
The automated upgrade process follows Ceph best practices. For example:
The upgrade order starts with Ceph Managers, Ceph Monitors, then other daemons.
Each daemon is restarted only after Ceph indicates that the cluster will remain available.
The storage cluster health status is likely to switch to HEALTH_WARNING during the upgrade. When the upgrade is complete, the
health status should switch back to HEALTH_OK.
IMPORTANT: Red Hat Enterprise Linux 9 and later does not support the cephadm-ansible playbook.
IMPORTANT: Red Hat Enterprise Linux 9 and later does not support the cephadm-ansible playbook.
Prerequisites
Edit online
Ansible user with sudo and passwordless ssh access to all nodes in the storage cluster.
At least two Ceph Manager nodes in the storage cluster: one active and one standby.
NOTE: IBM Storage Ceph 5 also includes a health check function that returns a DAEMON_OLD_VERSION warning if it detects that
any of the daemons in the storage cluster are running multiple versions of IBM Storage Ceph. The warning is triggered when the
daemons continue to run multiple versions of IBM Storage Ceph beyond the time value set in the
mon_warn_older_version_delay option. By default, the mon_warn_older_version_delay option is set to 1 week. This
setting allows most upgrades to proceed without falsely seeing the warning.
If the upgrade process is paused for an extended time period, you can mute the health warning:
Procedure
Edit online
1. Enable the Red Hat Enterprise Linux baseos and appstream repositories:
Example
Example
IMPORTANT: This step must be performed to avoid upgrade failures of cephadm and ceph-ansible packages.
Example
3. Enable the ceph-tools repository (/etc/yum.repos.d/) for both Red Hat Enterprise Linux 8 and Red Hat Enterprise Linux
9:
Repeat the above steps on all the nodes of the storage cluster
4. Add license to install IBM Storage Ceph and click Accept on all nodes:
Example
Example
5. Perform this step if upgrading from Red Hat Ceph Storage 5.3 to IBM Storage Ceph 5.3:
Example
Example
7. Run the preflight playbook with the upgrade_ceph_packagesparameter set to true on the bootstrapped host in the
storage cluster:
Syntax
Example
Example
9. Ensure all the hosts are online and that the storage cluster is healthy:
Example
10. Set the OSD noout, noscrub, and nodeep-scrub flags to prevent OSDs from getting marked out during upgrade and to
avoid unnecessary load on the cluster:
Example
11. Login to registry and check service versions and the available target containers:
Syntax
cat mylogin.json
{ "url":"REGISTRY_URL",
"username":"USER_NAME",
"password":"PASSWORD" }
ceph cephadm registry-login -i mylogin.json
Syntax
Example
Syntax
Example
While the upgrade is underway, a progress bar appears in the ceph status output.
Example
14. Verify the new IMAGE_ID and VERSION of the Ceph cluster:
Example
NOTE: If you are not using the cephadm-ansible playbooks, after upgrading your Ceph cluster, you must upgrade the
ceph-common package and client libraries on your client nodes.
Example
Example
15. When the upgrade is complete, unset the noout, noscrub, and nodeep-scrub flags:
Example
Prerequisites
Edit online
Ansible user with sudo and passwordless ssh access to all nodes in the storage cluster.
At least two Ceph Manager nodes in the storage cluster: one active and one standby.
Check for the customer container images in a disconnected environment and change the configuration, if required. See the
Changing configurations of custom container images for disconnected installations
By default, the monitoring stack components are deployed based on the primary Ceph image. For disconnected environment of the
storage cluster, you have to use the latest available monitoring stack component images.
Procedure
Edit online
1. Add license to install IBM Storage Ceph and click Accept on all nodes:
Example
Example
Example
3. Run the preflight playbook with the upgrade_ceph_packages parameter set to true and the ceph_origin parameter set
to custom on the bootstrapped host in the storage cluster:
Syntax
Example
Example
5. Ensure all the hosts are online and that the storage cluster is healthy:
Example
Syntax
Example
7. Set the OSD noout, noscrub, and nodeep-scrub flags to prevent OSDs from getting marked out during upgrade and to
avoid unnecessary load on the cluster:
Example
Syntax
Example
While the upgrade is underway, a progress bar appears in the ceph status output.
Example
Example
10. When the upgrade is complete, unset the noout, noscrub, and nodeep-scrub flags:
Example
Reference
Edit online
See the Registering IBM Storage Ceph nodes to the CDN and attaching subscriptions
Staggered upgrade
Edit online
As a storage administrator, you can upgrade IBM Storage Ceph components in phases rather than all at once. The ceph orch
upgrade command enables you to specify options to limit which daemons are upgraded by a single upgrade command.
NOTE: If you want to upgrade from a version that does not support staggered upgrades, you must first manually upgrade the Ceph
Manager (ceph-mgr) daemons.
--daemon_types: The --daemon_types option takes a comma-separated list of daemon types and will only upgrade
daemons of those types. Valid daemon types for this option include mgr, mon, crash, osd, mds, rgw, rbd-mirror, and
cephfs-mirror.
--services: The --services option is mutually exclusive with --daemon-types, only takes services of one type at a time,
and will only upgrade daemons belonging to those services. For example, you cannot provide an OSD and RGW service
simultaneously.
--hosts: You can combine the --hosts option with --daemon_types, --services, or use it on its own. The --hosts
option parameter follows the same format as the command line options for orchestrator CLI placement specification.
--limit: The --limit option takes an integer greater than zero and provides a numerical limit on the number of daemons
cephadm will upgrade. You can combine the --limit option with --daemon_types, --services, or --hosts. For
Cephadm strictly enforces an order for the upgrade of daemons that is still present in staggered upgrade scenarios. The current
upgrade order is:
Ceph-crash daemons
CephFS-mirror node
NOTE: If you specify parameters that upgrade daemons out of order, the upgrade command blocks and notes which daemons you
need to upgrade before you proceed.
Example
Error EINVAL: Cannot start upgrade. Daemons with types earlier in upgrade order than daemons on
given host need upgrading.
Please first upgrade mon.ceph-host01
NOTE: Enforced upgrade order is: mgr -> mon -> crash -> osd -> mds -> rgw -> rbd-mirror -> cephfs-
mirror
Prerequisites
Edit online
At least two Ceph Manager nodes in the storage cluster: one active and one standby.
Procedure
Edit online
Example
2. Ensure all the hosts are online and that the storage cluster is healthy:
Example
Syntax
Example
Syntax
Example
Syntax
Example
NOTE: In staggered upgrade scenarios, if using a limiting parameter, the monitoring stack daemons, including
Prometheus and node-exporter, are refreshed after the upgrade of the Ceph Manager daemons. As a result of the
limiting parameter, Ceph Manager upgrades take longer to complete. The versions of monitoring stack daemons might
not change between Ceph releases, in which case, they are only redeployed.
NOTE: Upgrade commands with limiting parameters validates the options before beginning the upgrade, which can
require pulling the new container image. As a result, the upgrade start command might take a while to return when
you provide limiting parameters.
5. To see which daemons you still need to upgrade, run the ceph orch upgrade check or ceph versions command:
Example
6. To complete the staggered upgrade, verify the upgrade of all remaining services:
Syntax
Example
Example
Reference
IBM Storage Ceph 217
Edit online
For more information about performing a staggered upgrade and staggered upgrade options, see Performing a staggered
upgrade.
NOTE: You have to upgrade one daemon type after the other. If a daemon cannot be upgraded, the upgrade is paused.
Prerequisites
Edit online
At least two Ceph Manager nodes in the storage cluster: one active and one standby.
Procedure
Edit online
1. Determine whether an upgrade is in process and the version to which the cluster is upgrading:
Example
NOTE: You do not get a message once the upgrade is successful. Run ceph versions and ceph orch ps commands to
verify the new image ID and the version of the storage cluster.
Example
Example
Example
Configuring
Edit online
This document provides instructions for configuring IBM Storage Ceph at boot time and run time. It also provides configuration
reference information.
Prerequisites
Ceph configuration
The Ceph configuration database
Using the Ceph metavariables
Viewing the Ceph configuration at runtime
Viewing a specific configuration at runtime
Setting a specific configuration at runtime
OSD Memory Target
MDS Memory Cache Limit
Ceph configuration
Cluster Identity
Authentication settings
Ceph daemons
Network configuration
Paths to keyrings
A deployment tool, such as cephadm, will typically create an initial Ceph configuration file for you. However, you can create one
yourself if you prefer to bootstrap an IBM Storage Ceph cluster without using a deployment tool.
Reference
For more information about cephadm and the Ceph orchestrator, see Operations.
Runtime override, using the ceph daemon DAEMON-NAME config set or ceph tell DAEMON-NAME injectargs
commands
There are still a few Ceph options that can be defined in the local Ceph configuration file, which is /etc/ceph/ceph.conf by
default.
cephadm uses a basic ceph.conf file that only contains a minimal set of options for connecting to Ceph Monitors, authenticating,
and fetching configuration information. In most cases, cephadm uses only the mon_host option. To avoid using ceph.conf only for
the mon_host option, use DNS SRV records to perform operations with Monitors.
IMPORTANT: IBM recommends that you use the assimilate-conf administrative command to move valid options into the
configuration database from the ceph.conf file. For more information about assimilate-conf, see Administrative Commands.
Ceph allows you to make changes to the configuration of a daemon at runtime. This capability can be useful for increasing or
decreasing the logging output, by enabling or disabling debug settings, and can even be used for runtime optimization.
NOTE: When the same option exists in the configuration database and the Ceph configuration file, the configuration database option
has a lower priority than what is set in the Ceph configuration file.
Just as you can configure Ceph options globally, per daemon type, or by a specific daemon in the Ceph configuration file, you can
also configure the Ceph options in the configuration database according to these sections:
Section Description
type:location
The type is a CRUSH property, for example, rack or host. The location is a value for the property type. For example, host:foo
limits the option only to daemons or clients running on the foo host.
Example
class:device-class
The device-class is the name of the CRUSH device class, such as hdd or ssd. For example, class:ssd limits the option only to
Ceph OSDs backed by solid state drives (SSD). This mask has no effect on non-OSD daemons of clients.
Example
Administrative Commands
The Ceph configuration database can be administered with the subcommand ceph config ACTION. These are the actions you can
do:
ls
Lists the available configuration options.
dump
Dumps the entire configuration database of options for the storage cluster.
get WHO
Dumps the configuration for a specific daemon or client. For example, WHO can be a daemon, like mds.a.
show WHO
Shows the reported running configuration for a running daemon. These options might be different from those stored by the
Ceph Monitors if there is a local configuration file in use or options have been overridden on the command line or at run time.
Also, the source of the option values is reported as part of the output.
Reference
For more information about the command, see Setting a specific configuration at runtime.
Metavariables are very powerful when used within the [global], [osd], [mon], or [client] sections of the Ceph configuration
file. However, you can also use them with the administration socket. Ceph metavariables are similar to Bash shell expansion.
$cluster
Description
Expands to the Ceph storage cluster name. Useful when running multiple Ceph storage clusters on the same hardware.
Example
/etc/ceph/$cluster.keyring
Default ceph
$type
Description
Expands to one of osd or mon, depending on the type of instant daemon.
Example
/var/lib/ceph/$type
$id
Description
Expands to the daemon identifier. For osd.0, this would be 0.
Example
/var/lib/ceph/$type/$cluster-$id
$host
Description
Expands to the host name of the instant daemon.
$name
Description
Expands to $type.$id.
Example
/var/run/ceph/$cluster-$name.asok
Prerequisites
Procedure
1. To view a runtime configuration, log in to a Ceph node running the daemon and execute:
Syntax
To see the configuration for osd.0, you can log into the node containing osd.0 and execute this command:
Example
Prerequisites
Procedure
Syntax
Example
Prerequisites
Procedure
Syntax
Example
Example
Syntax
Example
Syntax
Example
Example
Syntax
Example
NOTE: If you use a client that does not support reading options from the configuration database, or if you still need to use
ceph.conf to change your cluster configuration for other reasons, run the following command:
You must maintain and distribute the ceph.conf file across the storage cluster.
The option osd_memory_target sets OSD memory based upon the available RAM in the system. Use this option when TCMalloc is
configured as the memory allocator, and when the bluestore_cache_autotune option in BlueStore is set to true.
Ceph OSD memory caching is more important when the block device is slow; for example, traditional hard drives, because the
benefit of a cache hit is much higher than it would be with a solid state drive. However, this must be weighed into a decision to
collocate OSDs with other services, such as in a hyper-converged infrastructure (HCI) or other applications.
NOTE: Configuration options for individual OSDs take precedence over the settings for all OSDs.
Prerequisites
Procedure
Syntax
VALUE is the number of GBytes of memory to be allocated to each OSD in the storage cluster.
Syntax
.id is the ID of the OSD and VALUE is the number of GB of memory to be allocated to the specified OSD. For example, to
configure the OSD with ID 8 to use up to 16 GBytes of memory:
Example
3. To set an individual OSD to use one maximum amount of memory and configure the rest of the OSDs to use another amount,
specify the individual OSD first:
Example
Reference
To configure IBM Storage Ceph to autotune OSD memory usage, see Automatically tuning OSD memory.
Example
ceph_conf_overrides:
osd:
mds_cache_memory_limit=2000000000
NOTE: For a large IBM Storage Ceph cluster with a metadata-intensive workload, do not put an MDS server on the same node as
other memory-intensive services, doing so gives you the option to allocate more memory to MDS, for example, sizes greater than
100 GB.
Reference
Prerequisites
Network connectivity.
Reference
Ceph has one network configuration requirement that applies to all daemons. The Ceph configuration file must specify the host for
each daemon.
Some deployment utilities, such as cephadm creates a configuration file for you. Do not set these values if the deployment utility
does it for you.
IMPORTANT: The host option is the short name of the node, not its FQDN. It is not an IP address.
All Ceph clusters must use a public network. However, unless you specify an internal cluster network, Ceph assumes a single public
network. Ceph can function with a public network only, but for large storage clusters, you will see significant performance
improvement with a second private network for carrying only cluster-related traffic.
IMPORTANT: IBM recommends running a Ceph storage cluster with two networks. One public network and one private network.
To support two networks, each Ceph Node will need to have more than one network interface card (NIC).
Performance: Ceph OSDs handle data replication for the Ceph clients. When Ceph OSDs replicate data more than once, the
network load between Ceph OSDs easily dwarfs the network load between Ceph clients and the Ceph storage cluster. This can
introduce latency and create a performance problem. Recovery and rebalancing can also introduce significant latency on the
public network.
Security: While most people are generally civil, some actors will engage in what is known as a Denial of Service (DoS) attack.
When traffic between Ceph OSDs gets disrupted, peering may fail and placement groups may no longer reflect an active +
clean state, which may prevent users from reading and writing data. A great way to defeat this type of attack is to maintain a
completely separate cluster network that does not connect directly to the internet.
Network configuration settings are not required. Ceph can function with a public network only, assuming a public network is
configured on all hosts running a Ceph daemon. However, Ceph allows you to establish much more specific criteria, including
multiple IP networks and subnet masks for your public network. You can also establish a separate cluster network to handle OSD
heartbeat, object replication, and recovery traffic.
Do not confuse the IP addresses you set in the configuration with the public-facing IP addresses network clients might use to access
your service. Typical internal IP networks are often 192.168.0.0 or 10.0.0.0.
NOTE: Ceph uses CIDR notation for subnets, for example, 10.0.0.0/24.
IMPORTANT: If you specify more than one IP address and subnet mask for either the public or the private network, the subnets
within the network must be capable of routing to each other. Additionally, make sure you include each IP address and subnet in your
IP tables and open ports for them as necessary.
When you configured the networks, you can restart the cluster or restart each daemon. Ceph daemons bind dynamically, so you do
not have to restart the entire cluster at once if you change the network configuration.
Reference
For common option descriptions and usage information, see Ceph network configuration options.
simple
async
In IBM Storage Ceph 5.3 and higher, async is the default messenger type. To change the messenger type, specify the ms_type
configuration setting in the [global] section of the Ceph configuration file.
NOTE: For the async messenger, IBM supports the posix transport type, but does not currently support rdma or dpdk. By default,
the ms_type setting in IBM Storage Ceph 5.3 or higher reflects async+posix, where async is the messenger type and posix is
the transport type.
SimpleMessenger
The SimpleMessenger implementation uses TCP sockets with two threads per socket. Ceph associates each logical session
with a connection. A pipe handles the connection, including the input and output of each message. While SimpleMessenger
is effective for the posix transport type, it is not effective for other transport types such as rdma or dpdk.
AsyncMessenger
Consequently, AsyncMessenger is the default messenger type for IBM Storage Ceph 5.3 or higher. For IBM Storage Ceph 5.3
or higher, the AsyncMessenger implementation uses TCP sockets with a fixed-size thread pool for connections, which should
be equal to the highest number of replicas or erasure-code chunks. The thread count can be set to a lower value if
performance degrades due to a low CPU count or a high number of OSDs per server.
NOTE: IBM does not support other transport types such as rdma or dpdk at this time.
Reference
For more information about using on-wire encryption with the Ceph messenger version 2 protocol, see Ceph on-wire
encryption.
For more information about asynchronous messenger options, see Ceph network configuration options.
Ceph functions perfectly well with only a public network. However, Ceph allows you to establish much more specific criteria,
including multiple IP networks for your public network.
You can also establish a separate, private cluster network to handle OSD heartbeat, object replication, and recovery traffic. For more
information about the private network, see Configuring a private network.
NOTE:
Ceph uses CIDR notation for subnets, for example, 10.0.0.0/24. Typical internal IP networks are often 192.168.0.0/24 or
10.0.0.0/24.
If you specify more than one IP address for either the public or the cluster network, the subnets within the network must be
capable of routing to each other. In addition, make sure you include each IP address in your IP tables, and open ports for them
as necessary.
The public network configuration allows you specifically define IP addresses and subnets for the public network.
Prerequisites
Example
Syntax
Example
Example
4. Restart the daemons. Ceph daemons bind dynamically, so you do not have to restart the entire cluster at once if you change
the network configuration for a specific daemon.
Example
5. Optional: If you want to restart the cluster, on the admin node as a root user, run systemctl command:
Syntax
Example
Reference
For common option descriptions and usage information, see Ceph network configuration options.
An example of usage is a stretch cluster mode used for Advanced Cluster Management (ACM) in Metro DR for OpenShift Data
Foundation.
You can configure multiple public networks to the cluster during bootstrap and once bootstrap is complete.
Prerequisites
Before adding a host be sure that you have a running IBM Storage Ceph cluster.
Procedure
IMPORTANT: At least one of the provided public networks must be configured on the current host used for bootstrap.
Syntax
Example
[mon]
public_network = 10.40.0.0/24, 10.41.0.0/24, 10.42.0.0/24
NOTE: During the bootstrap you can include any other arguments that you want to provide.
Syntax
Example
NOTE: The host being added must be reachable from the host that the active manager is running on.
a. Install the cluster’s public SSH key in the new host’s root user’s authorized_keys file:
Syntax
Example
Syntax
Example
NOTE:
It is best to explicitly provide the host IP address. If an IP is not provided, then the host name will be
immediately resolved via DNS and that IP will be used.
One or more labels can also be included to immediately label the new host. For example, by default the _admin
label will make cephadm maintain a copy of the ceph.conf file and a client.admin keyring file in
/etc/ceph.
3. Add the networks configurations for the public network parameters to a running cluster. Be sure that the subnets are
separated by commas and that the subnets are listed in subnet/mask format.
Syntax
Example
[root@host01 ~]# ceph config set mon public_network "192.168.0.0/24, 10.42.0.0/24, ..."
Reference
For more information about stretch clusters, see Stretch clusters for Ceph storage.
If you create a cluster network, OSDs routes heartbeat, object replication, and recovery traffic over the cluster network. This can
improve performance, compared to using a single network.
IMPORTANT: For added security, the cluster network should not be reachable from the public network or the Internet.
To assign a cluster network, use the --cluster-network option with the cephadm bootstrap command. The cluster network
that you specify must define a subnet in CIDR notation (for example, 10.90.90.0/24 or fe80::/64).
Prerequisites
Procedure
1. Run the cephadm bootstrap command from the initial node that you want to use as the Monitor node in the storage cluster.
Include the --cluster-network option in the command.
Syntax
Example
2. To configure the cluster_network after bootstrap, run the config set command and redeploy the daemons:
Example
Syntax
Example
Example
Example
e. Optional: If you want to restart the cluster, on the admin node as a root user, run systemctl command:
Syntax
Example
Reference
For more information about invoking cephadm bootstrap, see Bootstrapping a new storage cluster.
NOTE: If your network has a dedicated firewall, you might need to verify its configuration in addition to following this procedure. See
the firewall’s documentation for more information.
Prerequisites
Procedure
b. Verify the absence of rules that restrict connectivity on TCP ports 6800—7100.
Example
Syntax
Example
Messenger v2 Protocol
The second version of Ceph’s on-wire protocol, msgr2, includes several new features:
The Ceph daemons bind to multiple ports allowing both the legacy, v1-compatible, and the new, v2-compatible, Ceph clients to
connect to the same storage cluster. Ceph clients or other Ceph daemons connecting to the Ceph Monitor daemon will try to use the
v2 protocol first, if possible, but if not, then the legacy v1 protocol will be used. By default, both messenger protocols, v1 and v2, are
enabled. The new v2 port is 3300, and the legacy v1 port is 6789, by default.
Prerequisites
Procedure
a. Replace IFACE with the public network interface (for example, eth0, eth1, and so on).
b. Replace IP-ADDRESS with the IP address of the public network and NETMASK with the netmask for the public network.
If you set separate public and cluster networks, you must add rules for both the public network and the cluster network, because
clients will connect using the public network and other Ceph OSD Daemons will connect using the cluster network.
Prerequisites
Procedure
a. Replace IFACE with the public network interface (for example, eth0, eth1, and so on).
b. Replace IP-ADDRESS with the IP address of the public network and NETMASK with the netmask for the public network.
If you put the cluster network into another zone, open the ports within that zone as appropriate.
Prerequisites
Reference
Ceph monitors maintain a master copy of the cluster map. That means a Ceph client can determine the location of all Ceph monitors
and Ceph OSDs just by connecting to one Ceph monitor and retrieving a current cluster map.
Before Ceph clients can read from or write to Ceph OSDs, they must connect to a Ceph Monitor first. With a current copy of the
cluster map and the CRUSH algorithm, a Ceph client can compute the location for any object. The abilityto compute object locations
allows a Ceph client to talk directly to Ceph OSDs, which is a very important aspect of Ceph’s high scalability and performance.
The primary role of the Ceph Monitor is to maintain a master copy of the cluster map. Ceph Monitors also provide authentication and
logging services. Ceph Monitors write all changes in the monitor services to a single Paxos instance, and Paxos writes the changes to
a key-value store for strong consistency. Ceph Monitors can query the most recent version of the cluster map during synchronization
operations. Ceph Monitors leverage the key-value store’s snapshots and iterators, using the rocksdb database, to perform store-
wide synchronization.
Figure 1. Paxos
NOTE: Previous releases of IBM Storage Ceph centralize Ceph Monitor configuration in /etc/ceph/ceph.conf. This configuration
file has been deprecated as of IBM Storage Ceph 5.3.
Procedure
Example
For more information about the options available for the ceph config command, use ceph config -h.
Which processes that are in the IBM Storage Ceph cluster are up and running or down.
Whether, the placement groups are active or inactive, and clean or in some other state.
Other details that reflect the current state of the cluster, such as:
When there is a significant change in the state of the cluster, for example, a Ceph OSD goes down, a placement group falls into a
degraded state, and so on. The cluster map gets updated to reflect the current state of the cluster. Additionally, the Ceph monitor
also maintains a history of the prior states of the cluster. The monitor map, OSD map, and placement group map each maintain a
history of their map versions. Each version is called an epoch.
When operating the IBM Storage Ceph cluster, keeping track of these states is an important part of the cluster administration.
When a Ceph storage cluster runs multiple Ceph Monitors for high availability, Ceph Monitors use the Paxos algorithm to establish
consensus about the master cluster map. A consensus requires a majority of monitors running to establish a quorum for consensus
about the cluster map. For example, 1; 2 out of 3; 3 out of 5; 4 out of 6; and so on.
IBM recommends running a production IBM Storage Ceph cluster with at least three Ceph Monitors to ensure high availability. When
you run multiple monitors, you can specify the initial monitors that must be members of the storage cluster to establish a quorum.
This may reduce the time it takes for the storage cluster to come online.
[mon]
mon_initial_members = a,b,c
A Ceph Monitor always refers to the local copy of the monitor map when discovering other Ceph Monitors in the IBM Storage Ceph
cluster. Using the monitor map instead of the Ceph configuration file avoids errors that could break the cluster. For example, typos in
the Ceph configuration file when specifying a monitor address or port. Since monitors use monitor maps for discovery and they share
monitor maps with clients and other Ceph daemons, the monitor map provides monitors with a strict guarantee that their consensus
is valid.
As with any other updates on the Ceph Monitor, changes to the monitor map always run through a distributed consensus algorithm
called Paxos. The Ceph Monitors must agree on each update to the monitor map, such as adding or removing a Ceph Monitor, to
ensure that each monitor in the quorum has the same version of the monitor map. Updates to the monitor map are incremental so
that Ceph Monitors have the latest agreed-upon version and a set of previous versions.
Maintaining history
Maintaining a history enables a Ceph Monitor that has an older version of the monitor map to catch up with the current state of the
IBM Storage Ceph cluster.
If Ceph Monitors discovered each other through the Ceph configuration file instead of through the monitor map, it would introduce
additional risks because the Ceph configuration files are not updated and distributed automatically. Ceph Monitors might
inadvertently use an older Ceph configuration file, fail to recognize a Ceph Monitor, fall out of a quorum, or develop a situation where
Paxos is not able to determine the current state of the system accurately.
File System ID: The fsid is the unique identifier for your object store. Since you can run multiple storage clusters on the
same hardware, you must specify the unique ID of the object store when bootstrapping a monitor. Using deployment tools,
such as cephadm, will generate a file system identifier, but you can also specify the fsid manually.
Monitor ID: A monitor ID is a unique ID assigned to each monitor within the cluster. By convention, the ID is set to the
monitor’s hostname. This option can be set using a deployment tool, using the ceph command, or in the Ceph configuration
file. In the Ceph configuration file, sections are formed as follows:
Example
[mon.host1]
[mon.host2]
Reference
For more information about cephadm and the Ceph orchestrator, see Operations.
NOTE: This minimum configuration for monitors assumes that a deployment tool generates the fsid and the mon. key for you.
You can use the following commands to set or read the storage cluster configuration options.
Here, the WHO parameter might be name of the section or a Ceph daemon, OPTION is a configuration file, and VALUE can be either
true or false.
IMPORTANT: When a Ceph daemon needs a config option prior to getting the option from the config store, you can set the
configuration by running the following command:
This command adds text to all the daemon’s ceph.conf files. It is a workaround and is NOT a recommended operation.
NOTE: Do not set this value if you use a deployment tool that does it for you.
IMPORTANT: IBM recommends running Ceph monitors on separate hosts and drives from Ceph OSDs for optimal performance in a
production IBM Storage Ceph cluster.
Ceph monitors call the fsync() function often, which can interfere with Ceph OSD workloads.
Ceph monitors store their data as key-value pairs. Using a data store prevents recovering Ceph monitors from running corrupted
versions through Paxos, and it enables multiple modification operations in one single atomic batch, among other advantages.
TIP: When monitoring a cluster, be alert to warnings related to the nearfull ratio. This means that a failure of some OSDs could
result in a temporary service disruption if one or more OSDs fails. Consider adding more OSDs to increase storage capacity.
A common scenario for test clusters involves a system administrator removing a Ceph OSD from the IBM Storage Ceph cluster to
watch the cluster re-balance. Then, removing another Ceph OSD, and so on until the IBM Storage Ceph cluster eventually reaches
the full ratio and locks up.
IMPORTANT: IBM recommends a bit of capacity planning even with a test cluster. Planning enables you to gauge how much spare
capacity you will need in to maintain high availability.
Ideally, you want to plan for a series of Ceph OSD failures where the cluster can recover to an active + clean state without
replacing those Ceph OSDs immediately. You can run a cluster in an active + degraded state, but this is not ideal for normal
operating conditions.
The following diagram depicts a simplistic IBM Storage Ceph cluster containing 33 Ceph Nodes with one Ceph OSD per host, each
Ceph OSD Daemon reading from and writing to a 3TB drive. So this exemplary IBM Storage Ceph cluster has a maximum actual
capacity of 99TB. With a mon osd full ratio of 0.95, if the IBM Storage Ceph cluster falls to 5 TB of remaining capacity, the
cluster will not allow Ceph clients to read and write data. So, the IBM Storage Ceph cluster's operating capacity is 95 TB, not 99 TB.
It is normal in such a cluster for one or two OSDs to fail. A less frequent but reasonable scenario involves a rack’s router or power
supply failing, which brings down multiple OSDs simultaneously, for example, OSDs 7-12. In such a scenario, you should still strive
for a cluster that can remain operational and achieve an active + clean state, even if that means adding a few hosts with
additional OSDs in short order. If your capacity utilization is too high, you might not lose data, but you could still sacrifice data
availability while resolving an outage within a failure domain if capacity utilization of the cluster exceeds the full ratio. For this
reason, IBM recommends at least some rough capacity planning.
To determine the mean average capacity of an OSD within a cluster, divide the total capacity of the cluster by the number of OSDs in
the cluster. Consider multiplying that number by the number of OSDs you expect to fail simultaneously during normal operations (a
relatively small number). Finally, multiply the capacity of the cluster by the full ratio to arrive at a maximum operating capacity. Then,
subtract the amount of data from the OSDs you expect to fail to arrive at a reasonable full ratio. Repeat the foregoing process with a
higher number of OSD failures (for example, a rack of OSDs) to arrive at a reasonable number for a near full ratio.
Ceph heartbeat
Edit online
Ceph monitors know about the cluster by requiring reports from each OSD, and by receiving reports from OSDs about the status of
their neighboring OSDs. Ceph provides reasonable default settings for interaction between monitor and OSD, however, you can
modify them as needed.
Synchronization roles
For the purposes of synchronization, monitors can assume one of three roles:
Leader: The Leader is the first monitor to achieve the most recent Paxos version of the cluster map.
Provider: The Provider is a monitor that has the most recent version of the cluster map, but was not the first to achieve the
most recent version.
Requester: The Requester is a monitor that has fallen behind the leader and must synchronize to retrieve the most recent
information about the cluster before it can rejoin the quorum.
These roles enable a leader to delegate synchronization duties to a provider, which prevents synchronization requests from
overloading the leader and improving performance. In the following diagram, the requester has learned that it has fallen behind the
other monitors. The requester asks the leader to synchronize, and the leader tells the requester to synchronize with a provider.
Synchronization always occurs when a new monitor joins the cluster. During runtime operations, monitors can receive updates to the
cluster map at different times. This means the leader and provider roles may migrate from one monitor to another. If this happens
while synchronizing, for example, a provider falls behind the leader, the provider can terminate synchronization with a requester.
Once synchronization is complete, Ceph requires trimming across the cluster. Trimming requires that the placement groups are
active + clean.
For example:
Timeouts triggered too soon or late when a message was not received in time.
TIP: Install NTP on the Ceph monitor hosts to ensure that the monitor cluster operates with synchronized clocks.
Clock drift may still be noticeable with NTP even though the discrepancy is not yet harmful. Ceph clock drift and clock skew warnings
can get triggered even though NTP maintains a reasonable level of synchronization. Increasing your clock drift may be tolerable
under such circumstances. However, a number of factors such as workload, network latency, configuring overrides to default
timeouts, and other synchronization options can influence the level of acceptable clock drift without compromising Paxos
guarantees.
Prerequisites
Reference
Cephx authentication
Enabling Cephx
Disabling Cephx
Cephx user keyrings
Cephx daemon keyrings
Cephx message signatures
Cephx authentication
Edit online
The cephx protocol is enabled by default. Cryptographic authentication has some computational costs, though they are generally
quite low. If the network environment connecting clients and hosts is considered safe and you cannot afford authentication
computational costs, you can disable it. When deploying a Ceph storage cluster, the deployment tool will create the client.admin
user and keyring.
NOTE: If you disable authentication, you are at risk of a man-in-the-middle attack altering client and server messages, which could
lead to significant security issues.
Enabling Cephx requires that you have deployed keys for the Ceph Monitors and OSDs. When toggling Cephx authentication on or off,
you do not have to repeat the deployment procedures.
Enabling Cephx
Edit online
When cephx is enabled, Ceph will look for the keyring in the default search path, which includes
/etc/ceph/$cluster.$name.keyring. You can override this location by adding a keyring option in the [global] section of
the Ceph configuration file, but this is not recommended.
Execute the following procedures to enable cephx on a cluster with authentication disabled. If you or your deployment utility have
already generated the keys, you may skip the steps related to generating keys.
Prerequisites
Procedure
1. Create a client.admin key, and save a copy of the key for your client host:
[root@mon ~]# ceph auth get-or-create client.admin mon 'allow *' osd 'allow *' -o
/etc/ceph/ceph.client.admin.keyring
WARNING: This will erase the contents of any existing /etc/ceph/client.admin.keyring file. Do not perform this step if
a deployment tool has already done it for you.
2. Create a keyring for the monitor cluster and generate a monitor secret key:
4. Generate a secret key for every OSD, where _ID_ is the OSD number:
ceph auth get-or-create osd.ID mon allow rwx osd allow * -o /var/lib/ceph/osd/ceph-ID/keyring
NOTE: If the cephx authentication protocol was disabled previously by setting the authentication options to none, then by
removing the following lines under the [global] section in the Ceph configuration file (/etc/ceph/ceph.conf) will
reenable the cephx authentication protocol:
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
IMPORTANT: Enabling cephx requires downtime because the cluster needs to be completely restarted, or it needs to be shut
down and then started while client I/O is disabled.These flags need to be set before restarting or shutting down the storage
cluster:
Once cephx is enabled and all PGs are active and clean, unset the flags:
Disabling Cephx
Edit online
The following procedure describes how to disable Cephx. If your cluster environment is relatively safe, you can offset the
computation expense of the running authentication.
Prerequisites
Procedure
1. Disable cephx authentication by setting the following options in the [global] section of the Ceph configuration file:
Example
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
The most common way to provide these keys to the ceph administrative commands and clients is to include a Ceph keyring under
the /etc/ceph/ directory. The file name is usually ceph.client.admin.keyring or $cluster.client.admin.keyring. If
you include the keyring under the /etc/ceph/ directory, you do not need to specify a keyring entry in the Ceph configuration file.
IMPORTANT: IBM recommends copying the IBM Storage Ceph cluster keyring file to nodes where you will run administrative
commands, because it contains the client.admin key.
Replace USER with the user name used on the host with the client.admin key and HOSTNAME with the host name of that host.
NOTE: Ensure the ceph.keyring file has appropriate permissions set on the client machine.
You can specify the key itself in the Ceph configuration file using the key setting, which is not recommended, or a path to a key file
using the keyfile setting.
NOTE: The monitor keyring contains a key but no capabilities, and is not part of the Ceph storage cluster auth database.
/var/lib/ceph/$type/CLUSTER-ID
Example
/var/lib/ceph/osd/ceph-12
IMPORTANT: IBM recommends that Ceph authenticate all ongoing messages between the entities using the session key set up for
that initial authentication.
IMPORTANT: IBM recommends overriding some of the defaults. Specifically, set a pool’s replica size and override the default
number of placement groups.
By default, Ceph makes three replicas of objects. If you want to set four copies of an object as the default value, a primary copy and
three replica copies, reset the default values as shown in osd_pool_default_size. If you want to allow Ceph to write a lesser
number of copies in a degraded state, set osd_pool_default_min_size to a number less than the osd_pool_default_size
value.
Example
[ceph: root@host01 /]# ceph config set global osd_pool_default_size 4 # Write an object four
times.
[ceph: root@host01 /]# ceph config set global osd_pool_default_min_size 1 # Allow writing one copy
in a degraded state.
Ensure you have a realistic number of placement groups. IBM recommends approximately 100 per OSD. Total number of OSDs
multiplied by 100 divided by the number of replicas, that is, osd_pool_default_size. For 10 OSDs and
osd_pool_default_size = 4, we would recommend approximately (100*10)/4=250.
Example
Reference
Cluster identity
Network configuration
Paths to keyrings
A deployment tool, such as cephadm, will typically create an initial Ceph configuration file for you. However, you can create one
yourself if you prefer to bootstrap a cluster without using a deployment tool.
For your convenience, each daemon has a series of default values. Many are set by the ceph/src/common/config_opts.h script.
You can override these settings with a Ceph configuration file or at runtime by using the monitor tell command or connecting
directly to a daemon socket on a Ceph node.
IMPORTANT: IBM does not recommend changing the default paths, as it makes it more difficult to troubleshoot Ceph later.
Reference
For more information about cephadm and the Ceph orchestrator, see Operations.
For each placement group, Ceph generates a catalog of all objects and compares each primary object and its replicas to ensure that
no objects are missing or mismatched.
Light scrubbing (daily) checks the object size and attributes. Deep scrubbing (weekly) reads the data and uses checksums to ensure
data integrity.
Scrubbing is important for maintaining data integrity, but it can reduce performance. Adjust the following settings to increase or
decrease scrubbing operations.
Reference
Backfilling an OSD
Edit online
When you add Ceph OSDs to a cluster or remove them from the cluster, the CRUSH algorithm rebalances the cluster by moving
placement groups to or from Ceph OSDs to restore the balance. The process of migrating placement groups and the objects they
contain can reduce the cluster operational performance considerably. To maintain operational performance, Ceph performs this
migration with the backfill process, which allows Ceph to set backfill operations to a lower priority than requests to read or write
data.
OSD recovery
Edit online
If a Ceph OSD crashes and comes back online, usually it will be out of sync with other Ceph OSDs containing more recent versions of
objects in the placement groups. When this happens, the Ceph OSD goes into recovery mode and seeks to get the latest copy of the
data and bring its map back up to date. Depending upon how long the Ceph OSD was down, the OSD’s objects and placement groups
may be significantly out of date. Also, if a failure domain went down, for example, a rack, more than one Ceph OSD might come back
online at the same time. This can make the recovery process time consuming and resource intensive.
To maintain operational performance, Ceph performs recovery with limitations on the number of recovery requests, threads, and
object chunk sizes which allows Ceph to perform well in a degraded state.
Reference
Ceph provides reasonable default settings for Ceph Monitor and OSD interaction. However, you can override the defaults. The
following sections describe how Ceph Monitors and Ceph OSD daemons interact for the purposes of monitoring the Ceph storage
cluster.
OSD heartbeat
Edit online
Each Ceph OSD daemon checks the heartbeat of other Ceph OSD daemons every 6 seconds. To change the heartbeat interval,
change the value at runtime:
Syntax
Example
If a neighboring Ceph OSD daemon does not send heartbeat packets within a 20 second grace period, the Ceph OSD daemon might
consider the neighboring Ceph OSD daemon down. It can report it back to a Ceph Monitor, which updates the Ceph cluster map. To
change the grace period, set the value at runtime:
Example
However, there is the chance that all the OSDs reporting the failure are in different hosts in a rack with a bad switch that causes
connection problems between OSDs.
To avoid a "false alarm," Ceph considers the peers reporting the failure as a proxy for a "subcluster" that is similarly laggy. While this
is not always the case, it may help administrators localize the grace correction to a subset of the system that is performing poorly.
Ceph uses the mon_osd_reporter_subtree_level setting to group the peers into the "subcluster" by their common ancestor
type in the CRUSH map.
By default, only two reports from a different subtree are required to report another Ceph OSD Daemon down. Administrators can
change the number of reporters from unique subtrees and the common ancestor type required to report a Ceph OSD Daemon down
to a Ceph Monitor by setting the mon_osd_min_down_reporters and mon_osd_reporter_subtree_level values at runtime:
Syntax
Example
Syntax
Syntax
Example
You can change the Ceph OSD Daemon minimum report interval by setting the osd_mon_report_interval value at runtime:
Syntax
To get, set, and verify the config you can use the following example:
Example
Prerequisite
NOTE: Typically, these will be set automatically by deployment tools, such as cephadm.
fsid
Description
The file system ID. One per cluster.
Type
UUID
Required
No.
Default
N/A. Usually generated by deployment tools.
admin_socket
Description
The socket for executing administrative commands on a daemon, irrespective of whether Ceph monitors have established a
quorum.
Type
String
Required
No
Default
/var/run/ceph/$cluster-$name.asok
pid_file
Description
The file in which the monitor or OSD will write its PID. For instance, /var/run/$cluster/$type.$id.pid will create
/var/run/ceph/mon.a.pid for the mon with id a running in the ceph cluster. The pid file is removed when the daemon stops
gracefully. If the process is not daemonized (meaning it runs with the -f or -d option), the pid file is not created.
Type
String
Required
No
Default
No
chdir
Description
The directory Ceph daemons change to once they are up and running. Default / directory recommended.
Type
String
Required
No
max_open_files
Description
If set, when the IBM Storage Ceph cluster starts, Ceph sets the max_open_fds at the OS level (that is, the max \# of file
descriptors). It helps prevent Ceph OSDs from running out of file descriptors.
Type
64-bit Integer
Required
No
Default
0
fatal_signal_handlers
Description
If set, we will install signal handlers for SEGV, ABRT, BUS, ILL, FPE, XCPU, XFSZ, SYS signals to generate a useful log message.
Type
Boolean
Default
true
public_network
Description
The IP address and netmask of the public (front-side) network (for example, 192.168.0.0/24). Set in [global]. You can
specify comma-delimited subnets.
Type
<ip-address>/<netmask> [, <ip-address>/<netmask>]
Required No
Default N/A
public_addr
Description
The IP address for the public (front-side) network. Set for each daemon.
Type
IP Address
Required No
Default N/A
cluster_network
Description
The IP address and netmask of the cluster network (for example, 10.0.0.0/24). Set in [global]. You can specify comma-
delimited subnets.
Type
<ip-address>/<netmask> [, <ip-address>/<netmask>]
Required No
Default NA
Type
Address
Required No
Default NA
ms_type
Description
The messenger type for the network transport layer. IBM supports the simple and the async messenger type using posix
semantics.
Type
String.
Required No.
Default async+posix
ms_public_type
Description
The messenger type for the network transport layer of the public network. It operates identically to ms_type, but is
applicable only to the public or front-side network. This setting enables Ceph to use a different messenger type for the public
or front-side and cluster or back-side networks.
Type String.
Required No.
Default None.
ms_cluster_type
Description
The messenger type for the network transport layer of the cluster network. It operates identically to ms_type, but is
applicable only to the cluster or back-side network. This setting enables Ceph to use a different messenger type for the public
or front-side and cluster or back-side networks.
Type String.
Required No.
Default None.
Host options You must declare at least one Ceph Monitor in the Ceph configuration file, with a mon addr setting under each
declared monitor. Ceph expects a host setting under each declared monitor, metadata server and OSD in the Ceph configuration file.
IMPORTANT: Do not use localhost. Use the short name of the node, not the fully-qualified domain name (FQDN). Do not specify
any value for host when using a third party deployment system that retrieves the node name for you.
mon_addr
Description
A list of <hostname>:<port> entries that clients can use to connect to a Ceph monitor. If not set, Ceph searches [mon.*]
sections.
Type String
Required No
Default NA
host
Description
The host name. Use this setting for specific daemon instances (for example, [osd.0]).
Type String
Default localhost
ms_tcp_nodelay
Description
Ceph enables ms_tcp_nodelay so that each request is sent immediately (no buffering). Disabling Nagle’s algorithm
increases network traffic, which can introduce congestion. If you experience large numbers of small packets, you may try
disabling ms_tcp_nodelay, but be aware that disabling it will generally increase latency.
Type Boolean
Required No
Default true
ms_tcp_rcvbuf
Description
The size of the socket buffer on the receiving end of a network connection. Disabled by default.
Required No
Default 0
ms_tcp_read_timeout
Description
If a client or daemon makes a request to another Ceph daemon and does not drop an unused connection, the tcp read
timeout defines the connection as idle after the specified number of seconds.
Required No
Bind options The bind options configure the default port ranges for the Ceph OSD daemons. The default range is 6800:7100. You
can also enable Ceph daemons to bind to IPv6 addresses.
IMPORTANT: Verify that the firewall configuration allows you to use the configured port range.
ms_bind_port_min
Description
The minimum port number to which an OSD daemon will bind.
Default 6800
Required No
ms_bind_port_max
Description
The maximum port number to which an OSD daemon will bind.
Default 7300
Required No.
ms_bind_ipv6
Description
Enables Ceph daemons to bind to IPv6 addresses.
Type Boolean
Default false
Asynchronous messenger options These Ceph messenger options configure the behavior of AsyncMessenger.
ms_async_transport_type
Description
Transport type used by the AsyncMessenger. IBM supports the posix setting, but does not support the dpdk or rdma
settings at this time. POSIX uses standard TCP/IP networking and is the default value. Other transport types are experimental
and are NOT supported.
Type String
Required No
Default posix
ms_async_op_threads
Description
Initial number of worker threads used by each AsyncMessenger instance. This configuration setting SHOULD equal the
number of replicas or erasure code chunks, but it may be set lower if the CPU core count is low or the number of OSDs on a
single server is high.
Required No
Default 3
ms_async_max_op_threads
Description
The maximum number of worker threads used by each AsyncMessenger instance. Set to lower values if the OSD host has
limited CPU count, and increase if Ceph is underutilizing CPUs are underutilized.
Required No
Default 5
ms_async_set_affinity
Description
Set to true to bind AsyncMessenger workers to particular CPU cores.
Type Boolean
Required No
Default true
ms_async_affinity_cores
Description
When ms_async_set_affinity is true, this string specifies how AsyncMessenger workers are bound to CPU cores. For
example, 0,2 will bind workers \#1 and \#2 to CPU cores \#0 and \#2, respectively.
NOTE: When manually setting affinity, make sure to not assign workers to virtual CPUs created as an effect of hyper threading
or similar technology, because they are slower than physical CPU cores.
Type String
Required No
Default (empty)
ms_async_send_inline
Description
Send messages directly from the thread that generated them instead of queuing and sending from the AsyncMessenger
thread. This option is known to decrease performance on systems with a lot of CPU cores, so it’s disabled by default.
Type Boolean
Default false
You can set these configuration options with the ceph config set mon CONFIGURATION_OPTION VALUE command.
mon_initial_members
Description The IDs of initial monitors in a cluster during startup. If specified, Ceph requires an odd number of monitors to
form an initial quorum (for example, 3).
Type
String
Default
None
mon_force_quorum_join
Description
Force monitor to join quorum even if it has been previously removed from the map
Type
Boolean
Default
False
mon_dns_srv_name
Description
The service name used for querying the DNS for the monitor hosts/addresses.
Type
String
Default
ceph-mon
fsid
Description
The cluster ID. One per cluster.
Type
UUID
Required
Yes.
Default
N/A. May be generated by a deployment tool if not specified.
mon_data
Description
The monitor’s data location.
Type
String
Default
/var/lib/ceph/mon/$cluster-$id
mon_data_size_warn
Description
Ceph issues a HEALTH_WARN status in the cluster log when the monitor’s data store reaches this threshold. The default value
Type
Integer
Default
15*1024*1024*1024*
mon_data_avail_warn
Description
Ceph issues a HEALTH_WARN status in the cluster log when the available disk space of the monitor’s data store is lower than
or equal to this percentage.
Type
Integer
Default
30
mon_data_avail_crit
Description
Ceph issues a HEALTH_ERR status in the cluster log when the available disk space of the monitor’s data store is lower or equal
to this percentage.
Type
Integer
Default
5
mon_warn_on_cache_pools_without_hit_sets
Description
Ceph issues a HEALTH_WARN status in the cluster log if a cache pool does not have the hit_set_type parameter set.
Type
Boolean
Default
True
mon_warn_on_crush_straw_calc_version_zero
Description
Ceph issues a HEALTH_WARN status in the cluster log if the CRUSH’s straw_calc_version is zero.
Type
Boolean
Default
True
mon_warn_on_legacy_crush_tunables
Description
Ceph issues a HEALTH_WARN status in the cluster log if CRUSH tunables are too old (older than
mon_min_crush_required_version).
Type
Boolean
Default
True
mon_crush_min_required_version
Description
This setting defines the minimum tunable profile version required by the cluster.
Type
String
Default
hammer
Type
Boolean
Default
True
mon_cache_target_full_warn_ratio
Description
Ceph issues a warning when between the ratio of cache_target_full and target_max_object.
Type
Float
Default
0.66
mon_health_data_update_interval
Description
How often (in seconds) a monitor in the quorum shares its health status with its peers. A negative number disables health
updates.
Type
Float
Default
60
mon_health_to_clog
Description
This setting enables Ceph to send a health summary to the cluster log periodically.
Type
Boolean
Default
True
mon_health_detail_to_clog
Description
This setting enable Ceph to send a health details to the cluster log periodically.
Type
Boolean
Default
True
mon_op_complaint_time
Description
Number of seconds after which the Ceph Monitor operation is considered blocked after no updates.
Type
Integer
Default
30
mon_health_to_clog_tick_interval
Description
How often (in seconds) the monitor sends a health summary to the cluster log. A non-positive number disables it. If the
current health summary is empty or identical to the last time, the monitor will not send the status to the cluster log.
Default
60.000000
mon_health_to_clog_interval
Description
How often (in seconds) the monitor sends a health summary to the cluster log. A non-positive number disables it. The monitor
will always send the summary to the cluster log.
Type
Integer
Default
600
mon_osd_full_ratio
Description
The percentage of disk space used before an OSD is considered full.
Type
Float:
Default
.95
mon_osd_nearfull_ratio
Description
The percentage of disk space used before an OSD is considered nearfull.
Type
Float
Default
.85
mon_sync_trim_timeout
Description;
Type
Double
Default
30.0
mon_sync_heartbeat_timeout
Description;
Type
Double
Default
30.0
mon_sync_heartbeat_interval
Description;
Type
Double
Default
5.0
mon_sync_backoff_timeout
Description;
Type
Double
mon_sync_timeout
Description
The number of seconds the monitor will wait for the next update message from its sync provider before it gives up and
bootstraps again.
Type
Double
Default
60.000000
mon_sync_max_retries
Description;
Type
Integer
Default
5
mon_sync_max_payload_size
Description
The maximum size for a sync payload (in bytes).
Type
32-bit Integer
Default
1045676
paxos_max_join_drift
Description
The maximum Paxos iterations before we must first sync the monitor data stores. When a monitor finds that its peer is too far
ahead of it, it will first sync with data stores before moving on.
Type
Integer
Default
10
paxos_stash_full_interval
Description
How often (in commits) to stash a full copy of the PaxosService state. Currently this setting only affects mds, mon, auth and
mgr PaxosServices.
Type
Integer
Default
25
paxos_propose_interval
Description
Gather updates for this time interval before proposing a map update.
Type
Double
Default
1.0
paxos_min
Description
The minimum number of paxos states to keep around
Default
500
paxos_min_wait
Description
The minimum amount of time to gather updates after a period of inactivity.
Type
Double
Default
0.05
paxos_trim_min
Description
Number of extra proposals tolerated before trimming.
Type
Integer
Default
250
paxos_trim_max
Description
The maximum number of extra proposals to trim at a time.
Type
Integer
Default
500
paxos_service_trim_min
Description
The minimum amount of versions to trigger a trim (0 disables it).
Type
Integer
Default
250
paxos_service_trim_max
Description
The maximum amount of versions to trim during a single proposal (0 disables it).
Type
Integer
Default
500
mon_max_log_epochs
Description
The maximum amount of log epochs to trim during a single proposal.
Type
Integer
Default
500
mon_max_pgmap_epochs
Description
The maximum amount of pgmap epochs to trim during a single proposal
Default
500
mon_mds_force_trim_to
Description
Force monitor to trim mdsmaps to this point (0 disables it. dangerous, use with care)
Type
Integer
Default
0
mon_osd_force_trim_to
Description
Force monitor to trim osdmaps to this point, even if there is PGs not clean at the specified epoch (0 disables it. dangerous, use
with care)
Type
Integer
Default
0
mon_osd_cache_size
Description
The size of osdmaps cache, not to rely on underlying store’s cache.
Type
Integer
Default
500
mon_election_timeout
Description
On election proposer, maximum waiting time for all ACKs in seconds.
Type
Float
Default
5
mon_lease
Description
The length (in seconds) of the lease on the monitor’s versions.
Type
Float
Default
5
mon_lease_renew_interval_factor
Description
mon lease \* mon lease renew interval factor will be the interval for the Leader to renew the other monitor’s
leases. The factor should be less than 1.0.
Type
Float
Default
0.6
mon_lease_ack_timeout_factor
Type
Float
Default
2.0
mon_accept_timeout_factor
Description
The Leader will wait mon lease \* mon accept timeout factor for the Requesters to accept a Paxos update. It is also
used during the Paxos recovery phase for similar purposes.
Type
Float
Default
2.0
mon_min_osdmap_epochs
Description
Minimum number of OSD map epochs to keep at all times.
Type
32-bit Integer
Default
500
mon_max_pgmap_epochs
Description
Maximum number of PG map epochs the monitor should keep.
Type
32-bit Integer
Default
500
mon_max_log_epochs
Description
Maximum number of Log epochs the monitor should keep.
Type
32-bit Integer
Default
500
clock_offset
Description
How much to offset the system clock. See Clock.cc for details.
Type
Double
Default
0
mon_tick_interval
Description
A monitor’s tick interval in seconds.
Type
32-bit Integer
Default
5
Type
Float
Default
.050
mon_clock_drift_warn_backoff
Description
Exponential backoff for clock drift warnings.
Type
Float
Default
5
mon_timecheck_interval
Description
The time check interval (clock drift check) in seconds for the leader.
Type
Float
Default
300.0
mon_timecheck_skew_interval
Description
The time check interval (clock drift check) in seconds when in the presence of a skew in seconds for the Leader.
Type
Float
Default
30.0
mon_max_osd
Description
The maximum number of OSDs allowed in the cluster.
Type
32-bit Integer
Default
10000
mon_globalid_prealloc
Description
The number of global IDs to pre-allocate for clients and daemons in the cluster.
Type
32-bit Integer
Default
10000
mon_sync_fs_threshold
Description
Synchronize with the filesystem when writing the specified number of objects. Set it to 0 to disable it.
Type
32-bit Integer
Default
5
mon_subscribe_interval
Type
Double
Default
86400.000000
mon_stat_smooth_intervals
Description
Ceph will smooth statistics over the last N PG maps.
Type
Integer
Default
6
mon_probe_timeout
Description
Number of seconds the monitor will wait to find peers before bootstrapping.
Type
Double
Default
2.0
mon_daemon_bytes
Description
The message memory cap for metadata server and OSD messages (in bytes).
Type
64-bit Integer Unsigned
Default
400ul << 20
mon_max_log_entries_per_event
Description
The maximum number of log entries per event.
Type
Integer
Default
4096
mon_osd_prime_pg_temp
Description
Enables or disable priming the PGMap with the previous OSDs when an out OSD comes back into the cluster. With the true
setting, the clients will continue to use the previous OSDs until the newly in OSDs as that PG peered.
Type
Boolean
Default
true
mon_osd_prime_pg_temp_max_time
Description
How much time in seconds the monitor should spend trying to prime the PGMap when an out OSD comes back into the
cluster.
Type
Float
mon_osd_prime_pg_temp_max_time_estimate
Description
Maximum estimate of time spent on each PG before we prime all PGs in parallel.
Type
Float
Default
0.25
mon_osd_allow_primary_affinity
Description
Allow primary_affinity to be set in the osdmap.
Type
Boolean
Default
False
mon_osd_pool_ec_fast_read
Description
Whether turn on fast read on the pool or not. It will be used as the default setting of newly created erasure pools if
fast_read is not specified at create time.
Type
Boolean
Default
False
mon_mds_skip_sanity
Description
Skip safety assertions on FSMap, in case of bugs where we want to continue anyway. Monitor terminates if the FSMap sanity
check fails, but we can disable it by enabling this option.
Type
Boolean
Default
False
mon_max_mdsmap_epochs
Description
The maximum amount of mdsmap epochs to trim during a single proposal.
Type
Integer
Default
500
mon_config_key_max_entry_size
Description
The maximum size of config-key entry (in bytes).
Type
Integer
Default
65536
mon_warn_pg_not_scrubbed_ratio
Description
The percentage of the scrub max interval past the scrub max interval to warn.
Default
0.5
mon_warn_pg_not_deep_scrubbed_ratio
Description
The percentage of the deep scrub interval past the deep scrub interval to warn.
Type
float
Default
0.75
mon_scrub_interval
Description
How often, in seconds, the monitor scrub its store by comparing the stored checksums with the computed ones of all the
stored keys.
Type
Integer
Default
3600*24
mon_scrub_timeout
Description
The timeout to restart scrub of mon quorum participant does not respond for the latest chunk.
Type
Integer
Default
5 min
mon_scrub_max_keys
Description
The maximum number of keys to scrub each time.
Type
Integer
Default
100
mon_scrub_inject_crc_mismatch
Description
The probability of injecting CRC mismatches into Ceph Monitor scrub.
Type
Integer
Default
3600*24
mon_scrub_inject_missing_keys
Description
The probability of injecting missing keys into monitor scrub.
Type
float
Default
0
mon_compact_on_start
Type
Boolean
Default
False
mon_compact_on_bootstrap
Description
Compact the database used as Ceph Monitor store on bootstrap. The monitor starts probing each other for creating a quorum
after bootstrap. If it times out before joining the quorum, it will start over and bootstrap itself again.
Type
Boolean
Default
False
mon_compact_on_trim
Description
Compact a certain prefix (including paxos) when we trim its old states.
Type
Boolean
Default
True
mon_cpu_threads
Description
Number of threads for performing CPU intensive work on monitor.
Type
Boolean
Default
True
mon_osd_mapping_pgs_per_chunk
Description
We calculate the mapping from the placement group to OSDs in chunks. This option specifies the number of placement groups
per chunk.
Type
Integer
Default
4096
mon_osd_max_split_count
Description
Largest number of PGs per "involved" OSD to let split create. When we increase the pg_num of a pool, the placement groups
will be split on all OSDs serving that pool. We want to avoid extreme multipliers on PG splits.
Type
Integer
Default
300
rados_mon_op_timeout
Description
Number of seconds to wait for a response from the monitor before returning an error from a rados operation. 0 means at limit,
or no wait time.
Default
0
auth_cluster_required
Description
Valid settings are cephx or none.
Type
String
Required
No
Default
cephx.
auth_service_required
Description
Valid settings are cephx or none.
Type
String
Required
No
Default
cephx.
auth_client_required
Description
If enabled, the IBM Storage Ceph cluster daemons require Ceph clients to authenticate with the IBM Storage Ceph cluster in
order to access Ceph services. Valid settings are cephx or none.
Type
String
Required
No
Default
cephx.
keyring
Description
The path to the keyring file.
Type
String
Required
No
Default
/etc/ceph/$cluster.$name.keyring, /etc/ceph/$cluster.keyring, /etc/ceph/keyring,
/etc/ceph/keyring.bin
keyfile
Type
String
Required
No
Default
None
key
Description
The key (that is, the text string of the key itself). Not recommended.
Type
String
Required
No
Default
None
ceph-mon
Location
$mon_data/keyring
Capabilities
mon 'allow *'
ceph-osd
Location
$osd_data/keyring
Capabilities
mon 'allow profile osd' osd 'allow *'
radosgw
Location
$rgw_data/keyring
Capabilities
mon 'allow rwx' osd 'allow rwx'
cephx_require_signatures
Description
If set to true, Ceph requires signatures on all message traffic between the Ceph client and the IBM Storage Ceph cluster, and
between daemons comprising the IBM Storage Ceph cluster.
Type
Boolean
Required
No
Default
false
cephx_cluster_require_signatures
Description
If set to true, Ceph requires signatures on all message traffic between Ceph daemons comprising the IBM Storage Ceph
cluster.
Type
Boolean
Required
No
cephx_service_require_signatures
Description
If set to true, Ceph requires signatures on all message traffic between Ceph clients and the IBM Storage Ceph cluster.
Type
Boolean
Required
No
Default
false
cephx_sign_messages
Description
If the Ceph version supports message signing, Ceph will sign all messages so they cannot be spoofed.
Type
Boolean
Default
true
auth_service_ticket_ttl
Description
When the IBM Storage Ceph cluster sends a Ceph client a ticket for authentication, the cluster assigns the ticket a time to live.
Type
Double
Default
60*60
mon_allow_pool_delete
Description
Allows a monitor to delete a pool. In RHCS 3 and later releases, the monitor cannot delete the pool by default as an added
measure to protect data.
Type
Boolean
Default
false
mon_max_pool_pg_num
Description
The maximum number of placement groups per pool.
Type
Integer
Default
65536
mon_pg_create_interval
Description
Number of seconds between PG creation in the same Ceph OSD Daemon.
Default
30.0
mon_pg_stuck_threshold
Description
Number of seconds after which PGs can be considered as being stuck.
Type
32-bit Integer
Default
300
mon_pg_min_inactive
Description
Ceph issues a HEALTH_ERR status in the cluster log if the number of PGs that remain inactive longer than the
mon_pg_stuck_threshold exceeds this setting. The default setting is one PG. A non-positive number disables this setting.
Type
Integer
Default
1
mon_pg_warn_min_per_osd
Description
Ceph issues a HEALTH_WARN status in the cluster log if the average number of PGs per OSD in the cluster is less than this
setting. A non-positive number disables this setting.
Type
Integer
Default
30
mon_pg_warn_max_per_osd
Description
Ceph issues a HEALTH_WARN status in the cluster log if the average number of PGs per OSD in the cluster is greater than this
setting. A non-positive number disables this setting.
Type
Integer
Default
300
mon_pg_warn_min_objects
Description
Do not warn if the total number of objects in the cluster is below this number.
Type
Integer
Default
1000
mon_pg_warn_min_pool_objects
Description
Do not warn on pools whose object number is below this number.
Type
Integer
Default
1000
mon_pg_check_down_all_threshold
Type
Float
Default
0.5
mon_pg_warn_max_object_skew
Description
Ceph issue a HEALTH_WARN status in the cluster log if the average number of objects in a pool is greater than mon pg warn
max object skew times the average number of objects for all pools. A non-positive number disables this setting.
Type
Float
Default
10
mon_delta_reset_interval
Description
The number of seconds of inactivity before Ceph resets the PG delta to zero. Ceph keeps track of the delta of the used space
for each pool to aid administrators in evaluating the progress of recovery and performance.
Type
Integer
Default
10
mon_osd_max_op_age
Description
The maximum age in seconds for an operation to complete before issuing a HEALTH_WARN status.
Type
Float
Default
32.0
osd_pg_bits
Description
Placement group bits per Ceph OSD Daemon.
Type
32-bit Integer
Default
6
osd_pgp_bits
Description
The number of bits per Ceph OSD Daemon for Placement Groups for Placement purpose (PGPs).
Type
32-bit Integer
Default
6
osd_crush_chooseleaf_type
Description
The bucket type to use for chooseleaf in a CRUSH rule. Uses ordinal rank rather than name.
Type
32-bit Integer
Default
1. Typically a host containing one or more Ceph OSD Daemons.
Type
8-bit Integer
Default
0
osd_pool_erasure_code_stripe_unit
Description
Sets the default size, in bytes, of a chunk of an object stripe for erasure coded pools. Every object of size S will be stored as N
stripes, with each data chunk receiving stripe unit bytes. Each stripe of N * stripe unit bytes will be
encoded/decoded individually. This option can be overridden by the stripe_unit setting in an erasure code profile.
Type
Unsigned 32-bit Integer
Default
4096
osd_pool_default_size
Description
Sets the number of replicas for objects in the pool. The default value is the same as ceph osd pool set {pool-name}
size {size}.
Type
32-bit Integer
Default
3
osd_pool_default_min_size
Description
Sets the minimum number of written replicas for objects in the pool in order to acknowledge a write operation to the client. If
the minimum is not met, Ceph will not acknowledge the write to the client. This setting ensures a minimum number of replicas
when operating in degraded mode.
Type
32-bit Integer
Default
0, which means no particular minimum. If 0, minimum is size - (size / 2).
osd_pool_default_pg_num
Description
The default number of placement groups for a pool. The default value is the same as pg_num with mkpool.
Type
32-bit Integer
Default
32
osd_pool_default_pgp_num
Description
The default number of placement groups for placement for a pool. The default value is the same as pgp_num with mkpool. PG
and PGP should be equal.
Type
32-bit Integer
Default
0
osd_pool_default_flags
Description
The default flags for new pools.
Default
0
osd_max_pgls
Description
The maximum number of placement groups to list. A client requesting a large number can tie up the Ceph OSD Daemon.
Type
Unsigned 64-bit Integer
Default
1024
Note
Default should be fine.
osd_min_pg_log_entries
Description
The minimum number of placement group logs to maintain when trimming log files.
Type
32-bit Int Unsigned
Default
250
osd_default_data_pool_replay_window
Description
The time, in seconds, for an OSD to wait for a client to replay a request.
Type
32-bit Integer
Default
45
You can set these configuration options with the ceph config set osd CONFIGURATION_OPTION VALUE command.
osd_uuid
Description
The universally unique identifier (UUID) for the Ceph OSD.
Type
UUID
Default
The UUID.
NOTE: The osd uuid applies to a single Ceph OSD. The fsid applies to the entire cluster.
osd_data
Description
The path to the OSD’s data. You must create the directory when deploying Ceph. Mount a drive for OSD data at this mount
point.
Type
String
osd_max_write_size
Description
The maximum size of a write in megabytes.
Type
32-bit Integer
Default
90
osd_client_message_size_cap
Description
The largest client data message allowed in memory.
Type
64-bit Integer Unsigned
Default
500MB. 500*1024L*1024L
osd_class_dir
Description
The class path for RADOS class plug-ins.
Type
String
Default
$libdir/rados-classes
osd_max_scrubs
Description
The maximum number of simultaneous scrub operations for a Ceph OSD.
Type
32-bit Int
Default
1
osd_scrub_thread_timeout
Description
The maximum time in seconds before timing out a scrub thread.
Type
32-bit Integer
Default
60
osd_scrub_finalize_thread_timeout
Description
The maximum time in seconds before timing out a scrub finalize thread.
Type
32-bit Integer
Default
60*10
osd_scrub_begin_hour
Description
This restricts scrubbing to this hour of the day or later. Use osd_scrub_begin_hour = 0 and osd_scrub_end_hour = 0
to allow scrubbing the entire day. Along with osd_scrub_end_hour, they define a time window, in which the scrubs can
happen. But a scrub is performed no matter whether the time window allows or not, as long as the placement group’s scrub
interval exceeds osd_scrub_max_interval.
Default
0
Allowed range
0 to 23
osd_scrub_end_hour
Description
This restricts scrubbing to the hour earlier than this. Use osd_scrub_begin_hour = 0 and osd_scrub_end_hour = 0 to
allow scrubbing for the entire day. Along with osd_scrub_begin_hour, they define a time window, in which the scrubs can
happen. But a scrub is performed no matter whether the time window allows or not, as long as the placement group's scrub
interval exceeds osd_scrub_max_interval.
Type
Integer
Default
0
Allowed range
0 to 23
osd_scrub_load_threshold
Description
The maximum load. Ceph will not scrub when the system load (as defined by the getloadavg() function) is higher than this
number. Default is 0.5.
Type
Float
Default
0.5
osd_scrub_min_interval
Description
The minimum interval in seconds for scrubbing the Ceph OSD when the IBM Storage Ceph cluster load is low.
Type
Float
Default
Once per day. 60*60*24
osd_scrub_max_interval
Description
The maximum interval in seconds for scrubbing the Ceph OSD irrespective of cluster load.
Type
Float
Default
Once per week. 7*60*60*24
osd_scrub_interval_randomize_ratio
Description
Takes the ratio and randomizes the scheduled scrub between osd scrub min interval and osd scrub max
interval.
Type
Float
Default
0.5.
mon_warn_not_scrubbed
Type
Integer
Default
0 (no warning).
osd_scrub_chunk_min
Description
The object store is partitioned into chunks which end on hash boundaries. For chunky scrubs, Ceph scrubs objects one chunk
at a time with writes blocked for that chunk. The osd scrub chunk min setting represents the minimum number of chunks
to scrub.
Type
32-bit Integer
Default
5
osd_scrub_chunk_max
Description
The maximum number of chunks to scrub.
Type
32-bit Integer
Default
25
osd_scrub_sleep
Description
The time to sleep between deep scrub operations.
Type
Float
Default
0 (or off).
osd_scrub_during_recovery
Description
Allows scrubbing during recovery.
Type
Boolean
Default
false
osd_scrub_invalid_stats
Description
Forces extra scrub to fix stats marked as invalid.
Type
Boolean
Default
true
osd_scrub_priority
Description
Controls queue priority of scrub operations versus client I/O.
Type
Unsigned 32-bit Integer
Default
5
Type
Unsigned 32-bit Integer
Default
120
osd_scrub_cost
Description
Cost of scrub operations in megabytes for queue scheduling purposes.
Type
Unsigned 32-bit Integer
Default
52428800
osd_deep_scrub_interval
Description
The interval for deep scrubbing, that is fully reading all data. The osd scrub load threshold parameter does not affect
this setting.
Type
Float
Default
Once per week. 60*60*24*7
osd_deep_scrub_stride
Description
Read size when doing a deep scrub.
Type
32-bit Integer
Default
512 KB. 524288
mon_warn_not_deep_scrubbed
Description
Number of seconds after osd_deep_scrub_interval to warn about any PGs that were not scrubbed.
Type
Integer
Default
0 (no warning)
osd_deep_scrub_randomize_ratio
Description
The rate at which scrubs will randomly become deep scrubs (even before osd_deep_scrub_interval has passed).
Type
Float
Default
0.15 or 15%
osd_deep_scrub_update_digest_min_age
Description
How many seconds old objects must be before scrub updates the whole-object digest.
Type
Integer
osd_deep_scrub_large_omap_object_key_threshold
Description
Warning when you encounter an object with more OMAP keys than this.
Type
Integer
Default
200000
osd_deep_scrub_large_omap_object_value_sum_threshold
Description
Warning when you encounter an object with more OMAP key bytes than this.
Type
Integer
Default
1 G
osd_delete_sleep
Description
Time in seconds to sleep before the next removal transaction. This throttles the placement group deletion process.
Type
Float
Default
0.0
osd_delete_sleep_hdd
Description
Time in seconds to sleep before the next removal transaction for HDDs.
Type
Float
Default
5.0
osd_delete_sleep_ssd
Description
Time in seconds to sleep before the next removal transaction for SSDs.
Type
Float
Default
1.0
osd_delete_sleep_hybrid
Description
Time in seconds to sleep before the next removal transaction when Ceph OSD data is on HDD and OSD journal or WAL and DB
is on SSD.
Type
Float
Default
1.0
osd_op_num_shards
Description
The number of shards for client operations.
Type
32-bit Integer
osd_op_num_threads_per_shard
Description
The number of threads per shard for client operations.
Type
32-bit Integer
Default
0
osd_op_num_shards_hdd
Description
The number of shards for HDD operations.
Type
32-bit Integer
Default
5
osd_op_num_threads_per_shard_hdd
Description
The number of threads per shard for HDD operations.
Type
32-bit Integer
Default
1
osd_op_num_shards_ssd
Description
The number of shards for SSD operations.
Type
32-bit Integer
Default
8
osd_op_num_threads_per_shard_ssd
Description
The number of threads per shard for SSD operations.
Type
32-bit Integer
Default
2
osd_client_op_priority
Description
The priority set for client operations. It is relative to osd recovery op priority.
Type
32-bit Integer
Default
63
Valid Range
1-63
osd_recovery_op_priority
Description
The priority set for recovery operations. It is relative to osd client op priority.
Default
3
Valid Range
1-63
osd_op_thread_timeout
Description
The Ceph OSD operation thread timeout in seconds.
Type
32-bit Integer
Default
15
osd_op_complaint_time
Description
An operation becomes complaint worthy after the specified number of seconds have elapsed.
Type
Float
Default
30
osd_disk_threads
Description
The number of disk threads, which are used to perform background disk intensive OSD operations such as scrubbing and snap
trimming.
Type
32-bit Integer
Default
1
osd_op_history_size
Description
The maximum number of completed operations to track.
Type
32-bit Unsigned Integer
Default
20
osd_op_history_duration
Description
The oldest completed operation to track.
Type
32-bit Unsigned Integer
Default
600
osd_op_log_threshold
Description
How many operations logs to display at once.
Type
32-bit Integer
Default
5
osd_op_timeout
Type
Integer
Default
0
IMPORTANT: Do not set the osd op timeout option unless your clients can handle the consequences. For example, setting
this parameter on clients running in virtual machines can lead to data corruption because the virtual machines interpret this
timeout as a hardware failure.
osd_max_backfills
Description
The maximum number of backfill operations allowed to or from a single OSD.
Type
64-bit Unsigned Integer
Default
1
osd_backfill_scan_min
Description
The minimum number of objects per backfill scan.
Type
32-bit Integer
Default
64
osd_backfill_scan_max
Description
The maximum number of objects per backfill scan.
Type
32-bit Integer
Default
512
osd_backfill_full_ratio
Description
Refuse to accept backfill requests when the Ceph OSD’s full ratio is above this value.
Type
Float
Default
0.85
osd_backfill_retry_interval
Description
The number of seconds to wait before retrying backfill requests.
Type
Double
Default
30.000000
osd_map_dedup
Description
Enable removing duplicates in the OSD map.
Type
Boolean
osd_map_cache_size
Description
The size of the OSD map cache in megabytes.
Type
32-bit Integer
Default
50
osd_map_cache_bl_size
Description
The size of the in-memory OSD map cache in OSD daemons.
Type
32-bit Integer
Default
50
osd_map_cache_bl_inc_size
Description
The size of the in-memory OSD map cache incrementals in OSD daemons.
Type
32-bit Integer
Default
100
osd_map_message_max
Description
The maximum map entries allowed per MOSDMap message.
Type
32-bit Integer
Default
40
osd_snap_trim_thread_timeout
Description
The maximum time in seconds before timing out a snap trim thread.
Type
32-bit Integer
Default
60*60*1
osd_pg_max_concurrent_snap_trims
Description
The max number of parallel snap trims/PG. This controls how many objects per PG to trim at once.
Type
32-bit Integer
Default
2
osd_snap_trim_sleep
Description
Insert a sleep between every trim operation a PG issues.
Type
32-bit Integer
osd_max_trimming_pgs
Description
The max number of trimming PGs
Type
32-bit Integer
Default
2
osd_backlog_thread_timeout
Description
The maximum time in seconds before timing out a backlog thread.
Type
32-bit Integer
Default
60*60*1
osd_default_notify_timeout
Description
The OSD default notification timeout (in seconds).
Type
32-bit Integer Unsigned
Default
30
osd_check_for_log_corruption
Description
Check log files for corruption. Can be computationally expensive.
Type
Boolean
Default
false
osd_remove_thread_timeout
Description
The maximum time in seconds before timing out a remove OSD thread.
Type
32-bit Integer
Default
60*60
osd_command_thread_timeout
Description
The maximum time in seconds before timing out a command thread.
Type
32-bit Integer
Default
10*60
osd_command_max_records
Description
Limits the number of lost objects to return.
Type
32-bit Integer
osd_auto_upgrade_tmap
Description
Uses tmap for omap on old objects.
Type
Boolean
Default
true
osd_tmapput_sets_users_tmap
Description
Uses tmap for debugging only.
Type
Boolean
Default
false
osd_preserve_trimmed_log
Description
Preserves trimmed log files, but uses more disk space.
Type
Boolean
Default
false
osd_recovery_delay_start
Description
After peering completes, Ceph delays for the specified number of seconds before starting to recover objects.
Type
Float
Default
0
osd_recovery_max_active
Description
The number of active recovery requests per OSD at one time. More requests will accelerate recovery, but the requests place
an increased load on the cluster.
Type
32-bit Integer
Default
0
osd_recovery_max_chunk
Description
The maximum size of a recovered chunk of data to push.
Type
64-bit Integer Unsigned
Default
8388608
osd_recovery_threads
Description
The number of threads for recovering data.
Type
32-bit Integer
osd_recovery_thread_timeout
Description
The maximum time in seconds before timing out a recovery thread.
Type
32-bit Integer
Default
30
osd_recover_clone_overlap
Description
Preserves clone overlap during recovery. Should always be set to true.
Type
Boolean
Default
true
rados_osd_op_timeout
Description
Number of seconds that RADOS waits for a response from the OSD before returning an error from a RADOS operation. A value
of 0 means no limit.
Type
Double
Default
0
mon_osd_min_up_ratio
Description
The minimum ratio of up Ceph OSD Daemons before Ceph will mark Ceph OSD Daemons down.
Type
Double
Default
.3
mon_osd_min_in_ratio
Description
The minimum ratio of in Ceph OSD Daemons before Ceph will mark Ceph OSD Daemons out.
Type Double
Default
0.750000
mon_osd_laggy_halflife
Description
The number of seconds laggy estimates will decay.
Type
Integer
Default
60*60
Type
Double
Default
0.3
mon_osd_laggy_max_interval
Description
Maximum value of laggy_interval in laggy estimations (in seconds). The monitor uses an adaptive approach to evaluate
the laggy_interval of a certain OSD. This value will be used to calculate the grace time for that OSD.
Type
Integer
Default
300
mon_osd_adjust_heartbeat_grace
Description
If set to true, Ceph will scale based on laggy estimations.
Type
Boolean
Default
true
mon_osd_adjust_down_out_interval
Description
If set to true, Ceph will scaled based on laggy estimations.
Type
Boolean
Default
true
mon_osd_auto_mark_in
Description
Ceph will mark any booting Ceph OSD Daemons as in the Ceph Storage Cluster.
Type
Boolean
Default
false
mon_osd_auto_mark_auto_out_in
Description
Ceph will mark booting Ceph OSD Daemons auto marked out of the Ceph Storage Cluster as in the cluster.
Type
Boolean
Default
true
mon_osd_auto_mark_new_in
Description
Ceph will mark booting new Ceph OSD Daemons as in the Ceph Storage Cluster.
Type
Boolean
Default
true
Type
32-bit Integer
Default
600
mon_osd_downout_subtree_limit
Description
The largest CRUSH unit type that Ceph will automatically mark out.
Type
String
Default
rack
mon_osd_reporter_subtree_level
Description
This setting defines the parent CRUSH unit type for the reporting OSDs. The OSDs send failure reports to the monitor if they
find an unresponsive peer. The monitor may mark the reported OSD down and then out after a grace period.
Type
String
Default
host
mon_osd_report_timeout
Description
The grace period in seconds before declaring unresponsive Ceph OSD Daemons down.
Type
32-bit Integer
Default
900
mon_osd_min_down_reporters
Description
The minimum number of Ceph OSD Daemons required to report a down Ceph OSD Daemon.
Type
32-bit Integer
Default
2
osd_heartbeat_address
Description
A Ceph OSD Daemon’s network address for heartbeats.
Type
Address
Default
The host address.
osd_heartbeat_interval
Description
How often a Ceph OSD Daemon pings its peers (in seconds).
Type
32-bit Integer
Default
6
Type
32-bit Integer
Default
20
osd_mon_heartbeat_interval
Description
Frequency of Ceph OSD Daemon pinging a Ceph Monitor if it has no Ceph OSD Daemon peers.
Type
32-bit Integer
Default
30
osd_mon_report_interval_max
Description The maximum time in seconds that a Ceph OSD Daemon can wait before it must report to a Ceph Monitor.
Type
32-bit Integer
Default
120
osd_mon_report_interval_min
Description
The minimum number of seconds a Ceph OSD Daemon may wait from startup or another reportable event before reporting to
a Ceph Monitor.
Type
32-bit Integer
Default
5
Valid Range
Should be less than osd mon report interval max
osd_mon_ack_timeout
Description
The number of seconds to wait for a Ceph Monitor to acknowledge a request for statistics.
Type
32-bit Integer
Default
30
The options take a single item that is assumed to be the default for all daemons regardless of channel. For example, specifying "info"
is interpreted as "default=info". However, options can also take key/value pairs. For example, "default=daemon audit=local0" is
interpreted as "default all to daemon, override audit with local0."
log_file
Description
The location of the logging file for the cluster.
Required
No
Default
/var/log/ceph/$cluster-$name.log
mon_cluster_log_file
Description The location of the monitor cluster’s log file.
Type String
Required No
Default
/var/log/ceph/$cluster.log
log_max_new
Description
The maximum number of new log files.
Type
Integer
Required
No
Default
1000
log_max_recent
Description
The maximum number of recent events to include in a log file.
Type
Integer
Required
No
Default
10000
log_flush_on_exit
Description
Determines if Ceph flushes the log files after exit.
Type
Boolean
Required
No
Default
true
mon_cluster_log_file_level
Description
The level of file logging for the monitor cluster. Valid settings include "debug", "info", "sec", "warn", and "error".
Type
String
Default
"info"
log_to_stderr
Description
Determines if logging messages appear in stderr.
Required
No
Default
true
err_to_stderr
Description
Determines if error messages appear in stderr.
Type
Boolean
Required
No
Default
true
log_to_syslog
Description
Determines if logging messages appear in syslog.
Type
Boolean
Required
No
Default
false
err_to_syslog
Description
Determines if error messages appear in syslog.
Type
Boolean
Required
No
Default
false
clog_to_syslog
Description
Determines if clog messages will be sent to syslog.
Type
Boolean
Required
No
Default
false
mon_cluster_log_to_syslog
Description
Determines if the cluster log will be output to syslog.
Type
Boolean
Required
No
mon_cluster_log_to_syslog_level
Description
The level of syslog logging for the monitor cluster. Valid settings include "debug", "info", "sec", "warn", and "error".
Type
String
Default
"info"
mon_cluster_log_to_syslog_facility
Description
The facility generating the syslog output. This is usually set to "daemon" for the Ceph daemons.
Type
String
Default
"daemon"
clog_to_monitors
Description
Determines if clog messages will be sent to monitors.
Type
Boolean
Required
No
Default
true
mon_cluster_log_to_graylog
Description
Determines if the cluster will output log messages to graylog.
Type
String
Default
"false"
mon_cluster_log_to_graylog_host
Description
The IP address of the graylog host. If the graylog host is different from the monitor host, override this setting with the
appropriate IP address.
Type
String
Default
"127.0.0.1"
mon_cluster_log_to_graylog_port
Description
Graylog logs will be sent to this port. Ensure the port is open for receiving data.
Type
String
Default
"12201"
osd_preserve_trimmed_log
Description
Preserves trimmed logs after trimming.
Required
No
Default
false
osd_tmapput_sets_uses_tmap
Description
Uses tmap. For debug only.
Type
Boolean
Required
No
Default
false
osd_min_pg_log_entries
Description
The minimum number of log entries for placement groups.
Type
32-bit Unsigned Integer
Required
No
Default
1000
osd_op_log_threshold
Description
Number of op log messages to show up in one pass.
Type
Integer
Required
No
Default
5
You can set these configuration options with the ceph config set global CONFIGURATION_OPTION VALUE command.
mds_max_scrub_ops_in_progress::
Description The maximum number of scrub operations performed in parallel. You can set this value with ceph config set
mds_max_scrub_ops_in_progress VALUE command.
Type
integer
Default
5
Type
integer
Default
1
osd_scrub_begin_hour
Description
The specific hour at which the scrubbing begins. Along with osd_scrub_end_hour, you can define a time window in which
the scrubs can happen. Use osd_scrub_begin_hour = 0 and osd_scrub_end_hour = 0 to allow scrubbing the entire
day.
Type
integer
Default
0
Allowed range
[0, 23]
osd_scrub_end_hour
Description
The specific hour at which the scrubbing ends. Along with osd_scrub_begin_hour, you can define a time window, in which
the scrubs can happen. Use osd_scrub_begin_hour = 0 and osd_scrub_end_hour = 0 to allow scrubbing for the
entire day.
Type
integer
Default
0
Allowed range
[0, 23]
osd_scrub_begin_week_day
Description
The specific day on which the scrubbing begins. 0 = Sunday, 1 = Monday, etc. Along with osd_scrub_end_week_day, you
can define a time window in which scrubs can happen. Use osd_scrub_begin_week_day = 0 and
osd_scrub_end_week_day = 0 to allow scrubbing for the entire week.
Type
integer
Default
0
Allowed range
[0, 6]
osd_scrub_end_week_day
Description
This defines the day on which the scrubbing ends. 0 = Sunday, 1 = Monday, etc. Along with osd_scrub_begin_week_day,
they define a time window, in which the scrubs can happen. Use osd_scrub_begin_week_day = 0 and
osd_scrub_end_week_day = 0 to allow scrubbing for the entire week.
Type
integer
Default
0
Allowed range
[0, 6]
Type
boolean
Default
false
osd_scrub_load_threshold
Description
The normalized maximum load. Scrubbing does not happen when the system load, as defined by getloadavg()/number of
online CPUs, is higher than this defined number.
Type
float
Default
0.5
osd_scrub_min_interval
Description
The minimal interval in seconds for scrubbing the Ceph OSD daemon when the Ceph storage Cluster load is low.
Type
float
Default
1 day
osd_scrub_max_interval
Description
The maximum interval in seconds for scrubbing the Ceph OSD daemon irrespective of cluster load.
Type
float
Default
7 days
osd_scrub_chunk_min
Description
The minimal number of object store chunks to scrub during a single operation. Ceph blocks writes to a single chunk during
scrub.
Type
integer
Default
5
osd_scrub_chunk_max
Description
The maximum number of object store chunks to scrub during a single operation.
Type
integer
Default
25
osd_scrub_sleep
Description
Time to sleep before scrubbing the next group of chunks. Increasing this value slows down the overall rate of scrubbing, so
that client operations are less impacted.
Default
0.0
osd_scrub_extended_sleep::
Description Duration to inject a delay during scrubbing out of scrubbing hours or seconds.
Type
float
Default
0.0
osd_scrub_backoff_ratio::
Description Backoff ratio for scheduling scrubs. This is the percentage of ticks that do NOT schedule scrubs, 66% means that
1 out of 3 ticks schedules scrubs.
Type
float
Default
0.66
osd_deep_scrub_interval
Description
The interval for deep scrubbing, fully reading all data. The osd_scrub_load_threshold does not affect this setting.
Type
float
Default
7 days
osd_debug_deep_scrub_sleep::
Description Inject an expensive sleep during deep scrub IO to make it easier to induce preemption.
Type
float
Default
0
osd_scrub_interval_randomize_ratio
Description
Add a random delay to osd_scrub_min_interval when scheduling the next scrub job for a placement group. The delay is a
random value less than osd_scrub_min_interval * osd_scrub_interval_randomized_ratio. The default setting
spreads scrubs throughout the allowed time window of [1, 1.5] * osd_scrub_min_interval.
Type
float
Default
0.5
osd_deep_scrub_stride
Description
Read size when doing a deep scrub.
Type
size
Default
512 KB
osd_scrub_auto_repair_num_errors
Description
Auto repair does not occur if more than this many errors are found.
Default
5
osd_scrub_auto_repair
Description
Setting this to true enables automatic Placement Group (PG) repair when errors are found by scrubs or deep-scrubs.
However, if more than osd_scrub_auto_repair_num_errors errors are found, a repair is NOT performed.
Type
boolean
Default
false
osd_scrub_max_preemptions
Description
Set the maximum number of times you need to preempt a deep scrub due to a client operation before blocking client IO to
complete the scrub.
Type
integer
Default
5
osd_deep_scrub_keys
Description
Number of keys to read from an object at a time during deep scrub.
Type
integer
Default
1024
rocksdb_cache_size
Description
The size of the RocksDB cache in MB.
Type
32-bit Integer
Default
512
Administering
Edit online
Learn how to properly administer and operate IBM Storage Ceph.
Administration
Operations
Ceph administration
Understanding process management for Ceph
Monitoring a Ceph storage cluster
Stretch clusters for Ceph storage
Override Ceph behavior
Ceph user management
The ceph-volume utility
Ceph performance benchmark
Ceph performance counters
BlueStore
Cephadm troubleshooting
Cephadm operations
Managing an IBM Storage Ceph cluster using cephadm-ansible modules
Ceph administration
Edit online
An IBM Storage Ceph cluster is the foundation for all Ceph deployments. After deploying an IBM Storage Ceph cluster, there are
administrative operations for keeping an IBM Storage Ceph cluster healthy and performing optimally.
How do I start and stop the IBM Storage Ceph cluster services?
How do I add or remove an OSD from a running IBM Storage Ceph cluster?
How do I manage user authentication and access controls to the objects stored in an IBM Storage Ceph cluster?
I want to understand how to use overrides with an IBM Storage Ceph cluster.
A Ceph Object Storage Device (OSD) stores data as objects within placement groups assigned to the OSD
A production system will have three or more Ceph Monitors for high availability and typically a minimum of 50 OSDs for acceptable
load balancing, data re-balancing and data recovery.
Reference
Edit online
For more information on using systemd, see Introduction to systemd and Managing system services with systemctl within the
Configuring basic system settings guide for Red Hat Enterprise Linux 8.
Prerequisites
Edit online
Procedure
Edit online
1. On the host where you want to start, stop, and restart the daemons, run the systemctl service to get the SERVICE_ID of the
service.
Example
Syntax
Example
Syntax
Example
Syntax
Example
IMPORTANT: If you want to start,stop, or restart a specific Ceph daemon in a specific host, you need to use the SystemD service. To
obtain a list of the SystemD services running in a specific host, connect to the host, and run the following command:
Example
The output will give you a list of the service names that you can use, to manage each Ceph daemon.
Prerequisites
Edit online
Procedure
Edit online
Example
2. Run the ceph orch ls command to get a list of Ceph services configured in the IBM Storage Ceph cluster and to get the
specific service ID.
Example
Syntax
Example
IMPORTANT: The ceph orch stop SERVICE_ID command results in the IBM Storage Ceph cluster being inaccessible,
only for the MON and MGR service. It is recommended to use the systemctl stop SERVICE_ID command to stop a
specific daemon in the host.
Syntax
Example
In the example the ceph orch stop node-exporter command removes all the daemons of the node exporter service.
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
1. To view the entire Ceph log file, run a journalctl command as root composed in the following format:
Syntax
Syntax
Example
NOTE: You can also use the sosreport utility to view the journald logs. For more details about SOS reports, see the What is an
sosreport and how to create one in Red Hat Enterprise Linux? solution on the Red Hat Customer Portal.
Reference
Edit online
Powering down and rebooting the cluster using the systemctl commands
Powering down and rebooting the cluster using the Ceph Orchestrator
Prerequisites
Edit online
Root-level access.
Procedure
Edit online
Powering down the IBM Storage Ceph cluster
1. Stop the clients from using the Block Device images RADOS Gateway - Ceph Object Gateway on this cluster and any other
clients.
Example
Example
4. If you use the Ceph File System (CephFS), bring down the CephFS cluster:
Syntax
Example
5. Set the noout, norecover, norebalance, nobackfill, nodown, and pause flags. Run the following on a node with the
client keyrings, for example, the Ceph Monitor or OpenStack controller node:
Example
6. If the MDS and Ceph Object Gateway nodes are on their own dedicated nodes, power them off.
Example
Example
1. If network equipment was involved, ensure it is powered ON and stable prior to powering ON any Ceph hosts or nodes.
Example
Example
5. Wait for all the nodes to come up. Verify all the services are up and there are no connectivity issues between the nodes.
6. Unset the noout, norecover, norebalance, nobackfill, nodown and pause flags. Run the following on a node with the
client keyrings, for example, the Ceph Monitor or OpenStack controller node:
7. If you use the Ceph File System (CephFS), bring the CephFS cluster back up by setting the joinable flag to true:
Syntax
Example
Verification
Edit online
Verify the cluster is in healthy state (Health_OK and all PGs active+clean). Run ceph status on a node with the client
keyrings, for example, the Ceph Monitor or OpenStack controller nodes, to ensure the cluster is healthy.
Example
Reference
Edit online
The Ceph Orchestrator supports several operations, such as start, stop, and restart. You can use these commands with
systemctl, for some cases, in powering down or rebooting the cluster.
Prerequisites
Edit online
Procedure
Edit online
Powering down the IBM Storage Ceph cluster
1. Stop the clients from using the user Block Device Image and Ceph Object Gateway on this cluster and any other clients.
3. The cluster must be in healthy state (Health_OK and all PGs active+clean) before proceeding. Run ceph status on the
host with the client keyrings, for example, the Ceph Monitor or OpenStack controller nodes, to ensure the cluster is healthy.
Example
4. If you use the Ceph File System (CephFS), bring down the CephFS cluster:
Syntax
Example
5. Set the noout, norecover, norebalance, nobackfill, nodown, and pause flags. Run the following on a node with the
client keyrings, for example, the Ceph Monitor or OpenStack controller node:
Example
Example
b. Stop the MDS service using the fetched name in the previous step:
Syntax
7. Stop the Ceph Object Gateway services. Repeat for each deployed service.
Example
b. Stop the Ceph Object Gateway service using the fetched name:
Syntax
Example
Example
Example
Example
Example
13. Shut down the OSD nodes from the cephadm node, one by one. Repeat this step for all the OSDs in the cluster.
Example
b. Shut down the OSD node using the OSD ID you fetched:
Example
Example
Example
Syntax
1. If network equipment was involved, ensure it is powered ON and stable prior to powering ON any Ceph hosts or nodes.
Example
Example
Example
6. Unset the noout, norecover, norebalance, nobackfill, nodown and pause flags. Run the following on a node with the
client keyrings, for example, the Ceph Monitor or OpenStack controller node:
Example
7. If you use the Ceph File System (CephFS), bring the CephFS cluster back up by setting the joinable flag to true:
Syntax
Example
Verification
Edit online
Verify the cluster is in healthy state (Health_OK and all PGs active+clean). Run ceph status on a node with the client
keyrings, for example, the Ceph Monitor or OpenStack controller nodes, to ensure the cluster is healthy.
Example
Reference
Edit online
Once you have a running IBM Storage Ceph cluster, you might begin monitoring the storage cluster to ensure that the Ceph Monitor
and Ceph OSD daemons are running, at a high-level. Ceph storage cluster clients connect to a Ceph Monitor and receive the latest
version of the storage cluster map before they can read and write data to the Ceph pools within the storage cluster. So the monitor
cluster must have agreement on the state of the cluster before Ceph clients can read and write data.
Ceph OSDs must peer the placement groups on the primary OSD with the copies of the placement groups on secondary OSDs. If
faults arise, peering will reflect something other than the active + clean state.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Replace:
MONITOR_NAME with the name of the Ceph Monitor container, found by running the podman ps command.
Example
This example opens an interactive terminal session on mon.host01, where you can start the Ceph interactive shell.
Prerequisites
Edit online
Procedure
Edit online
Example
2. You can check on the health of the Ceph storage cluster with the following command:
Example
3. You can check the status of the Ceph storage cluster by running ceph status command:
Example
Cluster ID
The monitor map epoch and the status of the monitor quorum.
The notional amount of data stored and the number of objects stored.
Upon starting the Ceph cluster, you will likely encounter a health warning such as HEALTH_WARN XXX num
placement groups stale. Wait a few moments and check it again. When the storage cluster is ready, ceph
health should return a message such as HEALTH_OK. At that point, it is okay to begin using the cluster.
Procedure
Edit online
Example
Example
services:
mon: 2 daemons, quorum Ceph5-2,Ceph5-adm (age 3d)
mgr: Ceph5-1.nqikfh(active, since 3w), standbys: Ceph5-adm.meckej
osd: 5 osds: 5 up (since 2d), 5 in (since 8w)
rgw: 2 daemons active (test_realm.test_zone.Ceph5-2.bfdwcn, test_realm.test_zone.Ceph5-
adm.acndrh)
data:
pools: 11 pools, 273 pgs
objects: 459 objects, 32 KiB
usage: 2.6 GiB used, 72 GiB / 75 GiB avail
pgs: 273 active+clean
io:
client: 170 B/s rd, 730 KiB/s wr, 0 op/s rd, 729 op/s wr
The SIZE/AVAIL/RAW USED in the ceph df and ceph status command output are different if some OSDs are marked OUT of the
cluster compared to when all OSDs are IN. The SIZE/AVAIL/RAW USED is calculated from sum of SIZE (osd disk size), RAW USE
(total used space on disk), and AVAIL of all OSDs which are in IN state. You can see the total of SIZE/AVAIL/RAW USED for all OSDs
in ceph osd df tree command output.
Example
The ceph df detail command gives more details about other pool statistics such as quota objects, quota bytes, used
compression, and under compression.
The RAW STORAGE section of the output provides an overview of the amount of storage the storage cluster manages for data.
In the above example, if the SIZE is 90 GiB, it is the total size without the replication factor, which is three by default. The
total available capacity with the replication factor is 90 GiB/3 = 30 GiB. Based on the full ratio, which is 0.85% by default, the
maximum available space is 30 GiB * 0.85 = 25.5 GiB
In the above example, if the SIZE is 90 GiB and the USED space is 6 GiB, then the AVAIL space is 84 GiB. The total available
space with the replication factor, which is three by default, is 84 GiB/3 = 28 GiB
In the above example, 100 MiB is the total space available after considering the replication factor. The actual available size is
33 MiB.
RAW USED: The amount of raw storage consumed by user data, internal overhead, or reserved capacity.
% RAW USED: The percentage of RAW USED. Use this number in conjunction with the full ratio and near full ratio
to ensure that you are not reaching the storage cluster’s capacity.
The POOLS section of the output provides a list of pools and the notional usage of each pool. The output from this section DOES NOT
reflect replicas, clones or snapshots. For example, if you store an object with 1 MB of data, the notional usage will be 1 MB, but the
actual usage may be 3 MB or more depending on the number of replicas for example, size = 3, clones and snapshots.
OBJECTS: The notional number of objects stored per pool. It is STORED size * replication factor.
USED: The notional amount of data stored in kilobytes, unless the number appends M for megabytes or G for gigabytes.
MAX AVAIL: An estimate of the notional amount of data that can be written to this pool. It is the amount of data that can be
used before the first OSD becomes full. It considers the projected distribution of data across disks from the CRUSH map and
uses the first OSD to fill up as the target.
In the above example, MAX AVAIL is 153.85 MB without considering the replication factor, which is three by default.
See the Knowledgebase article titled ceph df MAX AVAIL is incorrect for simple replicated pool to calculate the value of MAX
AVAIL.
USED COMPR: The amount of space allocated for compressed data including his includes compressed data, allocation,
replication and erasure coding overhead.
UNDER COMPR: The amount of data passed through compression and beneficial enough to be stored in a compressed form.
NOTE: The numbers in the POOLS section are notional. They are not inclusive of the number of replicas, snapshots or clones. As a
result, the sum of the USED and %USED amounts will not add up to the RAW USED and %RAW USED amounts in the GLOBAL
section of the output.
NOTE: The MAX AVAIL value is a complicated function of the replication or erasure code used, the CRUSH rule that maps storage to
devices, the utilization of those devices, and the configured mon_osd_full_ratio.
Reference
Edit online
Example
OMAP: An estimate value of the bluefs storage that is being used to store object map (omap) data (key value pairs stored in
rocksdb).
META: The bluefs space allocated, or the value set in the bluestore_bluefs_min parameter, whichever is larger, for
internal metadata which is calculated as the total space allocated in bluefs minus the estimated omap data size.
MIN/MAX VAR: The minimum and maximum variation across all OSDs.
Reference
Edit online
For more information, see:
CRUSH Weights
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Or
Example
Example
services:
mon: 3 daemons, quorum host03,host02 (age 3d), out of quorum: host01
mgr: host01.hdhzwn(active, since 9d), standbys: host05.eobuuv, host06.wquwpj
osd: 12 osds: 11 up (since 2w), 11 in (since 5w)
rgw: 2 daemons active (test_realm.test_zone.host04.hgbvnq,
test_realm.test_zone.host05.yqqilm)
data:
pools: 8 pools, 960 pgs
objects: 414 objects, 1.0 MiB
usage: 5.7 GiB used, 214 GiB / 220 GiB avail
pgs: 960 active+clean
io:
client: 41 KiB/s rd, 0 B/s wr, 41 op/s rd, 27 op/s wr
ceph> health
HEALTH_WARN 1 stray daemon(s) not managed by cephadm; 1/3 mons down, quorum host03,host02; too
many PGs per OSD (261 > max 250)
Check the Ceph Monitor status periodically to ensure that they are running. If there is a problem with the Ceph Monitor, that prevents
an agreement on the state of the storage cluster, the fault can prevent Ceph clients from reading and writing data.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
or
Example
3. To check the quorum status for the storage cluster, execute the following:
Example
{
"election_epoch": 6686,
"quorum": [
0,
1,
2
],
"quorum_names": [
"host01",
"host03",
"host02"
],
"quorum_leader_name": "host01",
"quorum_age": 424884,
"features": {
"quorum_con": "4540138297136906239",
"quorum_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging"
]
},
"monmap": {
"epoch": 3,
"fsid": "499829b4-832f-11eb-8d6d-001a4a000635",
"modified": "2021-03-15T04:51:38.621737Z",
"created": "2021-03-12T12:35:16.911339Z",
"min_mon_release": 16,
"min_mon_release_name": "pacific",
"election_strategy": 1,
"disallowed_leaders: ": "",
"stretch_mode": false,
"features": {
"persistent": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "host01",
"public_addrs": {
Set configuration values at runtime directly without relying on Monitors. This is useful when Monitors are down.
In addition, using the socket is helpful when troubleshooting problems related to Ceph Monitors or OSDs.
Regardless, if the daemon is not running, a following error is returned when attempting to use the administration socket:
IMPORTANT:
The administration socket is only available while a daemon is running. When you shut down the daemon properly, the administration
socket is removed. However, if the daemon terminates unexpectedly, the administration socket might persist.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Replace:
COMMAND with the command to run. Use help to list the available commands for a given daemon.
Example
Example
{
"name": "host01",
"rank": 0,
"state": "leader",
"election_epoch": 120,
"quorum": [
0,
1,
2
],
"quorum_age": 206358,
"features": {
"required_con": "2449958747317026820",
"required_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging"
],
"quorum_con": "4540138297136906239",
"quorum_mon": [
"kraken",
"luminous",
"mimic",
"osdmap-prune",
"nautilus",
"octopus",
"pacific",
"elector-pinging"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 3,
Syntax
Example
Example
Reference
Edit online
If you execute a command such as ceph health, ceph -s or ceph -w, you might notice that the storage cluster does not always
echo back HEALTH OK. Do not panic. With respect to Ceph OSDs, you can expect that the storage cluster will NOT echo HEALTH OK
in a few expected circumstances:
You have not started the storage cluster yet, and it is not responding.
You have just started or restarted the storage cluster, and it is not ready yet, because the placement groups are getting
created and the Ceph OSDs are in the process of peering.
An important aspect of monitoring Ceph OSDs is to ensure that when the storage cluster is up and running that all Ceph OSDs that
are in the storage cluster are up and running, too.
Example
or
Example
The result should tell you the map epoch, eNNNN, the total number of OSDs, x, how many, y, are up, and how many, z, are in:
If the number of Ceph OSDs that are in the storage cluster are more than the number of Ceph OSDs that are up. Execute the
following command to identify the ceph-osd daemons that are not running:
Example
TIP: The ability to search through a well-designed CRUSH hierarchy can help you troubleshoot the storage cluster by identifying the
physical locations faster.
If a Ceph OSD is down, connect to the node and start it. You can use IBM Storage Ceph Console to restart the Ceph OSD daemon, or
you can use the command line.
Syntax
Example
Reference
Edit online
You added or removed an OSD. Then, CRUSH reassigned the placement group to other Ceph OSDs, thereby changing the
composition of the acting set and spawning the migration of data with a "backfill" process.
A Ceph OSD in the acting set is down or unable to service requests, and another Ceph OSD has temporarily assumed its duties.
Ceph processes a client request using the Up Set, which is the set of Ceph OSDs that actually handle the requests. In most cases,
the up set and the Acting Set are virtually identical. When they are not, it can indicate that Ceph is migrating data, a Ceph OSD is
recovering, or that there is a problem, that is, Ceph usually echoes a HEALTH WARN state with a "stuck stale" message in such
scenarios.
Procedure
Edit online
Example
Example
3. View which Ceph OSDs are in the Acting Set or in the Up Set for a given placement group:
Syntax
Example
NOTE: If the Up Set and Acting Set do not match, this may be an indicator that the storage cluster rebalancing itself or of a
potential problem with the storage cluster.
Peering
Figure 1. Peering
You have just created a pool and placement groups have not peered yet.
You have just added an OSD to or removed an OSD from the cluster.
You have just modified the CRUSH map and the placement groups are migrating.
Ceph does not have enough storage capacity to complete backfilling operations.
If one of the foregoing circumstances causes Ceph to echo HEALTH WARN, do not panic. In many cases, the cluster will recover on its
own. In some cases, you may need to take action. An important aspect of monitoring placement groups is to ensure that when the
cluster is up and running that all placement groups are active, and preferably in the clean state.
Example
The result should tell you the placement group map version, vNNNNNN, the total number of placement groups, x, and how many
placement groups, y, are in a particular state such as active+clean:
NOTE: It is common for Ceph to report multiple states for placement groups.
Example Output:
244 active+clean+snaptrim_wait
32 active+clean+snaptrim
In addition to the placement group states, Ceph will also echo back the amount of data used, aa, the amount of storage capacity
remaining, bb, and the total storage capacity cc for the placement group. These numbers can be important in a few cases:
Your data isn’t getting distributed across the cluster due to an error in the CRUSH configuration.
Placement group IDs consist of the pool number, and not the pool name, followed by a period (.) and the placement group ID—a
hexadecimal number. You can view pool numbers and their names from the output of ceph osd lspools. The default pool names
data, metadata and rbd correspond to pool numbers 0, 1 and 2 respectively. A fully qualified placement group ID has the following
form:
Syntax
POOL_NUM.PG_ID
Example output:
0.1f
Syntax
Example
Syntax
Example
Reference
Edit online
See the Object Storage Daemon (OSD) configuration options for more details on the snapshot trimming settings.
Authoritative History
Ceph will NOT acknowledge a write operation to a client, until all OSDs of the acting set persist the write operation. This practice
ensures that at least one member of the acting set will have a record of every acknowledged write operation since the last successful
peering operation.
With an accurate record of each acknowledged write operation, Ceph can construct and disseminate a new authoritative history of
the placement group. A complete, and fully ordered set of operations that, if performed, would bring an OSD’s copy of a placement
group up to date.
The reason a placement group can be active+degraded is that an OSD may be active even though it doesn’t hold all of the
objects yet. If an OSD goes down, Ceph marks each placement group assigned to the OSD as degraded. The Ceph OSDs must peer
again when the Ceph OSD comes back online. However, a client can still write a new object to a degraded placement group if it is
active.
If an OSD is down and the degraded condition persists, Ceph may mark the down OSD as out of the cluster and remap the data
from the down OSD to another OSD. The time between being marked down and being marked out is controlled by
mon_osd_down_out_interval, which is set to 600 seconds by default.
A placement group can also be degraded, because Ceph cannot find one or more objects that Ceph thinks should be in the
placement group. While you cannot read or write to unfound objects, you can still access all of the other objects in the degraded
placement group.
For example, if there are nine OSDs in a three way replica pool. If OSD number 9 goes down, the PGs assigned to OSD 9 goes into a
degraded state. If OSD 9 does not recover, it goes out of the storage cluster and the storage cluster rebalances. In that scenario, the
PGs are degraded and then recover to an active state.
Recovery is not always trivial, because a hardware failure might cause a cascading failure of multiple Ceph OSDs. For example, a
network switch for a rack or cabinet may fail, which can cause the OSDs of a number of host machines to fall behind the current state
of the storage cluster. Each one of the OSDs must recover once the fault is resolved.
Ceph provides a number of settings to balance the resource contention between new service requests and the need to recover data
objects and restore the placement groups to the current state. The osd recovery delay start setting allows an OSD to restart,
re-peer and even process some replay requests before starting the recovery process. The osd recovery threads setting limits
the number of threads for the recovery process, by default one thread. The osd recovery thread timeout sets a thread
timeout, because multiple Ceph OSDs can fail, restart and re-peer at staggered rates. The osd recovery max active setting
limits the number of recovery requests a Ceph OSD works on simultaneously to prevent the Ceph OSD from failing to serve. The osd
recovery max chunk setting limits the size of the recovered data chunks to prevent network congestion.
During the backfill operations, you might see one of several states:
backfill_wait indicates that a backfill operation is pending, but isn’t underway yet
backfill_too_full indicates that a backfill operation was requested, but couldn’t be completed due to insufficient storage
capacity.
Ceph provides a number of settings to manage the load spike associated with reassigning placement groups to a Ceph OSD,
especially a new Ceph OSD. By default, osd_max_backfills sets the maximum number of concurrent backfills to or from a Ceph
OSD to 10. The osd backfill full ratio enables a Ceph OSD to refuse a backfill request if the OSD is approaching its full ratio,
by default 85%. If an OSD refuses a backfill request, the osd backfill retry interval enables an OSD to retry the request, by
default after 10 seconds. OSDs can also set osd backfill scan min and osd backfill scan max to manage scan intervals,
by default 64 and 512.
For some workloads, it is beneficial to avoid regular recovery entirely and use backfill instead. Since backfilling occurs in the
background, this allows I/O to proceed on the objects in the OSD. You can force a backfill rather than a recovery by setting the
osd_min_pg_log_entries option to 1, and setting the osd_max_pg_log_entries option to 2. Contact your IBM Support
account team for details on when this situation is appropriate for your workload.
When you start the storage cluster, it is common to see the stale state until the peering process completes. After the storage
cluster has been running for a while, seeing placement groups in the stale state indicates that the primary OSD for those placement
groups is down or not reporting placement group statistics to the monitor.
For example, there are 3 OSDs: 0,1,2 and all PGs map to some permutation of those three. If you add another OSD (OSD 3), some
PGs will now map to OSD 3 instead of one of the others. However, until OSD 3 is backfilled, the PG will have a temporary mapping
Syntax
is a temporary mapping, so the up set is not equal to the acting set and the PG is misplaced but not degraded since is still three
copies.
Example
OSD 3 is now backfilled and the temporary mapping is removed, not degraded and not misplaced.
Lets say OSD 1, 2, and 3 are the acting OSD set and it switches to OSD 1, 4, and 3, then osd.1 will request a temporary acting set of
OSD 1, 2, and 3 while backfilling 4. During this time, if OSD 1, 2, and 3 all go down, osd.4 will be the only one left which might not
have fully backfilled all the data. At this time, the PG will go incomplete indicating that there are no complete OSDs which are
current enough to perform recovery.
Alternately, if osd.4 is not involved and the acting set is simply OSD 1, 2, and 3 when OSD 1, 2, and 3 go down, the PG would likely
go stale indicating that the mons have not heard anything on that PG since the acting set changed. The reason being there are no
OSDs left to notify the new OSDs.
Unclean: Placement groups contain objects that are not replicated the desired number of times. They should be recovering.
Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date
data to come back up.
Stale: Placement groups are in an unknown state, because the OSDs that host them have not reported to the monitor cluster
in a while, and can be configured with the mon osd report timeout setting.
Prerequisites
Edit online
Procedure
Edit online
To identify stuck placement groups, execute the following:
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
To find the object location, all you need is the object name and the pool name:
Syntax
Example
IBM Storage Ceph is capable of withstanding the loss of Ceph OSDs because of its network and cluster, which are equally reliable
with failures randomly distributed across the CRUSH map. If a number of OSDs is shut down, the remaining OSDs and monitors still
manage to operate.
However, this might not be the best solution for some stretched cluster configurations where a significant part of the Ceph cluster
can use only a single network component. The example is a single cluster located in multiple data centers, for which the user wants
to sustain a loss of a full data center.
The standard configuration is with two data centers. Other configurations are in clouds or availability zones. Each site holds two
copies of the data, therefore, the replication size is four. The third site should have a tiebreaker monitor, this can be a virtual machine
or high-latency compared to the main sites. This monitor chooses one of the sites to restore data if the network connection fails and
both data centers remain active.
IMPORTANT: The standard Ceph configuration survives many failures of the network or data centers and it never compromises data
consistency. If you restore enough Ceph servers following a failure, it recovers. Ceph maintains availability if you lose a data center,
but can still form a quorum of monitors and have all the data available with enough copies to satisfy pools’ min_size, or CRUSH
rules that replicate again to meet the size.
NOTE: There are no additional steps to power down a stretch cluster. See Powering down and rebooting IBM Storage Ceph cluster
However, there are situations where you lose data availability even if you have enough servers available to meet Ceph’s consistency
and sizing constraints, or where you unexpectedly do not meet the constraints.
First important type of failure is caused by inconsistent networks. If there is a network split, Ceph might be unable to mark OSD as
down to remove it from the acting placement group (PG) sets despite the primary OSD being unable to replicate data. When this
happens, the I/O is not permitted because Ceph cannot meet its durability guarantees.
The second important category of failures is when it appears that you have data replicated across data enters, but the constraints are
not sufficient to guarantee this. For example, you might have data centers A and B, and the CRUSH rule targets three copies and
places a copy in each data center with a min_size of 2. The PG might go active with two copies in site A and no copies in site B,
which means that if you lose site A, you lose the data and Ceph cannot operate on it. This situation is difficult to avoid with standard
CRUSH rules.
In stretch mode, Ceph OSDs are only allowed to connect to monitors within the same data center. New monitors are not allowed to
join the cluster without specified location.
If all the OSDs and monitors from a data center become inaccessible at once, the surviving data center will enter a degraded stretch
mode. This issues a warning, reduces the min_size to 1, and allows the cluster to reach an active state with the data from the
remaining site.
NOTE: The degraded state also triggers warnings that the pools are too small, because the pool size does not get changed.
However, a special stretch mode flag prevents the OSDs from creating extra copies in the remaining data center, therefore it still
keeps 2 copies.
When the missing data center becomes accesible again, the cluster enters recovery stretch mode. This changes the warning and
allows peering, but still requires only the OSDs from the data center, which was up the whole time.
When all PGs are in a known state and are not degraded or incomplete, the cluster goes back to the regular stretch mode, ends the
warning, and restores min_size to its starting value 2. The cluster again requires both sites to peer, not only the site that stayed up
the whole time, therefore you can fail over to the other site, if necessary.
You cannot use erasure-coded pools with clusters in stretch mode. You can neither enter the stretch mode with erasure-
coded pools, nor create an erasure-coded pool when the stretch mode is active.
The weights of the two sites should be the same. If they are not, you receive the following error:
Example
Error EINVAL: the 2 datacenter instances in the cluster have differing weights 25947 and
15728 but stretch mode currently requires they be the same!
While it is not enforced, you should run two Ceph monitors on each site and a tiebreaker, for a total of five. This is because
OSDs can only connect to monitors in their own site when in stretch mode.
You have to create your own CRUSH rule, which provides two copies on each site, which totals to four on both sites.
You cannot enable stretch mode if you have existing pools with non-default size or min_size.
Because the cluster runs with min_size 1 when degraded, you should only use stretch mode with all-flash OSDs. This
minimizes the time needed to recover once connectivity is restored, and minimizes the potential for data loss.
Reference
Edit online
Bootstrap the cluster through a service configuration file, where the locations are added to the hosts as part of deployment.
Set the locations manually through ceph osd crush add-bucket and ceph osd crush move commands after the
cluster is deployed.
Prerequisites
Edit online
Procedure
Edit online
1. If you are bootstrapping your new storage cluster, you can create the service configuration .yaml file that adds the nodes to
the IBM Storage Ceph cluster and also sets specific labels for where the services should run:
Example
service_type: host
addr: host01
hostname: host01
location:
root: default
datacenter: DC1
labels:
- osd
- mon
- mgr
---
service_type: host
addr: host02
hostname: host02
Syntax
Example
IMPORTANT: You can use different command options with the cephadm bootstrap command. However, always include the
--apply-spec option to use the service configuration file and configure the host locations.
Reference
Edit online
For more information about Ceph bootstrapping and different cephadm bootstrap command options, see Bootstrapping a
new storage cluster
Prerequisites
Edit online
Procedure
Edit online
1. Add two buckets to which you plan to set the location of your non-tiebreaker monitors to the CRUSH map, specifying the
bucket type as as datacenter:
Syntax
Example
Syntax
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
2. Generate a CRUSH rule which places two copies on each data center:
Example
Example
rule stretch_rule {
id 1
type replicated
min_size 1
max_size 10
step take DC1 <2>
step chooseleaf firstn 2 type host
step emit
step take DC2
step chooseleaf firstn 2 type host
step emit
}
NOTE: This rule makes the cluster have read-affinity towards data center DC1. Therefore, all the reads or writes happen
through Ceph OSDs placed in DC1. If this is not desirable, and reads or writes are to be distributed evenly across the zones,
the crush rule is the following:
Example
rule stretch_rule {
id 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 0 type datacenter
step chooseleaf firstn 2 type host
step emit
}
In this rule, the data center is selected randomly and automatically. See CRUSH rules for more information on firstn and
indep options.
4. Inject the CRUSH map to make the rule available to the cluster:
Syntax
Example
5. If you do not run the monitors in connectivity mode, set the election strategy to connectivity:
Example
6. Enter stretch mode by setting the location of the tiebreaker monitor to split across the data centers:
Syntax
Example
IMPORTANT: The location of the tiebreaker monitor should differ from the data centers to which you previously set the non-
tiebreaker monitors. In the example above, it is data center DC3.
IMPORTANT: Do not add this data center to the CRUSH map as it results in the following error when you try to enter stretch
mode: Error EINVAL: there are 3 datacenters in the cluster but stretch mode currently only works with 2!
NOTE: If you are writing your own tooling for deploying Ceph, you can use a new --set-crush-location option when
booting monitors, instead of running the ceph mon set_location command. This option accepts only a single
bucket=location pair, for example ceph-mon --set-crush-location 'datacenter=DC1', which must match the
bucket type you specified when running the enable_stretch_mode command.
Example
epoch 361
fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
created 2023-01-16T05:47:28.4827170000
modified 2023-01-17T17:36:50.0661830000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 31
full_ratio 0.95
backfillfull_ratio 0.92
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release quincy
stretch_mode_enabled true
stretch_bucket_count 2
degraded_stretch_mode 0
recovering_stretch_mode 0
stretch_mode_bucket 8
The stretch_mode_enabled should be set to true. You can also see the number of stretch buckets, stretch mode buckets,
and if the stretch mode is degraded or recovering.
Example
epoch 19
fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
last_changed 2023-01-17T04:12:05.7094750000
created 2023-01-16T05:47:25.6316840000
min_mon_release 16 (pacific)
election_strategy: 3
stretch_mode_enabled 1
tiebreaker_mon host07
disallowed_leaders host07
0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host07; crush_location
{datacenter=DC3}
1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location
{datacenter=DC2}
2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location
{datacenter=DC1}
3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location
{datacenter=DC1}
4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location
{datacenter=DC2}
dumped monmap epoch 19
You can also see which monitor is the tiebreaker, and the monitor election strategy.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
IMPORTANT: This command creates collocated WAL and DB devices. If you want to create non-collocated devices, do
not use this command.
Example
Syntax
Example
NOTE: Ensure you add the same topology nodes on both sites. Issues might arise if hosts are added only on one site.
Reference
Edit online
Adding OSDs for more information about the addition of Ceph OSDs.
Prerequisites
Edit online
Procedure
Edit online
1. To override Ceph’s default behavior, use the ceph osd set command and the behavior you wish to override:
Syntax
Once you set the behavior, ceph health will reflect the override(s) that you have set for the cluster.
Example
2. To cease overriding Ceph’s default behavior, use the ceph osd unset command and the override you wish to cease.
Syntax
Example
Flag Description
noin Prevents OSDs from being treated as in the cluster.
noout Prevents OSDs from being treated as out of the cluster.
noup Prevents OSDs from being treated as up and running.
nodown Prevents OSDs from being treated as down.
full Makes a cluster appear to have reached its full_ratio, and thereby prevents write operations.
pause Ceph will stop processing read and write operations, but will not affect OSD in, out, up or down statuses.
nobackfill Ceph will prevent new backfill operations.
norebalance Ceph will prevent new rebalancing operations.
norecover Ceph will prevent new recovery operations.
noscrub Ceph will prevent new scrubbing operations.
nodeep-scrub Ceph will prevent new deep scrubbing operations.
notieragent Ceph will disable the process that is looking for cold/dirty objects to flush and evict.
noout: If the mon osd report timeout is exceeded and an OSD has not reported to the monitor, the OSD will get marked
out. If this happens erroneously, you can set noout to prevent the OSD(s) from getting marked out while you troubleshoot
the issue.
nodown: Networking issues may interrupt Ceph heartbeat processes, and an OSD may be up but still get marked down. You
can set nodown to prevent OSDs from getting marked down while troubleshooting the issue.
full: If a cluster is reaching its full_ratio, you can pre-emptively set the cluster to full and expand capacity.
pause: If you need to troubleshoot a running Ceph cluster without clients reading and writing data, you can set the cluster to
pause to prevent client operations.
nobackfill: If you need to take an OSD or node down temporarily, for example, upgrading daemons, you can set
nobackfill so that Ceph will not backfill while the OSDs is down.
norecover: If you need to replace an OSD disk and don’t want the PGs to recover to another OSD while you are hotswapping
disks, you can set norecover to prevent the other OSDs from copying a new set of PGs to other OSDs.
noscrub and nodeep-scrubb: If you want to prevent scrubbing for example, to reduce overhead during high loads,
recovery, backfilling, and rebalancing you can set noscrub and/or nodeep-scrub to prevent the cluster from scrubbing
OSDs.
notieragent: If you want to stop the tier agent process from finding cold objects to flush to the backing storage tier, you
may set notieragent.
IMPORTANT: Cephadm manages the client keyrings for the IBM Storage Ceph cluster as long as the clients are within the scope of
Cephadm. Users should not modify the keyrings that are managed by Cephadm, unless there is troubleshooting.
Alternatively, you may use the CEPH_ARGS environment variable to avoid re-entry of the user name and secret.
The following concepts can help you understand Ceph user management.
A user of the IBM Storage Ceph cluster is either an individual or as an application. Creating users allows you to control who can
access the storage cluster, its pools, and the data within those pools.
Ceph has the notion of a type of user. For the purposes of user management, the type will always be client. Ceph identifies users
in period (.) delimited form consisting of the user type and the user ID. For example, TYPE.ID, client.admin, or client.user1.
The reason for user typing is that Ceph Monitors, and OSDs also use the Cephx protocol, but they are not clients. Distinguishing the
user type helps to distinguish between client users and other users—streamlining access control, user monitoring and traceability.
Sometimes Ceph’s user type may seem confusing, because the Ceph command line allows you to specify a user with or without the
type, depending upon the command line usage. If you specify --user or --id, you can omit the type. So client.user1 can be
entered simply as user1. If you specify --name or -n, you must specify the type and name, such as client.user1. IBM
recommends using the type and name as a best practice wherever possible.
NOTE: An IBM Storage Ceph cluster user is not the same as a Ceph Object Gateway user. The object gateway uses an IBM Storage
Ceph cluster user to communicate between the gateway daemon and the storage cluster, but the gateway has its own user
management functionality for its end users.
Authorization capabilities
Ceph uses the term "capabilities" (caps) to describe authorizing an authenticated user to exercise the functionality of the Ceph
Monitors and OSDs. Capabilities can also restrict access to data within a pool or a namespace within a pool. A Ceph administrative
user sets a user’s capabilities when creating or updating a user. Capability syntax follows the form:
Syntax
Monitor Caps: Monitor capabilities include r, w, x, allow profile CAP, and profile rbd.
Example
OSD Caps: OSD capabilities include r, w, x, class-read, class-write, profile osd, profile rbd, and profile
rbd-read-only. Additionally, OSD capabilities also allow for pool and namespace settings. :
Syntax
NOTE: The Ceph Object Gateway daemon (radosgw) is a client of the Ceph storage cluster, so it is not represented as a Ceph storage
cluster daemon type.
A pool defines a storage strategy for Ceph clients, and acts as a logical partition for that strategy.
In Ceph deployments, it is common to create a pool to support different types of use cases. For example, cloud volumes or images,
object storage, hot storage, cold storage, and so on. When deploying Ceph as a back end for OpenStack, a typical deployment would
have pools for volumes, images, backups and virtual machines, and users such as client.glance, client.cinder, and so on.
Namespace
Objects within a pool can be associated to a namespace—a logical group of objects within the pool. A user’s access to a pool can be
associated with a namespace such that reads and writes by the user take place only within the namespace. Objects written to a
namespace within the pool can only be accessed by users who have access to the namespace.
NOTE: Currently, namespaces are only useful for applications written on top of librados. Ceph clients such as block device and
object storage do not currently support this feature.
The rationale for namespaces is that pools can be a computationally expensive method of segregating data by use case, because
each pool creates a set of placement groups that get mapped to OSDs. If multiple pools use the same CRUSH hierarchy and ruleset,
OSD performance may degrade as load increases.
For example, a pool should have approximately 100 placement groups per OSD. So an exemplary cluster with 1000 OSDs would
have 100,000 placement groups for one pool. Each pool mapped to the same CRUSH hierarchy and ruleset would create another
100,000 placement groups in the exemplary cluster. By contrast, writing an object to a namespace simply associates the namespace
to the object name without the computational overhead of a separate pool. Rather than creating a separate pool for a user or set of
users, you may use a namespace.
Reference
Edit online
A Ceph client user can be either individuals or applications, which use Ceph clients to interact with the IBM Storage Ceph cluster
daemons.
Prerequisites
IBM Storage Ceph 343
Edit online
Procedure
Edit online
Example
osd.10
key: AQBW7U5gqOsEExAAg/CxSwZ/gSh8iOsDV3iQOA==
caps: [mgr] allow profile osd
caps: [mon] allow profile osd
caps: [osd] allow *
osd.11
key: AQBX7U5gtj/JIhAAPsLBNG+SfC2eMVEFkl3vfA==
caps: [mgr] allow profile osd
caps: [mon] allow profile osd
caps: [osd] allow *
osd.9
key: AQBV7U5g1XDULhAAKo2tw6ZhH1jki5aVui2v7g==
caps: [mgr] allow profile osd
caps: [mon] allow profile osd
caps: [osd] allow *
client.admin
key: AQADYEtgFfD3ExAAwH+C1qO7MSLE4TWRfD2g6g==
caps: [mds] allow *
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *
client.bootstrap-mds
key: AQAHYEtgpbkANBAANqoFlvzEXFwD8oB0w3TF4Q==
caps: [mon] allow profile bootstrap-mds
client.bootstrap-mgr
key: AQAHYEtg3dcANBAAVQf6brq3sxTSrCrPe0pKVQ==
caps: [mon] allow profile bootstrap-mgr
client.bootstrap-osd
key: AQAHYEtgD/QANBAATS9DuP3DbxEl86MTyKEmdw==
caps: [mon] allow profile bootstrap-osd
client.bootstrap-rbd
key: AQAHYEtgjxEBNBAANho25V9tWNNvIKnHknW59A==
caps: [mon] allow profile bootstrap-rbd
client.bootstrap-rbd-mirror
key: AQAHYEtgdE8BNBAAr6rLYxZci0b2hoIgH9GXYw==
caps: [mon] allow profile bootstrap-rbd-mirror
client.bootstrap-rgw
key: AQAHYEtgwGkBNBAAuRzI4WSrnowBhZxr2XtTFg==
caps: [mon] allow profile bootstrap-rgw
client.crash.host04
key: AQCQYEtgz8lGGhAAy5bJS8VH9fMdxuAZ3CqX5Q==
caps: [mgr] profile crash
caps: [mon] profile crash
client.crash.host02
key: AQDuYUtgqgfdOhAAsyX+Mo35M+HFpURGad7nJA==
caps: [mgr] profile crash
caps: [mon] profile crash
client.crash.host03
key: AQB98E5g5jHZAxAAklWSvmDsh2JaL5G7FvMrrA==
caps: [mgr] profile crash
caps: [mon] profile crash
client.rgw.test_realm.test_zone.host01.hgbvnq
key: AQD5RE9gAQKdCRAAJzxDwD/dJObbInp9J95sXw==
caps: [mgr] allow rw
caps: [mon] allow *
caps: [osd] allow rwx tag rgw *=*
client.rgw.test_realm.test_zone.host02.yqqilm
NOTE: The TYPE.ID notation for users applies such that osd.0 is a user of type osd and its ID is 0, client.admin is a user of type
client and its ID is admin, that is, the default client.admin user. Note also that each entry has a key: VALUE entry, and one or
more caps: entries.
You may use the -o FILE_NAME option with ceph auth list to save the output to a file.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
The auth export command is identical to auth get, but also prints out the internal auid, which isn’t relevant to end users.
A user’s key enables the user to authenticate with the Ceph storage cluster. The user’s capabilities authorize the user to read, write,
or execute on Ceph monitors (mon), Ceph OSDs (osd) or Ceph Metadata Servers (mds).
ceph auth add: This command is the canonical way to add a user. It will create the user, generate a key and add any
specified capabilities.
ceph auth get-or-create: This command is often the most convenient way to create a user, because it returns a keyfile
format with the user name (in brackets) and the key. If the user already exists, this command simply returns the user name
and key in the keyfile format. You may use the -o FILE_NAME option to save the output to a file.
ceph auth get-or-create-key: This command is a convenient way to create a user and return the user’s key only. This is
useful for clients that need the key only, for example, libvirt. If the user already exists, this command simply returns the
key. You may use the -o FILE_NAME option to save the output to a file.
When creating client users, you may create a user with no capabilities. A user with no capabilities is useless beyond mere
authentication, because the client cannot retrieve the cluster map from the monitor. However, you can create a user with no
capabilities if you wish to defer adding capabilities later using the ceph auth caps command.
A typical user has at least read capabilities on the Ceph monitor and read and write capability on Ceph OSDs. Additionally, a user’s
OSD permissions are often restricted to accessing a particular pool. :
[ceph: root@host01 /]# ceph auth add client.john mon 'allow r' osd 'allow rw pool=mypool'
[ceph: root@host01 /]# ceph auth get-or-create client.paul mon 'allow r' osd 'allow rw pool=mypool'
[ceph: root@host01 /]# ceph auth get-or-create client.george mon 'allow r' osd 'allow rw
pool=mypool' -o george.keyring
[ceph: root@host01 /]# ceph auth get-or-create-key client.ringo mon 'allow r' osd 'allow rw
pool=mypool' -o ringo.key
IMPORTANT: If you provide a user with capabilities to OSDs, but you DO NOT restrict access to particular pools, the user will have
access to ALL pools in the cluster.
Prerequisites
Edit online
Procedure
Edit online
Syntax
[ceph: root@host01 /]# ceph auth caps client.john mon 'allow r' osd 'allow rw pool=mypool'
[ceph: root@host01 /]# ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=mypool'
[ceph: root@host01 /]# ceph auth caps client.brian-manager mon 'allow *' osd 'allow *'
2. To remove a capability, you may reset the capability. If you want the user to have no access to a particular daemon that was
previously set, specify an empty string:
Example
[ceph: root@host01 /]# ceph auth caps client.ringo mon ' ' osd ' '
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Prerequisites
Edit online
Syntax
Example
2. Printing a user’s key is useful when you need to populate client software with a user’s key, for example, libvirt.
Syntax
Example
Edit online
As a storage administrator, you can prepare, list, create, activate, deactivate, batch, trigger, zap, and migrate Ceph OSDs using the
ceph-volume utility. The ceph-volume utility is a single-purpose command-line tool to deploy logical volumes as OSDs. It uses a
plugin-type framework to deploy OSDs with different device technologies. The ceph-volume utility follows a similar workflow of the
ceph-disk utility for deploying OSDs, with a predictable, and robust way of preparing, activating, and starting OSDs. Currently, the
ceph-volume utility only supports the lvm plugin, with the plan to support others technologies in the future.
Edit online
By making use of LVM tags, the lvm sub-command is able to store and re-discover by querying devices associated with OSDs so they
can be activated. This includes support for lvm-based technologies like dm-cache as well.
When using ceph-volume, the use of dm-cache is transparent, and treats dm-cache like a logical volume. The performance gains
and losses when using dm-cache will depend on the specific workload. Generally, random and sequential reads will see an increase
in performance at smaller block sizes. While random and sequential writes will see a decrease in performance at larger block sizes.
To use the LVM plugin, add lvm as a subcommand to the ceph-volume command within the cephadm shell:
activate - Discover and mount the LVM device associated with an OSD ID and start the Ceph OSD.
batch - Automatically size devices for multi-OSD provisioning with minimal interaction.
zap - Removes all data and filesystems from a logical volume or partition.
new-wal - Allocate new WAL volume for the OSD at specified logical volume.
new-db - Allocate new DB volume for the OSD at specified logical volume.
NOTE: Using the create subcommand combines the prepare and activate subcommands into one subcommand.
Reference
Edit online
See the create subcommand in Creating OSDs section for more details.
Edit online
Previous versions of Ceph used the ceph-disk utility to prepare, activate, and create OSDs. Starting with IBM Storage Ceph 5,
ceph-disk is replaced by the ceph-volume utility that aims to be a single purpose command-line tool to deploy logical volumes as
OSDs, while maintaining a similar API to ceph-disk when preparing, activating, and creating OSDs.
The ceph-volume is a modular tool that currently supports two ways of provisioning hardware devices, legacy ceph-disk devices
and LVM (Logical Volume Manager) devices. The ceph-volume lvm command uses the LVM tags to store information about devices
specific to Ceph and its relationship with OSDs. It uses these tags to later re-discover and query devices associated with OSDS so
that it can activate them. It supports technologies based on LVM and dm-cache as well.
The ceph-volume utility uses dm-cache transparently and treats it as a logical volume. You might consider the performance gains
and losses when using dm-cache, depending on the specific workload you are handling. Generally, the performance of random and
sequential read operations increases at smaller block sizes; while the performance of random and sequential write operations
decreases at larger block sizes. Using ceph-volume does not introduce any significant performance penalties.
NOTE: The ceph-volume simple command can handle legacy ceph-disk devices, if these devices are still in use.
The ceph-disk utility was required to support many different types of init systems, such as upstart or sysvinit, while being
able to discover devices. For this reason, ceph-disk concentrates only on GUID Partition Table (GPT) partitions. Specifically on GPT
GUIDs that label devices in a unique way to answer questions like:
To solve these questions, ceph-disk uses UDEV rules to match the GUIDs.
Using the UDEV rules to call ceph-disk can lead to a back-and-forth between the ceph-disk systemd unit and the ceph-disk
executable. The process is very unreliable and time consuming and can cause OSDs to not come up at all during the boot process of
a node. Moreover, it is hard to debug, or even replicate these problems given the asynchronous behavior of UDEV.
Because ceph-disk works with GPT partitions exclusively, it cannot support other technologies, such as Logical Volume Manager
(LVM) volumes, or similar device mapper devices.
To ensure the GPT partitions work correctly with the device discovery workflow, ceph-disk requires a large number of special flags
to be used. In addition, these partitions require devices to be exclusively owned by Ceph.
Edit online
The prepare subcommand prepares an OSD back-end object store and consumes logical volumes (LV) for both the OSD data and
journal. It does not modify the logical volumes, except for adding some extra metadata tags using LVM. These tags make volumes
easier to discover, and they also identify the volumes as part of the Ceph Storage Cluster and the roles of those volumes in the
storage cluster.
The prepare subcommand accepts a whole device or partition, or a logical volume for block.
Prerequisites
Edit online
Optionally, create logical volumes. If you provide a path to a physical device, the subcommand turns the device into a logical
volume. This approach is simpler, but you cannot configure or change the way the logical volume is created.
Procedure
Edit online
Syntax
Example
Syntax
a. Optionally, if you want to use a separate device for RocksDB, specify the --block.db and --block.wal options:
Syntax
Example
[ceph: root@host01 /]# ceph-volume lvm prepare --bluestore --block.db --block.wal --data
example_vg/data_lv
Syntax
Example
References
Edit online
You can use the ceph-volume lvm list subcommand to list logical volumes and devices associated with a Ceph cluster, as long
as they contain enough metadata to allow for that discovery. The output is grouped by the OSD ID associated with the devices. For
logical volumes, the devices key is populated with the physical devices associated with the logical volume.
In some cases, the output of the ceph -s command shows the following error message:
In such cases, you can list the devices with ceph device ls-lights command which gives the details about the lights on the
devices. Based on the information, you can turn off the lights on the devices.
Prerequisites
Edit online
Procedure
Edit online
Example
[block] /dev/ceph-83909f70-95e9-4273-880e-5851612cbe53/osd-block-7ce687d9-07e7-4f8f-
a34e-d1b0efb89920
Optional: List the devices in the storage cluster with the lights:
Example
{
"fault": [
"SEAGATE_ST12000NM002G_ZL2KTGCK0000C149"
],
"ident": []
}
Syntax
Example
Edit online
The activation process enables a systemd unit at boot time, which allows the correct OSD identifier and its UUID to be enabled and
mounted.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
To activate all OSDs that are prepared for activation, use the --all option:
Example
3. Optionally, you can use the trigger subcommand. This command cannot be used directly, and it is used by systemd so that
it proxies input to ceph-volume lvm activate. This parses the metadata coming from systemd and startup, detecting the
UUID and ID associated with an OSD.
Syntax
Example
Reference
Edit online
For more information, see:
Edit online
You can deactivate the Ceph OSDs using the ceph-volume lvm subcommand. This subcommand removes the volume groups and
the logical volume.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Reference
Edit online
For more information, see:
Edit online
The create subcommand calls the prepare subcommand, and then calls the activate subcommand.
Prerequisites
Edit online
NOTE: If you prefer to have more control over the creation process, you can use the prepare and activate subcommands
separately to create the OSD, instead of using create. You can use the two subcommands to gradually introduce new OSDs into a
storage cluster, while avoiding having to rebalance large amounts of data. Both approaches work the same way, except that using the
create subcommand causes the OSD to become up and in immediately after completion.
Procedure
Edit online
Syntax
Example
Reference
Edit online
For more information, see:
The new volumes are attached to the OSD, replacing one of the source drives.
If source list has DB or WAL volume, then the target device replaces it.
If source list has slow volume only, then explicit allocation using the new-db or new-wal command is needed.
The new-db and new-wal commands attaches the given logical volume to the given OSD as a DB or a WAL volume respectively.
Prerequisites
Edit online
Procedure
Edit online
Example
2. Stop the OSD to which you have to add the DB or the WAL device:
Example
Example
Syntax
Example
Move BlueFS data from main device to LV that is already attached as DB:
Syntax
ceph-volume lvm migrate --osd-id OSD_ID --osd-fsid OSD_UUID --from data --target
VOLUME_GROUP_NAME/LOGICAL_VOLUME_NAME
Example
Move BlueFS data from shared main device to LV which shall be attached as a new DB:
Syntax
ceph-volume lvm migrate --osd-id OSD_ID --osd-fsid OSD_UUID --from data --target
VOLUME_GROUP_NAME/LOGICAL_VOLUME_NAME
Example
Move BlueFS data from DB device to new LV, and replace the DB device:
Syntax
Example
Move BlueFS data from main and DB devices to new LV, and replace the DB device:
Syntax
ceph-volume lvm migrate --osd-id OSD_ID --osd-fsid OSD_UUID --from data db --target
VOLUME_GROUP_NAME/LOGICAL_VOLUME_NAME
Example
Move BlueFS data from main, DB, and WAL devices to new LV, remove the WAL device, and replace the DB device:
Syntax
ceph-volume lvm migrate --osd-id OSD_ID --osd-fsid OSD_UUID --from data db wal --target
VOLUME_GROUP_NAME/LOGICAL_VOLUME_NAME
Example
Move BlueFS data from main, DB, and WAL devices to the main device, remove the WAL and DB devices:
Syntax
ceph-volume lvm migrate --osd-id OSD_ID --osd-fsid OSD_UUID --from db wal --target
VOLUME_GROUP_NAME/LOGICAL_VOLUME_NAME
Example
Edit online
The batch subcommand automates the creation of multiple OSDs when single devices are provided.
The ceph-volume command decides the best method to use to create the OSDs, based on drive type. Ceph OSD optimization
depends on the available devices:
If all devices are traditional hard drives, batch creates one OSD per device.
If all devices are solid state drives, batch creates two OSDs per device.
If there is a mix of traditional hard drives and solid state drives, batch uses the traditional hard drives for data, and creates
the largest possible journal (block.db) on the solid state drive.
NOTE: The batch subcommand does not support the creation of a separate logical volume for the write-ahead-log (block.wal)
device.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
[ceph: root@host01 /]# ceph-volume lvm batch --bluestore /dev/sda /dev/sdb /dev/nvme0n1
Reference
Edit online
Edit online
The zap subcommand removes all data and filesystems from a logical volume or partition.
You can use the zap subcommand to zap logical volumes, partitions, or raw devices that are used by Ceph OSDs for reuse. Any
filesystems present on the given logical volume or partition are removed and all data is purged.
Optionally, you can use the --destroy flag for complete removal of a logical volume, partition, or the physical device.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Example
Performance baseline
Benchmarking Ceph performance
Benchmarking Ceph block performance
Performance baseline
Edit online
The OSD, including the journal, disks and the network throughput should each have a performance baseline to compare against. You
can identify potential tuning opportunities by comparing the baseline performance data with the data from Ceph’s native tools. Red
Hat Enterprise Linux has many built-in tools, along with a plethora of open source community tools, available to help accomplish
these tasks.
Reference
Edit online
For more details about some of the available tools, see this Knowledgebase article.
NOTE: Before running these performance tests, drop all the file system caches by running the following:
Example
[ceph: root@host01 /]# echo 3 | sudo tee /proc/sys/vm/drop_caches && sudo sync
Prerequisites
Edit online
Procedure
Edit online
Example
[ceph: root@host01 /]# ceph osd pool create testbench 100 100
2. Execute a write test for 10 seconds to the newly created storage pool:
Example
Example
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
Total time run: 0.804869
Total reads made: 28
Read size: 4194304
Bandwidth (MB/sec): 139.153
Example
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 46 30 119.801 120 0.440184 0.388125
2 16 81 65 129.408 140 0.577359 0.417461
3 16 120 104 138.175 156 0.597435 0.409318
4 15 157 142 141.485 152 0.683111 0.419964
5 16 206 190 151.553 192 0.310578 0.408343
6 16 253 237 157.608 188 0.0745175 0.387207
7 16 287 271 154.412 136 0.792774 0.39043
8 16 325 309 154.044 152 0.314254 0.39876
9 16 362 346 153.245 148 0.355576 0.406032
10 16 405 389 155.092 172 0.64734 0.398372
Total time run: 10.302229
Total reads made: 405
Read size: 4194304
Bandwidth (MB/sec): 157.248
5. To increase the number of concurrent reads and writes, use the -t option, which the default is 16 threads. Also, the -b
parameter can adjust the size of the object being written. The default object size is 4 MB. A safe maximum object size is 16
MB. IBM recommends running multiple copies of these benchmark tests to different pools. Doing this shows the changes in
performance from multiple clients.
Add the --run-name LABEL option to control the names of the objects that get written during the benchmark test. Multiple
rados bench commands might be ran simultaneously by changing the --run-name label for each running command
instance. This prevents potential I/O errors that can occur when multiple clients are trying to access the same object and
allows for different clients to access different objects. The --run-name option is also useful when trying to simulate a real
world workload.
Example
Example
Prerequisites
Edit online
Example
Example
Example
Example
Example
Example
Example
Reference
Edit online
For more information about the rbd command, see Ceph block devices.
Here is the full list of the Monitor and the OSD collection name categories with a brief description for each:
Cluster Metrics - Displays information about the storage cluster: Monitors, OSDs, Pools, and PGs
Level Database Metrics - Displays information about the back-end KeyValueStore database
Write Back Throttle Metrics - Displays the statistics on how the write back throttle is tracking unflushed IO
Level Database Metrics - Displays information about the back-end KeyValueStore database
Read and Write Operations Metrics - Displays information on various read and write operations
OSD Throttle Metrics - Display the statistics on how the OSD is throttling
Object Gateway Client Metrics - Displays statistics on GET and PUT requests
Object Gateway Throttle Metrics - Display the statistics on how the OSD is throttling
Prerequisites
Edit online
Procedure
Edit online
Syntax
NOTE: You must run the ceph daemon command from the node running the daemon.
2. Executing ceph daemon DAEMON_NAME perf schema command from the monitor node:
Example
3. Executing the ceph daemon DAEMON_NAME perf schema command from the OSD node:
Example
Reference
Edit online
Prerequisites
Edit online
Procedure
364 IBM Storage Ceph
Edit online
Syntax
NOTE: You must run the ceph daemon command from the node running the daemon.
2. Executing ceph daemon .. perf dump command from the Monitor node:
3. Executing the ceph daemon .. perf dump command from the OSD node:
Reference
Edit online
To view a short description of each Monitor metric available, see the Ceph monitor metrics table.
Reference
Edit online
To view a short description of each OSD metric available, see the Ceph OSD table.
Cluster Metrics
Paxos Metrics
Throttle Metrics
Cluster Metrics
Edit online
Table 1. Cluster Metrics Table
Collection Name Metric Name Bit Field Value Short Description
cluster num_mon 2 Number of monitors
num_mon_quorum 2 Number of monitors in quorum
num_osd 2 Total number of OSD
Paxos Metrics
Edit online
Table 4. Paxos Metrics Table
Collection Name Metric Name Bit Field Value Short Description
Throttle Metrics
Edit online
Table 5. Throttle Metrics Table
Collection Name Metric Name Bit Field Value Short Description
throttle-* val 10 Currently available throttle
max 10 Max value for throttle
get 10 Gets
get_sum 10 Got data
get_or_fail_fail 10 Get blocked during get_or_fail
get_or_fail_success 10 Successful get during get_or_fail
take 10 Takes
take_sum 10 Taken data
put 10 Puts
put_sum 10 Put data
wait 5 Waiting latency
Objecter Metrics
OBjecter Metrics
Edit online
Table 3. Objecter Metrics Table
Collection Name Metric Name Bit Field Value Short Description
objecter op_active 2 Active operations
op_laggy 2 Laggy operations
op_send 10 Sent operations
op_send_bytes 10 Sent data
op_resend 10 Resent operations
op_ack 10 Commit callbacks
op_commit 10 Operation commits
op 10 Operation
op_r 10 Read operations
op_w 10 Write operations
op_rmw 10 Read-modify-write operations
op_pg 10 PG operation
osdop_stat 10 Stat operations
osdop_create 10 Create object operations
Objecter Metrics
Objecter Metrics
372 IBM Storage Ceph
Edit online
Table 2. Objecter Metrics Table
Collection Name Metric Name Bit Field Value Short Description
objecter op_active 2 Active operations
op_laggy 2 Laggy operations
op_send 10 Sent operations
op_send_bytes 10 Sent data
op_resend 10 Resent operations
op_ack 10 Commit callbacks
op_commit 10 Operation commits
op 10 Operation
op_r 10 Read operations
op_w 10 Write operations
op_rmw 10 Read-modify-write operations
op_pg 10 PG operation
osdop_stat 10 Stat operations
osdop_create 10 Create object operations
osdop_read 10 Read operations
osdop_write 10 Write operations
osdop_writefull 10 Write full object operations
osdop_append 10 Append operation
osdop_zero 10 Set object to zero operations
osdop_truncate 10 Truncate object operations
osdop_delete 10 Delete object operations
osdop_mapext 10 Map extent operations
osdop_sparse_read 10 Sparse read operations
osdop_clonerange 10 Clone range operations
osdop_getxattr 10 Get xattr operations
osdop_setxattr 10 Set xattr operations
osdop_cmpxattr 10 Xattr comparison operations
osdop_rmxattr 10 Remove xattr operations
osdop_resetxattrs 10 Reset xattr operations
osdop_tmap_up 10 TMAP update operations
osdop_tmap_put 10 TMAP put operations
osdop_tmap_get 10 TMAP get operations
osdop_call 10 Call (execute) operations
osdop_watch 10 Watch by object operations
osdop_notify 10 Notify about object operations
osdop_src_cmpxattr 10 Extended attribute comparison in multi operations
osdop_other 10 Other operations
linger_active 2 Active lingering operations
linger_send 10 Sent lingering operations
linger_resend 10 Resent lingering operations
linger_ping 10 Sent pings to lingering operations
poolop_active 2 Active pool operations
poolop_send 10 Sent pool operations
poolop_resend 10 Resent pool operations
poolstat_active 2 Active get pool stat operations
poolstat_send 10 Pool stat operations sent
poolstat_resend 10 Resent pool stats
statfs_active 2 Statfs operations
statfs_send 10 Sent FS stats
statfs_resend 10 Resent FS stats
command_active 2 Active commands
BlueStore
Edit online
BlueStore is the back-end object store for the OSD daemons and puts objects directly on the block device.
IMPORTANT: BlueStore provides a high-performance backend for OSD daemons in a production environment. By default, BlueStore
is configured to be self-tuning. If you determine that your environment performs better with BlueStore tuned manually, please
contact IBM support and share the details of your configuration to help us improve the auto-tuning capability. IBM looks forward to
your feedback and appreciates your recommendations.
Ceph BlueStore
Ceph BlueStore devices
Ceph BlueStore caching
Sizing considerations for Ceph BlueStore
Tuning Ceph BlueStore using bluestore_min_alloc_size parameter
Resharding the RocksDB database using the BlueStore admin tool
The BlueStore fragmentation tool
Ceph BlueStore BlueFS
Ceph BlueStore
Edit online
The following are some of the main features of using BlueStore:
BlueStore uses the RocksDB key-value database to manage internal metadata, such as the mapping from object names to block
locations on a disk.
By default all data and metadata written to BlueStore is protected by one or more checksums. No data or metadata are read from
disk or returned to the user without verification.
Efficient copy-on-write
The Ceph Block Device and Ceph File System snapshots rely on a copy-on-write clone mechanism that is implemented efficiently in
BlueStore. This results in efficient I/O both for regular snapshots and for erasure coded pools which rely on cloning to implement
efficient two-phase commits.
No large double-writes
BlueStore first writes any new data to unallocated space on a block device, and then commits a RocksDB transaction that updates
the object metadata to reference the new region of the disk. Only when the write operation is below a configurable size threshold, it
falls back to a write-ahead journaling scheme.
Multi-device support
BlueStore can use multiple block devices for storing different data. For example: Hard Disk Drive (HDD) for the data, Solid-state Drive
(SSD) for metadata, Non-volatile Memory (NVM) or Non-volatile random-access memory (NVRAM) or persistent memory for the
RocksDB write-ahead log (WAL). See Ceph BlueStore devices for details.
Because BlueStore does not use any file system, it minimizes the need to clear the storage device cache.
Primary
WAL
DB
In the simplest case, BlueStore consumes a single primary storage device. The storage device is partitioned into two parts that
contain:
OSD metadata: A small partition formatted with XFS that contains basic metadata for the OSD. This data directory includes
information about the OSD, such as its identifier, which cluster it belongs to, and its private keyring.
Data: A large partition occupying the rest of the device that is managed directly by BlueStore and that contains all of the OSD
data. This primary device is identified by a block symbolic link in the data directory.
A WAL (write-ahead-log) device: A device that stores BlueStore internal journal or write-ahead log. It is identified by the
block.wal symbolic link in the data directory. Consider using a WAL device only if the device is faster than the primary
device. For example, when the WAL device uses an SSD disk and the primary device uses an HDD disk.
A DB device: A device that stores BlueStore internal metadata. The embedded RocksDB database puts as much metadata as
it can on the DB device instead of on the primary device to improve performance. If the DB device is full, it starts adding
metadata to the primary device. Consider using a DB device only if the device is faster than the primary device.
If the bluestore_default_buffered_write option is set to true, data is written to the buffer first, and then committed to disk.
Afterwards, a write acknowledgement is sent to the client, allowing subsequent reads faster access to the data already in cache,
until that data is evicted.
Read-heavy workloads will not see an immediate benefit from BlueStore caching. As more reading is done, the cache will grow over
time and subsequent reads will see an improvement in performance. How fast the cache populates depends on the BlueStore block
and database disk type, and the client’s workload requirements.
IMPORTANT: Please contact IBM Support before enabling the bluestore_default_buffered_write option.
When not mixing drive types, there is no requirement to have a separate RocksDB logical volume. BlueStore will automatically
manage the sizing of RocksDB.
BlueStore’s cache memory is used for the key-value pair metadata for RocksDB, BlueStore metadata, and object data.
NOTE: The BlueStore cache memory values are in addition to the memory footprint already being consumed by the OSD.
In BlueStore, the raw partition is allocated and managed in chunks of bluestore_min_alloc_size. By default,
bluestore_min_alloc_size is 4096, equivalent to 4 KiB for HDDs and SSDs. The unwritten area in each chunk is filled with
zeroes when it is written to the raw partition. This can lead to wasted unused space when not properly sized for your workload, for
example when writing small objects.
It is best practice to set bluestore_min_alloc_size to match the smallest write so this write amplification penalty can be
avoided.
IMPORTANT: Changing the value of bluestore_min_alloc_size is not recommended. For any assistance, contact IBM support.
NOTE: The settings bluestore_min_alloc_size_ssd and bluestore_min_alloc_size_hdd are specific to SSDs and HDDs,
respectively, but setting them is not necessary because setting bluestore_min_alloc_size overrides them.
Prerequisites
376 IBM Storage Ceph
Edit online
The admin keyring for the Ceph Monitor node, if you are redeploying an existing Ceph OSD node.
Procedure
Edit online
Syntax
Example
You can see bluestore_min_alloc_size is set to 8192 bytes, which is equivalent to 8 KiB.
Syntax
Example
Verification
Edit online
Syntax
Example
Reference
Edit online
For OSD removal and addition, see Management of OSDs using the Ceph Orchestrator.
NOTE: For already deployed OSDs, you cannot modify the bluestore_min_alloc_size parameter so you have to remove
the OSDs and freshly deploy them again.
Prerequisites
Edit online
Procedure
Edit online
Example
2. Fetch the OSD_ID and the host details from the administration node:
Example
3. Log into the respective host as a root user and stop the OSD:
Syntax
Example
Syntax
Example
5. Log into the cephadm shell and check the file system consistency:
Syntax
Example
Syntax
Example
7. Run the ceph-bluestore-tool command to reshard. IBM recommends to use the parameters as given in the command:
Syntax
Example
reshard success
8. To check the sharding status of the OSD node, run the show-sharding command:
Syntax
Example
10. Log into the respective host as a root user and start the OSD:
Syntax
Example
Reference
Edit online
The BlueStore fragmentation tool generates a score on the fragmentation level of the BlueStore OSD. This fragmentation score is
given as a range, 0 through 1. A score of 0 means no fragmentation, and a score of 1 means severe fragmentation.
Prerequisites
Edit online
BlueStore OSDs.
a. Simple report:
Syntax
Example
[ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator score block
Syntax
Example
[ceph: root@host01 /]# ceph daemon osd.123 bluestore allocator dump block
1. Follow the steps for resharding for checking the offline fragmentation score.
Example
a. Simple report:
Syntax
Example
Syntax
Example
Reference
Edit online
See the BlueStore Fragmentation Tool for details on the fragmentation score.
See the Resharding the RocskDB database using the BlueStore admin tool for details on resharding.
There is also an internal, hidden file that serves as BlueFS replay log, ino 1, that works as directory structure, file mapping, and
operations log.
Fallback hierarchy
Edit online
With BlueFS it is possible to put any file on any device. Parts of file can even reside on different devices, that is WAL, DB, and SLOW.
There is an order to where BlueFS puts files. File is put to secondary storage only when primary storage is exhausted, and tertiary
only when secondary is exhausted.
The order for the specific files is as follows, for each device type.
IMPORTANT: There is an exception to control and DB file order. When RocksDB detects that you are running out of space on DB file,
it directly notifies you to put file to SLOW device.
Edit online
As a storage administrator, you can view the current setting for the bluefs_buffered_io parameter.
The option bluefs_buffered_io is set to True by default for IBM Storage Ceph. This option enable BlueFS to perform buffered
reads in some cases, and enables the kernel page cache to act as a secondary cache for reads like RocksDB block reads.
IMPORTANT: Changing the value of bluefs_buffered_io is not recommended. Before changing the bluefs_buffered_io
parameter, contact your IBM Support account team.
Prerequisites
Edit online
Log into the Cephadm shell, using the cephadm shell command.
Procedure
Edit online
View the current value of the bluefs_buffered_io parameter, using one of the following procedures.
Example
View the running value for an OSD where the running value is different from the value stored in the configuration database.
Syntax
Example
Prerequisites
Edit online
Log into the Cephadm shell, using the cephadm shell command.
Procedure
Edit online
View the BlueStore OSD statistics.
0 :
1 : device size 0x1dfbfe000 : using 0x1100000(17 MiB)
2 : device size 0x27fc00000 : using 0x248000(2.3 MiB)
RocksDBBlueFSVolumeSelector: wal_total:0, db_total:7646425907, slow_total:10196562739,
db_avail:935539507
Usage matrix:
DEV/LEV WAL DB SLOW * * REAL FILES
LOG 0 B 4 MiB 0 B 0 B 0 B 756 KiB 1
WAL 0 B 4 MiB 0 B 0 B 0 B 3.3 MiB 1
DB 0 B 9 MiB 0 B 0 B 0 B 76 KiB 10
SLOW 0 B 0 B 0 B 0 B 0 B 0 B 0
TOTALS 0 B 17 MiB 0 B 0 B 0 B 0 B 12
MAXIMUMS:
In this example,
IMPORTANT: DB and WAL devices are used only by BlueFS. For a main device, usage from stored BlueStore data is also
included. In this example, 2.3 MiB is the data from BlueStore.
wal_total, db_total, and slow_total are values that reiterate the device values previously stated.
db_avail represents how many bytes can be taken from the SLOW device, if necessary.
Usage matrix:
Rows WAL, DB, and SLOW describes where the specific file was intended to be put.
Columns WAL, DB, and SLOW describe where data is actually put.The values are in allocation units. WAL and DB
have bigger allocation units for performance reasons.
Columns * relate to virtual devices new-db and new-wal that are used for ceph-bluestore-tool. It should
always show 0 B.
MAXIMUMS. This table captures the maximum value of each entry from the usage matrix.
Cephadm troubleshooting
Edit online
As a storage administrator, you can troubleshoot the IBM Storage Ceph cluster. Sometimes there is a need to investigate why a
Cephadm command failed or why a specific service does not run properly.
Prerequisites
Edit online
Example
This stops any changes, but Cephadm periodically checks hosts to refresh it’s inventory of daemons and devices.
Example
Note that previously deployed daemon containers continue to exist and start as they did before.
Example
Per service
Syntax
Example
Per daemon
Syntax
Example
[ceph: root@host01 /]# ceph orch ps --service-name mds --daemon-id cephfs.hostname.ppdhsz --format
yaml
daemon_type: mds
Example
You can see the last few messages with the following command:
Example
If you have enabled logging to files, you can see a Cephadm log file called ceph.cephadm.log on the monitor hosts.
NOTE: You have to run all these commands outside the cephadm shell.
NOTE: By default, Cephadm stores logs in journald which means that daemon logs are no longer available in /var/log/ceph.
To read the log file of a specific daemon, run the following command:
Syntax
Example
NOTE: This command works when run on the same hosts where the daemon is running.
Syntax
Example
Syntax
[root@host01 ~]# for name in $(cephadm ls | python3 -c "import sys, json; [print(i['name'])
for i in json.load(sys.stdin)]") ; do cephadm logs --fsid 57bddb48-ee04-11eb-9962-001a4a000672
--name "$name" > $name; done
Example
Example
Example
...
Please make sure that the host is reachable and accepts connections using the cephadm SSH key
To ensure Cephadm has a SSH identity key, run the following command:
Example
If the above command fails, Cephadm does not have a key. To generate a SSH key, run the following command:
Or
Example
To ensure that the SSH configuration is correct, run the following command:
Example
Example
To verify that the public key is in the authorized_keys file, run the following commands:
Example
ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
Or
Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
Example
To access the admin socket, enter the daemon container on the host:
Example
Prerequisites
Edit online
Procedure
Edit online
Example
2. Disable the Cephadm scheduler to prevent Cephadm from removing the new MGR daemon, with the following command:
Example
3. Get or create the auth entry for the new MGR daemon:
Example
[ceph: root@host01 /]# ceph auth get-or-create mgr.host01.smfvfd1 mon "profile mgr" osd "allow
*" mds "allow *"
[mgr.host01.smfvfd1]
key = AQDhcORgW8toCRAAlMzlqWXnh3cGRjqYEa9ikw==
Example
Example
NOTE: Use the values from the output of the ceph config generate-minimal-conf command.
Example
{
{
"config": "# minimal ceph.conf for 8c9b0072-67ca-11eb-af06-001a4a0002a0\n[global]\n\tfsid =
8c9b0072-67ca-11eb-af06-001a4a0002a0\n\tmon_host =
[v2:10.10.200.10:3300/0,v1:10.10.200.10:6789/0]
[v2:10.10.10.100:3300/0,v1:10.10.200.100:6789/0]\n",
"keyring": "[mgr.Ceph5-2.smfvfd1]\n\tkey = AQDhcORgW8toCRAAlMzlqWXnh3cGRjqYEa9ikw==\n"
}
}
Example
Example
Verification
Edit online
Example
Cephadm operations
Edit online
As a storage administrator, you can carry out Cephadm operations in the IBM Storage Ceph cluster.
Prerequisites
Edit online
Example
Example
By default, the log displays info-level events and above. To see the debug-level messages, run the following commands:
Example
Example
Example
Theses events are also logged to ceph.cephadm.log file on the monitor hosts and to the monitor daemon’s stderr.
Logging to stdout
Traditionally, Ceph daemons have logged to /var/log/ceph. By default, Cephadm daemons log to stderr and the logs are
captured by the container runtime environment. For most systems, by default, these logs are sent to journald and accessible
through the journalctl command.
For example, to view the logs for the daemon on host01 for a storage cluster with ID 5c5a50ae-272a-455d-99e9-
32c6a013e694:
Example
This works well for normal Cephadm operations when logging levels are low.
Example
Logging to files
You can also configure Ceph daemons to log to files instead of stderr. When logging to files, Ceph logs are located in
/var/log/ceph/CLUSTER_FSID.
Example
By default, Cephadm sets up log rotation on each host to rotate these files. You can configure the logging retention schedule by
modifying /etc/logrotate.d/ceph.CLUSTER_FSID.
Data location
Edit online
Cephadm daemon data and logs are located in slightly different locations than the older versions of Ceph:
/var/lib/ceph/CLUSTER_FSID/removed` contains old daemon data directories for the stateful daemons, for example
monitor or Prometheus, that have been removed by Cephadm.
Disk usage
A few Ceph daemons may store a significant amount of data in /var/lib/ceph, notably the monitors and Prometheus daemon,
hence IBM recommends moving this directory to its own disk, partition, or logical volume so that the root file system is not filled up.
CEPHADM_PAUSED
Cephadm background work is paused with the ceph orch pause command. Cephadm continues to perform passive monitoring
activities such as checking the host and daemon status, but it does not make any changes like deploying or removing daemons. You
can resume Cephadm work with the ceph orch resume command.
CEPHADM_STRAY_HOST
One or more hosts have running Ceph daemons but are not registered as hosts managed by the Cephadm module. This means that
those services are not currently managed by Cephadm, for example, a restart and upgrade that is included in the ceph orch ps
command. You can manage the host(s) with the ceph orch host add HOST_NAME command but ensure that SSH access to the
remote hosts is configured. Alternatively, you can manually connect to the host and ensure that services on that host are removed or
migrated to a host that is managed by Cephadm. You can also disable this warning with the setting ceph config set mgr
mgr/cephadm/warn_on_stray_hosts false
One or more Ceph daemons are running but are not managed by the Cephadm module. This might be because they were deployed
using a different tool, or because they were started manually. Those services are not currently managed by Cephadm, for example, a
restart and upgrade that is included in the ceph orch ps command.
If the daemon is a stateful one that is a monitor or OSD daemon, these daemons should be adopted by Cephadm. For stateless
daemons, you can provision a new daemon with the ceph orch apply command and then stop the unmanaged daemon.
You can disable this health warning with the setting ceph config set mgr mgr/cephadm/warn_on_stray_daemons false.
CEPHADM_HOST_CHECK_FAILED
One or more hosts have failed the basic Cephadm host check, which verifies that:name: value
The host meets the basic prerequisites, like a working container runtime that is Podman , and working time synchronization. If
this test fails, Cephadm wont be able to manage the services on that host.
You can manually run this check with the ceph cephadm check-host HOST_NAME command. You can remove a broken host from
management with the ceph orch host rm HOST_NAME command. You can disable this health warning with the setting ceph
config set mgr mgr/cephadm/warn_on_failed_host_check false.
Example
The configuration checks are triggered after each host scan, which is for a duration of one minute.
The ceph -W cephadm command shows log entries of the current state and outcome of the configuration checks as follows:
Disabled state
Example
ALL cephadm checks are disabled, use 'ceph config set mgr mgr/cephadm/config_checks_enabled true' to enable
Enabled state
Example
CEPHADM 8/8 checks enabled and executed (0 bypassed, 0 disabled). No issues detected
The configuration checks themselves are managed through several cephadm subcommands.
To determine whether the configuration checks are enabled, run the following command:
Example
This command returns the status of the configuration checker as either Enabled or Disabled.
To list all the configuration checks and their current state, run the following command:
Example
CEPHADM_CHECK_KERNEL_LSM
Each host within the storage cluster is expected to operate within the same Linux Security Module (LSM) state. For example, if the
majority of the hosts are running with SELINUX in enforcing mode, any host not running in this mode would be flagged as an
anomaly and a healthcheck with a warning state is raised.
CEPHADM_CHECK_SUBSCRIPTION
This check relates to the status of the vendor subscription. This check is only performed for hosts using Red Hat Enterprise Linux, but
helps to confirm that all the hosts are covered by an active subscription so that patches and updates are available.
CEPHADM_CHECK_PUBLIC_MEMBERSHIP
All members of the cluster should have NICs configured on at least one of the public network subnets. Hosts that are not on the
public network will rely on routing which may affect performance.
CEPHADM_CHECK_MTU
The maximum transmission unit (MTU) of the NICs on OSDs can be a key factor in consistent performance. This check examines
hosts that are running OSD services to ensure that the MTU is configured consistently within the cluster. This is determined by
establishing the MTU setting that the majority of hosts are using, with any anomalies resulting in a Ceph healthcheck.
CEPHADM_CHECK_LINKSPEED
Similar to the MTU check, linkspeed consistency is also a factor in consistent cluster performance. This check determines the
linkspeed shared by the majority of the OSD hosts, resulting in a healthcheck for any hosts that are set at a lower linkspeed rate.
CEPHADM_CHECK_NETWORK_MISSING
The public_network and cluster_network settings support subnet definitions for IPv4 and IPv6. If these settings are not found
on any host in the storage cluster a healthcheck is raised.
CEPHADM_CHECK_CEPH_RELEASE Under normal operations, the Ceph cluster should be running daemons under the same Ceph
release, for example all IBM Storage Ceph cluster 5 releases. This check looks at the active release for each daemon, and reports any
anomalies as a healthcheck. This check is bypassed if an upgrade process is active within the cluster.
CEPHADM_CHECK_KERNEL_VERSION
The OS kernel version is checked for consistency across the hosts. Once again, the majority of the hosts is used as the basis of
identifying anomalies.
NOTE: At this time, cephadm-ansible modules only support the most important tasks. Any operation not covered by cephadm-
ansible modules must be completed using either the command or shell Ansible modules in your playbooks.
Edit online
The cephadm-ansible modules are a collection of modules that simplify writing Ansible playbooks by providing a wrapper around
cephadm and ceph orch commands. You can use the modules to write your own unique Ansible playbooks to administer your
cluster using one or more of the modules.
cephadm_bootstrap
ceph_orch_host
ceph_config
ceph_orch_apply
ceph_orch_daemon
cephadm_registry_login
Edit online
The following tables list the available options for the cephadm-ansible modules. Options listed as required need to be set when
using the modules in your Ansible playbooks. Options listed with a default value of true indicate that the option is automatically set
when using the modules and you do not need to specify it in your playbook. For example, for the cephadm_bootstrap module, the
Ceph Dashboard is installed unless you set dashboard: false.
Edit online
As a storage administrator, you can bootstrap a storage cluster using Ansible by using the cephadm_bootstrap and
cephadm_registry_login modules in your Ansible playbook.
Prerequisites
Edit online
An IP address for the first Ceph Monitor container, which is also the IP address for the first node in the storage cluster.
Procedure
Edit online
Example
3. Create the hosts file and add hosts, labels, and monitor IP address of the first host in the storage cluster:
Syntax
sudo vi INVENTORY_FILE
[admin]
ADMIN_HOST monitor_address=MONITOR_IP_ADDRESS labels="['ADMIN_LABEL', 'LABEL1', 'LABEL2']"
Example
[admin]
host01 monitor_address=10.10.128.68 labels="['_admin', 'mon', 'mgr']"
Syntax
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: NAME_OF_PLAY
hosts: BOOTSTRAP_HOST
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
-name: NAME_OF_TASK
cephadm_registry_login:
state: STATE
registry_url: REGISTRY_URL
registry_username: REGISTRY_USER_NAME
registry_password: REGISTRY_PASSWORD
- name: NAME_OF_TASK
cephadm_bootstrap:
mon_ip: "{{ monitor_address }}"
dashboard_user: DASHBOARD_USER
dashboard_password: DASHBOARD_PASSWORD
allow_fqdn_hostname: ALLOW_FQDN_HOSTNAME
cluster_network: NETWORK_CIDR
Example
---
- name: bootstrap the cluster
hosts: host01
become: true
gather_facts: false
tasks:
- name: login to registry
cephadm_registry_login:
state: login
registry_url: cp.icr.io/cp
registry_username: user1
registry_password: mypassword1
Syntax
Example
Verification
Edit online
Edit online
Add and remove hosts in your storage cluster by using the ceph_orch_host module in your Ansible playbook.
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
For more information about copying the storage cluster's public SSH keys to new hosts, see Adding hosts.
Procedure
Edit online
Example
c. Add the new hosts and labels to the Ansible inventory file.
Syntax
sudo vi INVENTORY_FILE
[admin]
ADMIN_HOST monitor_address=MONITOR_IP_ADDRESS labels="['ADMIN_LABEL', 'LABEL1',
'LABEL2']"
Example
[admin]
host01 monitor_address= 10.10.128.68 labels="['_admin', 'mon', 'mgr']"
NOTE: If you have previously added the new hosts to the Ansible inventory file and ran the preflight playbook on the
hosts, skip to step 3.
Syntax
Example
The preflight playbook installs podman, lvm2, chronyd, and cephadm on the new host. After installation is complete,
cephadm resides in the /usr/sbin/ directory.
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: HOSTS_OR_HOST_GROUPS
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_orch_host:
name: "{{ ansible_facts[hostname] }}"
address: "{{ ansible_facts[default_ipv4][address] }}"
labels: "{{ labels }}"
delegate_to: HOST_TO_DELEGATE_TASK_TO
- name: NAME_OF_TASK
when: inventory_hostname in groups[admin]
ansible.builtin.shell:
cmd: CEPH_COMMAND_TO_RUN
register: REGISTER_NAME
- name: NAME_OF_TASK
when: inventory_hostname in groups[admin]
debug:
msg: "{{ REGISTER_NAME.stdout }}"
NOTE: By default, Ansible executes all tasks on the host that matches the hosts line of your playbook. The ceph
orch commands must run on the host that contains the admin keyring and the Ceph configuration file. Use the
delegate_to keyword to specify the admin host in your cluster.
Example
---
- name: add additional hosts to the cluster
hosts: all
become: true
gather_facts: true
tasks:
- name: add hosts to the cluster
ceph_orch_host:
name: "{{ ansible_facts['hostname'] }}"
address: "{{ ansible_facts['default_ipv4']['address'] }}"
In this example, the playbook adds the new hosts to the cluster and displays a current list of hosts.
Syntax
Example
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: NAME_OF_PLAY
hosts: ADMIN_HOST
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_orch_host:
name: HOST_TO_REMOVE
state: STATE
- name: NAME_OF_TASK
ceph_orch_host:
name: HOST_TO_REMOVE
state: STATE
retries: NUMBER_OF_RETRIES
delay: DELAY
until: CONTINUE_UNTIL
register: REGISTER_NAME
- name: NAME_OF_TASK
ansible.builtin.shell:
cmd: ceph orch host ls
register: REGISTER_NAME
- name: NAME_OF_TASK
debug:
msg: "{{ REGISTER_NAME.stdout }}"
Example
In this example, the playbook tasks drain all daemons on host07, removes the host from the cluster, and displays a current
list of hosts.
Syntax
Example
Verification
Edit online
Review the Ansible task output displaying the current list of hosts in the cluster:
Example
Edit online
As a storage administrator, you can set or get IBM Storage Ceph configuration options using the ceph_config module.
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
The Ansible inventory file contains the cluster and admin hosts.
Procedure
Edit online
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: ADMIN_HOST
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_config:
action: GET_OR_SET
who: DAEMON_TO_SET_CONFIGURATION_TO
option: CEPH_CONFIGURATION_OPTION
value: VALUE_OF_PARAMETER_TO_SET
- name: NAME_OF_TASK
ceph_config:
action: GET_OR_SET
who: DAEMON_TO_SET_CONFIGURATION_TO
option: CEPH_CONFIGURATION_OPTION
register: REGISTER_NAME
- name: NAME_OF_TASK
debug:
msg: "MESSAGE_TO_DISPLAY {{ REGISTER_NAME.stdout }}"
Example
---
- name: set pool delete
hosts: host01
become: true
gather_facts: false
tasks:
- name: set the allow pool delete option
ceph_config:
action: set
who: mon
option: mon_allow_pool_delete
value: true
In this example, the playbook first sets the mon_allow_pool_delete option to false. The playbook then gets the current
mon_allow_pool_delete setting and displays the value in the Ansible output.
Syntax
Example
Verification
Edit online
Example
Reference
Edit online
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
The Ansible inventory file contains the cluster and admin hosts.
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: HOSTS_OR_HOST_GROUPS
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_orch_apply:
spec: |
service_type: SERVICE_TYPE
service_id: UNIQUE_NAME_OF_SERVICE
placement:
host_pattern: HOST_PATTERN_TO_SELECT_HOSTS
label: LABEL
spec:
SPECIFICATION_OPTIONS:
Example
---
- name: deploy osd service
hosts: host01
become: true
gather_facts: true
tasks:
- name: apply osd spec
ceph_orch_apply:
spec: |
service_type: osd
service_id: osd
placement:
host_pattern: '*'
label: osd
spec:
data_devices:
all: true
In this example, the playbook deploys the Ceph OSD service on all hosts with the label osd.
Syntax
Example
Verification
Edit online
Reference
Edit online
Prerequisites
Edit online
Ansible user with sudo and passwordless SSH access to all nodes in the storage cluster.
The Ansible inventory file contains the cluster and admin hosts.
Procedure
Edit online
Example
Syntax
sudo vi PLAYBOOK_FILENAME.yml
---
- name: PLAY_NAME
hosts: ADMIN_HOST
become: USE_ELEVATED_PRIVILEGES
gather_facts: GATHER_FACTS_ABOUT_REMOTE_HOSTS
tasks:
- name: NAME_OF_TASK
ceph_orch_daemon:
state: STATE_OF_SERVICE
daemon_id: DAEMON_ID
daemon_type: TYPE_OF_SERVICE
Example
---
- name: start and stop services
hosts: host01
In this example, the playbook starts the OSD with an ID of 0 and stops a Ceph Monitor with an id of host02.
Syntax
Example
Verification
Edit online
Operations
Edit online
Learn to do operational tasks for IBM Storage Ceph.
Orchestrator CLI : These are common APIs used in Orchestrators and include a set of commands that can be implemented.
These APIs also provide a common command line interface (CLI) to orchestrate ceph-mgr modules with external
orchestration services. The following are the nomenclature used with the Ceph Orchestrator:
Host : This is the host name of the physical host and not the pod name, DNS name, container name, or host name inside
the container.
Service type : This is the type of the service, such as mds, osd, mon, rgw, and mgr.
Service : A functional service provided by a Ceph storage cluster such as monitors service, managers service, OSD
services, and Ceph Object Gateway service.
Daemon : A specific instance of a service deployed by one or more hosts such as Ceph Object Gateway services can
have different Ceph Object Gateway daemons running in three different hosts.
Cephadm Orchestrator - This is a Ceph Orchestrator module that does not rely on an external tool such as Rook or Ansible,
but rather manages nodes in a cluster by establishing an SSH connection and issuing explicit management commands. This
module is intended for day-one and day-two operations.
Using the Cephadm Orchestrator is the recommended way of installing a Ceph storage cluster without leveraging any
deployment frameworks like Ansible. The idea is to provide the manager daemon with access to an SSH configuration and key
that is able to connect to all nodes in a cluster to perform any management operations, like creating an inventory of storage
devices, deploying and replacing OSDs, or starting and stopping Ceph daemons. In addition, the Cephadm Orchestrator will
deploy container images managed by systemd in order to allow independent upgrades of co-located services.
This orchestrator will also likely highlight a tool that encapsulates all necessary operations to manage the deployment of
container image based services on the current host, including a command that bootstraps a minimal cluster running a Ceph
Monitor and a Ceph Manager.
Rook follows the “operator” model, in which a custom resource definition (CRD) object is defined in Kubernetes to describe a
Ceph storage cluster and its desired state, and a rook operator daemon is running in a control loop that compares the current
cluster state to desired state and takes steps to make them converge. The main object describing Ceph’s desired state is the
Ceph storage cluster CRD, which includes information about which devices should be consumed by OSDs, how many monitors
should be running, and what version of Ceph should be used. Rook defines several other CRDs to describe RBD pools, CephFS
file systems, and so on.
The Rook Orchestrator module is the glue that runs in the ceph-mgr daemon and implements the Ceph orchestration API by
making changes to the Ceph storage cluster in Kubernetes that describe desired cluster state. A Rook cluster’s ceph-mgr
daemon is running as a Kubernetes pod, and hence, the rook module can connect to the Kubernetes API without any explicit
configuration.
Management of services
Edit online
As a storage administrator, after installing the IBM Storage Ceph cluster, you can monitor and manage the services in a storage
cluster. A service is a group of daemons that are configured together.
NOTE: If the services are applied with the ceph orch apply command while bootstrapping, changing the service specification file
is complicated. Instead, you can use the --export option with the ceph orch ls command to export the running specification,
update the yaml file, and re-apply the service.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Example
[ceph: root@host01 /]# ceph orch ls --service-type mgr --export > mgr.yaml
[ceph: root@host01 /]# ceph orch ls --export > cluster.yaml
This exports the file in the .yaml file format. This file can be used with the ceph orch apply -i command for retrieving
the service specification of a single service.
You can check the following status of the daemons of the storage cluster using the ceph orch ps command:
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Example
There are two ways of deploying the services using the placement specification:
Using the placement specification directly in the command line interface. For example, if you want to deploy three monitors on
the hosts, running the following command deploys three monitors on host01, host02, and host03.
Example
[ceph: root@host01 /]# ceph orch apply mon --placement="3 host01 host02 host03"
Using the placement specification in the YAML file. For example, if you want to deploy node-exporter on all the hosts, then
you can specify the following in the yaml file.
Example
service_type: node-exporter
placement:
host_pattern: '*'
Prerequisites
Edit online
Procedures
Edit online
Example
2. Use one of the following methods to deploy the daemons on the hosts:
Example
[ceph: root@host01 /]# ceph orch apply mon --placement="3 host01 host02 host03"
Method 2: Add the labels to the hosts and then deploy the daemons using the labels:
Syntax
Example
[ceph: root@host01 /]# ceph orch host label add host01 mon
Syntax
Example
Method 3: Add the labels to the hosts and deploy using the --placement argument:
Syntax
Example
[ceph: root@host01 /]# ceph orch host label add host01 mon
Syntax
Example
Verification
Edit online
Example
Syntax
Example
Adding hosts
Prerequisites
Edit online
Procedure
Edit online
Example
2. List the hosts on which you want to deploy the Ceph daemons:
Example
Syntax
Example
In this example, the mgr daemons are deployed only on two hosts.
Verification
Edit online
Example
Reference
See the Listing hosts section in the IBM Storage Ceph Operations Guide.
Example
service_type: mon
placement:
host_pattern: "mon*"
---
service_type: mgr
placement:
host_pattern: "mgr*"
---
service_type: osd
service_id: default_drive_group
placement:
host_pattern: "osd*"
data_devices:
all: true
The following list are the parameters where the properties of a service specification are defined as follows:
Ceph services like mon, crash, mds, mgr, osd, rbd, or rbd-mirror.
placement: This is used to define where and how to deploy the daemons.
unmanaged: If set to true, the Orchestrator will neither deploy nor remove any daemon associated with this service.
A stateless service is a service that does not need information of the state to be available. For example, to start an rgw service,
additional information is not needed to start or run the service. The rgw service does not create information about this state in order
to provide the functionality. Regardless of when the rgw service starts, the state is the same.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
service_type: SERVICE_NAME
placement:
hosts:
- HOST_NAME_1
- HOST_NAME_2
Example
service_type: mon
placement:
hosts:
- host01
- host02
- host03
Syntax
service_type: SERVICE_NAME
placement:
label: "LABEL_1"
Example
service_type: mon
placement:
label: "mon"
3. Optional: You can also use extra container arguments in the service specification files such as CPUs, CA certificates, and other
files while deploying services:
Example
extra_container_args:
- "-v"
- "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro"
- "--security-opt"
- "label=disable"
- "cpus=2"
Example
Example
Example
Verification
Edit online
Example
Syntax
Example
Reference
Edit online
See the Listing hosts section in the IBM Storage Ceph Operations Guide.
Management of hosts
Edit online
As a storage administrator, you can use the Ceph Orchestrator with Cephadm in the backend to add, list, and remove hosts in an
existing IBM Storage Ceph cluster.
You can also add labels to hosts. Labels are free-form and have no specific meanings. Each host can have multiple labels. For
example, apply the mon label to all hosts that have monitor daemons deployed, mgr for all hosts with manager daemons deployed,
rgw for Ceph object gateways, and so on.
Labeling all the hosts in the storage cluster helps to simplify system management tasks by allowing you to quickly identify the
daemons running on each host. In addition, you can use the Ceph Orchestrator or a YAML file to deploy or remove daemons on hosts
that have specific host labels.
Adding hosts
Adding multiple hosts
Listing hosts
Adding labels to hosts
Removing labels from hosts
Removing hosts
Placing hosts in the maintenance mode
Adding hosts
Edit online
You can use the Ceph Orchestrator with Cephadm in the backend to add hosts to an existing IBM Storage Ceph cluster.
Ansible user with sudo and passwordless ssh access to all nodes in the storage cluster.
Procedure
Edit online
1. From the Ceph administration node, log into the Cephadm shell:
Example
Syntax
Example
3. Copy Ceph cluster’s public SSH keys to the root user’s authorized_keys file on the new host:
Syntax
Example
4. From the Ansible administration node, add the new host to the Ansible inventory file. The default location for the file is
/usr/share/cephadm-ansible/hosts. The following example shows the structure of a typical inventory file:
Example
host01
host02
host03
[admin]
host00
NOTE: If you have previously added the new host to the Ansible inventory file and run the preflight playbook on the host, skip
to step 6.
Syntax
Example
6. From the Ceph administration node, log into the Cephadm shell:
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch host add host02 10.10.128.70 --labels=mon,mgr
Verification
Edit online
Example
Reference
Edit online
See the Listing hosts section in the IBM Storage Ceph Operations Guide.
For more information about the cephadm-preflight playbook, see Running the preflight playbook section in the IBM
Storage Ceph Installation Guide.
See the Registering IBM Storage Ceph nodes to the CDN and attaching subscriptions section in the IBM Storage Ceph
Installation Guide.
See the Creating an Ansible user with sudo access section in the IBM Storage Ceph Installation Guide.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
service_type: host
addr: host01
hostname: host01
labels:
- mon
- osd
- mgr
---
service_type: host
addr: host02
hostname: host02
labels:
- mon
- osd
- mgr
---
service_type: host
addr: host03
hostname: host03
labels:
- mon
- osd
Example
Example
Syntax
Example
Verification
Edit online
Example
Reference
Edit online
Listing hosts
Edit online
NOTE: The STATUS of the hosts is blank, in the output of the ceph orch host ls command.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
You will see that the STATUS of the hosts is blank which is expected.
You can also add the following host labels that have special meaning to cephadm and they begin with _:
_no_schedule: This label prevents cephadm from scheduling or deploying daemons on the host. If it is added to an existing
host that already contains Ceph daemons, it causes cephadm to move those daemons elsewhere, except OSDs which are not
removed automatically. When a host is added with the _no_schedule label, no daemons are deployed on it. When the
daemons are drained before the host is removed, the _no_schedule label is set on that host.
_no_autotune_memory: This label does not autotune memory on the host. It prevents the daemon memory from being
tuned even when the osd_memory_target_autotune option or other similar options are enabled for one or more daemons
on that host.
_admin: By default, the _admin label is applied to the bootstrapped host in the storage cluster and the client.admin key is
set to be distributed to that host with the ceph orch client-keyring {ls|set|rm} function. Adding this label to
additional hosts normally causes cephadm to deploy configuration and keyring files in the /etc/ceph directory.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch host label add host02 mon
Verification
Edit online
Example
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Verification
Edit online
Verify that the label has been moved from the host, by using the ceph orch host ls command.
IMPORTANT: If you are removing the bootstrap host, be sure to copy the admin keyring and the configuration file to another host in
the storage cluster before you remove the host.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Syntax
Example
The _no_schedule label is automatically applied to the host which blocks deployment.
Example
When no placement groups (PG) are left on the OSD, the OSD is decommissioned and removed from the storage cluster.
5. Check if all the daemons are removed from the storage cluster:
Syntax
Example
Syntax
Example
Reference
Edit online
Adding hosts
Listing hosts
The orchestrator adopts the following workflow when the host is placed in maintenance:
1. Confirms the removal of hosts does not impact data availability by running the orch host ok-to-stop command.
2. If the host has Ceph OSD daemons, it applies noout to the host subtree to prevent data migration from triggering during the
planned maintenance slot.
4. Disables the ceph target on the host, to prevent a reboot from automatically starting Ceph services.
Prerequisites
Edit online
Procedure
Edit online
Example
2. You can either place the host in maintenance mode or place it out of the maintenance mode:
Syntax
[ceph: root@host01 /]# ceph orch host maintenance enter host02 --force
The --force flag allows the user to bypass warnings, but not alerts.
Syntax
Example
Example
Management of monitors
Edit online
As a storage administrator, you can deploy additional monitors using placement specification, add monitors using service
specification, add monitors to a subnet configuration, and add monitors to specific hosts. Apart from this, you can remove the
monitors.
By default, a typical IBM Storage Ceph cluster has three or five monitor daemons deployed on different hosts.
IBM recommends deploying five monitors if there are five or more nodes in a cluster.
Ceph deploys monitor daemons automatically as the cluster grows, and scales back monitor daemons automatically as the cluster
shrinks. The smooth execution of this automatic growing and shrinking depends upon proper subnet configuration.
If your monitor nodes or your entire cluster are located on a single subnet, then Cephadm automatically adds up to five monitor
daemons as you add new hosts to the cluster. Cephadm automatically configures the monitor daemons on the new hosts. The new
hosts reside on the same subnet as the bootstrapped host in the storage cluster.
Cephadm can also deploy and scale monitors to correspond to changes in the size of the storage cluster.
Ceph Monitors
Configuring monitor election strategy
Deploying the Ceph monitor daemons using the command line interface
Deploying the Ceph monitor daemons using the service specification
Deploying the monitor daemons on specific network
Removing the monitor daemons
Removing a Ceph Monitor from an unhealthy storage cluster
Ceph Monitors
Edit online
Ceph Monitors are lightweight processes that maintain a master copy of the storage cluster map. All Ceph clients contact a Ceph
monitor and retrieve the current copy of the storage cluster map, enabling clients to bind to a pool and read and write data.
Ceph Monitors use a variation of the Paxos protocol to establish consensus about maps and other critical information across the
storage cluster. Due to the nature of Paxos, Ceph requires a majority of monitors running to establish a quorum, thus establishing
consensus.
IMPORTANT: IBM requires at least three monitors on separate hosts to receive support for a production cluster.
For an initial deployment of a multi-node Ceph storage cluster, IBM requires three monitors, increasing the number two at a time if a
valid need for more than three monitors exists.
Since Ceph Monitors are lightweight, it is possible to run them on the same host as OpenStack nodes. However, IBM recommends
running monitors on separate hosts.
When you remove monitors from a storage cluster, consider that Ceph Monitors use the Paxos protocol to establish a consensus
about the master storage cluster map. You must have a sufficient number of Ceph Monitors to establish a quorum.
Reference
Edit online
See the IBM Storage Ceph Supported configurations Knowledgebase article for all the supported Ceph configurations.
1. classic - This is the default mode in which the lowest ranked monitor is voted based on the elector module between the two
sites.
2. disallow - This mode lets you mark monitors as disallowed, in which case they will participate in the quorum and serve
clients, but cannot be an elected leader. This lets you add monitors to a list of disallowed leaders. If a monitor is in the
disallowed list, it will always defer to another monitor.
3. connectivity - This mode is mainly used to resolve network discrepancies. It evaluates connection scores, based on pings
that check liveness, provided by each monitor for its peers and elects the most connected and reliable monitor to be the
leader. This mode is designed to handle net splits, which may happen if your cluster is stretched across multiple data centers
or otherwise susceptible. This mode incorporates connection score ratings and elects the monitor with the best score. If a
specific monitor is desired to be the leader, configure the election strategy so that the specific monitor is the first monitor in
the list with a rank of 0.
IBM recommends you to stay in the classic mode unless you require features in the other modes.
Before constructing the cluster, change the election_strategy to classic, disallow, or connectivity in the following
command:
Syntax
Prerequisites
Edit online
Procedure
Edit online
Example
Method 1
NOTE: IBM recommends that you use the --placement option to deploy on specific hosts.
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01 host02 host03"
NOTE: Be sure to include the bootstrap node as the first node in the command.
IMPORTANT: Do not add the monitors individually as ceph orch apply mon supersedes and will not add the monitors to
all the hosts. For example, if you run the following commands, then the first command creates a monitor on host01. Then the
second command supersedes the monitor on host1 and creates a monitor on host02. Then the third command supersedes
the monitor on host02 and creates a monitor on host03. Eventually, there is a monitor only on the third host.
Method 2
Use placement specification to deploy specific number of monitors on specific hosts with labels:
Syntax
Example
[ceph: root@host01 /]# ceph orch host label add host01 mon
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01:mon host02:mon host03:mon"
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mon --placement="3 host01 host02 host03"
Method 4
Syntax
Example
Verification
Edit online
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
service_type: mon
placement:
hosts:
- HOST_NAME_1
- HOST_NAME_2
Example
service_type: mon
placement:
hosts:
- host01
- host02
Example
Example
Syntax
Example
Verification
Edit online
Example
Syntax
Example
Procedure
Edit online
Example
Example
Syntax
Example
Verification
Edit online
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Example
2. Run the ceph orch apply command to deploy the required monitor daemons:
Syntax
If you want to remove monitor daemons from host02, then you can redeploy the monitors on other hosts.
Example
Verification
Edit online
Syntax
Example
Reference
Edit online
See Deploying the Ceph monitor daemons using the command line interface section in the IBM Storage Ceph Operations Guide
for more information.
See Deploying the Ceph monitor daemons using the service specification section in the IBM Storage Ceph Operations Guide for
more information.
Prerequisites
Edit online
Syntax
ssh root@MONITOR_ID
Example
2. Log in to each Ceph Monitor host and stop all the Ceph Monitors:
Syntax
Example
3. Set up the environment suitable for extended daemon maintenance and to run the daemon interactively:
Syntax
Example
Syntax
Example
Syntax
Example
6. Inject the surviving monitor map with the removed monitor(s) into the surviving Ceph Monitor:
Syntax
Example
Syntax
Example
Example
Management of managers
Edit online
As a storage administrator, you can use the Ceph Orchestrator to deploy additional manager daemons. Cephadm automatically
installs a manager daemon on the bootstrap node during the bootstrapping process.
In general, you should set up a Ceph Manager on each of the hosts running the Ceph Monitor daemon to achieve same level of
availability.
By default, whichever ceph-mgr instance comes up first is made active by the Ceph Monitors, and others are standby managers.
There is no requirement that there should be a quorum among the ceph-mgr daemons.
If the active daemon fails to send a beacon to the monitors for more than the mon mgr beacon grace, then it is replaced by a
standby.
If you want to pre-empt failover, you can explicitly mark a ceph-mgr daemon as failed with ceph mgr fail MANAGER_NAME
command.
Prerequisites
Edit online
Prerequisites
Edit online
NOTE: Ensure your deployment has at least three Ceph Managers in each deployment.
Procedure
Edit online
Example
Method 1
NOTE: IBM recommends that you use the --placement option to deploy on specific hosts.
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mgr --placement="host01 host02 host03"
Method 2
Syntax
Example
Verification
Edit online
Example
Syntax
Example
Prerequisites
IBM Storage Ceph 433
Edit online
Procedure
Edit online
Example
2. Run the ceph orch apply command to redeploy the required manager daemons:
Syntax
If you want to remove manager daemons from host02, then you can redeploy the manager daemons on other hosts.
Example
[ceph: root@host01 /]# ceph orch apply mgr "“"2 host01 host03"
Verification
Edit online
Syntax
Example
Reference
Edit online
See Deploying the manager daemons section in the IBM Storage Ceph Operations Guide for more information.
Enable or disable modules with ceph mgr module enable MODULE command or ceph mgr module disable MODULE
command respectively.
If a module is enabled, then the active ceph-mgr daemon loads and executes it. In the case of modules that provide a service, such
as an HTTP server, the module might publish its address when it is loaded. To see the addresses of such modules, run the ceph mgr
services command.
MODULE
balancer on (always on)
crash on (always on)
devicehealth on (always on)
orchestrator on (always on)
pg_autoscaler on (always on)
progress on (always on)
rbd_support on (always on)
status on (always on)
telemetry on (always on)
volumes on (always on)
cephadm on
dashboard on
iostat on
nfs on
prometheus on
restful on
alerts -
diskprediction_local -
influx -
insights -
k8sevents -
localpool -
mds_autoscaler -
mirroring -
osd_perf_query -
osd_support -
rgw -
rook -
selftest -
snap_schedule -
stats -
telegraf -
test_orchestrator -
zabbix -
The first time the cluster starts, it uses the mgr_initial_modules setting to override which modules to enable. However, this
setting is ignored through the rest of the lifetime of the cluster: only use it for bootstrapping. For example, before starting your
monitor daemons for the first time, you might add a section like this to your ceph.conf file:
[mon]
mgr initial modules = dashboard balancer
Where a module implements comment line hooks, the commands are accessible as ordinary Ceph commands and Ceph
automatically incorporates module commands into the standard CLI interface and route them appropriately to the module:
You can use the following configuration parameters with the above command:
Currently the balancer module cannot be disabled. It can only be turned off to customize the configuration.
Modes
Edit online
There are currently two supported balancer modes:
crush-compat: The CRUSH compat mode uses the compat weight-set feature, introduced in Ceph Luminous, to manage an
alternative set of weights for devices in the CRUSH hierarchy. The normal weights should remain set to the size of the device
to reflect the target amount of data that you want to store on the device. The balancer then optimizes the weight-set
values, adjusting them up or down in small increments in order to achieve a distribution that matches the target distribution as
closely as possible. Because PG placement is a pseudorandom process, there is a natural amount of variation in the
placement; by optimizing the weights, the balancer counter-acts that natural variation.
This mode is fully backwards compatible with older clients. When an OSDMap and CRUSH map are shared with older clients, the
balancer presents the optimized weights as the real weights.
The primary restriction of this mode is that the balancer cannot handle multiple CRUSH hierarchies with different placement rules if
the subtrees of the hierarchy share any OSDs. Because this configuration makes managing space utilization on the shared OSDs
difficult, it is generally not recommended. As such, this restriction is normally not an issue.
upmap: Starting with Luminous, the OSDMap can store explicit mappings for individual OSDs as exceptions to the normal
CRUSH placement calculation. These upmap entries provide fine-grained control over the PG mapping. This CRUSH mode will
optimize the placement of individual PGs in order to achieve a balanced distribution. In most cases, this distribution is
"perfect", with an equal number of PGs on each OSD +/-1 PG, as they might not divide evenly.
IMPORTANT:
To allow use of this feature, you must tell the cluster that it only needs to support
luminous or later clients with the following command:
This command fails if any pre-luminous clients or daemons are connected to the monitors.
Due to a known issue, kernel CephFS clients report themselves as jewel clients. To work
around this issue, use the `--yes-i-really-mean-it` flag:
Prerequisites
Edit online
Procedure
436 IBM Storage Ceph
Edit online
Example
Example
Example
or
Example
Status
The current status of the balancer can be checked at any time with:
Example
Automatic balancing
Example
Example
This will use the crush-compat mode, which is backward compatible with older clients and will make small changes to the data
distribution over time to ensure that OSDs are equally utilized.
Throttling
No adjustments will be made to the PG distribution if the cluster is degraded, for example, if an OSD has failed and the system has
not yet healed itself.
When the cluster is healthy, the balancer throttles its changes such that the percentage of PGs that are misplaced, or need to be
moved, is below a threshold of 5% by default. This percentage can be adjusted using the target_max_misplaced_ratio setting.
For example, to increase the threshold to 7%:
Example
Supervised optimization
1. Building a plan.
2. Evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result
after executing a plan.
Example
Syntax
Example
Example
Syntax
Example
Syntax
Example
Syntax
Example
To calculate the quality of the distribution that would result after executing a plan:
Syntax
Example
Syntax
NOTE: Only execute the plan if it is expected to improve the distribution. After execution, the plan will be discarded.
NOTE: This module is not intended to be a robust monitoring solution. The fact that it is run as part of the Ceph cluster itself is
fundamentally limiting in that a failure of the ceph-mgr daemon prevents alerts from being sent. This module can, however, be
useful for standalone clusters that exist in environments where existing monitoring infrastructure does not exist.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Example
Example
5. Optional: By default, the alerts module uses SSL and port 465.
Syntax
Example
Syntax
Example
7. Optional: By default, SMTP From name is Ceph. To change that, set the smtp_from_name parameter:
Syntax
Example
[ceph: root@host01 /]# ceph config set mgr mgr/alerts/smtp_from_name 'Ceph Cluster Test'
8. Optional: By default, the alerts module checks the storage cluster’s health every minute, and sends a message when there is a
change in the cluster health status. To change the frequency, set the interval parameter:
Syntax
Example
Example
Reference
Edit online
See the Health messages of a Ceph cluster section in the IBM Storage Ceph Troubleshooting Guide for more information on
Ceph health messages.
You can use ceph-crash.service to submit these crashes automatically and persist in the Ceph Monitors. The ceph-
crash.service watches the crashdump directory and uploads them with ceph crash post.
The RECENT_CRASH heath message is one of the most common health messages in a Ceph cluster. This health message means that
one or more Ceph daemons has crashed recently, and the crash has not yet been archived or acknowledged by the administrator.
This might indicate a software bug, a hardware problem like a failing disk, or some other problem. The option
mgr/crash/warn_recent_interval controls the time period of what recent means, which is two weeks by default. You can
disable the warnings by running the following command:
Example
The option mgr/crash/retain_interval controls the period for which you want to retain the crash reports before they are
automatically purged. The default for this option is one year.
Prerequisites
Edit online
Procedure
Edit online
Example
2. Save a crash dump: The metadata file is a JSON blob stored in the crash dir as meta. You can invoke the ceph command -i -
option, which reads from stdin.
Example
3. List the timestamp or the UUID crash IDs for all the new and archived crash info:
Example
Example
5. List the timestamp or the UUID crash IDs for all the new crash information:
Example
Example
Syntax
Example
8. Remove saved crashes older than KEEP days: Here, KEEP must be an integer.
Syntax
Example
9. Archive a crash report so that it is no longer considered for the RECENT_CRASH health check and does not appear in the
crash ls-new output. It appears in the crash ls.
Syntax
Example
Example
Syntax
Example
Reference
Edit online
See the Health messages of a Ceph cluster section in the IBM Storage Ceph Troubleshooting Guide for more information on
Ceph health messages.
Management of OSDs
Edit online
As a storage administrator, you can use the Ceph Orchestrators to manage OSDs of an IBM Storage Ceph cluster.
Ceph OSDs
Ceph OSD node configuration
Automatically tuning OSD memory
Listing devices for Ceph OSD deployment
Zapping devices for Ceph OSD deployment
Deploying Ceph OSDs on all available devices
Deploying Ceph OSDs on specific devices and hosts
Advanced service specifications and filters for deploying OSDs
Deploying Ceph OSDs using advanced service specifications
Removing the OSD daemons
Replacing the OSDs
Replacing the OSDs with pre-created LVM
Replacing the OSDs in a non-colocated scenario
Ceph OSDs
Edit online
A Ceph OSD generally consists of one ceph-osd daemon for one storage drive and its associated journal within a node. If a node has
multiple storage drives, then map one ceph-osd daemon for each drive.
IBM recommends checking the capacity of a cluster regularly to see if it is reaching the upper end of its storage capacity. As a
storage cluster reaches its near full ratio, add one or more OSDs to expand the storage cluster’s capacity.
If the node has multiple storage drives, you might also need to remove one of the ceph-osd daemon for that drive. Generally, it’s a
good idea to check the capacity of the storage cluster to see if you are reaching the upper end of its capacity. Ensure that when you
remove an OSD that the storage cluster is not at its near full ratio.
IMPORTANT: Do not let a storage cluster reach the full ratio before adding an OSD. OSD failures that occur after the storage
cluster reaches the near full ratio can cause the storage cluster to exceed the full ratio. Ceph blocks write access to protect the
data until you resolve the storage capacity issues. Do not remove OSDs without considering the impact on the full ratio first.
If you add drives of dissimilar size, adjust their weights accordingly. When you add the OSD to the CRUSH map, consider the weight
for the new OSD. Hard drive capacity grows approximately 40% per year, so newer OSD nodes might have larger hard drives than
older nodes in the storage cluster, that is, they might have a greater weight.
Syntax
Cephadm starts with a fraction mgr/cephadm/autotune_memory_target_ratio, which defaults to 0.7 of the total RAM in the
system, subtract off any memory consumed by non-autotuned daemons such as non-OSDS and for OSDs for which
osd_memory_target_autotune is false, and then divide by the remaining OSDs.
By default, autotune_memory_target_ratio is 0.2 for hyper-converged infrastructure and 0.7 for other environments.
Syntax
Alertmanager: 1 GB
Ceph Manager: 4 GB
Ceph Monitor: 2 GB
Node-exporter: 1 GB
Prometheus: 1 GB
For example, if a node has 24 OSDs and has 251 GB RAM space, then osd_memory_target is 7860684936.
The final targets are reflected in the configuration database with options. You can view the limits and the current memory consumed
by each daemon from the ceph orch ps output under MEM LIMIT column.
NOTE: In a hyperconverged infrastructure, the autotune_memory_target_ratio can be set to 0.2 to reduce the memory
consumption of Ceph.
Example
You can manually set a specific memory target for an OSD in the storage cluster.
Example
You can manually set a specific memory target for an OSD host in the storage cluster.
Syntax
Example
NOTE: Enabling osd_memory_target_autotune overwrites existing manual OSD memory target settings. To prevent daemon
memory from being tuned even when the osd_memory_target_autotune option or other similar options are enabled, set the
_no_autotune_memory label on the host.
Syntax
You can exclude an OSD from memory autotuning by disabling the autotune option and setting a specific memory target.
Example
NOTE: Ceph will not provision an OSD on a device that is not available.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Using the --wide option provides all details relating to the device, including any reasons that the device might not be eligible
for use as an OSD. This option does not support NVMe devices.
3. Optional: To enable Health, Ident, and Fault fields in the output of ceph orch device ls, run the following commands:
NOTE: These fields are supported by libstoragemgmt library and currently supports SCSI, SAS, and SATA devices.
a. As root user outside the Cephadm shell, check your hardware’s compatibility with libstoragemgmt library to avoid
unplanned interruption to services:
Example
In the output, you see the Health Status as Good with the respective SCSI VPD 0x83 ID.
NOTE: If you do not get this information, then enabling the fields might cause erratic behavior of devices.
b. Log back into the Cephadm shell and enable libstoragemgmt support:
Example
Once this is enabled, ceph orch device ls gives the output of Health field as Good.
Verification
Edit online
Example
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch device zap host02 /dev/sdb --force
Verification
Edit online
Example
Reference
Edit online
To deploy OSDs all available devices, run the command without the unmanaged parameter and then re-run the command with the
parameter to prevent from creating future OSDs.
NOTE: The deployment of OSDs with --all-available-devices is generally used for smaller clusters. For larger clusters, use
the OSD specification file.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Example
The effect of ceph orch apply is persistent which means that the Orchestrator automatically finds the device, adds it to the
cluster, and creates new OSDs. This occurs under the following conditions:
You can disable automatic creation of OSDs on all the available devices by using the --unmanaged parameter.
Example
NOTE: The command ceph orch daemon add creates new OSDs, but does not add an OSD service.
Verification
Edit online
Example
Example
Reference
Edit online
See the Listing devices for Ceph OSD deployment section in the IBM Storage Ceph Operations Guide.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Example
To deploy ODSs on a raw physical device, without an LVM layer, use the --method raw option.
Syntax
Example
[ceph: root@host01 /]# ceph orch daemon add osd --method raw host02:/dev/sdb
NOTE: If you have separate DB or WAL devices, the ratio of block to DB or WAL devices MUST be 1:1.
Verification
Edit online
Example
Example
Syntax
Example
Reference
Edit online
See the Listing devices for Ceph OSD deployment section in the IBM Storage Ceph Operations Guide.
service_id: Use the service name or identification you prefer. A set of OSDs is created using the specification file. This name is
used to manage all the OSDs together and represent an Orchestrator service.
placement: This is used to define the hosts on which the OSDs need to be deployed.
label: osd_host - A label used in the hosts where OSD need to be deployed.
hosts: host01, host02 - An explicit list of host names where OSDs needs to be deployed.
selection of devices: The devices where OSDs are created. This allows us to separate an OSD from different devices. You can
create only BlueStore OSDs which have three components:
data_devices: Define the devices to deploy OSD. In this case, OSDs are created in a collocated schema. You can use filters to
select devices and folders.
wal_devices: Define the devices used for WAL OSDs. You can use filters to select devices and folders.
db_devices: Define the devices for DB OSDs. You can use the filters to select devices and folders.
encrypted: An optional parameter to encrypt information on the OSD which can set to either True or False
unmanaged: An optional parameter, set to False by default. You can set it to True if you do not want the Orchestrator to
manage the OSD service.
osds_per_device: User-defined value for deploying more than one OSD per device.
method: An optional parameter to specify if an OSD is created with an LVM layer or not. Set to raw if you want to create OSDs
on raw physical devices that do not include an LVM layer. If you have separate DB or WAL devices, the ratio of block to DB or
WAL devices MUST be 1:1.
Filters are used in conjunction with the data_devices, wal_devices and db_devices parameters.
NOTE: The devices used for deploying OSDs must be supported by libstoragemgmt.
Reference
Edit online
See the Deploying Ceph OSDs using the advanced specifications section in the IBM Storage Ceph Operations Guide.
For more information on libstoragemgmt, see the Listing devices for Ceph OSD deployment section in the IBM Storage Ceph
Operations Guide.
You can deploy the OSD for each device and each host by defining a yaml file or a json file.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
service_type: osd
service_id: SERVICE_ID
placement:
host_pattern: * # optional
data_devices: # optional
model: DISK_MODEL_NAME # optional
paths:
- /DEVICE_PATH
osds_per_device: NUMBER_OF_DEVICES # optional
db_devices: # optional
size: # optional
all: true # optional
paths:
- /DEVICE_PATH
encrypted: true
a. Simple scenarios: In these cases, all the nodes have the same set-up.
service_type: osd
service_id: osd_spec_default
placement:
host_pattern: '*'
data_devices:
all: true
paths:
- /dev/sdb
encrypted: true
Example
service_type: osd
service_id: osd_spec_default
placement:
host_pattern: '*'
data_devices:
size: '80G'
db_devices:
size: '40G:'
paths:
- /dev/sdc
b. Simple scenario: In this case, all the nodes have the same setup with OSD devices created in raw mode, without an LVM
layer.
Example
service_type: osd
service_id: all-available-devices
encrypted: "true"
method: raw
placement:
host_pattern: "*"
data_devices:
all: "true"
c. Advanced scenario: This would create the desired layout by using all HDDs as data_devices with two SSD assigned
as dedicated DB or WAL devices. The remaining SSDs are data_devices that have the NVMEs vendors assigned as
dedicated DB or WAL devices.
Example
service_type: osd
service_id: osd_spec_hdd
placement:
host_pattern: '*'
data_devices:
rotational: 0
db_devices:
model: Model-name
limit: 2
---
service_type: osd
service_id: osd_spec_ssd
placement:
host_pattern: '*'
data_devices:
model: Model-name
db_devices:
vendor: Vendor-name
d. Advanced scenario with non-uniform nodes: This applies different OSD specs to different hosts depending on the
host_pattern key.
Example
service_type: osd
service_id: osd_spec_node_one_to_five
placement:
host_pattern: 'node[1-5]'
data_devices:
rotational: 1
Example
service_type: osd
service_id: osd_using_paths
placement:
hosts:
- host01
- host02
data_devices:
paths:
- /dev/sdb
db_devices:
paths:
- /dev/sdc
wal_devices:
paths:
- /dev/sdd
Example
service_type: osd
service_id: multiple_osds
placement:
hosts:
- host01
- host02
osds_per_device: 4
data_devices:
paths:
- /dev/sdb
g. For pre-created volumes, edit the osd_spec.yaml file to include the following details:
Syntax
service_type: osd
service_id: SERVICE_ID
placement:
hosts:
- HOSTNAME
data_devices: # optional
model: DISK_MODEL_NAME # optional
paths:
- /DEVICE_PATH
db_devices: # optional
size: # optional
all: true # optional
paths:
- /DEVICE_PATH
Example
service_type: osd
service_id: osd_spec
placement:
hosts:
- machine1
data_devices:
paths:
Example
Example
NOTE: This step gives a preview of the deployment, without deploying the daemons.
Example
Syntax
Example
Verification
Edit online
Example
Example
Reference
Edit online
See the Advanced service specifications and filters for deploying OSDs section in the IBM Storage Ceph Operations Guide.
Ceph Monitor, Ceph Manager and Ceph OSD daemons are deployed on the storage cluster.
Procedure
Edit online
Example
2. Check the device and the node from which the OSD has to be removed:
Example
Syntax
Example
NOTE: If you remove the OSD from the storage cluster without an option, such as --replace, the device is removed from the
storage cluster completely. If you want to use the same device for deploying OSDs, you have to first zap the device before
adding it to the storage cluster.
4. Optional: To remove multiple OSDs from a specific node, run the following command:
Syntax
Example
Example
When no PGs are left on the OSD, it is decommissioned and removed from the cluster.
Verification
Edit online
Verify the details of the devices and the nodes from which the Ceph OSDs are removed:
Example
Reference
Edit online
See the Deploying Ceph OSDs on all available devices section in the IBM Storage Ceph Operations Guide for more information.
See the Deploying Ceph OSDs on specific devices and hosts section in the IBM Storage Ceph Operations Guide for more
information.
See the Zapping devices for Ceph OSD deployment section in the IBM Storage Ceph Operations Guide for more information on
clearing space on devices.
You can replace the OSDs from the cluster by preserving the OSD ID using the ceph orch rm command.
NOTE: If you want to replace a single OSD, Deploying Ceph OSDs on specific devices and hosts . If you want to deploy OSDs on all
available devices, Deploying Ceph OSDs on all available devices.
The OSD is not permanently removed from the CRUSH hierarchy, but is assigned the destroyed flag. This flag is used to determine
the OSD IDs that can be reused in the next OSD deployment. The destroyed flag is used to determine which OSD id is reused in the
next OSD deployment.
If you use OSD specification for deployment, your newly added disk is assigned the OSD ID of their replaced counterparts.
Prerequisites
Edit online
Monitor, Manager, and OSD daemons are deployed on the storage cluster.
Procedure
Edit online
Example
2. Check the device and the node from which the OSD has to be replaced:
Example
IMPORTANT: If the storage cluster has health_warn or other errors associated with it, check the and try to fix any errors
before replacing the OSD to avoid data loss.
Syntax
Example
Example
Verification
Edit online
Verify the details of the devices and the nodes from which the Ceph OSDs are replaced:
Example
You will see an OSD with the same id as the one you replaced running on the same host.
Reference
Edit online
See the Deploying Ceph OSDs on all available devices section in the IBM Storage Ceph Operations Guide for more information.
See the Deploying Ceph OSDs on specific devices and hosts section in the IBM Storage Ceph Operations Guide for more
information.
Prerequisites
Edit online
Failed OSD
Procedure
Edit online
Example
Syntax
Example
Example
Syntax
Example
Zapping: /dev/vg1/data-lv2
Closing encrypted path /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC
Running command: /usr/sbin/cryptsetup remove /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-
jNRcXC
Running command: /usr/bin/dd if=/dev/zero of=/dev/vg1/data-lv2 bs=1M count=10 conv=fsync
stderr: 10+0 records in
10+0 records out
stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.034742 s, 302 MB/s
Zapping successful for OSD: 8
Example
6. Recreate the OSD with a specification file corresponding to that specific OSD topology:
Example
Example
Example
Prerequisites
Edit online
Failed OSD
Procedure
Edit online
Example
NAME
MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda
8:0 0 20G 0 disk
├─sda1
8:1 0 1G 0 part /boot
└─sda2
8:2 0 19G 0 part
├─rhel-root
253:0 0 17G 0 lvm /
└─rhel-swap
253:1 0 2G 0 lvm [SWAP]
sdb
8:16 0 10G 0 disk
└─ceph--5726d3e9--4fdb--4eda--b56a--3e0df88d663f-osd--block--3ceb89ec--87ef--46b4--99c6-
-2a56bac09ff0 253:2 0 10G 0 lvm
sdc
Example
Example
[db] /dev/ceph-d7064874-66cb-4a77-a7c2-8aa0b0125c3c/osd-db-0dfe6eca-ba58-438a-
9510-d96e6814d853
[db] /dev/ceph-d7064874-66cb-4a77-a7c2-8aa0b0125c3c/osd-db-26b70c30-8817-45de-
8843-4c0932ad2429
4. In the osds.yaml file, set unmanaged parameter to true, else cephadm redeploys the OSDs:
Example
Example
Example
7. Remove the OSDs. Ensure to use the --zap option to remove hte backend services and the --replace option to retain the
OSD IDs:
Example
Example
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR
PGS STATUS TYPE NAME
-5 0.04877 - 55 GiB 15 GiB 4.1 MiB 0 B 60 MiB 40 GiB 27.27 1.17
- host02
2 hdd 0.01219 1.00000 15 GiB 5.0 GiB 996 KiB 0 B 15 MiB 10 GiB 33.33 1.43
0 destroyed osd.2
9. Edit the osds.yaml specification file to change unmanaged parameter to false and replace the path to the DB device if it
has changed after the device got physically replaced:
Example
IMPORTANT: If you use the same host specification file to replace the faulty DB device on a single OSD node, modify the
host_pattern option to specify only the OSD node, else the deployment fails and you cannot find the new DB device on
other hosts.
10. Reapply the specification file with the --dry-run option to ensure the OSDs shall be deployed with the new DB device:
Example
Example
Example
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR
PGS STATUS TYPE NAME
Verification
Edit online
1. From the OSD host where the OSDS are redeployed, verify if they are on the new DB device:
Example
[db] /dev/ceph-15ce813a-8a4c-46d9-ad99-7e0845baf15e/osd-db-1998a02e-5e67-42a9-
b057-e02c22bbf461
[db] /dev/ceph-15ce813a-8a4c-46d9-ad99-7e0845baf15e/osd-db-6c154191-846d-4e63-
8c57-fc4b99e182bd
If the OSD is in the process of removal, then you cannot stop the process.
Procedure
Edit online
Example
2. Check the device and the node from which the OSD was initiated to be removed:
Example
Syntax
Example
Example
Verify the details of the devices and the nodes from which the Ceph OSDs were queued for removal:
Example
Reference
Edit online
See the Removing the OSD daemons section in the IBM Storage Ceph Operations Guide for more information.
Prerequisites
Monitor, Manager and OSD daemons are deployed on the storage cluster.
Procedure
Edit online
Example
2. After the operating system of the host is reinstalled, activate the OSDs:
Syntax
Example
Verification
Edit online
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Example
2. Watch as the placement group states change from active+clean to active, some degraded objects, and finally
active+clean when migration completes.
When defining a pool the number of placement groups defines the grade of granularity the data is spread with across all available
OSDs. The higher the number the better the equalization of capacity load can be. However, since handling the placement groups is
also important in case of reconstruction of data, the number is significant to be carefully chosen upfront. To support calculation a
tool is available to produce agile environments.
During the lifetime of a storage cluster a pool may grow above the initially anticipated limits. With the growing number of drives a
recalculation is recommended. The number of placement groups per OSD should be around 100. When adding more OSDs to the
storage cluster the number of PGs per OSD will lower over time. Starting with 120 drives initially in the storage cluster and setting the
pg_num of the pool to 4000 will end up in 100 PGs per OSD, given with the replication factor of three. Over time, when growing to ten
times the number of OSDs, the number of PGs per OSD will go down to ten only. Because a small number of PGs per OSD will tend to
an unevenly distributed capacity, consider adjusting the PGs per pool.
Adjusting the number of placement groups can be done online. Recalculating is not only a recalculation of the PG numbers, but will
involve data relocation, which will be a lengthy process. However, the data availability will be maintained at any time.
Very high numbers of PGs per OSD should be avoided, because reconstruction of all PGs on a failed OSD will start at once. A high
number of IOPS is required to perform reconstruction in a timely manner, which might not be available. This would lead to deep I/O
queues and high latency rendering the storage cluster unusable or will result in long healing times.
Reference
Edit online
See the PG calculator for calculating the values by a given use case.
See the Erasure Code Pools chapter in the IBM Storage Ceph Strategies Guide for more information.
NOTE: IBM Storage Ceph 5.3 does not support custom images for deploying monitoring services such as Prometheus, Grafana,
Alertmanager, and node-exporter.
The Prometheus configuration, including scrape targets, such as metrics providing daemons, is set up automatically by Cephadm.
Cephadm also deploys a list of default alerts, for example, health error, 10% OSDs down, or pgs inactive.
Alertmanager handles alerts sent by the Prometheus server. It deduplicates, groups, and routes the alerts to the correct
receiver. By default, the Ceph dashboard is automatically configured as the receiver. The Alertmanager handles alerts sent by
the Prometheus server. Alerts can be silenced using the Alertmanager, but silences can also be managed using the Ceph
Dashboard.
Grafana is a visualization and alerting software. The alerting functionality of Grafana is not used by this monitoring stack. For
alerting, the Alertmanager is used.
By default, traffic to Grafana is encrypted with TLS. You can either supply your own TLS certificate or use a self-signed one. If no
custom certificate has been configured before Grafana has been deployed, then a self-signed certificate is automatically created and
configured for Grafana. Custom certificates for Grafana can be configured using the following commands:
Syntax
Node exporter is an exporter for Prometheus which provides data about the node on which it is installed. It is recommended to
install the node exporter on all nodes. This can be done using the monitoring.yml file with the node-exporter service type.
You can deploy the monitoring stack using the service specification in YAML file format. All the monitoring services can have the
network and port they bind to configured in the yml file.
Prerequisites
Edit online
Procedure
Edit online
1. Enable the prometheus module in the Ceph Manager daemon. This exposes the internal Ceph metrics so that Prometheus can
read them:
Example
IMPORTANT: Ensure this command is run before Prometheus is deployed. If the command was not run before the
deployment, you must redeploy Prometheus to update the configuration:
cd /var/lib/ceph/DAEMON_PATH/
Example
Example
4. Edit the specification file with a content similar to the following example:
Example
service_type: prometheus
service_name: prometheus
placement:
hosts:
- host01
networks:
- 192.169.142.0/24
---
service_type: node-exporter
---
service_type: alertmanager
service_name: alertmanager
placement:
hosts:
- host01
networks:
- 192.169.142.0/24
---
service_type: grafana
service_name: grafana
placement:
hosts:
- host01
networks:
- 192.169.142.0/24
NOTE: Ensure the monitoring stack components alertmanager, prometheus, and grafana are deployed on the same
host. The node-exporter component should be deployed on all the hosts.
Example
Verification
Edit online
Example
Syntax
Example
IMPORTANT: Prometheus, Grafana, and the Ceph dashboard are all automatically configured to talk to each other, resulting in a fully
functional Grafana integration in the Ceph dashboard.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Example
Verification
Edit online
Example
Syntax
ceph orch ps
Example
See Deploying the monitoring stack section in the IBM Storage Ceph Operations Guide for more information.
Prerequisites
Edit online
Prerequisites
Edit online
Procedure
Edit online
1. On the node where you want to set up the files, create a directory ceph in the /etc folder:
Example
Example
The contents of this file should be installed in /etc/ceph/ceph.conf path. You can use this configuration file to reach the
Ceph monitors.
Prerequisites
Edit online
Procedure
Edit online
1. On the node where you want to set up the keyring, create a directory ceph in the /etc folder:
Example
Example
Syntax
Example
Example
[client.fs]
key = AQAvoH5gkUCsExAATz3xCBLd4n6B6jRv+Z7CVQ==
The resulting output should be put into a keyring file, for example /etc/ceph/ceph.keyring.
Prerequisites
Edit online
NOTE: Ensure you have at least two pools, one for Ceph file system (CephFS) data and one for CephFS metadata.
Prerequisites
Edit online
Procedure
Edit online
Example
2. There are two ways of deploying MDS daemons using placement specification:
Method 1
Use ceph fs volume to create the MDS daemons. This creates the CephFS volume and pools associated with the CephFS,
and also starts the MDS service on the hosts.
Syntax
[ceph: root@host01 /]# ceph fs volume create test --placement="2 host01 host02"
Method 2
Create the pools, CephFS, and then deploy MDS service using placement specification:
Syntax
Example
Typically, the metadata pool can start with a conservative number of Placement Groups (PGs) as it generally has far
fewer objects than the data pool. It is possible to increase the number of PGs if needed. The pool sizes range from 64
PGs to 512 PGs. Size the data pool is proportional to the number and sizes of files you expect in the file system.
* A higher replication level because any data loss to this pool can make the whole file
system inaccessible.
* Storage with lower latency such as Solid-State Drive (SSD) disks because this directly
affects the observed latency of file system operations on clients.
2. Create the file system for the data pools and metadata pools:
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mds test --placement="2 host01 host02"
Verification
Edit online
Example
Example
Syntax
Example
NOTE: Ensure you have at least two pools, one for the Ceph File System (CephFS) data and one for the CephFS metadata.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
service_type: mds
service_id: FILESYSTEM_NAME
placement:
hosts:
- HOST_NAME_1
- HOST_NAME_2
- HOST_NAME_3
Example
service_type: mds
service_id: fs_name
placement:
hosts:
- host01
- host02
Example
Example
Example
Syntax
Example
8. Once the MDS services is deployed and functional, create the CephFS:
Syntax
Example
Verification
Edit online
Example
Syntax
Example
Prerequisites
Edit online
There are two ways of removing MDS daemons from the cluster:
Method 1
Example
Example
Syntax
Example
This command will remove the file system, its data, and metadata pools. It also tries to remove the MDS using the
enabled ceph-mgr Orchestrator module.
Method 2
Use the ceph orch rm command to remove the MDS service from the entire cluster:
Example
Syntax
Example
Verification
Edit online
Syntax
ceph orch ps
Example
Reference
IBM Storage Ceph 477
Edit online
See Deploying the MDS service using the command line interface section in the IBM Storage Ceph Operations Guide for more
information.
See Deploying the MDS service using the service specification section in the IBM Storage Ceph Operations Guide for more
information.
You can also configure multisite object gateways, and remove the Ceph object gateway.
Cephadm deploys Ceph object gateway as a collection of daemons that manages a single-cluster deployment or a particular realm
and zone in a multisite deployment.
NOTE: With Cephadm, the object gateway daemons are configured using the monitor configuration database instead of a
ceph.conf or the command line. If that configuration is not already in the client.rgw section, then the object gateway daemons
will start up with default settings and bind to the port 80.
NOTE: The .default.rgw.buckets.index pool is created only after the bucket is created in Ceph Object Gateway, while the
.default.rgw.buckets.data pool is created after the data is uploaded to the bucket.
Deploying the Ceph Object Gateway using the command line interface
Deploying the Ceph Object Gateway using the service specification
Deploying a multi-site Ceph Object Gateway
Removing the Ceph Object Gateway
Prerequisites
Edit online
All the managers, monitors, and OSDs are deployed in the storage cluster.
Prerequisites
Edit online
Log in to the Cephadm shell by using the cephadm shell to deploy Ceph Object Gateway daemons.
Procedure
Edit online
Method 1:
1. You can deploy the Ceph object gateway daemons in three different ways:
Create realm, zone group, zone, and then use the placement specification with the host name:
1. Create a realm:
Syntax
Example
Syntax
Example
3. Create a zone:
Syntax
Example
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch apply rgw test --realm=test_realm --zone=test_zone --
placement="2 host01 host02"
Method 2:
Use an arbitrary service name to deploy two Ceph Object Gateway daemons for a single cluster deployment:
Syntax
Example
Method 3:
Syntax
NUMBER_OF_DAEMONS controls the number of Ceph object gateways deployed on each host. To achieve the highest
performance without incurring an additional cost, set this value to 2.
Example
[ceph: root@host01 /]# ceph orch host label add host01 rgw # the 'rgw' label can be anything
[ceph: root@host01 /]# ceph orch host label add host02 rgw
[ceph: root@host01 /]# ceph orch apply rgw foo "--placement=label:rgw count-per-host:2" --
port=8000
Verification
Edit online
Example
Syntax
Example
Prerequisites
Edit online
Procedure
480 IBM Storage Ceph
Edit online
Example
2. Edit the radosgw.yml file to include the following details for the default realm, zone, and zone group:
Syntax
service_type: rgw
service_id: REALM_NAME.ZONE_NAME
placement:
hosts:
- HOST_NAME_1
- HOST_NAME_2
count-per-host: NUMBER_OF_DAEMONS
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
rgw_frontend_port: FRONT_END_PORT
networks:
- NETWORK_CIDR # Ceph Object Gateway service binds to a specific network
NOTE: NUMBER_OF_DAEMONS controls the number of Ceph Object Gateways deployed on each host. To achieve the highest
performance without incurring an additional cost, set this value to 2.
Example
service_type: rgw
service_id: default
placement:
hosts:
- host01
- host02
- host03
count-per-host: 2
spec:
rgw_realm: default
rgw_zone: default
rgw_frontend_port: 1234
networks:
- 192.169.142.0/24
3. Optional: For custom realm, zone, and zone group, create the resources and then create the radosgw.yml file:
Example
Example
service_type: rgw
service_id: test_realm.test_zone
placement:
hosts:
- host01
- host02
- host03
count-per-host: 2
spec:
rgw_realm: test_realm
rgw_zone: test_zone
rgw_frontend_port: 1234
Example
NOTE: Every time you exit the shell, you have to mount the file in the container before deploying the daemon.
Syntax
Example
Verification
Edit online
Example
Syntax
Example
You can configure each object gateway to work in an active-active zone configuration allowing writes to a non-primary zone. The
multi-site configuration is stored within a container called a realm.
The realm stores zone groups, zones, and a time period. The rgw daemons handle the synchronization eliminating the need for a
separate synchronization agent, thereby operating with an active-active configuration.
You can also deploy multi-site zones using the command line interface (CLI).
NOTE: The following configuration assumes at least two IBM Storage Ceph clusters are in geographically separate locations.
However, the configuration also works on the same site.
Prerequisites
Edit online
At least two Ceph Object Gateway instances, one for each IBM Storage Ceph cluster.
Procedure
Edit online
a. Create a realm:
Syntax
Example
If the storage cluster has a single realm, then specify the --default flag.
Syntax
Example
Syntax
Example
d. Optional: Delete the default zone, zone group, and the associated pools.
IMPORTANT: Do not delete the default zone and its pools if you are using the default zone and zone group to store
data. Also, removing the default zone group deletes the system user.
To access old data in the default zone and zonegroup, use --rgw-zone default and --rgw-zonegroup
default in radosgw-admin commands.
Example
Syntax
Example
f. Add the access key and system key to the primary zone:
Syntax
Example
Syntax
Example
h. Outside the cephadm shell, fetch the FSID of the storage cluster and the processes:
Example
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Example
IMPORTANT: Do not delete the default zone and its pools if you are using the default zone and zone group to store
data.
To access old data in the default zone and zonegroup, use --rgw-zone default and --rgw-zonegroup
default in radosgw-admin commands.
Example
Syntax
Example
Syntax
Example
g. Outside the Cephadm shell, fetch the FSID of the storage cluster and the processes:
Example
Syntax
3. Optional: Deploy multi-site Ceph Object Gateways using the placement specification:
Syntax
Example
[ceph: root@host04 /]# ceph orch apply rgw east --realm=test_realm --zone=us-east-1 --
placement="2 host01 host02"
Verification
Edit online
Example
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Syntax
Example
Verification
Edit online
Syntax
ceph orch ps
Example
Reference
Edit online
See Deploying the Ceph object gateway using the command line interface section in the IBM Storage Ceph Operations Guide for
more information.
See Deploying the Ceph object gateway using the service specification section in the IBM Storage Ceph Operations Guide for
more information.
The IBM Ceph Storage SNMP gateway service deploys one instance of the gateway by default. You can increase this by providing
placement information. However, if you enable multiple SNMP gateway daemons, your SNMP management platform receives
multiple notifications for the same event.
The SNMP traps are alert messages and the Prometheus Alertmanager sends these alerts to the SNMP notifier which then looks for
object identifier (OID) in the given alerts’ labels. Each SNMP trap has a unique ID which allows it to send additional traps with
updated status to a given SNMP poller. SNMP hooks into the Ceph health checks so that every health warning generates a specific
SNMP trap.
In order to work correctly and transfer information on device status to the user to monitor, SNMP relies on several components.
There are four main components that makeup SNMP:
SNMP Agent - An SNMP agent is a program that runs on a system to be managed and contains the MIB database for the
system. These collect data like bandwidth and disk space, aggregates it, and sends it to the management information base
(MIB).
Management information base (MIB) - These are components contained within the SNMP agents. The SNMP manager uses
this as a database and asks the agent for access to particular information. This information is needed for the network
management systems (NMS). The NMS polls the agent to take information from these files and then proceeds to translate it
into graphs and displays that can be viewed by the user. MIBs contain statistical and control values that are determined by the
network device.
SNMP Devices
The following versions of SNMP are compatible and supported for gateway implementation:
V2c - Uses a community string without any authentication and is vulnerable to outside attacks.
V3 authPriv - Uses the username and password authentication with encryption to the SNMP management platform.
IMPORTANT: When using SNMP traps, ensure that you have the correct security configuration for your version number to minimize
the vulnerabilities that are inherent to SNMP and keep your network protected from unauthorized users.
Configuring snmptrapd
Edit online
It is important to configure the simple network management protocol (SNMP) target before deploying the snmp-gateway because
the snmptrapd daemon contains the auth settings that you need to specify when creating the snmp-gateway service.
The SNMP gateway feature provides a means of exposing the alerts that are generated in the Prometheus stack to an SNMP
management platform. You can configure the SNMP traps to the destination based on the snmptrapd tool. This tool allows you to
establish one or more SNMP trap listeners.
The engine-id is a unique identifier for the device, in hex, and required for SNMPV3 gateway. IBM recommends using
8000C53F_CLUSTER_FSID_WITHOUT_DASHES_for this parameter.
The snmp-community, which is the SNMP_COMMUNITY_FOR_SNMPV2 parameter, is public for SNMPV2c gateway.
The auth-protocol which is the AUTH_PROTOCOL, is mandatory for SNMPV3 gateway and is SHA by default.
The SNMP_V3_AUTH_USER_NAME is the user name and is mandatory for SNMPV3 gateway.
Prerequisites
Edit online
Example
Example
3. Implement the management information base (MIB) to make sense of the SNMP notification and enhance SNMP support on
the destination host. Copy the raw file from the main repository:
https://github.com/ceph/ceph/blob/master/monitoring/snmp/CEPH-MIB.txt
Example
Example
5. Create the configuration files in snmptrapd directory for each protocol based on the SNMP version:
Syntax
Example
The public setting here must match the snmp_community setting used when deploying the snmp-gateway service.
For SNMPV3 with authentication only, create the snmptrapd_auth.conf file as follows:
Example
snmp_v3_auth_username: myuser
snmp_v3_auth_password: mypassword
For SNMPV3 with authentication and encryption, create the snmptrapd_authpriv.conf file as follows:
Example
Example
snmp_v3_auth_username: myuser
snmp_v3_auth_password: mypassword
snmp_v3_priv_password: mysecret
Syntax
Example
7. If any alert is triggered on the storage cluster, you can monitor the output on the SNMP management host. Verify the SNMP
traps and also the traps decoded by MIB.
Example
Reference
Edit online
See the Deploying the SNMP gateway section in the IBM Storage Ceph Operations Guide.
2. By creating one service configuration yaml file with all the details.
You can use the following parameters to deploy the SNMP gateway based on the versions:
The engine-id is a unique identifier for the device, in hex, and required for SNMPV3 gateway. IBM recommends to use
8000C53F_CLUSTER_FSID_WITHOUT_DASHES_for this parameter.
The privacy-protocol is mandatory for SNMPV3 gateway with authentication and encryption.
You must provide a -i _FILENAME_ to pass the secrets and passwords to the orchestrator.
Once the SNMP gateway service is deployed or updated, the Prometheus Alertmanager configuration is automatically updated to
forward any alert that has an objectidentifier to the SNMP gateway daemon for further processing.
Prerequisites
Edit online
Configuring snmptrapd on the destination host, which is the SNMP management host.
Procedure
Edit online
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch host label add host02 snmp-gateway
3. Create a credentials file or a service configuration file based on the SNMP version:
Example
snmp_community: public
OR
Example
service_type: snmp-gateway
service_name: snmp-gateway
placement:
count: 1
spec:
credentials:
snmp_community: public
port: 9464
snmp_destination: 192.168.122.73:162
snmp_version: V2c
Example
snmp_v3_auth_username: myuser
snmp_v3_auth_password: mypassword
OR
Example
service_type: snmp-gateway
service_name: snmp-gateway
placement:
count: 1
spec:
credentials:
snmp_v3_auth_password: mypassword
snmp_v3_auth_username: myuser
engine_id: 8000C53Ff64f341c655d11eb8778fa163e914bcc
port: 9464
snmp_destination: 192.168.122.1:162
snmp_version: V3
For SNMPV3 with authentication and encryption, create the file as follows:
Example
snmp_v3_auth_username: myuser
snmp_v3_auth_password: mypassword
snmp_v3_priv_password: mysecret
Example
service_type: snmp-gateway
service_name: snmp-gateway
placement:
count: 1
spec:
credentials:
snmp_v3_auth_password: mypassword
snmp_v3_auth_username: myuser
snmp_v3_priv_password: mysecret
engine_id: 8000C53Ff64f341c655d11eb8778fa163e914bcc
port: 9464
snmp_destination: 192.168.122.1:162
snmp_version: V3
Syntax
OR Syntax
For SNMPV2c, with the snmp_creds file, run the ceph orch command with the snmp-version as V2c:
Example
For SNMPV3 with authentication only, with the snmp_creds file, run the ceph orch command with the snmp-version as
V3 and engine-id:
Example
For SNMPV3 with authentication and encryption, with the snmp_creds file, run the ceph orch command with the snmp-
version as V3, privacy-protocol, and engine-id:
Example
OR
For all the SNMP versions, with the snmp-gateway file, run the following command:
Example
Reference
Edit online
See the Configuring snmptrapd section in the IBM Storage Ceph Operations Guide.
There are three node failure scenarios. Here is the high-level workflow for each scenario when replacing a node:
Replacing the node, but using the root and Ceph OSD disks from the failed node.
Disable backfilling.
Replace the node, taking the disks from the old node, and adding them to the new node.
Enable backfilling.
Replacing the node, reinstalling the operating system, and using the Ceph OSD disks from the failed node.
Disable backfilling.
Replace the node and add the Ceph OSD disks from the failed node.
Add the new node to the storage cluster commands and Ceph daemons are placed automatically on the respective
node.
Enable backfilling.
Replacing the node, reinstalling the operating system, and using all new Ceph OSDs disks.
Disable backfilling.
Remove all OSDs on the failed node from the storage cluster.
Replace the node and add the Ceph OSD disks from the failed node.
Add the new node to the storage cluster commands and Ceph daemons are placed automatically on the respective
node.
Enable backfilling.
Prerequisites
Edit online
A failed node.
One of the outstanding features of Ceph is the ability to add or remove Ceph OSD nodes at run time. This means that you can resize
the storage cluster capacity or replace hardware without taking down the storage cluster.
The ability to serve Ceph clients while the storage cluster is in a degraded state also has operational benefits. For example, you can
add or remove or replace hardware during regular business hours, rather than working overtime or on weekends. However, adding
and removing Ceph OSD nodes can have a significant impact on performance.
Before you add or remove Ceph OSD nodes, consider the effects on storage cluster performance:
Whether you are expanding or reducing the storage cluster capacity, adding or removing Ceph OSD nodes induces backfilling
as the storage cluster rebalances. During that rebalancing time period, Ceph uses additional resources, which can impact
storage cluster performance.
In a production Ceph storage cluster, a Ceph OSD node has a particular hardware configuration that facilitates a particular
type of storage strategy.
Since a Ceph OSD node is part of a CRUSH hierarchy, the performance impact of adding or removing a node typically affects
the performance of pools that use the CRUSH ruleset.
Reference
Performance considerations
Edit online
The following factors typically affect a storage cluster’s performance when adding or removing Ceph OSD nodes:
Ceph clients place load on the I/O interface to Ceph; that is, the clients place load on a pool. A pool maps to a CRUSH ruleset.
The underlying CRUSH hierarchy allows Ceph to place data across failure domains. If the underlying Ceph OSD node involves a
pool that is experiencing high client load, the client load could significantly affect recovery time and reduce performance.
Because write operations require data replication for durability, write-intensive client loads in particular can increase the time
for the storage cluster to recover.
Generally, the capacity you are adding or removing affects the storage cluster’s time to recover. In addition, the storage
density of the node you add or remove might also affect recovery times. For example, a node with 36 OSDs typically takes
longer to recover than a node with 12 OSDs.
When removing nodes, you MUST ensure that you have sufficient spare capacity so that you will not reach full ratio or
near full ratio. If the storage cluster reaches full ratio, Ceph will suspend write operations to prevent data loss.
A Ceph OSD node maps to at least one Ceph CRUSH hierarchy, and the hierarchy maps to at least one pool. Each pool that
uses a CRUSH ruleset experiences a performance impact when Ceph OSD nodes are added or removed.
Replication pools tend to use more network bandwidth to replicate deep copies of the data, whereas erasure coded pools
tend to use more CPU to calculate k+m coding chunks. The more copies that exist of the data, the longer it takes for the
storage cluster to recover. For example, a larger pool or one that has a greater number of k+m chunks will take longer to
recover than a replication pool with fewer copies of the same data.
Drives, controllers and network interface cards all have throughput characteristics that might impact the recovery time.
Generally, nodes with higher throughput characteristics, such as 10 Gbps and SSDs, recover more quickly than nodes with
lower throughput characteristics, such as 1 Gbps and SATA drives.
To remove an OSD:
To add an OSD:
When adding or removing Ceph OSD nodes, consider that other ongoing processes also affect storage cluster performance. To
reduce the impact on client I/O, IBM recommends the following:
Calculate capacity
Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all its OSDs without reaching the full
ratio. Reaching the full ratio will cause the storage cluster to refuse write operations.
Scrubbing is essential to ensuring the durability of the storage cluster’s data; however, it is resource intensive. Before adding or
removing a Ceph OSD node, disable scrubbing and deep-scrubbing and let the current scrubbing operations complete before
proceeding.
Once you have added or removed a Ceph OSD node and the storage cluster has returned to an active+clean state, unset the
noscrub and nodeep-scrub settings.
If you have reasonable data durability, there is nothing wrong with operating in a degraded state. For example, you can operate the
storage cluster with osd_pool_default_size = 3 and osd_pool_default_min_size = 2. You can tune the storage cluster
for the fastest possible recovery time, but doing so significantly affects Ceph client I/O performance. To maintain the highest Ceph
client I/O performance, limit the backfill and recovery operations and allow them to take longer.
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
You can also consider setting the sleep and delay parameters such as, osd_recovery_sleep.
Finally, if you are expanding the size of the storage cluster, you may need to increase the number of placement groups. If you
determine that you need to expand the number of placement groups, IBM recommends making incremental increases in the number
of placement groups. Increasing the number of placement groups by a significant amount will cause a considerable degradation in
performance.
Prerequisites
Edit online
Procedure
Edit online
1. Verify that other nodes in the storage cluster can reach the new node by its short host name.
Example
Syntax
Example
Syntax
Example
5. Copy ceph cluster’s public SSH keys to the root user’s authorized_keys file on the new host:
Syntax
Example
Syntax
Example
7. Add an OSD for each disk on the node to the storage cluster.
IMPORTANT: When adding an OSD node to an IBM Storage Ceph cluster, IBM recommends adding one OSD daemon at a time and
allowing the cluster to recover to an active+clean state before proceeding to the next OSD.
Reference
Edit online
See Adding a Bucket and Moving a Bucket for details on placing the node at an appropriate location in the CRUSH hierarchy.
WARNING: Before removing a Ceph OSD node, ensure that the storage cluster can backfill the contents of all OSDs without reaching
the full ratio. Reaching the full ratio will cause the storage cluster to refuse write operations.
Prerequisites
Edit online
Procedure
Edit online
Syntax
ceph df
rados df
ceph osd df
Syntax
Syntax
Example
IMPORTANT: When removing an OSD node from the storage cluster, IBM recommends removing one OSD at a time
within the node and allowing the cluster to recover to an active+clean state before proceeding to remove the next
OSD.
Syntax
ceph -s
ceph df
Repeat this step until all OSDs on the node are removed from the storage cluster.
Reference
Edit online
See the Setting a specific configuration at runtime section in the IBM Storage Ceph Configuration Guide for more details.
Prerequisites
Edit online
Procedure
Edit online
1. Check the storage cluster’s capacity to understand the impact of removing the node:
Example
Example
4. If you are changing the host name, remove the node from CRUSH map:
Example
Example
Example
Example
Reference
Edit online
Prerequisites
Edit online
Each data center within a stretch cluster can have a different storage cluster configuration to reflect local capabilities and
dependencies. Set up replication between the data centers to help preserve the data. If one data center fails, the other data centers
in the storage cluster contain copies of the data.
Failure, or failover, domains are redundant copies of domains within the storage cluster. If an active domain fails, the failure domain
becomes the active domain.
When planning a storage cluster that contains multiple data centers, place the nodes within the CRUSH map hierarchy so that if one
data center goes down, the rest of the storage cluster stays up and running.
If you plan to use three-way replication for data within the storage cluster, consider the location of the nodes within the failure
domain. If an outage occurs within a data center, it is possible that some data might reside in only one copy. When this scenario
happens, there are two options:
Live with only one copy for the duration of the outage.
With the standard settings, and because of the randomness of data placement across the nodes, not all the data will be affected, but
some data can have only one copy and the storage cluster would revert to read-only mode. However, if some data exist in only one
copy, the storage cluster reverts to read-only mode.
A logical structure of the placement hierarchy should be considered. A proper CRUSH map can be used, reflecting the hierarchical
structure of the failure domains within the infrastructure. Using logical hierarchical definitions improves the reliability of the storage
cluster, versus using the standard hierarchical definitions. Failure domains are defined in the CRUSH map. The default CRUSH map
contains all nodes in a flat hierarchy. In a three data center environment, such as a stretch cluster, the placement of nodes should be
managed in a way that one data center can go down, but the storage cluster stays up and running. Consider which failure domain a
node resides in when using 3-way replication for the data.
In the example below, the resulting map is derived from the initial setup of the storage cluster with 6 OSD nodes. In this example, all
nodes have only one disk and hence one OSD. All of the nodes are arranged under the default root, that is the standard root of the
hierarchy tree. Because there is a weight assigned to two of the OSDs, these OSDs receive fewer chunks of data than the other OSDs.
These nodes were introduced later with bigger disks than the initial OSD disks. This does not affect the data placement to withstand
a failure of a group of nodes.
Example
Using logical hierarchical definitions to group the nodes into same data center can achieve data placement maturity. Possible
definition types of root, datacenter, rack, row and host allow the reflection of the failure domains for the three data center stretch
cluster:
Since all OSDs in a host belong to the host definition there is no change needed. All the other assignments can be adjusted during
runtime of the storage cluster by:
Moving the nodes into the appropriate place within this structure by modifying the CRUSH map:
Within this structure any new hosts can be added too, as well as new disks. By placing the OSDs at the right place in the hierarchy the
CRUSH algorithm is changed to place redundant pieces into different failure domains within the structure.
Example
The listing from above shows the resulting CRUSH map by displaying the osd tree. Easy to see is now how the hosts belong to a data
center and all data centers belong to the same top level structure but clearly distinguishing between locations.
NOTE: Placing the data in the proper locations according to the map works only properly within the healthy cluster. Misplacement
might happen under circumstances, when some OSDs are not available. Those misplacements will be corrected automatically once it
is possible to do so.
Reference
Edit online
See the CRUSH administration chapter in the IBM Storage Ceph Storage Strategies Guide for more information.
Dashboard
502 IBM Storage Ceph
Edit online
Use this information to understand how to use the IBM Storage Ceph Dashboard for monitoring and management purposes.
The dashboard is accessible from a web browser and includes many useful management and monitoring features, for example, to
configure manager modules and monitor the state of OSDs.
Prerequisites
Edit online
The Prometheus node-exporter daemon, running on each host of the storage cluster.
Reference
Edit online
Multi-user and role management: The dashboard supports multiple user accounts with different permissions and roles. User
accounts and roles can be managed using both, the command line and the web user interface. The dashboard supports
various methods to enhance password security. Password complexity rules may be configured, requiring users to change their
password after the first login or after a configurable time period.
Single Sign-On (SSO): The dashboard supports authentication with an external identity provider using the SAML 2.0 protocol.
Auditing: The dashboard backend can be configured to log all PUT, POST and DELETE API requests in the Ceph manager log.
Management features
View cluster hierarchy: You can view the CRUSH map, for example, to determine which host a specific OSD ID is running on.
This is helpful if there is an issue with an OSD.
Configure manager modules: You can view and change parameters for Ceph manager modules.
Embedded Grafana Dashboards: Ceph Dashboard Grafana dashboards might be embedded in external applications and web
pages to surface information and performance metrics gathered by the Prometheus module.
View and filter logs: You can view event and audit cluster logs and filter them based on priority, keyword, date, or time range.
Toggle dashboard components: You can enable and disable dashboard components so only the features you need are
available.
Manage OSD settings: You can set cluster-wide OSD flags using the dashboard. You can also Mark OSDs up, down or out,
purge and reweight OSDs, perform scrub operations, modify various scrub-related configuration options, select profiles to
adjust the level of backfilling activity. You can set and change the device class of an OSD, display and sort OSDs by device
class. You can deploy OSDs on new drives and hosts.
Viewing Alerts: The alerts page allows you to see details of current alerts.
Quality of Service for images: You can set performance limits on images, for example limiting IOPS or read BPS burst rates.
Monitoring features
Username and password protection: You can access the dashboard only by providing a configurable user name and
password.
Overall cluster health: Displays performance and capacity metrics. This also displays the overall cluster status, storage
utilization, for example, number of objects, raw capacity, usage per pool, a list of pools and their status and usage statistics.
Hosts: Provides a list of all hosts associated with the cluster along with the running services and the installed Ceph version.
Monitors: Lists all Monitors, their quorum status and open sessions.
Configuration editor: Displays all the available configuration options, their descriptions, types, default, and currently set
values. These values are editable.
Cluster logs: Displays and filters the latest updates to the cluster’s event and audit log files by priority, date, or keyword.
Device management: Lists all hosts known by the Orchestrator. Lists all drives attached to a host and their properties.
Displays drive health predictions, SMART data, and blink enclosure LEDs.
View storage cluster capacity: You can view raw storage capacity of the IBM Storage Ceph cluster in the Capacity panels of
the Ceph dashboard.
OSDs: Lists and manages all OSDs, their status and usage statistics as well as detailed information like attributes, like OSD
map, metadata, and performance counters for read and write operations. Lists all drives associated with an OSD.
Images: Lists all RBD images and their properties such as size, objects, and features. Create, copy, modify and delete RBD
images. Create, delete, and rollback snapshots of selected images, protect or unprotect these snapshots against modification.
Copy or clone snapshots, flatten cloned images.
NOTE: The performance graph for I/O changes in the Overall Performance tab for a specific image shows values only after
specifying the pool that includes that image by setting the rbd_stats_pool parameter in Cluster > Manager modules >
Prometheus.
RBD Mirroring: Enables and configures RBD mirroring to a remote Ceph server. Lists all active sync daemons and their status,
pools and RBD images including their synchronization state.
Ceph File Systems: Lists all active Ceph file system (CephFS) clients and associated pools, including their usage statistics.
Evict active CephFS clients, manage CephFS quotas and snapshots, and browse a CephFS directory structure.
Object Gateway (RGW): Lists all active object gateways and their performance counters. Displays and manages, including
add, edit, delete, object gateway users and their details, for example quotas, as well as the users’ buckets and their details,
for example, owner or quotas.
Security features
SSL and TLS support: All HTTP communication between the web browser and the dashboard is secured via SSL. A self-signed
certificate can be created with a built-in command, but it is also possible to import custom certificates signed and issued by a
Certificate Authority (CA).
Reference
Edit online
Cephadm installs the dashboard by default. Following is an example of the dashboard URL:
URL: https://host01:8443/
User: admin
Password: zbiql951ar
NOTE: Update the browser and clear the cookies prior to accessing the dashboard URL.
The following are the Cephadm bootstrap options that are available for the Ceph dashboard configurations:
–initial-dashboard-user INITIAL_DASHBOARD_USER
Use this option while bootstrapping to set initial-dashboard-user.
–initial-dashboard-password INITIAL_DASHBOARD_PASSWORD
Use this option while bootstrapping to set initial-dashboard-password.
–ssl-dashboard-port SSL_DASHBOARD_PORT
Use this option while bootstrapping to set custom dashboard port other than default 8443.
–dashboard-key DASHBOARD_KEY
Use this option while bootstrapping to set Custom key for SSL.
–dashboard-crt DASHBOARD_CRT
Use this option while bootstrapping to set Custom certificate for SSL.
–skip-dashboard
Use this option while bootstrapping to deploy Ceph without dashboard.
–dashboard-password-noupdate
Use this option while bootstrapping if you used above two options and don't want to reset password at the first time login.
–allow-fqdn-hostname
–skip-prepare-host
Use this option while bootstrapping to skip preparing the host.
NOTE:
To avoid connectivity issues with dashboard related external URL, use the fully qualified domain names (FQDN) for
hostnames. For example, host01.ceph.redhat.com.
Open the Grafana URL directly in the client internet browser and accept the security exception to see the graphs on the Ceph
dashboard. Reload the browser to view the changes.
Example
NOTE:
While boostrapping the storage cluster using cephadm, you can use the --image option for either custom container images
or local container images.
You have to change the password the first time you log into the dashboard with the credentials provided on bootstrapping only
if --dashboard-password-noupdate option is not used while bootstrapping. You can find the Ceph dashboard credentials
in the var/log/ceph/cephadm.log file. Search with the "Ceph Dashboard is now available at" string. For example,
host01.ceph.redhat.com.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
https://HOST_NAME:PORT
Replace:
HOST_NAME with the fully qualified domain name (FQDN) of the active manager host.
Example
https://host01:8443
Example
This command will show you all endpoints that are currently configured. Look for the dashboard key to obtain the URL for
accessing the dashboard.
2. On the login page, enter the username admin and the default password provided during bootstrapping.
3. You have to change the password the first time you log in to the IBM Storage Ceph dashboard.
4. After logging in, the dashboard default landing page is displayed, which provides a high-level overview of status, performance,
and capacity metrics of the IBM Storage Ceph cluster.
5. Click the following icon on the dashboard landing page to collapse or display the options in the vertical menu:
Reference
Edit online
As a storage administrator, you can configure a message of the day (MOTD) using the command-line interface (CLI).
When the user logs in to the Ceph Dashboard, the configured MOTD is displayed at the top of the Ceph Dashboard similar to the
Telemetry module.
The importance of MOTD can be configured based on severity, such as info, warning, or danger.
A MOTD with a info or warning severity can be closed by the user. The info MOTD is not displayed anymore until the local storage
cookies are cleared or a new MOTD with a different severity is displayed. A MOTD with a warning severity is displayed again in a
new session.
Prerequisites
Edit online
A running IBM Storage Ceph cluster with the monitoring stack installed.
Procedure
Edit online
Syntax
Example
[ceph: root@host01 /]# ceph dashboard motd set danger 2d "Custom login message"
Message of the day has been set.
Replace
EXPIRES can be for seconds (s), minutes (m), hours (h), days (d), weeks (w), or never expires (0).
MESSAGE can be any custom message that users can view as soon as they log in to the dashboard.
Example
[ceph: root@host01 /]# ceph dashboard motd set danger 0 "Custom login message"
Message of the day has been set.
Example
Example
https://HOST_NAME:8443
Once you bootstrap a new storage cluster, the Ceph Monitor and Ceph Manager daemons are created and the cluster is in
HEALTH_WARN state. After creating all the services for the cluster on the dashboard, the health of the cluster changes from
HEALTH_WARN to HEALTH_OK status.
Prerequisites
Edit online
Bootstrapped storage cluster. For more information, see Bootstrapping a new storage cluster.
At least cluster-manager role for the user on the IBM Storage Ceph Dashboard. For more information, see User roles and
permissions.
Procedure
Edit online
1. Copy the admin key from the bootstrapped host to other hosts:
Syntax
Example
2. Log in to the dashboard with the default credentials provided during bootstrap.
3. Change the password and log in to the dashboard with the new password .
5. Add hosts:
b. Provide the hostname. This is same as the hostname that was provided while copying the key from the bootstrapped
host.
NOTE: You can use the tool tip in the Add Hosts dialog box for more details.
d. Optional: Select the labels for the hosts on which the services are going to be created.
7. Create OSDs:
b. In the Primary Devices window, filter for the device and select the device.
c. Click Add.
d. Optional: In the Create OSDs window, if you have any shared devices such as WAL or DB devices, then add the devices.
8. Create services:
10. Review the Cluster Resources, Hosts by Services, Host Details. If you want to edit any parameter, click Back and follow the
above steps.
12. You get a notification that the cluster expansion was successful.
Verification
Edit online
Example
Example
Available features:
Mirroring, mirroring
NOTE: By default, the Ceph Manager is collocated with the Ceph Monitor.
IMPORTANT: Once a feature is disabled, it can take up to 20 seconds to reflect the change in the web interface.
Prerequisites
Edit online
User access to the Ceph Manager host or the dashboard web interface.
Procedure
Edit online
c. In the Edit Manager module page, you can enable or disable the dashboard features by checking or unchecking the selection
box next to the feature name.
c. Disable a feature:
d. Enable a feature:
Link to the documentation, Ceph Rest API, and details about the IBM Storage Ceph Dashboard.
The individual panel displays specific information about the state of the cluster.
Categories
The landing page organizes panels into the following three categories:
1. Status
2. Capacity
3. Performance
Status panel
The status panels display the health of the cluster and host and daemon states.
Cluster Status
Displays the current health status of the Ceph storage cluster.
Hosts
Displays the total number of hosts in the Ceph storage cluster.
Monitors
Displays the number of Ceph Monitors and the quorum status.
OSDs
Displays the total number of OSDs in the Ceph Storage cluster and the number that are up, and in.
Managers
Displays the number and status of the Manager Daemons.
Metadata Servers
Displays the number and status of metadata servers for Ceph Filesystems (CephFS).
Capacity panel
The capacity panel displays storage usage metrics.
Raw Capacity
Displays the utilization and availability of the raw storage capacity of the cluster.
Objects
Displays the total number of objects in the pools and a graph dividing objects into states of Healthy, Misplaced, Degraded, or
Unfound.
PG Status
Displays the total number of Placement Groups and a graph dividing PGs into states of Clean, Working, Warning, or Unknown.
To simplify display of PG states Working and Warning actually each encompass multiple states.
activating
backfill_wait
backfilling
creating
deep
degraded
forced_backfill
forced_recovery
peering
peered
recovering
recovery_wait
repair
scrubbing
snaptrim
snaptrim_wait
backfill_toofull
backfill_unfound
down
incomplete
inconsistent
recovery_toofull
recovery_unfound
remapped
stale
undersized
Pools
Displays the number of storage pools in the Ceph cluster.
Performance panel
The performance panel display information related to data transfer speeds.
Client Read/Write
Displays total input/output operations per second, reads per second, and writes per second.
Client Throughput
Displays total client throughput, read throughput, and write throughput.
Recovery Throughput
Displays the data recovery rate.
Scrubbing
Displays whether Ceph is scrubbing data to verify its integrity.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
https://HOST_NAME:8443
2. Click the Dashboard Settings icon and then click User management.
5. In the Edit User window, enter the new password, and change the other parameters, and then Click Edit User.
You will be logged out and redirected to the log-in screen. A notification appears confirming the password change.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Example
Syntax
Example
Verification
Edit online
Edit online
By default, cephadm does not create an admin user for Grafana. With the Ceph Orchestrator, you can create an admin user and set
the password.
Prerequisites
Edit online
A running IBM Storage Ceph cluster with the monitoring stack installed.
Procedure
Edit online
1. As a root user, create a grafana.yml file and provide the following details:
Syntax
service_type: grafana
spec:
initial_admin_password: PASSWORD
Example
service_type: grafana
spec:
initial_admin_password: mypassword
Example
NOTE: Every time you exit the shell, you have to mount the file in the container before deploying the daemon.
Example
Example
Syntax
Example
Example
This creates an admin user called admin with the given password and the user can log in to the Grafana URL with these
credentials.
Verification
IBM Storage Ceph 519
Edit online
Syntax
https://HOST_NAME: PORT
Example
https://host01:3000/
Prerequisites
Edit online
A running IBM Storage Ceph cluster installed with --skip-dashboard option during bootstrap.
Procedure
Edit online
Example
Example
{
"prometheus": "http://10.8.0.101:9283/"
}
Example
Example
NOTE: You can disable the certificate verification to avoid certification errors.
Example
{
"dashboard": "https://10.8.0.101:8443/",
"prometheus": "http://10.8.0.101:9283/"
}
Syntax
Example
After creating the account, use Red Hat Single Sign-on (SSO) to synchronize users to the Ceph dashboard. See Syncing users to the
Ceph dashboard using Red Hat Single Sign-On.
Prerequisites
Edit online
Dashboard is installed.
Red Hat Single Sign-On installed from a ZIP file. For more information, see Installing Red Hat Single Sign-On from a zip file.
Procedure
Edit online
1. Download the Red Hat Single Sign-On 7.4.0 Server on the system where IBM Storage Ceph is installed.
3. Navigate to the standalone/configuration directory and open the standalone.xml for editing:
4. Replace all instances of localhost and two instances of 127.0.0.1 with the IP address of the machine where Red Hat SSO
is installed.
Example
[root@host01 ~]# keytool -import -noprompt -trustcacerts -alias ca -file ../ca.cer -keystore
/etc/java/java-1.8.0-openjdk/java-1.8.0-openjdk-1.8.0.272.b10-
3.el8_3.x86_64/lib/security/cacert
6. To start the server from the bin directory of rh-sso-7.4 folder, run the standalone boot script:
7. Create the admin account in https: IP_ADDRESS :8080/auth with a username and password:
NOTE: You have to create an admin account only the first time that you log into the console
Reference
Edit online
For adding roles for users on the dashboard, see Creating roles.
Syncing users to the Ceph dashboard using Red Hat Single Sign-On
Edit online
You can use Red Hat Single Sign-on (SSO) with Lightweight Directory Access Protocol (LDAP) integration to synchronize users to the
IBM Storage Ceph Dashboard.
The users are added to specific realms in which they can access the dashboard through SSO without any additional requirements of
a password.
Prerequisites
Edit online
Dashboard is installed.
Admin account created for syncing users. See Creating an admin account for syncing users to the Ceph dashboard.
Procedure
Edit online
1. To create a realm, click the Master drop-down menu. In this realm, you can provide access to users and applications.
2. In the Add Realm window, enter a case-sensitive realm name and set the parameter Enabled to ON and click Create.
3. In the Realm Settings tab, set the following parameters and click Save:
a. Enabled - ON
b. User-Managed Access - ON
5. In the Add Client window, set the following parameters and click Save:
a. Client ID - BASE_URL:8443/auth/saml2/metadata
Example
https://example.ceph.redhat.com:8443/auth/saml2/metadata ..
Client Protocol - saml
6. In the Client window, under Settings tab, set the following parameters:
Name of the
Syntax Example
parameter
Client ID BASE_URL:8443/auth/saml2/metadata https://example.ceph.redhat.com:8443/auth/sa
ml2/metadata
Enabled ON ON
Client Protocol saml saml
Include ON ON
AuthnStatement
Sign Documents ON ON
Signature RSA_SHA1 RSA_SHA1
Algorithm
SAML Signature KEY_ID KEY_ID
Key Name
Valid Redirect BASE_URL:8443 https://example.ceph.redhat.com:8443/
URLs
Base URL BASE_URL:8443 https://example.ceph.redhat.com:8443/
Master SAML https://localhost:8080/auth/realms/REALM_NAM https://localhost:8080/auth/realms/Ceph_LDAP
Processing URL E/protocol/saml/descriptor /protocol/saml/descriptor
NOTE: Paste the link of SAML 2.0 Identity Provider Metadata from Realm Settings tab.
Under Fine Grain SAML Endpoint Configuration, set the following parameters and click Save:
a. In Mappers tab, select role list, set the Single Role Attribute to ON.
c. In Mappers tab, select first name row and edit the following parameter and Click Save:
You will get a notification that the sync of users is finished successfully.
10. In the Users tab, search for the user added to the dashboard and click the Search icon.
11. To view the user , click the specific row. You should see the federation link as the name provided for the User Federation.
IMPORTANT: Do not add users manually as the users will not be synchronized by LDAP. If added manually, delete the user by
clicking Delete.
Users added to the realm and the dashboard can access the Ceph dashboard with their mail address and password.
Example
https://example.ceph.redhat.com:8443
Reference
Edit online
For adding roles for users on the dashboard, see Creating roles.
For more information, see Creating an admin account for syncing users to the Ceph dashboard.
Prerequisites
Edit online
Procedure
Edit online
Syntax
podman exec CEPH_MGR_HOST ceph dashboard sso setup saml2 CEPH_DASHBOARD_BASE_URL IDP_METADATA
IDP_USERNAME_ATTRIBUTE IDP_ENTITY_ID SP_X_509_CERT SP_PRIVATE_KEY
Example
[root@host01 ~]# podman exec host01 ceph dashboard sso setup saml2
https://dashboard_hostname.ceph.redhat.com:8443 idp-metadata.xml username
https://10.70.59.125:8080/auth/realms/realm_name /home/certificate.txt /home/private-key.txt
Replace
IDP_METADATA with the URL to remote or local path or content of the IdP metadata XML. The supported URL types are
http, https, and file.
Optional: IDP_USERNAME_ATTRIBUTE with the attribute used to get the username from the authentication response.
Defaults to uid.
Optional: SP_X_509_CERT with the file path of the certificate used by Ceph Dashboard for signing and encryption.
Optional: SP_PRIVATE_KEY with the file path of the private key used by Ceph Dashboard for signing and encryption.
Syntax
Example
[root@host01 ~]# podman exec host01 ceph dashboard sso show saml2
Syntax
Example
[root@host01 ~]# podman exec host01 ceph dashboard sso enable saml2
Example
https://dashboard_hostname.ceph.redhat.com:8443
5. On the SSO page, enter the login credentials. SSO redirects to the dashboard web interface.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
Reference
Edit online
For information about enabling enable single sign-on, see Enabling Single Sign-on for the Ceph Dashboard.
Management of roles
Edit online
As a storage administrator, you can create, edit, clone, and delete roles on the dashboard.
By default, there are eight system roles. You can create custom roles and give permissions to those roles. These roles can be
assigned to users based on the requirements.
The IBM Storage Ceph dashboard functionality or modules are grouped within a security scope. Security scopes are predefined and
static. The current available security scopes on the IBM Storage Ceph dashboard are:
rgw: Includes all features related to Ceph object gateway (RGW) management.
A role specifies a set of mappings between a security scope and a set of permissions. There are four types of permissions:
Read
Create
Update
Delete
cephfs-manager: Allows full permissions for the Ceph file system scope.
cluster-manager: Allows full permissions for the hosts, OSDs, monitor, manager, and config-opt scopes.
read-only: Allows read permission for all security scopes except the dashboard settings and config-opt scopes.
rgw-manager: Allows full permissions for the Ceph object gateway scope.
For example, you need to provide rgw-manager access to the users for all Ceph object gateway operations.
Reference
Edit online
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
2. Click the Dashboard Settings icon and then click User management.
4. In the Create Role window, set the Name, Description, and select the Permissions for this role, and then click the Create Role
button.
6. Click on the Expand/Collapse icon of the row to view the details and permissions given to the roles.
Reference
Edit online
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
2. Click the Dashboard Settings icon and then click User management.
4. In the Edit Role window, edit the parameters, and then click Edit Role.
Cloning roles
Edit online
When you want to assign additional permissions to existing roles, you can clone the system roles and edit it on the IBM Storage Ceph
Dashboard.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
2. Click the Dashboard Settings icon and then click User management.
6. Once you clone the role, you can customize the permissions as per the requirements.
Reference
Edit online
Deleting roles
Edit online
You can delete the custom roles that you have created on the IBM Storage Ceph dashboard.
NOTE: You cannot delete the system roles of the Ceph Dashboard.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
2. Click the Dashboard Settings icon and then click User management.
5. In the Delete Role dialog box, Click the Yes, I am sure box and then click Delete Role.
Reference
Edit online
Management of users
Edit online
As a storage administrator, you can create, edit, and delete users with specific roles on the IBM Storage Ceph dashboard. Role-based
access control is given to each user based on their roles and requirements.
Creating users
Editing users
Deleting users
Prerequisites
Edit online
Dashboard is installed.
NOTE: The IBM Storage Ceph Dashboard does not support any email verification when changing a users password. This behavior is
intentional, because the Dashboard supports Single Sign-On (SSO) and this feature can be delegated to the SSO provider.
Procedure
Edit online
2. Click the Dashboard Settings icon and then click User management.
4. In the Create User window, set the Username and other parameters including the roles, and then click Create User.
Reference
Editing users
Edit online
You can edit the users on the IBM Storage Ceph dashboard. You can modify the user’s password and roles based on the
requirements.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
2. Click the Dashboard Settings icon and then click User management.
5. In the Edit User window, edit parameters like password and roles, and then click Edit User.
Deleting users
Edit online
You can delete users on the Ceph dashboard. Some users might be removed from the system. The access to such users can be
deleted from the Ceph dashboard.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
2. Click the Dashboard Settings icon and then click User management.
5. In the Delete User dialog box, Click the Yes, I am sure box and then Click Delete User to save the settings.
Reference
Edit online
Daemon actions
Daemon actions
Edit online
The IBM Storage Ceph dashboard allows you to start, stop, restart, and redeploy daemons.
NOTE: These actions are supported on all daemons except monitor and manager daemons.
Dashboard is installed.
Procedure
Edit online
You can manage daemons two ways.
3. View the details of the service with the daemon to perform the action on by clicking the Expand/Collapse icon on its row.
4. In Details, select the drop down next to the desired daemon to perform Start, Stop, Restart, or Redeploy.
3. From the Hosts List, select the host with the daemon to perform the action on.
5. Use the drop down at the top to perform Start, Stop, Restart, or Redeploy.
Devices - This tab has details such as device ID, state of health, device name, and the daemons on the hosts.
Inventory - This tab shows all disks attached to a selected host, as well as their type, size and others. It has details such as
device path, type of device, available, vendor, model, size, and the OSDs deployed.
Daemons - This tab shows all services that have been deployed on the selected host, which container they are running in and
their current status. It has details such as hostname, daemon type, daemon ID, container ID, container image name, container
image ID, version status and last refreshed time.
Performance details - This tab has details such as OSDs deployed, CPU utilization, RAM usage, network load, network drop
rate, and OSD disk performance statistics.
Device health - For SMART-enabled devices, you can get the individual health status and SMART data only on the OSD
deployed hosts.
Prerequisites
Edit online
Dashboard is installed.
All the services, monitor, manager and OSD daemons are deployed on the storage cluster.
Procedure
Edit online
3. To view the details of a specific host, click the Expand/Collapse icon on it’s row.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Optional: You can search for the configuration using the Search box:
4. Optional: You can filter for a specific configuration using following filters:
Service - Any, mon, mgr, osd, mds, common, mds_client, rgw, and similar filters.
Modified - yes or no
5. To view the details of the configuration, click the Expand/Collapse icon on it’s row.
a. In the edit dialog window, edit the required parameters and Click Update.
You can view, enable or disable, and edit the manager modules of a cluster on the IBM Storage Ceph dashboard.
Prerequisites
Edit online
Procedure
Edit online
3. To view the details of a specific manager module, click the Expand/Collapse icon on it’s row.
NOTE: Not all modules have configurable parameters. If a module is not configurable, the Edit button is disabled.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. The Monitors overview page displays information about the overall monitor status as well as tables of in Quorum and Not in
quorum Monitor hosts.
4. To see the number of open sessions, hover the cursor over the blue dotted trail.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. To view the details of a specific service, click the Expand/Collapse icon on it’s row.
Devices - This tab has details such as Device ID, state of health, life expectancy, device name, and the daemons on the hosts.
Attributes (OSD map) - This tab shows the cluster address, details of heartbeat, OSD state, and the other OSD attributes.
Metadata - This tab shows the details of the OSD object store, the devices, the operating system, and the kernel details.
Device health - For SMART-enabled devices, you can get the individual health status and SMART data.
Performance counter - This tab gives details of the bytes written on the devices.
Performance Details - This tab has details such as OSDs deployed, CPU utilization, RAM usage, network load, network drop
rate, and OSD disk performance statistics.
Prerequisites
540 IBM Storage Ceph
Edit online
Dashboard is installed.
All the services including OSDs are deployed on the storage cluster.
Procedure
Edit online
3. To view the details of a specific OSD, click the Expand/Collapse icon on it’s row.
You can view additional details such as Devices, Attributes (OSD map), Metadata, Device Health, Performance counter, and
Performance Details by clicking on the respective tabs.
Monitoring HAProxy
Edit online
The Ceph Object Gateway allows you to assign many instances of the object gateway to a single zone, so that you can scale out as
load increases. Since each object gateway instance has its own IP address, you can use HAProxy to balance the load across Ceph
Object Gateway servers.
Total requests/responses.
You can also get the Grafana details by running the ceph dashboard get-grafana-api-url command.
Prerequisites
Edit online
An existing Ceph Object Gateway service, without SSL. If you want SSL service, the certificate should be configured on the
ingress service, not the Ceph Object Gateway service.
Procedure
Edit online
Syntax
Example
https://dashboard_url:3000
Example
https://dashboard_url:8443
5. Select Daemons.
Verification
Edit online
Reference
Edit online
For more information, see Configuring high availability for the Ceph Object Gateway.
The CRUSH map allows you to determine which host a specific OSD ID is running on. This is helpful if there is an issue with an OSD.
Prerequisites
Edit online
Dashboard is installed.
You can download the logs to the system or copy the logs to the clipboard as well for further analysis.
Prerequisites
Edit online
Log entries have been generated since the Ceph Monitor was last started.
NOTE: The Dashboard logging feature only displays the thirty latest high level events. The events are stored in memory by the Ceph
Monitor. The entries disappear after restarting the Monitor. If you need to review detailed or older logs, refer to the file based logs.
Procedure
Edit online
a. To filter by priority, click the Priority drop-down menu and select either Debug, Info, Warning, Error, or All.
c. To filter by date, click the Date field and either use the date picker to select a date from the menu, or enter a date in the
form of YYYY-MM-DD.
d. To filter by time, enter a range in the Time range fields using the HH:MM - HH:MM format. Hours must be entered
using numbers 0 to 23.
3. Click the Download icon or Copy to Clipboard icon to download the logs.
Reference
Edit online
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. View the pools list which gives the details of Data protection and the application for which the pool is enabled. Hover the
mouse over Usage, Read bytes, and Write bytes for the required details.
4. To view more information about a pool, click the Expand/Collapse icon on it’s row.
Details - View the metadata servers (MDS) and their rank plus any standby daemons, pools and their usage,and performance
counters.
Clients - View list of clients that have mounted the file systems.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. To view more information about the file system, click the Expand/Collapse icon on it’s row.
Prerequisites
Edit online
Dashboard is installed.
At least one Ceph object gateway daemon configured in the storage cluster.
Procedure
Edit online
3. To view more information about the Ceph object gateway daemon, click the Expand/Collapse icon on it’s row.
If you have configured multiple Ceph Object Gateway daemons, click on Sync Performance tab and view the multi-site performance
counters.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. To view more information about the images, click the Expand/Collapse icon on it’s row.
Management of Alerts
Edit online
As a storage administrator, you can see the details of alerts and create silences for them on the IBM Storage Ceph dashboard.
CephadmDaemonFailed
CephadmPaused
CephadmUpgradeFailed
CephDaemonCrash
CephDeviceFailurePredicted
CephDeviceFailurePredictionTooHigh
CephDeviceFailureRelocationIncomplete
CephFilesystemDamaged
CephFilesystemDegraded
CephFilesystemFailureNoStandby
CephFilesystemInsufficientStandby
CephFilesystemMDSRanksLow
CephFilesystemOffline
CephFilesystemReadOnly
CephHealthError
CephHealthWarning
CephMgrModuleCrash
CephMgrPrometheusModuleInactive
CephMonClockSkew
CephMonDiskspaceCritical
CephMonDiskspaceLow
CephMonDown
CephMonDownQuorumAtRisk
CephNodeDiskspaceWarning
CephNodeInconsistentMTU
CephNodeNetworkPacketDrops
CephNodeNetworkPacketErrors
CephNodeRootFilesystemFull
CephObjectMissing
CephOSDDown
CephOSDDownHigh
CephOSDFlapping
CephOSDFull
CephOSDHostDown
CephOSDInternalDiskSizeMismatch
CephOSDNearFull
CephOSDReadErrors
CephOSDTimeoutsClusterNetwork
CephOSDTimeoutsPublicNetwork
CephOSDTooManyRepairs
CephPGBackfillAtRisk
CephPGImbalance
CephPGNotDeepScrubbed
CephPGNotScrubbed
CephPGRecoveryAtRisk
CephPGsDamaged
CephPGsHighPerOSD
CephPGsInactive
CephPGsUnclean
CephPGUnavilableBlockingIO
CephPoolBackfillFull
CephPoolFull
CephPoolGrowthWarning
CephPoolNearFull
CephSlowOps
PrometheusJobMissing
You can also monitor alerts using simple network management protocol (SNMP) traps.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
To see the configured alerts, configure the URL to the Prometheus API. Using this API, the Ceph Dashboard UI verifies
that a new silence matches a corresponding alert.
Syntax
Example
Syntax
Example
Example
For Prometheus:
Example
For Alertmanager:
Example
For Grafana:
Example
5. Get the details of the self-signed certificate verification setting for Prometheus, Alertmanager, and Grafana:
Example
6. Optional: If the dashboard does not reflect the changes, you have to disable and then enable the dashboard:
Example
You can configure a custom certificate with the ceph config-key set command.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Example
Example
[root@host01 internalca]# openssl ecparam -genkey -name secp384r1 -out $(date +%F).key
d. Make a request:
Example
[root@host01 internalca]# umask 077; openssl req -config openssl-san.cnf -new -sha256 -
key $(date +%F).key -out $(date +%F).csr
Example
f. As the CA sign:
Example
# openssl ca -extensions v3_req -in $(date +%F).csr -out $(date +%F).crt -extfile openssl-san.cnf
Example
Reference
Edit online
For example, if an OSD is down in an IBM Storage Ceph cluster, you can configure the Alertmanager to send notification on Google
chat.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
service_type: alertmanager
spec:
user_data:
default_webhook_urls:
- "_URLS_"
Example
service_type: alertmanager
spec:
user_data:
webhook_configs:
- url: 'http:127.0.0.10:8080'
Example
Verification
Edit online
Example
Viewing alerts
Edit online
After an alert has fired, you can view it on the IBM Storage Ceph Dashboard. You can edit the Manager module settings to trigger a
mail when an alert is fired.
Prerequisites
Edit online
Dashboard is installed.
An alert fired.
Procedure
Edit online
2. Customize the alerts module on the dashboard to get an email alert for the storage cluster:
e. In the Edit Manager module window, update the required parameters and click Update.
5. To view details of the alert, click the Expand/Collapse icon on it’s row.
6. To view the source of an alert, click on its row, and then click Source.
Creating a silence
552 IBM Storage Ceph
Edit online
You can create a silence for an alert for a specified amount of time on the IBM Storage Ceph Dashboard.
Prerequisites
Edit online
Dashboard is installed.
An alert fired.
Procedure
Edit online
6. In the Create Silence window, add the details for the Duration and click Create Silence.
Re-creating a silence
Edit online
You can re-create a silence from an expired silence on the IBM Storage Ceph Dashboard.
Prerequisites
Edit online
Dashboard is installed.
An alert fired.
Procedure
Edit online
Editing a silence
Edit online
You can edit an active silence, for example, to extend the time it is active on the IBM Storage Ceph Dashboard. If the silence has
expired, you can either recreate a silence or create a new silence for the alert.
Prerequisites
Edit online
Dashboard is installed.
An alert fired.
Procedure
Edit online
7. In the Edit Silence window, update the details and click Edit Silence.
Expiring a silence
Edit online
You can expire a silence so any matched alerts will not be suppressed on the IBM Storage Ceph Dashboard.
Prerequisites
Edit online
Dashboard is installed.
An alert fired.
7. In the Expire Silence dialog box, select Yes, I am sure, and then click Expire Silence.
Management of pools
Edit online
As a storage administrator, you can create, edit, and delete pools on the IBM Storage Ceph dashboard.
Creating pools
Editing pools
Deleting pools
Creating pools
Edit online
When you deploy a storage cluster without creating a pool, Ceph uses the default pools for storing data. You can create pools to
logically partition your storage objects on the IBM Storage Ceph dashboard.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Click Create.
a. Set the name of the pool and select the pool type.
f. Optional: To see the settings for the currently selected EC profile, click the question mark.
h. Optional: Click the pencil symbol to select an application for the pool.
Reference
Edit online
Editing pools
Edit online
You can edit the pools on the IBM Storage Ceph Dashboard.
Dashboard is installed.
A pool is created.
Procedure
Edit online
5. In the Edit Pool window, edit the required parameters and click Edit Pool:
Deleting pools
Edit online
You can delete the pools on the IBM Storage Ceph Dashboard. Ensure that value of mon_allow_pool_delete is set to True in
Manager modules.
Dashboard is installed.
A pool is created.
Procedure
Edit online
Reference
Edit online
Management of hosts
Edit online
As a storage administrator, you can enable or disable maintenance mode for a host in the IBM Storage Ceph Dashboard. The
maintenance mode ensures that shutting down the host, to perform maintenance activities, does not harm the cluster.
You can also remove hosts using Start Drain and Remove options in the IBM Storage Ceph Dashboard.
Prerequisites
Edit online
Dashboard is installed.
Hosts, Ceph Monitors and Ceph Manager Daemons are added to the storage cluster.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
NOTE: When a host enters maintenance, all daemons are stopped. You can check the status of the daemons under the Daemons tab
of a host.
Verification
Edit online
You get a notification that the host is successfully moved to maintenance and a maintenance label appears in the Status
column.
NOTE: If the maintenance mode fails, you get a notification indicating the reasons for failure.
Prerequisites
Edit online
Dashboard is installed.
All other prerequisite checks are performed internally by Ceph and any probable errors are taken care of internally by Ceph.
Procedure
Edit online
NOTE: You can identify the host in maintenance by checking for the maintenance label in the Status column.
After exiting the maintenance mode, you need to create the required services on the host by default-crash and the node-exporter
gets deployed.
Verification
Edit online
You get a notification that the host has been successfully moved out of maintenance and the maintenance label is removed
from the Status column.
Removing hosts
Edit online
To remove a host from a Ceph cluster, you can use Start Drain and Remove options in IBM Storage Ceph Dashboard.
Prerequisites
Edit online
Dashboard is installed.
All other prerequisite checks are performed internally by Ceph and any probable errors are taken care of internally by Ceph.
Procedure
Edit online
3. From the Hosts List, select the host you want to remove.
4. From the Edit drop-down menu, click Start Drain. This option drains all the daemons from the host.
NOTE: The _no_schedule label is automatically applied to the host, which blocks the deployment of daemons on this host.
a. Optional: to stop the draining of daemons from the host, click Stop Drain option from the Edit drop-down menu.
IMPORTANT: A host can be safely removed from the cluster after all the daemons are removed from it.
b. In the Remove Host dialog box, check Yes, I am sure. and click Remove Host.
Verification
Edit online
You get a notification after the successful removal of the host from the Hosts List.
List OSDs, their status, statistics, information such as attributes, metadata, device health, performance counters and
performance details.
Mark OSDs down, in, out, lost, purge, reweight, scrub, deep-scrub, destroy, delete, and select profiles to adjust backfilling
activity.
.Prerequisites
Prerequisites
Edit online
Dashboard is installed.
Hosts, Monitors and Manager Daemons are added to the storage cluster.
Procedure
Edit online
Creating an OSD
NOTE: Ensure you have an available host and a few available devices. You can check for available devices in Physical Disks
under the Cluster drop-down menu.
a. In the Create OSDs window, from Deployment Options, select one of the below options:
Throughput-optimized: Slower devices are used to store data and faster devices are used to store
journals/WALs.
b. From the Advanced Mode, you can add primary, WAL and DB devices by clicking +Add.
DB devices: DB devices are used to store BlueStore’s internal metadata and are used only if the DB device is
faster than the primary device. For example, NVMEs or SSDs).
c. If you want to encrypt your data for security purposes, under Features, select encryption.
d. Click the Preview button and in the OSD Creation Preview dialog box, Click Create.
Editing an OSD
c. Click Update.
d. You get a notification that the flags of the OSD was updated successfully.
c. You get a notification that the scrubbing of the OSD was initiated successfully.
c. You get a notification that the deep scrubbing of the OSD was initiated successfully.
b. In the Reweight OSD dialog box, enter a value between zero and one.
c. Click Reweight.
Marking OSDs In
1. To mark the OSD in, select the OSD row that is in out status.
1. To mark the OSD lost, select the OSD in out and down status.
b. In the Mark OSD Lost dialog box, check Yes, I am sure option, and click Mark Lost.
Purging OSDs
b. In the Purge OSDs dialog box, check Yes, I am sure option, and click Purge OSD.
c. All the flags are reset and the OSD is back in in and up status.
Destroying OSDs
b. In the Destroy OSDs dialog box, check Yes, I am sure option, and click Destroy OSD.
Deleting OSDs
b. In the Destroy OSDs dialog box, check Yes, I am sure option, and click Delete OSD.
NOTE: You can preserve the OSD_ID when you have to to replace the failed OSD.
Prerequisites
Edit online
Procedure
Edit online
1. On the dashboard, you can identify the failed OSDs in the following ways:
In this example, you can see that one of the OSDs is down on the landing page of the dashboard.
Apart from this, on the physical drive, you can view the LED lights blinking if one of the OSDs is down.
2. Click OSDs.
a. From the Edit drop-down menu, select Flags and select No Up and click Update.
c. In the Delete OSD dialog box, select the Preserve OSD ID(s) for replacement and Yes, I am sure check boxes.
e. Wait till the status of the OSD changes to out and destroyed status.
4. Optional: If you want to change the No Up Flag for the entire cluster, in the Cluster-wide configuration drop-down menu, select
Flags.
5. Optional: If the OSDs are down due to a hard disk failure, replace the physical drive:
If the drive is hot-swappable, replace the failed drive with a new one.
If the drive is not hot-swappable and the host contains multiple OSDs, you might have to shut down the whole host and
replace the physical drive. Consider preventing the cluster from backfilling. For more information, see Stopping and
Starting Rebalancing.
When the drive appears under the /dev/ directory, make a note of the drive path.
If you want to add the OSD manually, find the OSD drive and format the disk.
Syntax
Example
a. In the Primary devices dialog box, from the Hostname drop-down list, select any one filter. From Any drop-down list,
select the respective option.
NOTE: You have to select the Hostname first and then at least one filter to add the devices.
For example, from Hostname list, select Type and from Any list select hdd. Select Vendor and from Any list, select ATA.
b. Click Add.
e. You will get a notification that the OSD is created. The OSD will be in out and down status.
8. Select the newly created OSD that has out and down status.
9. Optional: If you have changed the No Up Flag before for cluster-wide configuration, in the Cluster-wide configuration menu,
select Flags.
Verification
Edit online
Verify that the OSD that was destroyed is created on the device and the OSD ID is preserved.
You can also create the Ceph Object Gateway services with Secure Sockets Layer (SSL) using the dashboard.
For example, monitoring functions allow you to view details about a gateway daemon such as its zone name, or performance graphs
of GET and PUT rates. Management functions allow you to view, create, and edit both users and buckets.
Ceph Object Gateway functions are divided between user functions and bucket functions.
Creating the Ceph Object Gateway services with SSL using the dashboard
Management of Ceph Object Gateway users
Management of Ceph Object Gateway buckets
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
Example
Example
This creates a Ceph Object Gateway user with UID dashboard for each realm in the system.
3. Optional: If you have configured a custom admin resource in your Ceph Object Gateway admin API, you have to also set the
the admin resource:
Syntax
Example
4. Optional: If you are using HTTPS with a self-signed certificate, disable certificate verification in the dashboard to avoid refused
connections.
Refused connections can happen when the certificate is signed by an unknown Certificate Authority, or if the host name used
does not match the host name in the certificate.
Syntax
Example
5. Optional: If the Object Gateway takes too long to process requests and the dashboard runs into timeouts, you can set the
timeout value:
Syntax
Example
Creating the Ceph Object Gateway services with SSL using the
dashboard
Edit online
After installing an IBM Storage Ceph cluster, you can create the Ceph Object Gateway service with SSL using two methods:
Prerequisites
Edit online
Dashboard is installed.
NOTE: Obtain the SSL certificate from a CA that matches the hostname of the gateway host. IBM recommends obtaining a certificate
from a CA that has subject alternate name fields and a wildcard for use with S3-style subdomains.
Procedure
Edit online
3. Click +Create.
Reference
Edit online
Prerequisites
Edit online
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
a. Set the user name, full name, and edit the maximum number of buckets if required.
c. Optional: Set a custom access key and secret key by unchecking Auto-generate key.
Reference
Edit online
For more information, see Adding Ceph object gateway login credentials to the dashboard.
Prerequisites
Edit online
Procedure
Edit online
3. Click Users.
7. In the Create Subuser dialog box, enter the user name and select the appropriate permissions.
8. Check the Auto-generate secret box and then click Create Subuser.
NOTE: By clicking Auto-generate-secret checkbox, the secret key for object gateway is generated automatically.
10. You get a notification that the user was updated successfully.
Dashboard is installed.
Procedure
Edit online
3. Click Users.
Reference
Edit online
For more information, see Adding Ceph object gateway login credentials to the dashboard.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Click Users.
7. In the Delete user dialog window, Click the Yes, I am sure box and then Click Delete User to save the settings:
Reference
Edit online
For more information, see Adding Ceph object gateway login credentials to the dashboard.
Prerequisites
Edit online
Dashboard is installed.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
4. In the Create Bucket window, enter a value for Name and select a user that is not suspended. Select a placement target.
NOTE: A bucket’s placement target is selected on creation and can not be modified.
5. Optional: Enable Locking for the objects in the bucket. Locking can only be enabled while creating a bucket. Once locking is
enabled, you also have to choose the lock mode, Compliance or Governance and the lock retention period in either days or
years, not both.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Click Buckets.
6. In the Edit bucket window, edit the Owner by selecting the user from the dropdown.
a. Optional: Enable Versioning if you want to enable versioning state for all the objects in an existing bucket. - To enable
versioning, you must be the owner of the bucket. - If Locking is enabled during bucket creation, you cannot disable the
versioning. - All objects added to the bucket will receive a unique version ID. - If the versioning state has not been set on a
bucket, then the bucket will not have a versioning state.
b. Optional: Check Delete enabled for Multi-Factor Authentication. Multi-Factor Authentication(MFA) ensures that users need
to use a one-time password(OTP) when removing objects on certain buckets. Enter a value for Token Serial Number and Token
PIN.
NOTE: The buckets must be configured with versioning and MFA enabled which can be done through the S3 API.
Prerequisites
Edit online
Dashboard is installed.
3. Click Buckets.
6. In the Delete Bucket dialog box, Click the Yes, I am sure box and then Click Delete bucket to save the settings:
Prerequisites
Edit online
At least one running IBM Storage Ceph cluster deployed on both the sites.
Dashboard is installed.
The multi-site object gateway is configured on the primary and secondary sites.
Object gateway login credentials of the primary and secondary sites are added to the dashboard.
Procedure
Edit online
1. On the Dashboard landing page of the secondary site, in the vertical menu bar, click Object Gateway drop-down list.
2. Select Buckets.
3. You can see those object gateway buckets on the secondary landing page that were created for the object gateway users on
the primary site.
Prerequisites
Edit online
At least one running IBM Storage Ceph cluster deployed on both the sites.
Dashboard is installed.
The multi-site object gateway is configured on the primary and secondary sites.
Object gateway login credentials of the primary and secondary sites are added to the dashboard.
Prerequisites
Edit online
At least one running IBM Storage Ceph cluster deployed on both the sites.
Dashboard is installed.
The multi-site object gateway is configured on the primary and secondary sites.
Object gateway login credentials of the primary and secondary sites are added to the dashboard.
Procedure
Edit online
1. On the Dashboard landing page of the secondary site, in the vertical menu bar, click Object Gateway drop-down list.
2. Select Buckets.
3. You can see those object gateway buckets on the secondary landing page that were created for the object gateway users on
the primary site.
6. In the Edit Bucket window, edit the required parameters and click Edit Bucket.
Reference
Edit online
For more information on configuring multisite, see Multi-site configuration and administration.
For more information on adding object gateway login credentials to the dashboard, see Manually adding object gateway login
credentials to the Ceph dashboard.
For more information on creating object gateway users on the dashboard, see Creating object gateway users.
For more information on creating object gateway buckets on the dashboard, see Creating object gateway buckets.
For more information on system roles, see User roles and permissions.
IMPORTANT: IBM does not recommend to delete buckets of primary site from secondary sites.
At least one running IBM Storage Ceph cluster deployed on both the sites.
Dashboard is installed.
The multi-site object gateway is configured on the primary and secondary sites.
Object gateway login credentials of the primary and secondary sites are added to the dashboard.
Procedure
Edit online
1. On the Dashboard landing page of the primary site, in the vertical menu bar, click Object Gateway drop-down list.
2. Select Buckets.
3. You can see those object gateway buckets of the secondary site here.
6. In the Delete Bucket dialog box, select Yes, I am sure checkbox, and click Delete Bucket.
Verification
Edit online
Reference
Edit online
For more information on configuring multisite, see Multi-site configuration and administration.
For more information on adding object gateway login credentials to the dashboard, see Manually adding object gateway login
credentials to the Ceph dashboard.
For more information on creating object gateway users on the dashboard, see Creating object gateway users.
For more information on creating object gateway buckets on the dashboard, see Creating object gateway buckets.
For more information on system roles, see User roles and permissions.
You can also create, clone, copy, rollback, and delete snapshots of the images using the Ceph dashboard.
NOTE: The Block Device images table is paginated for use with 10000+ image storage clusters to reduce Block Device information
retrieval costs.
Creating images
Creating namespaces
Editing images
Copying images
Moving images to trash
Purging trash
Restoring images from trash
Deleting images.
Deleting namespaces.
Creating snapshots of images
Renaming snapshots of images
Protecting snapshots of images
Cloning snapshots of images
Copying snapshots of images
Unprotecting snapshots of images
Rolling back snapshots of images
Deleting snapshots of images
Creating images
Edit online
You can create block device images on the IBM Storage Ceph dashboard.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Select Images.
4. Click Create.
Reference
Edit online
Creating namespaces
Edit online
You can create namespaces for the block device images on the IBM Storage Ceph dashboard.
Once the namespaces are created, you can give access to the users for those namespaces.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Select Images.
4. To create the namespace of the image, in the Namespaces tab, click Create.
5. In the Create Namespace window, select the pool and enter a name for the namespace.
6. Click Create.
Reference
Edit online
See the Knowledgebase article Segregate Block device images within isolated namespaces for more details.
Editing images
Edit online
You can edit block device images on the IBM Storage Ceph dashboard.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
6. In the Edit RBD window, edit the required parameters and click Edit RBD.
Copying images
Edit online
You can copy block device images on the IBM Storage Ceph dashboard.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
Reference
Edit online
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
6. In the Moving an image to trash window, edit the date till which the image needs protection, and then click Move.
7. You get a notification that the image was moved to trash successfully.
Purging trash
Edit online
You can purge trash using the IBM Storage Ceph dashboard.
Prerequisites
Edit online
Dashboard is installed.
An image is trashed.
Procedure
Edit online
3. Select Images.
5. In the Purge Trash window, select the pool, and then click Purge Trash.
6. You get a notification that the pools in the trash were purged successfully.
Prerequisites
Edit online
Dashboard is installed.
An image is trashed.
Procedure
Edit online
3. Select Images.
4. To restore the image from Trash, in the Trash tab, click its row.
6. In the Restore Image window, enter the new name of the image , and then click Restore.
Deleting images.
Edit online
You can delete the images only after the images are moved to trash. You can delete the cloned images and the copied images
directly without moving them to trash.
Dashboard is installed.
Procedure
Edit online
3. Select Images.
6. Optional: To remove the cloned images and copied images, select Delete from the Edit drop-down menu.
7. In the Delete RBD dialog box, click the Yes, I am sure box and then Click Delete RBD to save the settings:
Reference
Edit online
For more information on on creating images in an RBD pool, see Moving images to trash.
Deleting namespaces.
Edit online
You can delete the namespaces of the images on the IBM Storage Ceph Dashboard.
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Select Images.
4. To delete the namespace of the image, in the Namespaces tab, click its row.
5. Click Delete.
6. In the Delete Namespace dialog box, click the Yes, I am sure box and then Click Delete Namespace to save the settings:
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
4. To take the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
6. In the Create RBD Snapshot dialog, enter the name and click Create RBD Snapshot:
Reference
Edit online
For more information on creating snapshots, see Creating a block device snapshot.
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
4. To rename the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
6. In the Rename RBD Snapshot dialog box, enter the name and click Rename RBD Snapshot.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
4. To protect the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
4. To protect the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
6. In the Clone RBD window, edit the parameters and click Clone RBD.
Reference
Edit online
Prerequisites
Edit online
Dashboard is installed.
Procedure
Edit online
3. Select Images.
4. To protect the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
6. In the Copy RBD window, enter the parameters and click the Copy RBD button:
7. You get a notification that the snapshot was copied successfully. You can search for the copied image in the Images tab.
Reference
Edit online
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
4. To unprotect the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Procedure
Edit online
3. Select Images.
4. To rollback the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
3. Select Images.
4. To take the snapshot of the image, in the Images tab, click its row, and then click the Snapshots tab.
Reference
Edit online
You can add another layer of redundancy to Ceph block devices by mirroring data images between storage clusters. Understanding
and using Ceph block device mirroring can provide you protection against data loss, such as a site failure. There are two
configurations for mirroring Ceph block devices, one-way mirroring or two-way mirroring, and you can configure mirroring on pools
and individual images.
Mirroring view
Editing mode of pools
Adding peer in mirroring
Editing peer in mirroring
Deleting peer in mirroring
Mirroring view
Edit online
You can view the Block device mirroring on the IBM Storage Ceph Dashboard.
You can view the daemons, the site details, the pools, and the images that are configured for Block device mirroring.
Prerequisites
Edit online
Dashboard is installed.
Mirroring is configured.
Procedure
IBM Storage Ceph 595
Edit online
3. Click Mirroring.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Mirroring is configured.
Procedure
Edit online
3. Click Mirroring.
6. In the Edit pool mirror mode window, select the mode from the drop-down, and then click Update. Pool is updated
successfully.
Prerequisites
Edit online
NOTE: Ensure that mirroring is enabled for the pool in which images are created.
Procedure
Edit online
Site A
2. From the Navigation menu, click the Block drop-down menu, and click Mirroring.
3. Click Create Bootstrap Token and configure the following in the window:
a. Choose the pool for mirroring for the provided site name.
b. For the selected pool, generate a new bootstrap token by clicking Generate.
d. Click Close.
c. From the Edit pool mirror mode window, select Image from the drop-down.
d. Click Update.
Site B
2. From the Navigation menu, click the Block drop-down menu, and click Mirroring.
3. From the Create Bootstrap token drop-down, select Import Bootstrap Token.
NOTE: Ensure that mirroring mode is enabled for the specific pool for which you are importing the bootstrap token.
4. In the Import Bootstrap Token window, choose the direction, and paste the token copied earlier from site A.
5. Click Submit. The peer is added and the images are mirrored in the cluster at site B.
6. Verify the health of the pool is in OK state. In the Navigation menu, under Block, select Mirroring. The health of the pool
should be OK.
Site A
b. Click Images.
c. Click Create.
d. In the Create RBD window, provide the Name, Size, and enable Mirroring.
2. Verify the image is available at both the sites. In the Navigation menu, under Block, select Images. The image in site A is
primary while the image in site B is secondary.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Mirroring is configured.
A peer is added.
Procedure
Edit online
3. Click Mirroring.
6. In the Edit pool mirror peer window, edit the parameters, and then click Submit.
Prerequisites
Edit online
Dashboard is installed.
An image is created.
Mirroring is configured.
A peer is added.
Procedure
Edit online
3. Click Mirroring.
6. In the Delete mirror peer dialog window, Click the Yes, I am sure box and then Click Delete mirror peer to save the settings:
Reference
Edit online
View the telemetry data that is sent to the Ceph developers on the public telemetry dashboard. This allows the community to easily
see summary statistics on how many clusters are reporting, their total capacity and OSD count, and version distribution trends.
The telemetry report is broken down into several channels, each with a different type of information. Assuming telemetry has been
enabled, you can turn on and off the individual channels. If telemetry is off, the per-channel setting has no effect.
The data reports contain information that help the developers gain a better understanding of the way Ceph is used. The data includes
counters and statistics on how the cluster has been deployed, the version of Ceph, the distribution of the hosts, and other
parameters.
IMPORTANT: The data reports do not contain any sensitive data like pool names, object names, object contents, hostnames, or
device serial numbers.
NOTE: Telemetry can also be managed by using an API. For more information, see Telemetry.
Procedure
Edit online
NOTE: For detailed information about each channel type, click More Info next to the channels.
3. Complete the Contact Information for the cluster. Enter the contact, Ceph cluster description, and organization.
Proxy: Use this to configure an HTTP or HTTPs proxy server if the cluster cannot directly connect to the configured telemetry
endpoint. Add the server in one of the following formats:
https://10.0.0.1:8080 or https://ceph:[email protected]:8080
5. Click Next. This displays the Telemetry report preview before enabling telemetry.
NOTE: The report can be downloaded and saved locally or copied to the clipboard.
7. Select I agree to my telemetry data being submitted under the Community Data License Agreement.
Deactivating telemetry
Edit online
To deactivate the telemetry module, go to Settings > Telemetry configuration and click Deactivate.
This uses a "Day Zero", "Day One", and "Day Two" organizational methodology, providing readers with a logical progression path.
Day Zero is where research and planning are done before implementing a potential solution.
Day One is where the actual deployment, and installation of the software happens.
Day Two is where all the basic, and advanced configuration happens.
S3-compatibility: Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3
RESTful API.
The Ceph Object Gateway is a service interacting with a Ceph storage cluster. Since it provides interfaces compatible with OpenStack
Swift and Amazon S3, the Ceph Object Gateway has its own user management system. Ceph Object Gateway can store data in the
same Ceph storage cluster used to store data from Ceph block device clients; however, it would involve separate pools and likely a
different CRUSH hierarchy. The S3 and Swift APIs share a common namespace, so you can write data with one API and retrieve it
with the other.
Administrative API: Provides an administrative interface for managing the Ceph Object Gateways.
Administrative API requests are done on a URI that starts with the admin resource end point. Authorization for the administrative
API mimics the S3 authorization convention. Some operations require the user to have special administrative capabilities. The
response type can be either XML or JSON by specifying the format option in the request, but defaults to the JSON format.
Reference
Edit online
Prerequisites
Edit online
Reference
Edit online
For more details about Ceph’s various internal components and the strategies around those components, see Storage
Strategies.
Storage administrators prefer that a storage cluster recovers as quickly as possible. Carefully consider bandwidth requirements for
the storage cluster network, be mindful of network link oversubscription, and segregate the intra-cluster traffic from the client-to-
cluster traffic. Also consider that network performance is increasingly important when considering the use of Solid State Disks (SSD),
flash, NVMe, and other high performing storage devices.
Ceph supports a public network and a storage cluster network. The public network handles client traffic and communication with
Ceph Monitors. The storage cluster network handles Ceph OSD heartbeats, replication, backfilling, and recovery traffic. At a
minimum, a single 10 GB Ethernet link should be used for storage hardware, and you can add additional 10 GB Ethernet links for
connectivity and throughput.
IMPORTANT:
IBM recommends allocating bandwidth to the storage cluster network, such that it is a multiple of the public network using the
osd_pool_default_size as the basis for the multiple on replicated pools. IBM also recommends running the public and storage
cluster networks on separate network cards.
IMPORTANT:
In the case of a drive failure, replicating 1 TB of data across a 1 GB Ethernet network takes 3 hours, and 3 TB takes 9 hours. Using 3
TB is the typical drive configuration. By contrast, with a 10 GB Ethernet network, the replication times would be 20 minutes and 1
hour. Remember that when a Ceph OSD fails, the storage cluster recovers by replicating the data it contained to other OSDs within
the same failure domain and device class as the failed OSD.
The failure of a larger domain such as a rack means that the storage cluster utilizes considerably more bandwidth. When building a
storage cluster consisting of multiple racks, which is common for large storage implementations, consider utilizing as much network
bandwidth between switches in a "fat tree" design for optimal performance. A typical 10 GB Ethernet switch has 48 10 GB ports and
four 40 GB ports. Use the 40 GB ports on the spine for maximum throughput. Alternatively, consider aggregating unused 10 GB ports
with QSFP+ and SFP+ cables into more 40 GB ports to connect to other rack and spine routers. Also, consider using LACP mode 4 to
bond network interfaces. Additionally, use jumbo frames, with a maximum transmission unit (MTU) of 9000, especially on the
backend or cluster network.
Most performance-related problems in Ceph usually begin with a networking issue. Simple network issues like a kinked or bent Cat-6
cable could result in degraded bandwidth. Use a minimum of 10 GB ethernet for the front side network. For large clusters, consider
using 40 GB ethernet for the backend or cluster network.
IMPORTANT:
For network optimization, IBM recommends using jumbo frames for a better CPU per bandwidth ratio, and a non-blocking network
switch back-plane.IBM Storage Ceph requires the same MTU value throughout all networking devices in the communication path,
end-to-end for both public and cluster networks. Verify that the MTU value is the same on all hosts and networking equipment in the
environment before using a IBM Storage Ceph cluster in production.
One of the most important steps in a successful Ceph deployment is identifying a price-to-performance profile suitable for the
storage cluster’s use case and workload. It is important to choose the right hardware for the use case. For example, choosing IOPS-
optimized hardware for a cold storage application increases hardware costs unnecessarily. Whereas, choosing capacity-optimized
hardware for its more attractive price point in an IOPS-intensive workload will likely lead to unhappy users complaining about slow
performance.
Use cases, cost versus benefit performance tradeoffs, and data durability are the primary considerations that help develop a sound
storage strategy.
Use Cases
Ceph provides massive storage capacity, and it supports numerous use cases, such as:
The Ceph Block Device client is a leading storage backend for cloud platforms that provides limitless storage for volumes and
images with high performance features like copy-on-write cloning.
The Ceph Object Gateway client is a leading storage backend for cloud platforms that provides a RESTful S3-compliant and
Swift-compliant object storage for objects like audio, bitmap, video, and other data.
Faster is better. Bigger is better. High durability is better. However, there is a price for each superlative quality, and a corresponding
cost versus benefit tradeoff. Consider the following use cases from a performance perspective: SSDs can provide very fast storage
for relatively small amounts of data and journaling. Storing a database or object index can benefit from a pool of very fast SSDs, but
proves too expensive for other data. SAS drives with SSD journaling provide fast performance at an economical price for volumes and
images. SATA drives without SSD journaling provide cheap storage with lower overall performance. When you create a CRUSH
hierarchy of OSDs, you need to consider the use case and an acceptable cost versus performance tradeoff.
In large scale storage clusters, hardware failure is an expectation, not an exception. However, data loss and service interruption
remain unacceptable. For this reason, data durability is very important. Ceph addresses data durability with multiple replica copies
of an object or with erasure coding and multiple coding chunks. Multiple copies or multiple coding chunks present an additional cost
versus benefit tradeoff: it is cheaper to store fewer copies or coding chunks, but it can lead to the inability to service write requests in
a degraded state. Generally, one object with two additional copies, or two coding chunks can allow a storage cluster to service writes
in a degraded state while the storage cluster recovers.
Replication stores one or more redundant copies of the data across failure domains in case of a hardware failure. However,
redundant copies of data can become expensive at scale. For example, to store 1 petabyte of data with triple replication would
require a cluster with at least 3 petabytes of storage capacity.
Erasure coding stores data as data chunks and coding chunks. In the event of a lost data chunk, erasure coding can recover the lost
data chunk with the remaining data chunks and coding chunks. Erasure coding is substantially more economical than replication. For
example, using erasure coding with 8 data chunks and 3 coding chunks provides the same redundancy as 3 copies of the data.
However, such an encoding scheme uses approximately 1.5x the initial data stored compared to 3x with replication.
The CRUSH algorithm aids this process by ensuring that Ceph stores additional copies or coding chunks in different locations within
the storage cluster. This ensures that the failure of a single storage device or host does not lead to a loss of all of the copies or coding
chunks necessary to preclude data loss. You can plan a storage strategy with cost versus benefit tradeoffs, and data durability in
mind, then present it to a Ceph client as a storage pool.
IMPORTANT:
ONLY the data storage pool can use erasure coding. Pools storing service data and bucket indexes use replication.
IMPORTANT:
Ceph’s object copies or coding chunks make RAID solutions obsolete. Do not use RAID, because Ceph already handles data
durability, a degraded RAID has a negative impact on performance, and recovering data using RAID is substantially slower than using
deep copies or erasure coding chunks.
Reference
Edit online
By using containers you can colocate one daemon from the following list with a Ceph OSD daemon (ceph-osd). Additionally, for the
Ceph Object Gateway (radosgw), Ceph Metadata Server (ceph-mds), and Grafana, you can colocate it either with a Ceph OSD
daemon, plus a daemon from the list below.
Colocating Ceph daemons can be done from the command line interface, by using the --placement option to the ceph orch
command, or you can use a service specification YAML file.
[ceph: root@host01 /]# ceph orch apply mon --placement="host1 host2 host3"
service_type: mon
placement:
hosts:
- host01
- host02
- host03
The diagrams below shows the difference between storage clusters with colocated and non-colocated daemons.
To the Ceph client interface that reads and writes data, a Ceph storage cluster appears as a simple pool where the client stores data.
However, the storage cluster performs many complex operations in a manner that is completely transparent to the client interface.
Ceph clients and Ceph object storage daemons, referred to as Ceph OSDs, or simply OSDs, both use the Controlled Replication Under
A CRUSH map describes a topography of cluster resources, and the map exists both on client hosts as well as Ceph Monitor hosts
within the cluster. Ceph clients and Ceph OSDs both use the CRUSH map and the CRUSH algorithm. Ceph clients communicate
directly with OSDs, eliminating a centralized object lookup and a potential performance bottleneck. With awareness of the CRUSH
map and communication with their peers, OSDs can handle replication, backfilling, and recovery—allowing for dynamic failure
recovery.
Ceph uses the CRUSH map to implement failure domains. Ceph also uses the CRUSH map to implement performance domains,
which simply take the performance profile of the underlying hardware into consideration. The CRUSH map describes how Ceph
stores data, and it is implemented as a simple hierarchy, specifically an acyclic graph, and a ruleset. The CRUSH map can support
multiple hierarchies to separate one type of hardware performance profile from another. Ceph implements performance domains
with device "classes".
Hard disk drives (HDDs) are typically appropriate for cost and capacity-focused workloads.
Throughput-sensitive workloads typically use HDDs with Ceph write journals on solid state drives (SSDs).
Workloads
IMPORTANT: Carefully consider the workload being run by IBM Storage Ceph clusters BEFORE considering what hardware to
purchase, because it can significantly impact the price and performance of the storage cluster. For example, if the workload is
capacity-optimized and the hardware is better suited to a throughput-optimized workload, then hardware will be more expensive
than necessary. Conversely, if the workload is throughput-optimized and the hardware is better suited to a capacity-optimized
workload, then the storage cluster can suffer from poor performance.
3x replication for hard disk drives (HDDs) or 2x replication for solid state drives (SSDs).
Throughput optimized: Throughput-optimized deployments are suitable for serving up significant amounts of data, such as
graphic, audio, and video content. Throughput-optimized deployments require high bandwidth networking hardware,
controllers, and hard disk drives with fast sequential read and write characteristics. If fast data access is a requirement, then
use a throughput-optimized storage strategy. Also, if fast write performance is a requirement, using Solid State Disks (SSD) for
journals will substantially improve write performance.
3x replication.
Capacity optimized: Capacity-optimized deployments are suitable for storing significant amounts of data as inexpensively as
possible. Capacity-optimized deployments typically trade performance for a more attractive price point. For example,
capacity-optimized deployments often use slower and less expensive SATA drives and co-locate journals rather than using
SSDs for journaling.
Object archive.
IMPORTANT:
IBM recommends identifying realm, zone group and zone names BEFORE creating Ceph’s storage pools. Prepend some pool names
with the zone name as a standard naming convention.
Reference
Edit online
.rgw.root
.default.rgw.control
.default.rgw.meta
.default.rgw.log
.default.rgw.buckets.index
.default.rgw.buckets.data
.default.rgw.buckets.non-ec
NOTE: The .default.rgw.buckets.index pool is created only after the bucket is created in Ceph Object Gateway, while the
.default.rgw.buckets.data pool is created after the data is uploaded to the bucket.
Consider creating these pools manually so you can set the CRUSH ruleset and the number of placement groups. In a typical
configuration, the pools that store the Ceph Object Gateway’s administrative data will often use the same CRUSH ruleset, and use
fewer placement groups, because there are 10 pools for the administrative data.
IBM recommends that the .rgw.root pool and the service pools use the same CRUSH hierarchy, and use at least node as the
failure domain in the CRUSH rule. IBM recommends using replicated for data durability, and NOT erasure for the .rgw.root
pool, and the service pools.
The mon_pg_warn_max_per_osd setting warns you if you assign too many placement groups to a pool, 300 by default. You may
adjust the value to suit your needs and the capabilities of your hardware where n is the maximum number of PGs per OSD.
mon_pg_warn_max_per_osd = n
IMPORTANT:
Garbage collection uses the .log pool with regular RADOS objects instead of OMAP. In future releases, more features will store
metadata on the .log pool. Therefore, IBM recommends using NVMe/SSD Ceph OSDs for the .log pool.
.rgw.root Pool
The pool where the Ceph Object Gateway configuration is stored. This includes realms, zone groups, and zones. By convention, its
name is not prepended with the zone name.
Service Pools
The service pools store objects related to service control, garbage collection, logging, user information, and usage. By convention,
these pool names have the zone name prepended to the pool name.
.ZONE_NAME.log : The log pool contains logs of all bucket, container, and object actions, such as create, read, update, and
delete.
.ZONE_NAME.rgw.meta : The metadata pool stores user_keys and other critical metadata.
.ZONE_NAME.meta:users.keys : The keys pool contains access keys and secret keys for each user ID.
.ZONE_NAME.meta:users.email : The email pool contains email addresses associated to a user ID.
.ZONE_NAME.meta:users.swift : The Swift pool contains the Swift subuser information for a user ID.
Reference
Edit online
About pools
Index pool
Edit online
When selecting OSD hardware for use with a Ceph Object Gateway--irrespective of the use case--an OSD node that has at least one
high performance drive, either an SSD or NVMe drive, is required for storing the index pool. This is particularly important when
buckets contain a large number of objects.
For IBM Storage Ceph running BlueStore, IBM recommends deploying an NVMe drive as a block.db device, rather than as a
separate pool.
Ceph Object Gateway index data is written only into an object map (OMAP). OMAP data for BlueStore resides on the block.db
device on an OSD. When an NVMe drive functions as a block.db device for an HDD OSD and when the index pool is backed by HDD
OSDs, the index data will ONLY be written to the block.db device. As long as the block.db partition/lvm is sized properly at 4% of
block, this configuration is all that is needed for BlueStore.
NOTE: IBM does not support HDD devices for index pools. For more information on supported configurations, see the Red Hat Ceph
Storage:Supported configurations article.
An index entry is approximately 200 bytes of data, stored as an OMAP in rocksdb. While this is a trivial amount of data, some uses
of Ceph Object Gateway can result in tens or hundreds of millions of objects in a single bucket. By mapping the index pool to a
IMPORTANT: In a production cluster, a typical OSD node will have at least one SSD or NVMe drive for storing the OSD journal and the
index pool or block.db device, which use separate partitions or logical volumes for the same physical drive.
Data pool
Edit online
The data pool is where the Ceph Object Gateway stores the object data for a particular storage policy. The data pool has a full
complement of placement groups (PGs), not the reduced number of PGs for service pools. Consider using erasure coding for the data
pool, as it is substantially more efficient than replication, and can significantly reduce the capacity requirements while maintaining
data durability.
To use erasure coding, create an erasure code profile. See Erasure code profiles
IMPORTANT: Choosing the correct profile is important because you cannot change the profile after you create the pool. To modify a
profile, you must create a new pool with a different profile and migrate the objects from the old pool to the new pool.
The default configuration is two data chunks(k) and two encoding chunks(m), which means only one OSD can be lost. For higher
resiliency, consider a larger number of data and encoding chunks. For example, some large scale systems use 8 data chunks and 3
encoding chunks, which allows 3 OSDs to fail without losing data.
IMPORTANT: Each data and encoding chunk SHOULD get stored on a different node or host at a minimum. For smaller storage
clusters, this makes using rack impractical as the minimum CRUSH failure domain for a larger number of data and encoding chunks.
Consequently, it is common for the data pool to use a separate CRUSH hierarchy with host as the minimum CRUSH failure domain.
IBM recommends host as the minimum failure domain. If erasure code chunks get stored on Ceph OSDs within the same host, a
host failure, such as a failed journal or network card, could lead to data loss.
To create a data pool, run the ceph osd pool create command with the pool name, the number of PGs and PGPs, the erasure
data durability method, the erasure code profile, and the name of the rule.
NOTE: The placement group (PG) per Pool Calculator recommends a smaller number of PGs per pool for the data_extra_pool;
however, the PG count is approximately twice the number of PGs as the service pools and the same as the bucket index pool.
To create a data extra pool, run the ceph osd pool create command with the pool name, the number of PGs and PGPs, the
replicated data durability method, and the name of the rule. For example:
IMPORTANT: The default rbd pool can use the default CRUSH rule. DO NOT delete the default rule or hierarchy if Ceph clients have
used them to store client data.
Service Pools: At least one CRUSH hierarchy will be for service pools and potentially for data. The service pools include
.rgw.root and the service pools associated with the zone. Service pools typically fall under a single CRUSH hierarchy, and
use replication for data durability. A data pool may also use the CRUSH hierarchy, but the pool will usually be configured with
erasure coding for data durability.
Index: At least one CRUSH hierarchy SHOULD be for the index pool, where the CRUSH hierarchy maps to high performance
media, such as SSD or NVMe drives. Bucket indices can be a performance bottleneck. IBM recommends to use SSD or NVMe
drives in this CRUSH hierarchy. Create partitions for indices on SSDs or NVMe drives used for Ceph OSD journals. Additionally,
an index should be configured with bucket sharding.
Placement Pools: The placement pools for each placement target include the bucket index, the data bucket, and the bucket
extras. These pools can fall under separate CRUSH hierarchies. Since the Ceph Object Gateway can support multiple storage
policies, the bucket pools of the storage policies may be associated with different CRUSH hierarchies, reflecting different use
cases, such as IOPS-optimized, throughput-optimized, and capacity-optimized. The bucket index pool SHOULD use its own
CRUSH hierarchy to map the bucket index pool to higher performance storage media, such as SSD or NVMe drives.
In the following examples, the hosts named data0, data1, and data2 use extended logical names, such as data0-sas-ssd,
data0-index, and so forth in the CRUSH map, because there are multiple CRUSH hierarchies pointing to the same physical hosts.
A typical CRUSH root might represent nodes with SAS drives and SSDs for journals. For example:
##
# SAS-SSD ROOT DECLARATION
##
root sas-ssd {
id -1 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item data2-sas-ssd weight 4.000
item data1-sas-ssd weight 4.000
item data0-sas-ssd weight 4.000
}
A CRUSH root for bucket indexes SHOULD represent high performance media, such as SSD or NVMe drives. Consider creating
partitions on SSD or NVMe media that store OSD journals. For example:
##
# INDEX ROOT DECLARATION
##
root index {
id -2 # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
item data2-index weight 1.000
item data1-index weight 1.000
item data0-index weight 1.000
}
NOTE: The default rbd pool may use this rule. DO NOT delete the default rule if other pools have used it to store customer data.
For general details on CRUSH rules, see the CRUSH rules section. To manually edit a CRUSH map, see the Editing a CRUSH map.
For each CRUSH hierarchy, create a CRUSH rule. The following example illustrates a rule for the CRUSH hierarchy that will store the
service pools, including .rgw.root. In this example, the root sas-ssd serves as the main CRUSH hierarchy. It uses the name rgw-
service to distinguish itself from the default rule. The step take sas-ssd line tells the pool to use the sas-ssd root created in
_Creating CRUSH rootswhose child buckets contain OSDs with SAS drives and high performance storage media, such as SSD or
NVMe drives, for journals in a high throughput hardware configuration. The type rack portion of step chooseleaf is the failure
domain. In the following example, it is a rack.
##
# SERVICE RULE DECLARATION
##
rule rgw-service {
type replicated
min_size 1
max_size 10
step take sas-ssd
step chooseleaf firstn 0 type rack
step emit
}
NOTE: In the foregoing example, if data gets replicated three times, there should be at least three racks in the cluster containing a
similar number of OSD nodes.
TIP: The type replicated setting has NOTHING to do with data durability, the number of replicas, or the erasure coding. Only
replicated is supported.
The following example illustrates a rule for the CRUSH hierarchy that will store the data pool. In this example, the root sas-ssd
serves as the main CRUSH hierarchy—the same CRUSH hierarchy as the service rule. It uses rgw-throughput to distinguish itself
from the default rule and rgw-service. The step take sas-ssd line tells the pool to use the sas-ssd root created in Creating
CRUSH roots, whose child buckets contain OSDs with SAS drives and high performance storage media, such as SSD or NVMe drives,
in a high throughput hardware configuration. The type host portion of step chooseleaf is the failure domain. In the following
example, it is a host. Notice that the rule uses the same CRUSH hierarchy, but a different failure domain.
##
# THROUGHPUT RULE DECLARATION
##
rule rgw-throughput {
type replicated
min_size 1
max_size 10
step take sas-ssd
step chooseleaf firstn 0 type host
step emit
}
NOTE: In the foregoing example, if the pool uses erasure coding with a larger number of data and encoding chunks than the default,
there should be at least as many racks in the cluster containing a similar number of OSD nodes to facilitate the erasure coding
chunks. For smaller clusters, this may not be practical, so the foregoing example uses host as the CRUSH failure domain.
The following example illustrates a rule for the CRUSH hierarchy that stores the index pool. In this example, the root index serves as
the main CRUSH hierarchy. It uses rgw-index to distinguish itself from rgw-service and rgw-throughput. The step take index line tells
the pool to use the index root created in Creating CRUSH roots, whose child buckets contain high performance storage media, such
as SSD or NVMe drives, or partitions on SSD or NVMe drives that also store OSD journals. The type rack portion of step chooseleaf is
the failure domain. In the following example, it is a rack.
##
# INDEX RULE DECLARATION
##
rule rgw-index {
type replicated
min_size 1
max_size 10
Reference
Edit online
CRUSH Administration
Multi-site configurations require a primary zone group and a primary zone. Additionally, each zone group requires a primary zone.
Zone groups might have one or more secondary zones.
IMPORTANT: The primary zone within the primary zone group of a realm is responsible for storing the primary copy of the realm’s
metadata, including users, quotas, and buckets. This metadata gets synchronized to secondary zones and secondary zone groups
automatically. Metadata operations issued with the radosgw-admin command line interface (CLI)MUST* be issued on a node within
the primary zone of the primary zone group to ensure that they synchronize to the secondary zone groups and zones. Currently, it is
possible to issue metadata operations on secondary zones and zone groups, but it is NOT recommended because they WILL NOT be
synchronized, which can lead to fragmentation of the metadata.
The diagrams below illustrate the possible one, and two realm configurations in multi-site Ceph Object Gateway environments.
The following examples are common sizes for Ceph storage clusters.
Sizing includes current needs and near future needs. Consider the rate at which the gateway client will add new data to the cluster.
That can differ from use-case to use-case. For example, recording 4k videos or storing medical images can add significant amounts
of data faster than less storage-intensive information, such as financial market data. Additionally, consider that the data durability
methods, such as replication versus erasure coding, can have a significant impact on the storage media required.
Reference
Edit online
For additional information on sizing, see the Hardware section and its associated links for selecting OSD hardware.
Another factor that favors more nodes over higher storage density is erasure coding. When writing an object using erasure coding
and using node as the minimum CRUSH failure domain, the Ceph storage cluster will need as many nodes as data and coding
chunks. For example, a cluster using k=8, m=3 should have at least 11 nodes so that each data or coding chunk is stored on a
separate node.
Hot-swapping is also an important consideration. Most modern servers support drive hot-swapping. However, some hardware
configurations require removing more than one drive to replace a drive. IBM recommends avoiding such configurations, because they
can bring down more Ceph OSDs than required when swapping out failed disks.
[osd]
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
The ceph-manager daemon handles PG queries, so the cluster map should not impact performance.
Adjusting scrubbing
Edit online
By default, Ceph performs light scrubbing daily and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to
ensure that PGs are storing the same object data. Over time, disk sectors can go bad irrespective of object sizes and checksums.
Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. In this respect,
deep scrubbing ensures data integrity in the manner of fsck, but the procedure imposes an I/O penalty on the cluster. Even light
scrubbing can impact I/O.
The default settings may allow Ceph OSDs to initiate scrubbing at inopportune times, such as peak operating times or periods with
heavy loads. End users may experience latency and poor performance when scrubbing operations conflict with end user operations.
To prevent end users from experiencing poor performance, Ceph provides a number of scrubbing settings that can limit scrubbing to
periods with lower loads or during off-peak hours. See Scrubbing the OSD for more details.
If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours. For
example:
[osd]
osd_scrub_begin_hour = 23 #23:01H, or 10:01PM.
osd_scrub_end_hour = 6 #06:01H or 6:01AM.
If time constraints aren’t an effective method of determining a scrubbing schedule, consider using the
osd_scrub_load_threshold. The default value is 0.5, but it could be modified for low load conditions. For example:
[osd]
osd_scrub_load_threshold = 0.25
Increase rgw_thread_pool_size
Edit online
To improve scalability, you can edit the value of the rgw_thread_pool_size parameter, which is the size of the thread pool. The
new beast frontend is not restricted by the thread pool size to accept new connections.
rgw_thread_pool_size = 512
Increase objecter_inflight_ops
Edit online
To improve scalability, you can edit the value of the objecter_inflight_ops parameter, which specifies the maximum number of
unsent I/O requests allowed. This parameter is used for client traffic control.
objecter_inflight_ops = 24576
The Ceph Object Gateway can hang if it runs out of file descriptors. You can modify the /etc/security/limits.conf file on Ceph
Object Gateway hosts to increase the file descriptors for the Ceph Object Gateway.
When running Ceph administrative commands on large storage clusters, for example, with 1024 Ceph OSDs or more, create an
/etc/security/limits.d/50-ceph.conf file on each host that runs administrative commands with the following contents:
Replace USER_NAME with the name of the non-root user account that runs the Ceph administrative commands.
NOTE: The root user’s ulimit value is already set to unlimited by default on Red Hat Enterprise Linux.
Deployment
Edit online
As a storage administrator, you can deploy the Ceph Object Gateway using the Ceph Orchestrator with the command line interface or
the service specification. You can also configure multi-site Ceph Object Gateways, and remove the Ceph Object Gateway using the
Ceph Orchestrator.
The cephadm command deploys the Ceph Object Gateway as a collection of daemons that manages a single-cluster deployment or a
particular realm and zone in a multi-site deployment.
NOTE: With cephadm, the Ceph Object Gateway daemons are configured using the Ceph Monitor configuration database instead of
the ceph.conf file or the command line options. If the configuration is not in the client.rgw section, then the Ceph Object
Gateway daemons start up with default settings and bind to port 80.
WARNING: If you want Cephadm to handle the setting of a realm and zone, specify the realm and zone in the service specification
during the deployment of the Ceph Object Gateway. If you want to change that realm or zone at a later point, ensure to update and
reapply the rgw_realm and rgw_zone parameters in the specification file. If you want to handle these options manually without
Cephadm, do not include them in the service specification. Cephadm still deploys the Ceph Object Gateway daemons without setting
the configuration option for which realm or zone the daemons should use. In this case, the update of the specification file is not
necessary.
Deploying the Ceph Object Gateway using the command line interface
Deploying the Ceph Object Gateway using the service specification
Deploying a multi-site Ceph Object Gateway using the Ceph Orchestrator
Removing the Ceph Object Gateway using the Ceph Orchestrator
Prerequisites
Edit online
All the managers, monitors, and OSDs are deployed in the storage cluster.
Prerequisites
Edit online
Log in to the Cephadm shell by using the cephadm shell to deploy Ceph Object Gateway daemons.
Procedure
Edit online
Method 1:
1. You can deploy the Ceph object gateway daemons in three different ways:
Create realm, zone group, zone, and then use the placement specification with the host name:
1. Create a realm:
Syntax
Example
Syntax
Example
3. Create a zone:
Syntax
Example
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# ceph orch apply rgw test --realm=test_realm --zone=test_zone --
placement="2 host01 host02"
Method 2:
Use an arbitrary service name to deploy two Ceph Object Gateway daemons for a single cluster deployment:
Syntax
Example
Method 3:
Syntax
NUMBER_OF_DAEMONS controls the number of Ceph object gateways deployed on each host. To achieve the highest
performance without incurring an additional cost, set this value to 2.
Example
[ceph: root@host01 /]# ceph orch host label add host01 rgw # the 'rgw' label can be anything
[ceph: root@host01 /]# ceph orch host label add host02 rgw
[ceph: root@host01 /]# ceph orch apply rgw foo "--placement=label:rgw count-per-host:2" --
port=8000
Verification
Edit online
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Example
2. Edit the radosgw.yml file to include the following details for the default realm, zone, and zone group:
Syntax
service_type: rgw
service_id: REALM_NAME.ZONE_NAME
placement:
hosts:
- HOST_NAME_1
- HOST_NAME_2
count-per-host: NUMBER_OF_DAEMONS
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
rgw_zonegroup: ZONE_GROUP_NAME
rgw_frontend_port: FRONT_END_PORT
networks:
- NETWORK_CIDR # Ceph Object Gateway service binds to a specific network
NUMBER_OF_DAEMONS controls the number of Ceph Object Gateways deployed on each host. To achieve the highest
performance without incurring an additional cost, set this value to 2.
Example
service_type: rgw
service_id: default
placement:
hosts:
- host01
- host02
- host03
count-per-host: 2
spec:
rgw_realm: default
rgw_zone: default
rgw_zonegroup: default
rgw_frontend_port: 1234
networks:
- 192.169.142.0/24
Example
Example
service_type: rgw
service_id: test_realm.test_zone
placement:
hosts:
- host01
- host02
- host03
count-per-host: 2
spec:
rgw_realm: test_realm
rgw_zone: test_zone
rgw_zonegroup: test_zonegroup
rgw_frontend_port: 1234
networks:
- 192.169.142.0/24
Example
NOTE: Every time you exit the shell, you have to mount the file in the container before deploying the daemon.
Syntax
Example
Verification
Edit online
Example
Syntax
Example
You can configure each object gateway to work in an active-active zone configuration allowing writes to a non-primary zone. The
multi-site configuration is stored within a container called a realm.
The realm stores zone groups, zones, and a time period. The rgw daemons handle the synchronization eliminating the need for a
separate synchronization agent, thereby operating with an active-active configuration.
You can also deploy multi-site zones using the command line interface (CLI).
NOTE: The following configuration assumes at least two IBM Storage Ceph clusters are in geographically separate
locations.However, the configuration also works on the same site.
Prerequisites
Edit online
At least two Ceph Object Gateway instances, one for each IBM Storage Ceph cluster.
Procedure
Edit online
a. Create a realm:
Syntax
Example
If the storage cluster has a single realm, then specify the --default flag.
Syntax
Example
Syntax
Example
d. Optional: Delete the default zone, zone group, and the associated pools.
IMPORTANT: Do not delete the default zone and its pools if you are using the default zone and zone group to store
data. Also, removing the default zone group deletes the system user.
To access old data in the default zone and zonegroup, use --rgw-zone default and --rgw-zonegroup
default in radosgw-admin commands.
Example
Syntax
Example
f. Add the access key and system key to the primary zone:
Syntax
Example
Syntax
Example
h. Outside the cephadm shell, fetch the FSID of the storage cluster and the processes:
Example
Example
Syntax
Example
Syntax
Example
Syntax
Example
IMPORTANT: Do not delete the default zone and its pools if you are using the default zone and zone group to store
data. To access old data in the default zone and zonegroup, use --rgw-zone default and --rgw-zonegroup
default in radosgw-admin commands.
Example
Syntax
Example
Syntax
Example
g. Outside the Cephadm shell, fetch the FSID of the storage cluster and the processes:
Example
Syntax
Example
3. Optional: Deploy multi-site Ceph Object Gateways using the placement specification:
Syntax
Example
[ceph: root@host04 /]# ceph orch apply rgw east --realm=test_realm --zone=us-east-1 --
placement="2 host01 host02"
Verification
Edit online
Example
Prerequisites
IBM Storage Ceph 627
Edit online
Procedure
Edit online
Example
Example
Syntax
Example
Verification
Edit online
Syntax
ceph orch ps
Example
Reference
Edit online
Deploying the Ceph object gateway using the command line interface
Basic configuration
Edit online
As a storage administrator, learning the basics of configuring the Ceph Object Gateway is important. You can learn about the defaults
and the embedded web server called Beast. For troubleshooting issues with the Ceph Object Gateway, you can adjust the logging
and debugging output generated by the Ceph Object Gateway. Also, you can provide a High-Availability proxy for storage cluster
access using the Ceph Object Gateway.
Prerequisites
Edit online
Procedure
Edit online
1. To use Ceph with S3-style subdomains, add a wildcard to the DNS record of the DNS server that the ceph-radosgw daemon
uses to resolve domain names:
Syntax
bucket-name.domain-name.com
For dnsmasq, add the following address setting with a dot (.) prepended to the host name:
Syntax
address=/.HOSTNAME_OR_FQDN/HOST_IP_ADDRESS
Example
address=/.gateway-host01/192.168.122.75
Example
$TTL 604800
@ IN SOA gateway-host01. root.gateway-host01. (
2 ; Serial
604800 ; Refresh
86400 ; Retry
2419200 ; Expire
604800 ) ; Negative Cache TTL
;
@ IN NS gateway-host01.
@ IN A 192.168.122.113
* IN CNAME @
2. Restart the DNS server and ping the server with a subdomain to ensure that the ceph-radosgw daemon can process the
subdomain requests:
Syntax
ping mybucket.HOSTNAME
3. If the DNS server is on the local machine, you might need to modify /etc/resolv.conf by adding a nameserver entry for
the local machine.
4. Add the host name in the Ceph Object Gateway zone group:
Syntax
Example
Example
Example
Example
Syntax
Example
Example
g. Restart the Ceph Object Gateway so that the DNS setting takes effect.
Reference
Edit online
IMPORTANT: Prevent unauthorized access to the .pem file, because it contains the secret key hash.
IMPORTANT: IBM recommends obtaining a certificate from a CA with the Subject Alternative Name (SAN) field, and a wildcard for
use with S3-style subdomains.
IMPORTANT: IBM recommends only using SSL with the Beast front-end web server for small to medium sized test environments. For
production environments, you must use HAProxy and keepalived to terminate the SSL connection at the HAProxy.
If the Ceph Object Gateway acts as a client and a custom certificate is used on the server, set the rgw_verify_ssl parameter to
false because injecting a custom CA to Ceph Object Gateways is currently unavailable.
Example
Prerequisites
Edit online
Procedure
Edit online
Example
2. Open the rgw.yml file for editing, and customize it for the environment:
Syntax
service_type: rgw
service_id: SERVICE_ID
service_name: SERVICE_NAME
placement:
hosts:
- HOST_NAME
spec:
ssl: true
rgw_frontend_ssl_certificate: CERT_HASH
Example
service_type: rgw
service_id: foo
service_name: rgw.foo
placement:
hosts:
- host01
spec:
ssl: true
rgw_frontend_ssl_certificate: |
-----BEGIN RSA PRIVATE KEY-----
MIIEpAIBAAKCAQEA+Cf4l9OagD6x67HhdCy4Asqw89Zz9ZuGbH50/7ltIMQpJJU0
gu9ObNtIoC0zabJ7n1jujueYgIpOqGnhRSvsGJiEkgN81NLQ9rqAVaGpadjrNLcM
bpgqJCZj0vzzmtFBCtenpb5l/EccMFcAydGtGeLP33SaWiZ4Rne56GBInk6SATI/
JSKweGD1y5GiAWipBR4C74HiAW9q6hCOuSdp/2WQxWT3T1j2sjlqxkHdtInUtwOm
j5Ism276IndeQ9hR3reFR8PJnKIPx73oTBQ7p9CMR1J4ucq9Ny0J12wQYT00fmJp
-----END RSA PRIVATE KEY-----
-----BEGIN CERTIFICATE-----
MIIEBTCCAu2gAwIBAgIUGfYFsj8HyA9Zv2l600hxzT8+gG4wDQYJKoZIhvcNAQEL
BQAwgYkxCzAJBgNVBAYTAklOMQwwCgYDVQQIDANLQVIxDDAKBgNVBAcMA0JMUjEM
MAoGA1UECgwDUkhUMQswCQYDVQQLDAJCVTEkMCIGA1UEAwwbY2VwaC1zc2wtcmhj
czUtOGRjeHY2LW5vZGU1MR0wGwYJKoZIhvcNAQkBFg5hYmNAcmVkaGF0LmNvbTAe
-----END CERTIFICATE-----
3. Deploy the Ceph Object Gateway using the service specification file:
Example
IMPORTANT: Verbose logging can generate over 1 GB of data per hour. This type of logging can potentially fill up the operating
system’s disk, causing the operating system to stop functioning.
Procedure
Edit online
1. Set the following parameter to increase the Ceph Object Gateway logging output:
Syntax
Example
Syntax
Example
2. Optionally, you can configure the Ceph daemons to log their output to files. Set the log_to_file, and
mon_cluster_log_to_file options to true:
Example
Reference
Edit online
IBM assumes that each zone will have multiple gateway instances using a load balancer, such as high-availability (HA) Proxy and
keepalived.
IMPORTANT: IBM DOES NOT support using a Ceph Object Gateway instance to deploy both standard S3/Swift APIs and static web
hosting simultaneously.
Reference
Edit online
1. S3 static web hosting uses Ceph Object Gateway instances that are separate and distinct from instances used for standard
S3/Swift API use cases.
2. Gateway instances hosting S3 static web sites should have separate, non-overlapping domain names from the standard
S3/Swift API gateway instances.
3. Gateway instances hosting S3 static web sites should use separate public-facing IP addresses from the standard S3/Swift API
gateway instances.
4. Gateway instances hosting S3 static web sites load balance, and if necessary terminate SSL, using HAProxy/keepalived.
Syntax
Example
The rgw_enable_static_website setting MUST be true. The rgw_enable_apis setting MUST enable the s3website API.
The rgw_dns_name and rgw_dns_s3website_name settings must provide their fully qualified domains. If the site uses canonical
name extensions, then set the rgw_resolve_cname option to true.
objects-zonegroup.domain.com. IN A 192.0.2.10
objects-zonegroup.domain.com. IN AAAA 2001:DB8::192:0:2:10
*.objects-zonegroup.domain.com. IN CNAME objects-zonegroup.domain.com.
objects-website-zonegroup.domain.com. IN A 192.0.2.20
objects-website-zonegroup.domain.com. IN AAAA 2001:DB8::192:0:2:20
NOTE: The IP addresses in the first two lines differ from the IP addresses in the fourth and fifth lines.
If using Ceph Object Gateway in a multi-site configuration, consider using a routing solution to route traffic to the gateway closest to
the client.
The Amazon Web Service (AWS) requires static web host buckets to match the host name. Ceph provides a few different ways to
configure the DNS, and HTTPS will work if the proxy has a matching certificate.
To use AWS-style S3 subdomains, use a wildcard in the DNS entry which can redirect requests to any bucket. A DNS entry might look
like the following:
Access the bucket name, where the bucket name is bucket1, in the following manner:
http://bucket1.objects-website-zonegroup.domain.com
Ceph supports mapping domain names to buckets without including the bucket name in the request, which is unique to Ceph Object
Gateway. To use a domain name to access a bucket, map the domain name to the bucket name. A DNS entry might look like the
following:
http://www.example.com
AWS typically requires the bucket name to match the domain name. To configure the DNS for static web hosting using CNAME, the
DNS entry might look like the following:
http://www.example.com
If the DNS name contains other non-CNAME records, such as SOA, NS, MX or TXT, the DNS record must map the domain name
directly to the IP address. For example:
www.example.com. IN A 192.0.2.20
www.example.com. IN AAAA 2001:DB8::192:0:2:20
http://www.example.com
1. Create an S3 bucket. The bucket name MIGHT be the same as the website’s domain name. For example, mysite.com may
have a bucket name of mysite.com. This is required for AWS, but it is NOT required for Ceph.
2. Upload the static website content to the bucket. Contents may include HTML, CSS, client-side JavaScript, images, audio/video
content, and other downloadable files. A website MUST have an index.html file and might have an error.html file.
3. Verify the website’s contents. At this point, only the creator of the bucket has access to the contents.
Prerequisites
Edit online
Capacity for at least two instances of the ingress service running on different hosts
A virtual IP address is automatically configured on one of the ingress hosts at a time, known as the primary host. The Ceph
orchestrator selects the first network interface based on existing IP addresses that are configured as part of the same subnet. In
cases where the virtual IP address does not belong to the same subnet, you can define a list of subnets for the Ceph orchestrator to
match with existing IP addresses. If the keepalived daemon and the active haproxy are not responding on the primary host, then
the virtual IP address moves to a backup host. This backup host becomes the new primary host.
WARNING: Currently, you can not configure a virtual IP address on a network interface that does not have a configured IP address.
Reference
Edit online
Prerequisites
Edit online
A minimum of two hosts running Red Hat Enterprise Linux 8, or higher, for installing the ingress service on.
If using a firewall, then open port 80 for HTTP and port 443 for HTTPS traffic.
Procedure
Example
2. Open the ingress.yaml file for editing. Added the following options, and add values applicable to the environment:
Syntax
service_type: ingress
service_id: SERVICE_ID
placement:
hosts:
- HOST1
- HOST2
- HOST3
spec:
backend_service: SERVICE_ID
virtual_ip: IP_ADDRESS/CIDR
frontend_port: INTEGER
monitor_port: INTEGER
virtual_interface_networks:
- IP_ADDRESS/CIDR
ssl_cert: |
service_id - Must match the existing Ceph Object Gateway service name.
Example
service_type: ingress
service_id: rgw.foo
placement:
hosts:
- host01.example.com
- host02.example.com
- host03.example.com
spec:
backend_service: rgw.foo
virtual_ip: 192.168.1.2/24
frontend_port: 8080
monitor_port: 1967
virtual_interface_networks:
- 10.10.0.0/16
ssl_cert: |
-----BEGIN CERTIFICATE-----
MIIEpAIBAAKCAQEA+Cf4l9OagD6x67HhdCy4Asqw89Zz9ZuGbH50/7ltIMQpJJU0
gu9ObNtIoC0zabJ7n1jujueYgIpOqGnhRSvsGJiEkgN81NLQ9rqAVaGpadjrNLcM
bpgqJCZj0vzzmtFBCtenpb5l/EccMFcAydGtGeLP33SaWiZ4Rne56GBInk6SATI/
JSKweGD1y5GiAWipBR4C74HiAW9q6hCOuSdp/2WQxWT3T1j2sjlqxkHdtInUtwOm
j5Ism276IndeQ9hR3reFR8PJnKIPx73oTBQ7p9CMR1J4ucq9Ny0J12wQYT00fmJp
-----END CERTIFICATE-----
-----BEGIN PRIVATE KEY-----
MIIEBTCCAu2gAwIBAgIUGfYFsj8HyA9Zv2l600hxzT8+gG4wDQYJKoZIhvcNAQEL
BQAwgYkxCzAJBgNVBAYTAklOMQwwCgYDVQQIDANLQVIxDDAKBgNVBAcMA0JMUjEM
MAoGA1UECgwDUkhUMQswCQYDVQQLDAJCVTEkMCIGA1UEAwwbY2VwaC1zc2wtcmhj
czUtOGRjeHY2LW5vZGU1MR0wGwYJKoZIhvcNAQkBFg5hYmNAcmVkaGF0LmNvbTAe
-----END PRIVATE KEY-----
Example
4. For non-default version and specific hot-fix scenario, configure the latest haproxy and keepalived images:
NOTE: The image names are set as default in cephadm and mgr/cephadm configuration.
Syntax
Example
5. Install and configure the new ingress service using the Ceph orchestrator:
a. On the host running the ingress service, check that the virtual IP address appears:
Example
Syntax
wget HOST_NAME
Example
If this returns an index.html with similar content as in the example below, then the HA configuration for the Ceph
Object Gateway is working properly.
Example
Reference
Edit online
See the Performing a Standard RHEL Installation Guide for more details.
HAProxy/keepalived Prerequisites
Another use case for HAProxy and keepalived is to terminate HTTPS at the HAProxy server. You can use an HAProxy server to
terminate HTTPS at the HAProxy server and use HTTP between the HAProxy server and the Beast web server instances.
HAProxy/keepalived Prerequisites
Preparing HAProxy Nodes
Installing and Configuring HAProxy
Installing and Configuring keepalived
HAProxy/keepalived Prerequisites
Edit online
To set up an HAProxy with the Ceph Object Gateway, you must have:
At least two Ceph Object Gateway servers within the same zone are configured to run on port 80. If you follow the simple
installation procedure, the gateway instances are in the same zone group and zone by default. If you are using a federated
architecture, ensure that the instances are in the same zone group and zone.
At least two Red Hat Enterprise Linux 8 servers for HAProxy and keepalived.
NOTE: This section assumes that you have at least two Ceph Object Gateway servers running, and that you get a valid response from
each of them when running test scripts over port 80.
Procedure
Edit online
1. Install haproxy.
As root, assign the correct SELinux context and file permissions to the haproxy-http.xml file.
[root@haproxy]# cd /etc/firewalld/services
[root@haproxy]# restorecon haproxy-http.xml
[root@haproxy]# chmod 640 haproxy-http.xml
3. If you intend to use HTTPS, configure haproxy for SELinux and HTTPS.
As root, assign the correct SELinux context and file permissions to the haproxy-https.xml file.
# cd /etc/firewalld/services
# restorecon haproxy-https.xml
# chmod 640 haproxy-https.xml
5. Configure haproxy.
The global and defaults may remain unchanged. After the defaults section, you will need to configure frontend and
backend sections. For example:
frontend http_web
bind *:80
mode http
default_backend rgw
frontend rgw
-https
bind *:443 ssl crt /etc/ssl/private/example.com.pem
default_backend rgw
6. Enable/start haproxy
Prerequisites
Edit online
Procedure
Edit online
1. Install keepalived:
vrrp_script chk_haproxy {
script "killall -0 haproxy" # check the haproxy process
interval 2 # every 2 seconds
weight 2 # add 2 points if OK
}
Next, the instance on the primary and backup load balancers uses eno1 as the network interface. It also assigns a virtual IP
address, that is, 192.168.1.20.
vrrp_instance RGW {
state MASTER # might not be necessary. This is on the primary LB node.
@main interface eno1
priority 100
advert_int 1
interface eno1
virtual_router_id 50
@main unicast_src_ip 10.8.128.43 80
unicast_peer {
10.8.128.53
}
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.20
}
track_script {
vrrp_instance RGW {
state BACKUP # might not be necessary?
priority 99
advert_int 1
interface eno1
virtual_router_id 50
unicast_src_ip 10.8.128.53 80
unicast_peer {
10.8.128.43
}
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.20
}
track_script {
chk_haproxy
}
}
virtual_server 192.168.1.20 80 eno1 { #populate correct interface
delay_loop 6
lb_algo wlc
lb_kind dr
persistence_timeout 600
protocol TCP
real_server 10.8.128.43 80 { # ip address of rgw2 on physical interface, haproxy listens
here, rgw listens to localhost:8080 or similar
weight 100
TCP_CHECK { # perhaps change these to a HTTP/SSL GET?
connect_timeout 3
}
}
real_server 10.8.128.53 80 { # ip address of rgw3 on physical interface, haproxy listens
here, rgw listens to localhost:8080 or similar
weight 100
TCP_CHECK { # perhaps change these to a HTTP/SSL GET?
connect_timeout 3
}
}
}
A single zone configuration typically consists of one zone group containing one zone and one or more ceph-radosgw instances
where you may load-balance gateway client requests between the instances. In a single zone configuration, typically multiple
gateway instances point to a single Ceph storage cluster. However, IBM supports several multi-site configuration options for the
Ceph Object Gateway:
Multi-zone: A more advanced configuration consists of one zone group and multiple zones, each zone with one or more
ceph-radosgw instances. Each zone is backed by its own Ceph Storage Cluster. Multiple zones in a zone group provides
disaster recovery for the zone group should one of the zones experience a significant failure. Each zone is active and may
receive write operations. In addition to disaster recovery, multiple active zones may also serve as a foundation for content
delivery networks. To configure multiple zones without replication, see Configuring multiple zones without replication
Multi-zone-group: Formerly called 'regions', the Ceph Object Gateway can also support multiple zone groups, each zone
group with one or more zones. Objects stored to zone groups within the same realm share a global namespace, ensuring
unique object IDs across zone groups and zones.
Multiple Realms: The Ceph Object Gateway supports the notion of realms, which can be a single zone group or multiple zone
groups and a globally unique namespace for the realm. Multiple realms provides the ability to support numerous
configurations and namespaces.
Prerequisites
Edit online
This guide assumes at least two Ceph storage clusters in geographically separate locations; however, the configuration can work on
the same physical site. This guide also assumes four Ceph object gateway servers named rgw1, rgw2, rgw3 and rgw4 respectively.
A multi-site configuration requires a master zone group and a master zone. Additionally, each zone group requires a master
zone. Zone groups might have one or more secondary or non-master zones.
IBM also recommends private Ethernet or Dense wavelength-division multiplexing (DWDM) as a VPN over the internet is not ideal
due to the additional overhead incurred.
IMPORTANT: The master zone within the master zone group of a realm is responsible for storing the master copy of the realm’s
metadata, including users, quotas and buckets (created by the radosgw-admin CLI). This metadata gets synchronized to secondary
zones and secondary zone groups automatically. Metadata operations executed with the radosgw-admin CLI MUST be executed
on a host within the master zone of the master zone group in order to ensure that they get synchronized to the secondary zone
groups and zones. Currently, it is possible to execute metadata operations on secondary zones and zone groups, but it is NOT
recommended because they WILL NOT be synchronized, leading to fragmented metadata.
In the following examples, the rgw1 host will serve as the master zone of the master zone group; the rgw2 host will serve as the
secondary zone of the master zone group; the rgw3 host will serve as the master zone of the secondary zone group; and the rgw4
host will serve as the secondary zone of the secondary zone group.
IMPORTANT: When you have a large cluster with more Ceph Object Gateways configured in a multi-site storage cluster, IBM
recommends to dedicate not more than three sync-enabled Ceph Object Gateways per site for multi-site synchronization. If there
are more than three syncing Ceph Object Gateways, it has diminishing returns sync rate in terms of performance and the increased
contention creates an incremental risk for hitting timing-related error conditions. This is due to a sync-fairness known issue
BZ#1740782. For the rest of the Ceph Object Gateways in such a configuration, which are dedicated for client I/O operations
through load balancers, run the ceph config set client.rgw.CLIENT_NODE rgw_run_sync_thread false command to
prevent them from performing sync operations, and then restart the Ceph Object Gateway.
Example
global
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 7000
user haproxy
group haproxy
daemon
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 30s
timeout server 30s
timeout http-keep-alive 10s
timeout check 10s
timeout client-fin 1s
timeout server-fin 1s
maxconn 6000
listen stats
bind 0.0.0.0:1936
maxconn 256
clitimeout 10m
srvtimeout 10m
contimeout 10m
timeout queue 10m
# JTH start
stats enable
stats hide-version
stats refresh 30s
stats show-node
## stats auth admin:password
stats uri /haproxy?stats
stats admin if TRUE
frontend main
bind *:5000
acl url_static path_beg -i /static /images /javascript /stylesheets
acl url_static path_end -i .jpg .gif .png .css .js
backend static
balance roundrobin
fullconn 6000
server app8 host01:8080 check maxconn 2000
server app9 host02:8080 check maxconn 2000
server app10 host03:8080 check maxconn 2000
backend app
balance roundrobin
fullconn 6000
server app8 host01:8080 check maxconn 2000
server app9 host02:8080 check maxconn 2000
server app10 host03:8080 check maxconn 2000
Pools
Edit online
IBM recommends using the Ceph Placement Group’s per Pool Calculator to calculate a suitable number of placement groups for the
pools the radosgw daemon will create. Set the calculated values as defaults in the Ceph configuration database.
Example
NOTE: Making this change to the Ceph configuration will use those defaults when the Ceph Object Gateway instance creates the
pools. Alternatively, you can create the pools manually.
Pool names particular to a zone follow the naming convention ZONE_NAME.POOL_NAME. For example, a zone named us-east will
have the following pools:
.rgw.root
us-east.rgw.control
us-east.rgw.meta
us-east.rgw.log
us-east.rgw.buckets.data
us-east.rgw.buckets.non-ec
us-east.rgw.meta:users.keys
us-east.rgw.meta:users.email
us-east.rgw.meta:users.swift
us-east.rgw.meta:users.uid
Reference
Edit online
Pools
Prerequisites
Edit online
Procedure
Edit online
Syntax
2. Rename the default zone and zonegroup. Replace ZONE_GROUP_NAME and ZONE_NAME with the zonegroup or zone name
respectively.
Syntax
3. Configure the primary zonegroup. Replace ZONE_GROUP_NAME with the zonegroup name and REALM_NAME with realm name.
Replace FQDN with the fully qualified domain name(s) in the zonegroup.
Syntax
4. Create a system user. Replace USER_ID with the username. Replace DISPLAY_NAME with a display name. It can contain
spaces.
Syntax
5. Configure the primary zone. Replace FQDN with the fully qualified domain name(s) in the zonegroup.
Syntax
6. Optional: If you specified the realm and zone in the service specification during the deployment of the Ceph Object Gateway,
update the spec section of the specification file:
Syntax
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
Syntax
Example
Example
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
NOTE: To add a additional zones, follow the same procedures as for adding the secondary zone. Use a different zone name.
IMPORTANT: You must run metadata operations, such as user creation and quotas, on a host within the master zone of the master
zonegroup. The master zone and the secondary zone can receive bucket operations from the RESTful APIs, but the secondary zone
redirects bucket operations to the master zone. If the master zone is down, bucket operations will fail. If you create a bucket using
the radosgw-admin CLI, you must run it on a host within the master zone of the master zone group so that the buckets will
synchronize with other zone groups and zones.
Prerequisites
Edit online
At least two Ceph Object Gateway instances, one for each IBM Storage Ceph cluster.
Procedure
Edit online
Example
Syntax
Example
Syntax
Example
NOTE: All zones run in an active-active configuration by default; that is, a gateway client might write data to any zone and the
zone will replicate the data to all other zones within the zone group. If the secondary zone should not accept write operations,
specify the --read-only flag to create an active-passive configuration between the master zone and the secondary zone.
Additionally, provide the access_key and secret_key of the generated system user stored in the master zone of the master
zone group.
Syntax
Example
IMPORTANT:
Do not delete the default zone and its pools if you are using the default zone and zone group to store data.
Example
6. Optional: If you specified the realm and zone in the service specification during the deployment of the Ceph Object Gateway,
update the spec section of the specification file:
Syntax
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
Syntax
Example
Syntax
Example
9. Outside the cephadm shell, fetch the FSID of the storage cluster and the processes:
Example
**Syntax**
**Example**
IMPORTANT: Technology Preview features are not supported with IBM production service level agreements (SLAs), might not be
functionally complete, and IBM does not recommend using them for production. These features provide early access to upcoming
product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
Edit online
Procedure
Edit online
Configure the archive zone when creating a new zone by using the archive tier:
Syntax
Example
Reference
Edit online
The rules within the lifecycle policy determine when and what objects to delete. For more information about lifecycle creation and
management, see Bucket lifecycle.
Prerequisites
Edit online
Procedure
Edit online
1. Set the <ArchiveZone> lifecycle policy rule. For more information about creating a lifecycle policy, see the Creating a
lifecycle management policy.
Example
Syntax
Example
{
"prefix_map": {
"": {
"status": true,
"dm_expiration": true,
"expiration": 0,
"noncur_expiration": 2,
"mp_expiration": 0,
"transitions": {},
"noncur_transitions": {}
}
},
"rule_map": [
{
"id": "Rule 1",
<1> The archive zone rule. This is an example of a lifecycle policy with an archive zone rule.
3. If the Ceph Object Gateway user is deleted, the buckets at the archive site owned by that user is inaccessible. Link those
buckets to another Ceph Object Gateway user to access the data.
Syntax
Example
[ceph: root@host01 /]# radosgw-admin bucket link --uid arcuser1 --bucket arc1-deleted-
da473fbbaded232dc5d1e434675c1068 --yes-i-really-mean-it
Additional resources
Bucket lifecycle
S3 bucket lifecycle
Prerequisites
Edit online
1. Make the secondary zone the primary and default zone. For example:
Syntax
By default, Ceph Object Gateway runs in an active-active configuration. If the cluster was configured to run in an active-
passive configuration, the secondary zone is a read-only zone. Remove the --read-only status to allow the zone to receive
write operations. For example:
Syntax
Example
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
4. From the recovered zone, pull the realm from the current primary zone:
Syntax
Syntax
Example
Syntax
Example
8. If the secondary zone needs to be a read-only configuration, update the secondary zone:
Syntax
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Example
Syntax
Example
4. Get the JSON file with the configuration of the zone group:
Syntax
Example
a. Open the file for editing, and set the log_meta, log_data, and sync_from_all fields to false:
Example
{
"id": "72f3a886-4c70-420b-bc39-7687f072997d",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "a5e44ecd-7aae-4e39-b743-3a709acb60c5",
"zones": [
{
"id": "975558e0-44d8-4866-a435-96d3e71041db",
"name": "testzone",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 11,
"read_only": "false",
"tier_type": "",
"sync_from_all": "false",
"sync_from": []
},
{
Syntax
Example
Example
7. Optional: If you specified the realm and zone in the service specification during the deployment of the Ceph Object Gateway,
update the spec section of the specification file:
Syntax
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
Reference
Edit online
Realms
Zone Groups
Zones
Installation
NOTE: IBM recommends that each realm has its own Ceph Object Gateway.
The access key and secret key for each data center in the storage cluster.
Each data center has its own local realm. They share a realm that replicates on both sites.
Procedure
Edit online
1. Create one local realm on the first data center in the storage cluster:
Syntax
Example
Syntax
Example
Syntax
Example
Example
5. Optional: If you specified the realm and zone in the service specification during the deployment of the Ceph Object Gateway,
update the spec section of the specification file:
Syntax
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
6. You can either deploy the Ceph Object Gateway daemons with the appropriate realm and zone or update the configuration
database:
Syntax
Example
[ceph: root@host01 /]# ceph orch apply rgw rgw --realm=ldc1 --zone=ldc1z --placement="1
host01"
Syntax
Example
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
8. Create one local realm on the second data center in the storage cluster:
Syntax
Example
Syntax
Example
Syntax
Example
Example
12. Optional: If you specified the realm and zone in the service specification during the deployment of the Ceph Object Gateway,
update the spec section of the specification file:
Syntax
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
13. You can either deploy the Ceph Object Gateway daemons with the appropriate realm and zone or update the configuration
database:
Syntax
Example
[ceph: root@host01 /]# ceph orch apply rgw rgw --realm=ldc2 --zone=ldc2z --placement="1
host01"
Syntax
Example
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
15. Create a replicated realm on the first data center in the storage cluster:
Syntax
Example
Use the --default flag to make the replicated realm default on the primary site.
Syntax
Example
Syntax
Example
18. Create a synchronization user and add the system user to the master zone for multi-site:
Syntax
Example
Example
20. Optional: If you specified the realm and zone in the service specification during the deployment of the Ceph Object Gateway,
update the spec section of the specification file:
Syntax
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
21. You can either deploy the Ceph Object Gateway daemons with the appropriate realm and zone or update the configuration
database:
Syntax
Example
[ceph: root@host01 /]# ceph orch apply rgw rgw --realm=rdc1 --zone=rdc1z --placement="1
host01"
Syntax
Example
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Example
Example
27. Optional: If you specified the realm and zone in the service specification during the deployment of the Ceph Object Gateway,
update the spec section of the specification file:
Syntax
spec:
rgw_realm: REALM_NAME
rgw_zone: ZONE_NAME
28. You can either deploy the Ceph Object Gateway daemons with the appropriate realm and zone or update the configuration
database:
Syntax
Example
[ceph: root@host04 /]# ceph orch apply rgw rgw --realm=rdc1 --zone=rdc2z --placement="1
host04"
Syntax
Example
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Example
30. Log in as root on the endpoint for the second data center.
Syntax
Example
IMPORTANT: In IBM Storage Ceph 5.3.z5, compress-encrypted feature is displayed with radosgw-admin sync status
command and it is disabled by default. Do not enable this feature as it is not supported until IBM Storage Ceph 6.1.z2.
32. Log in as root on the endpoint for the first data center.
Syntax
Example
34. To store and access data in the local site, create the user for local realm:
Syntax
IMPORTANT: By default, users are created under the default realm. For the users to access data in the local realm, the
radosgw-admin command requires the --rgw-realm argument.
Realms
Zone Groups
Zones
Realms
Edit online
A realm represents a globally unique namespace consisting of one or more zonegroups containing one or more zones, and zones
containing buckets, which in turn contain objects. A realm enables the Ceph Object Gateway to support multiple namespaces and
their configuration on the same hardware.
A realm contains the notion of periods. Each period represents the state of the zone group and zone configuration in time. Each time
you make a change to a zonegroup or zone, update the period and commit it.
Creating a realm
Making a Realm the Default
Deleting a Realm
Getting a realm
Listing realms
Setting a realm
Listing Realm Periods
Pulling a Realm
Renaming a Realm
Creating a realm
Edit online
To create a realm, issue the realm create command and specify the realm name. If the realm is the default, specify --default.
Syntax
Example
By specifying --default, the realm will be called implicitly with each radosgw-admin call unless --rgw-realm and the realm
name are explicitly provided.
NOTE: When the realm is default, the command line assumes --rgw-realm=_REALM_NAME_ as an argument.
Deleting a Realm
Edit online
To delete a realm, run the realm delete command and specify the realm name.
Syntax
Example
Getting a realm
Edit online
To get a realm, run the realm get command and specify the realm name.
Syntax
Example
The CLI will echo a JSON object with the realm properties.
{
"id": "0a68d52e-a19c-4e8e-b012-a8f831cb3ebc",
"name": "test_realm",
"current_period": "b0c5bbef-4337-4edd-8184-5aeab2ec413b",
"epoch": 1
}
Use > and an output file name to output the JSON object to a file.
Listing realms
Edit online
To list realms, run the realm list command:
Example
Syntax
Example
Example
Pulling a Realm
Edit online
To pull a realm from the node containing the master zone group and master zone to a node containing a secondary zone group or
zone, run the realm pull command on the node that will receive the realm configuration.
Syntax
Renaming a Realm
Edit online
A realm is not part of the period. Consequently, renaming the realm is only applied locally, and will not get pulled with realm pull.
When renaming a realm with multiple zones, run the command on each zone. To rename a realm, run the following command:
Syntax
NOTE: Do NOT use realm set to change the name parameter. That changes the internal name only. Specifying --rgw-realm
would still use the old realm name.
Zone Groups
Edit online
The Ceph Object Gateway supports multi-site deployments and a global namespace by using the notion of zone groups. Formerly
called a region, a zone group defines the geographic location of one or more Ceph Object Gateway instances within one or more
zones.
NOTE: The radosgw-admin zonegroup operations can be performed on any node within the realm, because the step of updating
the period propagates the changes throughout the cluster. However, radosgw-admin zone operations MUST be performed on a
host within the zone.
Syntax
NOTE: Use zonegroup modify --rgw-zonegroup=_ZONE_GROUP_NAME_ to modify an existing zone group’s settings.
Example
NOTE: When the zonegroup is the default, the command line assumes --rgw-zonegroup=_ZONE_GROUP_NAME_ as an argument.
Syntax
Syntax
Example
{
"default_info": "90b28698-e7c3-462c-a42d-4aa780d24eda",
"zonegroups": [
"us"
]
}
Syntax
{
"id": "90b28698-e7c3-462c-a42d-4aa780d24eda",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http://rgw1:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "9248cab2-afe7-43d8-a661-a40bf316665e",
"zones": [
{
"id": "9248cab2-afe7-43d8-a661-a40bf316665e",
"name": "us-east",
"endpoints": [
"http://rgw1"
You may only have one zone group with is_master equal to true, and it must be specified as the master_zonegroup at the end
of the zone group map. The following JSON object is an example of a default zone group map.
{
"zonegroups": [
{
"key": "90b28698-e7c3-462c-a42d-4aa780d24eda",
"val": {
"id": "90b28698-e7c3-462c-a42d-4aa780d24eda",
"name": "us",
"api_name": "us",
"is_master": "true",
"endpoints": [
"http://rgw1:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "9248cab2-afe7-43d8-a661-a40bf316665e",
"zones": [
{
"id": "9248cab2-afe7-43d8-a661-a40bf316665e",
"name": "us-east",
"endpoints": [
"http://rgw1"
],
"log_meta": "true",
"log_data": "true",
"bucket_index_max_shards": 11,
"read_only": "false"
},
{
"id": "d1024e59-7d28-49d1-8222-af101965a939",
"name": "us-west",
"endpoints": [
"http://rgw2:80"
Example
Where zonegroupmap.json is the JSON file you created. Ensure that you have zones created for the ones specified in the zone
group map. Finally, update the period.
Example
3. is_master: Determines if the zone group is the master zone group. Required.
4. endpoints: A list of all the endpoints in the zone group. For example, you may use multiple domain names to refer to the
same zone group. Remember to escape the forward slashes (/). You may also specify a port (fqdn:port) for each endpoint.
Optional.
5. hostnames: A list of all the hostnames in the zone group. For example, you may use multiple domain names to refer to the
same zone group. Optional. The rgw dns name setting will automatically be included in this list. You should restart the
gateway daemon(s) after changing this setting.
6. master_zone: The master zone for the zone group. Optional. Uses the default zone if not specified.
NOTE: You can only have one master zone per zone group.
8. placement_targets: A list of placement targets (optional). Each placement target contains a name (required) for the
placement target and a list of tags (optional) so that only users with the tag can use the placement target (i.e., the user’s
placement_tags field in the user info).
9. default_placement: The default placement target for the object index and object data. Set to default-placement by
default. You may also set a per-user default placement in the user info for each user.
To set a zone group, create a JSON object consisting of the required fields, save the object to a file, for example, zonegroup.json;
then, run the following command:
Example
IMPORTANT: The default zone group is_master setting is true by default. If you create a new zone group and want to make it
the master zone group, you must either set the default zone group is_master setting to false, or delete the default zone
group.
Example
Zones
Edit online
Ceph Object Gateway supports the notion of zones. A zone defines a logical group consisting of one or more Ceph Object Gateway
instances.
Configuring zones differs from typical configuration procedures, because not all of the settings end up in a Ceph configuration file.
You can list zones, get a zone configuration, and set a zone configuration.
IMPORTANT: All radosgw-admin zone operations MUST be issued on a host that operates or will operate within the zone.
Creating a Zone
Deleting a zone
Modifying a Zone
Listing Zones
Getting a Zone
Setting a Zone
Renaming a zone
Adding a Zone to a Zone Group
Removing a Zone from a Zone Group
Creating a Zone
Edit online
To create a zone, specify a zone name. If it is a master zone, specify the --master option. Only one zone in a zone group may be a
master zone. To add the zone to a zonegroup, specify the --rgw-zonegroup option with the zonegroup name.
IMPORTANT: Zones must be created on a Ceph Object Gateway node that will be within the zone.
Syntax
Example
Deleting a zone
Edit online
To delete a zone, first remove it from the zonegroup.
Syntax
Example
Syntax
Example
IMPORTANT: Do not delete a zone without removing it from a zone group first. Otherwise, updating the period will fail.
If the pools for the deleted zone will not be used anywhere else, consider deleting the pools. Replace DELETED_ZONE_NAME
in the example below with the deleted zone’s name.
IMPORTANT: Once Ceph deletes the zone pools, it deletes all of the data within them in an unrecoverable manner. Only delete the
zone pools if Ceph clients no longer need the pool contents.
IMPORTANT: In a multi-realm cluster, deleting the .rgw.root pool along with the zone pools will remove ALL the realm information
for the cluster. Ensure that .rgw.root does not contain other active realms before deleting the .rgw.root pool.
Syntax
Modifying a Zone
IMPORTANT: Zones should be modified on a Ceph Object Gateway node that will be within the zone.
Syntax
`--access-key=<key>`
`--secret/--secret-key=<key>`
`--master`
`--default`
`--endpoints=<list>`
Example
Listing Zones
Edit online
As root, to list the zones in a cluster, run the following command:
Example
Getting a Zone
Edit online
As root, to get the configuration of a zone, run the following command:
Syntax
{ "domain_root": ".rgw",
"control_pool": ".rgw.control",
"gc_pool": ".rgw.gc",
"log_pool": ".log",
"intent_log_pool": ".intent-log",
"usage_log_pool": ".usage",
"user_keys_pool": ".users",
"user_email_pool": ".users.email",
"user_swift_pool": ".users.swift",
"user_uid_pool": ".users.uid",
"system_key": { "access_key": "", "secret_key": ""},
"placement_pools": [
{ "key": "default-placement",
"val": { "index_pool": ".rgw.buckets.index",
"data_pool": ".rgw.buckets"}
}
]
}
Setting a Zone
Edit online
IMPORTANT: Zones should be set on a Ceph Object Gateway node that will be within the zone.
To set a zone, create a JSON object consisting of the pools, save the object to a file, for example, zone.json; then, run the following
command, replacing ZONE_NAME with the name of the zone:
Example
Example
Renaming a zone
Edit online
To rename a zone, specify the zone name and the new zone name. Issue the following command on a host within the zone:
Syntax
Example
Syntax
Example
Syntax
Prerequisites
Edit online
The Directory Server node’s FQDN is resolvable using DNS or the /etc/hosts file.
Register the Directory Server node to the Red Hat subscription management service.
A valid Red Hat Directory Server subscription is available in your Red Hat account.
Procedure
Edit online
Follow the instructions in Installing the Directory Server packages and Setting up a new Directory Server instance of the Red Hat
Directory Server Installation Guide.
Reference
Edit online
Configure LDAPS
Edit online
The Ceph Object Gateway uses a simple ID and password to authenticate with the LDAP server, so the connection requires an SSL
certificate for LDAP. Once the LDAP is working, configure the Ceph Object Gateway servers to trust the Directory Server’s certificate.
1. Extract/Download a PEM-formatted certificate for the Certificate Authority (CA) that signed the LDAP server’s SSL certificate.
4. Use the certutil command to add the AD CA to the store at /etc/openldap/certs. For example, if the CA is "msad-
frog-MSAD-FROG-CA", and the PEM-formatted CA file is ldap.pem, use the following command:
Example
Example
# setsebool -P httpd_can_network_connect on
Example
Example
$ ldapwhoami -H ldaps://redhat-directory-server.example.com -d 9
The -d 9 option will provide debugging information in case something went wrong with the SSL negotiation.
Example
Procedure
Edit online
1. Create an LDAP user for the Ceph Object Gateway, and make a note of the binddn. Since the Ceph object gateway uses the
ceph user, conside using ceph as the username. The user needs to have permissions to search the directory. The Ceph Object
Gateway binds to this user as specified in rgw_ldap_binddn.
2. Test to ensure that the user creation worked. Where ceph is the user ID under People and example.com is the domain, you
can perform a search for the user.
On each gateway node, create a file for the user’s secret. For example, the secret may get stored in a file entitled /etc/bindpass.
For security, change the owner of this file to the ceph user and group to ensure it is not globally readable.
Syntax
Example
1. Patch the bind password file to the Ceph Object Gateway container and reapply the Ceph Object Gateway specification.
Example
service_type: rgw
service_id: rgw.1
service_name: rgw.rgw.1
placement:
label: rgw
extra_container_args:
- -v
- /etc/bindpass:/etc/bindpass
1. Change the Ceph configuration with the following commands on all the Ceph nodes.
Syntax
Example
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
Example
"objectclass=inetorgperson"
The Ceph Object Gateway generates the search filter with the user name from the token and the value of rgw_ldap_dnattr.
The constructed filter is then combined with the partial filter from the rgw_ldap_searchfilter value. For example, the
user name and the settings generate the final search filter:
Example
"(&(uid=joe)(objectclass=inetorgperson))"
User joe is only granted access if he is found in the LDAP directory, has an object class of inetorgperson, and specifies a
valid password.
A complete filter must contain a USERNAME token which is substituted with the user name during the authentication attempt.
The rgw_ldap_dnattr setting is not used in this case. For example, to limit valid users to a specific group, use the following
filter:
Example
"(&(uid=@USERNAME@)(memberOf=cn=ceph-users,ou=groups,dc=mycompany,dc=com))"
Syntax
export RGW_ACCESS_KEY_ID="USERNAME"
Syntax
export RGW_SECRET_ACCESS_KEY="PASSWORD"
3. Export the token. For LDAP, use ldap as the token type (ttype).
Example
Example
The result is a base-64 encoded string, which is the access token. Provide this access token to S3 clients in lieu of the access
key. The secret key is no longer required.
4. Optional: For added convenience, export the base-64 encoded string to the RGW_ACCESS_KEY_ID environment variable if the
S3 client uses the environment variable.
Example
export
RGW_ACCESS_KEY_ID="ewogICAgIlJHV19UT0tFTiI6IHsKICAgICAgICAidmVyc2lvbiI6IDEsCiAgICAgICAgInR5cGU
iOiAibGRhcCIsCiAgICAgICAgImlkIjogImNlcGgiLAogICAgICAgICJrZXkiOiAiODAwI0dvcmlsbGEiCiAgICB9Cn0K"
Procedure
Edit online
1. Use the RGW_ACCESS_KEY_ID environment variable to configure the Ceph Object Gateway client. Alternatively, you can copy
the base-64 encoded string and specify it as the access key. Following is an example of the configured S3 client.
Example
cat .aws/credentials
[default]
aws_access_key_id =
Example
3. Optional: You can also run the radosgw-admin user command to verify the user in the directory.
Example
The process for configuring Active Directory is essentially identical to Configuring LDAP and Ceph Object Gateway, but may have
some Windows-specific usage.
NOTE: Ensure that port 636 is open on the Active Directory host.
Example
The Ceph Object Gateway will bind to this user as specified in rgw_ldap_binddn.
Test to ensure that the user creation worked. Where ceph is the user ID under People and example.com is the domain, you can
perform a search for the user.
On each gateway node, create a file for the user’s secret. For example, the secret may get stored in a file entitled /etc/bindpass.
For security, change the owner of this file to the ceph user and group to ensure it is not globally readable.
Syntax
Example
Syntax
Syntax
Example
For the rgw_ldap_uri setting, substitute FQDN with the fully qualified domain name of the LDAP server. If there is more than
one LDAP server, specify each domain.
For the rgw_ldap_binddn setting, substitute BINDDN with the bind domain. With a domain of example.com and a ceph
user under users and accounts, it should look something like this:
Example
rgw_ldap_binddn "uid=ceph,cn=users,cn=accounts,dc=example,dc=com"
For the rgw_ldap_searchdn setting, substitute SEARCHDN with the search domain. With a domain of example.com and
users under users and accounts, it should look something like this:
rgw_ldap_searchdn "cn=users,cn=accounts,dc=example,dc=com"
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
Syntax
export RGW_ACCESS_KEY_ID="USERNAME"
Syntax
export RGW_SECRET_ACCESS_KEY="PASSWORD"
3. Export the token. For LDAP, use ldap as the token type (ttype).
Example
Example
The result is a base-64 encoded string, which is the access token. Provide this access token to S3 clients in lieu of the access
key. The secret key is no longer required.
4. Optional: For added convenience, export the base-64 encoded string to the RGW_ACCESS_KEY_ID environment variable if the
S3 client uses the environment variable.
Example
export
RGW_ACCESS_KEY_ID="ewogICAgIlJHV19UT0tFTiI6IHsKICAgICAgICAidmVyc2lvbiI6IDEsCiAgICAgICAgInR5cGU
iOiAibGRhcCIsCiAgICAgICAgImlkIjogImNlcGgiLAogICAgICAgICJrZXkiOiAiODAwI0dvcmlsbGEiCiAgICB9Cn0K"
Procedure
Edit online
1. Use the RGW_ACCESS_KEY_ID environment variable to configure the Ceph Object Gateway client. Alternatively, you can copy
the base-64 encoded string and specify it as the access key. Following is an example of the configured S3 client:
Example
cat .aws/credentials
[default]
aws_access_key_id =
Example
3. Optional: You can also run the radosgw-admin user command to verify the user in the directory.
Example
NOTE: The member role’s read permissions only apply to objects of the project it belongs to.
admin
The admin role is reserved for the highest level of authorization within a particular scope. This usually includes all the create, read,
update, or delete operations on a resource or API.
member
The member role is not used directly by default. It provides flexibility during deployments and helps reduce responsibility for
administrators.
For example, you can override a policy for a deployment by using the default member role and a simple policy override, to allow
system members to update services and endpoints. This provides a layer of authorization between admin and reader roles.
reader
The reader role is reserved for read-only operations regardless of the scope.
WARNING: If you use a reader to access sensitive information such as image license keys, administrative image data,
administrative volume metadata, application credentials, and secrets, you might unintentionally expose sensitive information.
Hence, APIs that expose these resources should carefully consider the impact of the reader role and appropriately defer access to
the member and admin roles.
Benefits
The Ceph Object Gateway will query Keystone periodically for a list of revoked tokens.
Prerequisites
Edit online
Procedure
Edit online
# openstack service create --name=swift --description="Swift Service" object-store. Creating the service will echo the service
settings.
Example
Field Value
description Swift Service
enabled True
id 37c4c0e79571404cb4644201a4a6e5ee
name swift
type object-store
Prerequisites
Edit online
Procedure
Edit online
Syntax
Replace REGION_NAME with the name of the gateway’s zone group name or region name. Replace URL with URLs appropriate
for the Ceph Object Gateway.
Field Value
adminurl http://radosgw.example.com:8080/swift/v1
id e4249d2b60e44743a67b5e5b38c18dd3
internalurl http://radosgw.example.com:8080/swift/v1
publicurl http://radosgw.example.com:8080/swift/v1
region us-west
service_id 37c4c0e79571404cb4644201a4a6e5ee
service_name swift
service_type object-store
Setting the endpoints will output the service endpoint settings.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Syntax
Showing the endpoints will echo the endpoints settings, and the service settings.
Table 1. Example
Field Value
adminurl http://radosgw.example.com:8080/swift/v1
enabled True
id e4249d2b60e44743a67b5e5b38c18dd3
internalurl http://radosgw.example.com:8080/swift/v1
publicurl http://radosgw.example.com:8080/swift/v1
region us-west
service_id 37c4c0e79571404cb4644201a4a6e5ee
Prerequisites
Edit online
Procedure
Edit online
Example
2. Install Keystone’s SSL certificate in the node running the Ceph Object Gateway. Alternatively set the value of the configurable
rgw_keystone_verify_ssl setting to false.
Setting rgw_keystone_verify_ssl to false means that the gateway will not attempt to verify the certificate.
Prerequisites
Edit online
Procedure
Example
b. Set the nss_db_path setting to the path where the NSS database is stored:
Example
It is possible to configure a Keystone service tenant, user, and password for keystone for the OpenStack Identity API, similar
to the way system administrators tend to configure OpenStack services. Providing a username and password avoids providing
the shared secret to the rgw_keystone_admin_token setting.
IMPORTANT:
IBM recommends disabling authentication by admin token in production environments. The service tenant credentials should
have admin privileges.
Syntax
A Ceph Object Gateway user is mapped into a Keystone tenant. A Keystone user has different roles assigned to it on possibly
more than a single tenant. When the Ceph Object Gateway gets the ticket, it looks at the tenant, and the user roles that are
assigned to that ticket, and accepts or rejects the request according to the rgw_keystone_accepted_roles configurable.
Reference
Edit online
See the Users and Identity Management Guide for Red Hat OpenStack Platform.
Prerequisites
Edit online
Procedure
Edit online
Once you have saved the Ceph configuration file and distributed it to each Ceph node, restart the Ceph Object Gateway
instances:
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
1. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
2. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
Security
Edit online
As a storage administrator, securing the storage cluster environment is important. IBM Storage Ceph provides encryption and key
management to secure the Ceph Object Gateway access point.
S3 server-side encryption
Server-side encryption requests
Configuring server-side encryption
The HashiCorp Vault
The Ceph Object Gateway and multi-factor authentication
Prerequisites
Edit online
S3 server-side encryption
Edit online
The Ceph Object Gateway supports server-side encryption of uploaded objects for the S3 application programming interface (API).
Server-side encryption means that the S3 client sends data over HTTP in its unencrypted form, and the Ceph Object Gateway stores
that data in the IBM Storage Ceph cluster in encrypted form.
NOTE: IBM does NOT support S3 object encryption of Static Large Object (SLO) or Dynamic Large Object (DLO).
IMPORTANT: To use encryption, client requests MUST send requests over an SSL connection. IBM does not support S3 encryption
from a client unless the Ceph Object Gateway uses SSL. However, for testing purposes, administrators can disable SSL during testing
by setting the rgw_crypt_require_ssl configuration setting to false at runtime, using the ceph config set client.rgw
command, and then restarting the Ceph Object Gateway instance. In a production environment, it might not be possible to send
encrypted requests over SSL. In such a case, send requests using HTTP with server-side encryption.
Customer-provided Keys
When using customer-provided keys, the S3 client passes an encryption key along with each request to read or write encrypted data.
It is the customer’s responsibility to manage those keys. Customers must remember which key the Ceph Object Gateway used to
encrypt each object.
Ceph Object Gateway implements the customer-provided key behavior in the S3 API according to the Amazon SSE-C specification.
Since the customer handles the key management and the S3 client passes keys to the Ceph Object Gateway, the Ceph Object
Gateway requires no special configuration to support this encryption mode.
When using a key management service, the secure key management service stores the keys and the Ceph Object Gateway retrieves
them on demand to serve requests to encrypt or decrypt data.
Ceph Object Gateway implements the key management service behavior in the S3 API according to the Amazon SSE-KMS
specification.
IMPORTANT: Currently, the only tested key management implementations are HashiCorp Vault, and OpenStack Barbican. However,
OpenStack Barbican is a Technology Preview and is not supported for use in production systems.
Amazon SSE-C
Amazon SSE-KMS
In this type of configuration, it is possible that SSL terminations occur both at a load balancer and between the load balancer and the
multiple Ceph Object Gateways. Communication occurs using HTTP only.
Prerequisites
Procedure
Edit online
Example
frontend rgw
-https
bind *:443 ssl crt /etc/ssl/private/example.com.pem
default_backend rgw
backend rgw
balance roundrobin
mode http
server rgw1 10.0.0.71:8080 check
server rgw2 10.0.0.80:8080 check
2. Comment out the lines that allow access to the http front end and add instructions to direct HAProxy to use the https front
end instead:
Example
frontend rgw
-https
bind *:443 ssl crt /etc/ssl/private/example.com.pem
http-request set-header X-Forwarded-Proto https if { ssl_fc }
http-request set-header X-Forwarded-Proto https
# here we set the incoming HTTPS port on the load balancer (eg : 443)
http-request set-header X-Forwarded-Port 443
default_backend rgw
backend rgw
balance roundrobin
mode http
server rgw1 10.0.0.71:8080 check
server rgw2 10.0.0.80:8080 check
Example
Reference
Edit online
1. The client requests the creation of a secret key from the Vault based on an object's key ID.
2. The client uploads an object with the object's key ID to the Ceph Object Gateway.
3. The Ceph Object Gateway then requests the newly created secret key from the Vault.
4. The Vault replies to the request by returning the secret key to the Ceph Object Gateway.
5. Now the Ceph Object Gateway can encrypt the object using the new secret key.
6. After encryption is done the object is then stored on the Ceph OSD.
IMPORTANT: IBM works with our technology partners to provide this documentation as a service to our customers. However, IBM
does not provide support for this product. If you need technical assistance for this product, then contact Hashicorp for support.
Prerequisites
IBM Storage Ceph 695
Edit online
Reference
Edit online
The Ceph Object Gateway supports two of the HashiCorp Vault secret engines:
Key/Value version 2
Transit
Key/Value version 2
The Key/Value secret engine stores random secrets within the Vault, on disk. With version 2 of the kv engine, a key can have a
configurable number of versions. The default number of versions is 10. Deleting a version does not delete the underlying data, but
marks the data as deleted, allowing deleted versions to be undeleted. You can use the API endpoint or the destroy command to
permanently remove a version’s data. To delete all versions and metadata for a key, you can use the metadata command or the API
endpoint. The key names must be strings, and the engine will convert non-string values into strings when using the command line
interface. To preserve non-string values, provide a JSON file or use the HTTP application programming interface (API).
NOTE: For access control list (ACL) policies, the Key/Value secret engine recognizes the distinctions between the create and
update capabilities.
Transit
The Transit secret engine performs cryptographic functions on in-transit data. The Transit secret engine can generate hashes, can be
a source of random bytes, and can also sign and verify data. The Vault does not store data when using the Transit secret engine. The
Transit secret engine supports key derivation, by allowing the same key to be used for multiple purposes. Also, the transit secret
engine supports key versioning. The Transit secret engine supports these key types:
aes128-gcm96
AES-GCM with a 128-bit AES key and a 96-bit nonce; supports encryption, decryption, key derivation, and convergent encryption
aes256-gcm96
AES-GCM with a 256-bit AES key and a 96-bit nonce; supports encryption, decryption, key derivation, and convergent encryption
(default)
chacha20-poly1305
ChaCha20-Poly1305 with a 256-bit key; supports encryption, decryption, key derivation, and convergent encryption
ed25519
Ed25519; supports signing, signature verification, and key derivation
ecdsa-p256
ECDSA using curve P-256; supports signing and signature verification
ecdsa-p384
ECDSA using curve P-384; supports signing and signature verification
ecdsa-p521
ECDSA using curve P-521; supports signing and signature verification
rsa-3072
3072-bit RSA key; supports encryption, decryption, signing, and signature verification
rsa-4096
4096-bit RSA key; supports encryption, decryption, signing, and signature verification
See the KV Secrets Engine documentation on Vault’s project site for more information.
See the Transit Secrets Engine documentation on Vault’s project site for more information.
IMPORTANT: IBM supports the usage of Vault agent as the authentication method for containers and the usage of token
authentication is not supported on containers.
Vault Agent
The Vault agent is a daemon that runs on a client node and provides client-side caching, along with token renewal. The Vault agent
typically runs on the Ceph Object Gateway node. Run the Vault agent and refresh the token file. When the Vault agent is used in this
mode, you can use file system permissions to restrict who has access to the usage of tokens. Also, the Vault agent can act as a proxy
server, that is, Vault will add a token when required and add it to the requests passed to it before forwarding them to the actual
server. The Vault agent can still handle token renewal just as it would when storing a token in the Filesystem. It is required to secure
the network that Ceph Object Gateways uses to connect with the Vault agent, for example, the Vault agent listens to only the
localhost.
Reference
Edit online
See the Vault Agent documentation on Vault’s project site for more information.
Reference
Edit online
See the Vault Enterprise Namespaces documentation on Vault’s project site for more information.
Example
[ceph: root@host03 /]# ceph config set client.rgw rgw_crypt_vault_secret_engine transit compat=0
NOTE: This is the default for future versions and you can use the current version for new installations.
Example
[ceph: root@host03 /]# ceph config set client.rgw rgw_crypt_vault_secret_engine transit compat=1
This enables the new engine for newly created objects and still allows the old engine to be used for the old objects. To access old
and new objects, the Vault token must have both the old and new transit policies.
You can force use only the old engine with the following command:
Example
[ceph: root@host03 /]# ceph config set client.rgw rgw_crypt_vault_secret_engine transit compat=2
IMPORTANT: After configuring the client.rgw options, you need to restart the Ceph Object Gateway daemons for the new values
to take effect.
Reference
Edit online
See the Vault Agent documentation on Vault’s project site for more information.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
path "transit/keys/*" {
capabilities = ["read", "delete"]
}
path "transit/keys/" {
capabilities = ["list"]
}
path "transit/keys/+/rotate" {
capabilities = [ "update" ]
}
path "transit/*" {
capabilities = [ "update" ]
}
EOF
NOTE: If you have used the Transit secret engine on an older version of Ceph, the token policy is:
Example
If you are using both SSE-KMS and SSE-S3, you should point each to separate containers. You could either use separate Vault
instances or separately mount transit instances or different branches under a common transit point. If you are not using separate
Vault instances, you can point SSE-KMS or SSE-S3 to serparate containers using rgw_crypt_vault_prefix and
rgw_crypt_sse_s3_vault_prefix. When granting Vault permissions to SSE-KMS bucket owners, you should not give them
permission to SSE-S3 keys; only Ceph should have access to the SSE-S3 keys.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Syntax
Example
Customize the policy as per your use case to set up a Vault agent.
Syntax
Syntax
Example
pid_file = "/run/rgw-vault-agent-pid"
auto_auth {
method "AppRole" {
mount_path = "auth/approle"
config = {
role_id_file_path ="/usr/local/etc/vault/.rgw-ap-role-id"
secret_id_file_path ="/usr/local/etc/vault/.rgw-ap-secret-id"
remove_secret_id_file_after_reading ="false"
}
}
}
cache {
use_auto_auth_token = true
}
listener "tcp" {
address = "127.0.0.1:8100"
tls_disable = true
}
vault {
address = "https://vaultserver:8200"
}
Example
Method 2 If using token authentication, configure the following settings: NOTE: Token authentication is not support on
IBM Ceph Storage 5.
Syntax
Example
NOTE: For security reasons, the path to the token file should only be readable by the RADOS Gateway.
4. Set the Vault secret engine to use to retrieve encryption keys, either Key/Value or Transit.
Example
Example
5. Optional: Configure the Ceph Object Gateway to access Vault within a particular namespace to retrieve the encryption keys:
Example
NOTE: Vault namespaces allow teams to operate within isolated environments known as tenants. The Vault namespaces
feature is only available in the Vault Enterprise version.
6. Optional: Restrict access to a particular subset of the Vault secret space by setting a URL path prefix, where the Ceph Object
Gateway retrieves the encryption keys from:
Example
Example
Assuming the domain name of the Vault server is vaultserver, the Ceph Object Gateway will fetch encrypted transit
keys from the following URL:
Example
http://vaultserver:8200/v1/transit
7. Optional: To use custom SSL certification to authenticate with Vault, configure the following settings:
Syntax
Example
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
Reference
Edit online
Prerequisites
Edit online
Procedure
702 IBM Storage Ceph
Edit online
1. Use the ceph config set client.rgw _OPTION_ _VALUE_ command to enable Vault as the encryption key store:
Syntax
Syntax
Syntax
Syntax
Example
pid_file = "/run/kv-vault-agent-pid"
auto_auth {
method "AppRole" {
mount_path = "auth/approle"
config = {
role_id_file_path ="/root/vault_configs/kv-agent-role-id"
secret_id_file_path ="/root/vault_configs/kv-agent-secret-id"
remove_secret_id_file_after_reading ="false"
}
}
}
cache {
use_auto_auth_token = true
}
listener "tcp" {
address = "127.0.0.1:8100"
tls_disable = true
}
vault {
address = "http://10.8.128.9:8200"
}
Example
8. A token file is populated with a valid token when the Vault agent runs.
Example
10. Use the ceph config set client.rgw _OPTION_ _VALUE_ command to set the Vault namespace to retrieve the
encryption keys:
Example
11. Restrict where the Ceph Object Gateway retrieves the encryption keys from the Vault by setting a path prefix:
Example
Example
Assuming the domain name of the Vault server is vault-server, the Ceph Object Gateway will fetch encrypted transit
keys from the following URL:
Example
http://vault-server:8200/v1/transit/export/encryption-key
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
Reference
Edit online
Edit online
Configure the HashiCorp Vault Key/Value secret engine (kv) so you can create a key for use with the Ceph Object Gateway. Secrets
are stored as key-value pairs in the kv secret engine.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
Key Value
--- -----
keys map[1:-gbTI9lNpqv/V/2lDcmH2Nq1xKn6FPDWarCmFM2aNsQ=]
name mybucketkey
type aes256-gcm96
NOTE: Providing the full key path, including the key version, is required.
NOTE: The URL is constructed using the base address, set by the rgw_crypt_vault_addr option, and the path prefix, set by the
rgw_crypt_vault_prefix option.
Prerequisites
Edit online
Procedure
Edit online
1. Upload an object using the AWS command line client and provide the Secure Side Encryption(SSE) key ID in the request:
Example
NOTE: In the example, the Ceph Object Gateway would fetch the secret from http://vault-
server:8200/v1/secret/data/myproject/mybucketkey
Example
NOTE: In the example, the Ceph Object Gateway would fetch the secret from
http://vaultserver:8200/v1/transit/mybucketkey
Multi-factor authentication
Creating a seed for multi-factor authentication
Creating a new multi-factor authentication TOTP token
Test a multi-factor authentication TOTP token
Resynchronizing a multi-factor authentication TOTP token
Listing multi-factor authentication TOTP tokens
Display a multi-factor authentication TOTP token
Deleting a multi-factor authentication TOTP token
Multi-factor authentication
Edit online
When a bucket is configured for object versioning, a developer can optionally configure the bucket to require multi-factor
authentication (MFA) for delete requests. Using MFA, a time-based one time password (TOTP) token is passed as a key to the x-
amz-mfa header. The tokens are generated with virtual MFA devices like Google Authenticator, or a hardware MFA device like those
provided by Gemalto.
Use radosgw-admin to assign time-based one time password tokens to a user. You must set a secret seed and a serial ID. You can
also use radosgw-admin to list, remove, and resynchronize tokens.
IMPORTANT: In a multisite environment it is advisable to use different tokens for different zones, because, while MFA IDs are set on
the user’s metadata, the actual MFA one time password configuration resides on the local zone’s OSDs.
Term Description
TOTP Time-based One Time Password.
Token serial A string that represents the ID of a TOTP token.
Token seed The secret seed that is used to calculate the TOTP. It can be hexadecimal or base32.
TOTP seconds The time resolution used for TOTP generation.
TOTP window The number of TOTP tokens that are checked before and after the current token when validating tokens.
TOTP pin The valid value of a TOTP token at a certain time.
Table: Terminology
Prerequisites
Edit online
A Linux system.
Procedure
Edit online
1. Generate a 30 character seed from the urandom Linux device file and store it in the shell variable SEED:
Example
Example
Configure the one-time password generator and the back-end MFA system to use the same seed.
Reference
Edit online
Prerequisites
Edit online
A secret seed for the one-time password generator and Ceph Object Gateway MFA was generated.
Procedure
Edit online
Syntax
Set USERID to the user name to set up MFA on, set SERIAL to a string that represents the ID for the TOTP token, and set SEED
to a hexadecimal or base32 value that is used to calculate the TOTP. The following settings are optional: Set the SEED_TYPE to
hex or base32, set TOTP_SECONDS to the timeout in seconds, or set TOTP_WINDOW to the number of TOTP tokens to check
before and after the current token when validating tokens.
Example
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Test the TOTP token PIN to verify that TOTP functions correctly:
Syntax
Set USERID to the user name MFA is set up on, set SERIAL to the string that represents the ID for the TOTP token, and set PIN
to the latest PIN from the one-time password generator.
Example
If this is the first time you have tested the PIN, it may fail. If it fails, resynchronize the token.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
1. Resynchronize a multi-factor authentication TOTP token in case of time skew or failed checks.
This requires passing in two consecutive pins: the previous pin, and the current pin.
Syntax
Set USERID to the user name MFA is set up on, set SERIAL to the string that represents the ID for the TOTP token, set
PREVIOUS_PIN to the user’s previous PIN, and set CURRENT_PIN to the user’s current PIN.
Example
Syntax
Set USERID to the user name MFA is set up on, set SERIAL to the string that represents the ID for the TOTP token, and set PIN
to the user’s PIN.
Example
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Set USERID to the user name MFA is set up on and set SERIAL to the string that represents the ID for the TOTP token.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Set USERID to the user name MFA is set up on and set SERIAL to the string that represents the ID for the TOTP token.
Example
Syntax
Example
Administration
Edit online
As a storage administrator, you can manage the Ceph Object Gateway using the radosgw-admin command line interface (CLI) or
using the IBM Storage Ceph Dashboard.
NOTE: Not all of the Ceph Object Gateway features are available to the IBM Storage Ceph Dashboard.
Prerequisites
Edit online
Storage policies give Ceph Object Gateway clients a way of accessing a storage strategy, that is, the ability to target a particular type
of storage, such as SSDs, SAS drives, and SATA drives, as a way of ensuring, for example, durability, replication, and erasure coding.
For details, see the Storage Strategies
1. Create a new pool .rgw.buckets.special with the desired storage strategy. For example, a pool customized with erasure-
coding, a particular CRUSH ruleset, the number of replicas, and the pg_num and pgp_num count.
Syntax
Example
Example
{
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"master_zone": "",
"zones": [{
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 5
}],
"placement_targets": [{
"name": "default-placement",
"tags": []
}, {
"name": "special-placement",
"tags": []
}],
"default_placement": "default-placement"
}
Example
5. Get the zone configuration and store it in a file, for example, zone.json:
Example
6. Edit the zone file and add the new placement policy key under placement_pool:
Example
{
"domain_root": ".rgw",
"control_pool": ".rgw.control",
"gc_pool": ".rgw.gc",
"log_pool": ".log",
"intent_log_pool": ".intent-log",
"usage_log_pool": ".usage",
"user_keys_pool": ".users",
"user_email_pool": ".users.email",
"user_swift_pool": ".users.swift",
"user_uid_pool": ".users.uid",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [{
"key": "default-placement",
"val": {
"index_pool": ".rgw.buckets.index",
"data_pool": ".rgw.buckets",
"data_extra_pool": ".rgw.buckets.extra"
}
}, {
"key": "special-placement",
"val": {
"index_pool": ".rgw.buckets.index",
"data_pool": ".rgw.buckets.special",
"data_extra_pool": ".rgw.buckets.extra"
}
Example
Example
Example
IMPORTANT: The bucket index does not reflect the correct state of the bucket, and listing these buckets does not correctly return
their list of objects. This affects multiple features. Specifically, these buckets are not synced in a multi-zone environment because
the bucket index is not used to store change information. IBM recommends not to use S3 object versioning on indexless buckets,
because the bucket index is necessary for this feature.
NOTE: Using indexless buckets removes the limit of the max number of objects in a single bucket.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Example
In this example, the buckets created in the indexless-placement target will be indexless buckets.
Example
5. Restart the Ceph Object Gateways on all nodes in the storage cluster for the change to take effect:
Syntax
Example
You can reshard a bucket index either manually offline or dynamically online.
Bucket index resharding prevents performance bottlenecks when you add a high number of objects per bucket.
You can configure bucket index resharding for new buckets or change the bucket index on the existing ones.
During the process of resharding bucket index dynamically, there is a periodic check of all the Ceph Object Gateway buckets
and it detects buckets that require resharding. If a bucket has grown larger than the value specified in the
rgw_max_objs_per_shard parameter, the Ceph Object Gateway reshards the bucket dynamically in the background. The
default value for rgw_max_objs_per_shard is 100k objects per shard. Resharding bucket index dynamically works as
expected on the upgraded single-site configuration without any modification to the zone or the zone group. A single site-
configuration can be any of the following:
Prerequisites
Edit online
Procedure
Edit online
Perform either of the below two steps to perform recovery of bucket indexes:
Syntax
marker is d8a347a4-99b6-4312-a5c1-75b83904b3d4.41610.2
bucket_id is d8a347a4-99b6-4312-a5c1-75b83904b3d4.41610.2
number of bucket index shards is 5
data pool is local-zone.rgw.buckets.data
NOTICE: This tool is currently considered EXPERIMENTAL.
The list of objects that we will attempt to restore can be found in "/tmp/rgwrbi-
object-list.49946".
Please review the object names in that file (either below or in another
window/terminal) before proceeding.
Type "proceed!" to proceed, "view" to view object list, or "q" to quit: view
Viewing...
Type "proceed!" to proceed, "view" to view object list, or "q" to quit: proceed!
Proceeding...
NOTICE: Bucket stats are currently incorrect. They can be restored with the following
command after 2 minutes:
radosgw-admin bucket list --bucket=bucket-large-1 --allow-unordered --max-
entries=1073741824
real 2m16.530s
user 0m1.082s
sys 0m0.870s
NOTE: The tool's scope is limited to a single site only and not multi-site, that is, if we run rgw-restore-bucket-index tool
at site-1, it does not recover objects in site-2 and vice versa. On a multi-site, the recovery tool and the object re-index
command should be executed at both sites for a bucket.
Maximum number of objects in one bucket before it needs resharding: Use a maximum of 102,400 objects per bucket
index shard. To take full advantage of resharding and maximize parallelism, provide a sufficient number of OSDs in the Ceph
Object Gateway bucket index pool. This parallelization scales with the number of Ceph Object Gateway instances, and
replaces the in-order index shard enumeration with a number sequence. The default locking timeout is extended from 60
seconds to 90 seconds.
Maximum number of objects when using sharding: Based on prior testing, the number of bucket index shards currently
supported is 65521.
You can reshard a bucket three times before the other zones catch-up: Resharding is not recommended until the older
generations synchronize. Around four generations of the buckets from previous reshards are supported. Once the limit is
reached, dynamic resharding does not reshard the bucket again until at least one of the old log generations are fully trimmed.
Using the command radosgw-admin bucket reshard throws the following error:
Bucket BUCKET_NAME already has too many log generations (4) from previous reshards that peer
zones haven't finished syncing.
Resharding is not recommended until the old generations sync, but you can force a reshard with
`--yes-i-really-mean-it`.
A value greater than 0 to enable bucket sharding and to set the maximum number of shards.
Prerequisites
Edit online
Procedure
Edit online
NOTE: The maximum number of bucket index shards currently supported is 65,521.
Syntax
Example
To configure bucket index resharding for all instances of the Ceph Object Gateway, set the
rgw_override_bucket_index_max_shards parameter with the global option.
To configure bucket index resharding only for a particular instance of the Ceph Object Gateway, add
rgw_override_bucket_index_max_shards parameter under the instance.
3. Restart the Ceph Object Gateways on all nodes in the cluster to take effect:
Syntax
Example
Reference
Edit online
A value greater than 0 to enable bucket sharding and to set the maximum number of shards.
NOTE: Mapping the index pool, for each zone, if applicable, to a CRUSH ruleset of SSD-based OSDs might also help with bucket index
performance. See CRUSH performance domains
Prerequisites
Edit online
Procedure
Edit online
NOTE: The maximum number of bucket index shards currently supported is 65,521.
Example
3. In the zonegroup.json file, set the bucket_index_max_shards parameter for each named zone:
Syntax
bucket_index_max_shards = VALUE
Example
bucket_index_max_shards = 12
Example
Example
Syntax
Example
Verification
Edit online
Example
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
rgw_reshard_num_logs: The number of shards for the resharding log. The default value is 16.
rgw_reshard_bucket_lock_duration: The duration of the lock on a bucket during resharding. The default value is
360 seconds.
rgw_max_objs_per_shard: The maximum number of objects per shard. The default value is 100000 objects per
shard.
rgw_reshard_thread_interval: The maximum time between rounds of reshard thread processing. The default
value is 600 seconds.
Example
Syntax
Example
Example
Syntax
Example
WARNING: You can only cancel pending resharding operations. Do not cancel ongoing resharding operations.
Syntax
Example
Verification
722 IBM Storage Ceph
Edit online
Syntax
Example
Reference
Edit online
You need to enable the resharding feature manually on the existing zones and the zone groups after upgrading the storage cluster.
NOTE: You can reshard a bucket three times before the other zones catch-up. See Limitations of bucket index resharding
NOTE: If a bucket is created and uploaded with more than the threshold number of objects for resharding dynamically, you need to
continue to write I/Os to old buckets to begin the resharding process.
Prerequisites
Edit online
All the Ceph Object Gateway daemons enabled at both the sites are upgraded to the latest version.
Procedure
Edit online
Example
If zonegroup features enabled is not enabled for resharding on the zonegroup, then continue with the procedure.
2. Enable the resharding feature on all the zonegroups in the multi-site configuration where Ceph Object Gateway is installed:
Syntax
Example
Example
4. Enable the resharding feature on all the zones in the multi-site configuration where Ceph Object Gateway is installed:
Syntax
Example
Example
6. Verify the resharding feature is enabled on the zones and zonegroups. You can see that each zone lists its
supported_features and the zonegroups lists its enabled_features
Example
"zones": [
{
"id": "505b48db-6de0-45d5-8208-8c98f7b1278d",
"name": "us_east",
"endpoints": [
"http://10.0.208.11:8080"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 11,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": "",
"supported_features": [
"resharding"
]
"default_placement": "default-placement",
"realm_id": "26cf6f23-c3a0-4d57-aae4-9b0010ee55cc",
"sync_policy": {
"groups": []
},
"enabled_features": [
"resharding"
]
Example
In this example. you can see that the resharding feature is enabled for the us zonegroup.
8. Optional: You can disable the resharding feature for the zonegroups:
a. Disable the feature on all the zonegroups in the multi-site where Ceph Object Gateway is installed:
Example
Example
Reference
Edit online
Creates a new set of bucket index objects for the specified bucket.
Links the new bucket instance with the bucket so that all new index operations go through the new bucket indexes.
Prints the old and the new bucket ID to the command output.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Verification
Edit online
Syntax
Example
Clean them manually to prevent the stale instances from negatively impacting the performance of the storage cluster.
IMPORTANT: Use this procedure only in simple deployments, not in multi-site clusters.
Prerequisites
Edit online
Procedure
Edit online
Verification
Edit online
Syntax
Example
However, for older buckets that had lifecycle policies and have undergone resharding, you can fix such buckets with the reshard
fix option.
Prerequisites
Edit online
Procedure
Edit online
Syntax
IMPORTANT: If you do not use the --bucket argument, then the command fixes lifecycle policies for all the buckets in the
storage cluster.
Example
Enabling compression
Edit online
The Ceph Object Gateway supports server-side compression of uploaded objects using any of Ceph’s compression plugins. These
include:
zlib: Supported.
snappy: Supported.
zstd: Supported.
Configuration
To enable compression on a zone’s placement target, provide the --compression=TYPE option to the radosgw-admin zone
placement modify command. The compression TYPE refers to the name of the compression plugin to use when writing new
object data.
Each compressed object stores the compression type. Changing the setting does not hinder the ability to decompress existing
compressed objects, nor does it force the Ceph Object Gateway to recompress existing objects.
This compression setting applies to all new objects uploaded to buckets using this placement target.
To disable compression on a zone’s placement target, provide the --compression=TYPE option to the radosgw-admin zone
placement modify command and specify an empty string or none.
Example
After enabling or disabling compression, restart the Ceph Object Gateway instance so the change will take effect.
NOTE: Ceph Object Gateway creates a default zone and a set of pools. For production deployments, see Creating a realm
Statistics
While all existing commands and APIs continue to report object and bucket sizes based on their uncompressed data, the radosgw-
admin bucket stats command includes compression statistics for all buckets.
Syntax
The size is the accumulated size of the objects in the bucket, uncompressed and unencrypted. The size_kb is the accumulated
size in kilobytes and is calculated as ceiling(size/1024). In this example, it is ceiling(1075028/1024) = 1050.
The size_actual is the accumulated size of all the objects after each object is distributed in a set of 4096-byte blocks. If a bucket
has two objects, one of size 4100 bytes and the other of 8500 bytes, the first object is rounded up to 8192 bytes, and the second one
rounded 12288 bytes, and their total for the bucket is 20480 bytes. The size_kb_actual is the actual size in kilobytes and is
calculated as size_actual/1024. In this example, it is 1331200/1024 = 1300.
The size_utilized is the total size of the data in bytes after it has been compressed and/or encrypted. Encryption could increase
the size of the object while compression could decrease it. The size_kb_utilized is the total size in kilobytes and is calculated as
ceiling(size_utilized/1024). In this example, it is ceiling(592035/1024)= 579.
User management
Edit online
Ceph Object Storage user management refers to users that are client applications of the Ceph Object Storage service; not the Ceph
Object Gateway as a client application of the Ceph Storage Cluster. You must create a user, access key, and secret to enable client
applications to interact with the Ceph Object Gateway service.
You can create, modify, view, suspend, and remove users and subusers.
IMPORTANT: When managing users in a multi-site deployment, ALWAYS issue the radosgw-admin command on a Ceph Object
Gateway node within the master zone of the master zone group to ensure that users synchronize throughout the multi-site cluster.
DO NOT create, modify, or delete users on a multi-site cluster from a secondary zone or a secondary zone group.
In addition to creating user and subuser IDs, you may add a display name and an email address for a user. You can specify a key and
secret, or generate a key and secret automatically. When generating or specifying keys, note that user IDs correspond to an S3 key
type and subuser IDs correspond to a swift key type. Swift keys also have access levels of read, write, readwrite and full.
User management command line syntax generally follows the pattern user COMMAND USER_ID where USER_ID is either the --
uid= option followed by the user's ID (S3) or the --subuser= option followed by the user name (Swift).
Syntax
Multi-tenant namespace
Create a user
Create a subuser
Get user information
Modify user information
Enable and suspend users
Remove a user
Remove a subuser
Rename a user
Create a key
Add and remove access keys
Add and remove admin capabilities
Multi-tenant namespace
Edit online
The Ceph Object Gateway supports multi-tenancy for both the S3 and Swift APIs, where each user and bucket lies under a "tenant."
Multi tenancy prevents namespace clashing when multiple tenants are using common bucket names, such as "test", "main", and so
forth.
Each user and bucket lies under a tenant. For backward compatibility, a "legacy" tenant with an empty name is added. Whenever
referring to a bucket without specifically specifying a tenant, the Swift API will assume the "legacy" tenant. Existing users are also
stored under the legacy tenant, so they will access buckets and objects the same way as earlier releases.
Tenants as such do not have any operations on them. They appear and disappear as needed, when users are administered. In order
to create, modify, and remove users with explicit tenants, either an additional option --tenant is supplied, or a syntax
"_TENANT_$_USER_" is used in the parameters of the radosgw-admin command.
Example
To create a user testx$tester for Swift, run one of the following commands:
Example
NOTE: The subuser with explicit tenant had to be quoted in the shell.
Create a user
Edit online
Use the user create command to create an S3-interface user. You MUST specify a user ID and a display name. You may also
specify an email address. If you DO NOT specify a key or secret, radosgw-admin will generate them for you automatically. However,
you may specify a key and/or a secret if you prefer not to use generated key/secret pairs.
Syntax
Example
{ "user_id": "janedoe",
"display_name": "Jane Doe",
"email": "[email protected]",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [
{ "user": "janedoe",
"access_key": "11BS02LGFB6AL6H1ADMW",
"secret_key": "vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY"}],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": { "enabled": false,
"max_size_kb": -1,
"max_objects": -1},
"user_quota": { "enabled": false,
"max_size_kb": -1,
"max_objects": -1},
"temp_url_keys": []}
IMPORTANT: Check the key output. Sometimes radosgw-admin generates a JSON escape () character, and some clients
do not know how to handle JSON escape
characters. Remedies include removing the JSON escape character (), encapsulating the string in quotes,
regenerating the key to ensure that it does not have a JSON escape character, or specifying the key and secret manually.
Create a subuser
Edit online
To create a subuser (Swift interface), you must specify the user ID (--uid=_USERNAME_), a subuser ID and the access level for the
subuser. If you DO NOT specify a key or secret, radosgw-admin generates them for you automatically. However, you can specify a
key, a secret, or both if you prefer not to use generated key and secret pairs.
NOTE: full is not readwrite, as it also includes the access control policy.
Syntax
{ "user_id": "janedoe",
"display_name": "Jane Doe",
"email": "[email protected]",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [
{ "id": "janedoe:swift",
"permissions": "full-control"}],
"keys": [
{ "user": "janedoe",
"access_key": "11BS02LGFB6AL6H1ADMW",
"secret_key": "vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY"}],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": { "enabled": false,
"max_size_kb": -1,
"max_objects": -1},
"user_quota": { "enabled": false,
"max_size_kb": -1,
"max_objects": -1},
"temp_url_keys": []}
Example
To get information about a tenanted user, specify both the user ID and the name of the tenant.
Example
To modify subuser values, specify subuser modify and the subuser ID.
Example
To re-enable a suspended user, specify user enable and the user ID:
Remove a user
Edit online
When you remove a user, the user and subuser are removed from the system. However, you may remove only the subuser if you
wish. To remove a user (and subuser), specify user rm and the user ID.
Syntax
Example
To remove the subuser only, specify subuser rm and the subuser name.
Example
Options include:
Purge Data: The --purge-data option purges all data associated with the UID.
Purge Keys: The --purge-keys option purges all keys associated with the UID.
Remove a subuser
Edit online
When you remove a subuser, you are removing access to the Swift interface. The user remains in the system. To remove the subuser,
specify subuser rm and the subuser ID.
Syntax
Example
Options include:
Purge Keys: The --purge-keys option purges all keys associated with the UID.
Rename a user
Edit online
To change the name of a user, use the radosgw-admin user rename command. The time that this command takes depends on
the number of buckets and objects that the user has. If the number is large, IBM recommends using the command in the Screen
utility provided by the screen package.
root or sudo access to the host running the Ceph Object Gateway.
Procedure
Edit online
1. Rename a user:
Syntax
Example
{
"user_id": "user2",
"display_name": "user 2",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "user2",
"access_key": "59EKHI6AI9F8WOW8JQZJ",
"secret_key": "XH0uY3rKCUcuL73X0ftjXbZqUbk0cavD11rD8MsA"
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw"
}
If a user is inside a tenant, specify both the user name and the tenant:
Syntax
Example
Syntax
Example
Syntax
Create a key
Edit online
To create a key for a user, you must specify key create. For a user, specify the user ID and the s3 key type. To create a key for a
subuser, you must specify the subuser ID and the swift keytype.
Example
{ "user_id": "johndoe",
"rados_uid": 0,
"display_name": "John Doe",
"email": "[email protected]",
"suspended": 0,
"subusers": [
{ "id": "johndoe:swift",
"permissions": "full-control"}],
"keys": [
{ "user": "johndoe",
"access_key": "QFAMEDSJP5DEKJO0DDXY",
"secret_key": "iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87"}],
"swift_keys": [
{ "user": "johndoe:swift",
"secret_key": "E9T2rUZNu2gxUjcwUBO8n/Ev4KX6/GprEuH4qhu1"}]}
--key-type=_KEY_TYPE_ specifies a key type. The options are: swift and s3.
Example
To remove an access key, you need to specify the user and the key:
Example
2. Specify the user ID and the access key from the previous step to remove the access key:
Syntax
Example
Syntax
You can add read, write, or all capabilities to users, buckets, metadata, and usage (utilization).
Syntax
--caps="[users|buckets|metadata|usage|zone]=[*|read|write|read, write]"
Example
Example
Role management
Edit online
As a storage administrator, you can create, delete, or update a role and the permissions associated with that role with the radosgw-
admin commands.
A role is similar to a user and has permission policies attached to it. It can be assumed by any identity. If a user assumes a role, a set
of dynamically created temporary credentials are returned to the user. A role can be used to delegate access to users, applications
and services that do not have permissions to access some S3 resources.
Creating a role
Reference
Edit online
Creating a role
Edit online
Create a role for the user with the radosgw-admin role create command. You need to create a user with assume-role-
policy-doc parameter in the command, which is the trust relationship policy document that grants an entity the permission to
assume the role.
Prerequisites
Edit online
An S3 bucket created.
Procedure
Edit online
Syntax
Example
{
"RoleId": "ca43045c-082c-491a-8af1-2eebca13deec",
"RoleName": "S3Access1",
"Path": "/application_abc/component_xyz/",
"Arn": "arn:aws:iam:::role/application_abc/component_xyz/S3Access1",
"CreateDate": "2022-06-17T10:18:29.116Z",
"MaxSessionDuration": 3600,
"AssumeRolePolicyDocument": "{"Version":"2012-10-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":
Getting a role
Edit online
Get the information about a role with the get command.
Prerequisites
Edit online
An S3 bucket created.
A role created.
Procedure
Edit online
Syntax
Example
{
"RoleId": "ca43045c-082c-491a-8af1-2eebca13deec",
"RoleName": "S3Access1",
"Path": "/application_abc/component_xyz/",
"Arn": "arn:aws:iam:::role/application_abc/component_xyz/S3Access1",
"CreateDate": "2022-06-17T10:18:29.116Z",
"MaxSessionDuration": 3600,
"AssumeRolePolicyDocument": "{"Version":"2012-10-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":
["sts:AssumeRole"]}]}"
}
Reference
Edit online
Creating a role
Listing a role
Edit online
You can list the roles in the specific path with the role list command.
An S3 bucket created.
A role created.
Procedure
Edit online
Syntax
Example
[
{
"RoleId": "85fb46dd-a88a-4233-96f5-4fb54f4353f7",
"RoleName": "kvm-sts",
"Path": "/application_abc/component_xyz/",
"Arn": "arn:aws:iam:::role/application_abc/component_xyz/kvm-sts",
"CreateDate": "2022-09-13T11:55:09.39Z",
"MaxSessionDuration": 7200,
"AssumeRolePolicyDocument": "{"Version":"2012-10-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/kvm"]},"Action":
["sts:AssumeRole"]}]}"
},
{
"RoleId": "9116218d-4e85-4413-b28d-cdfafba24794",
"RoleName": "kvm-sts-1",
"Path": "/application_abc/component_xyz/",
"Arn": "arn:aws:iam:::role/application_abc/component_xyz/kvm-sts-1",
"CreateDate": "2022-09-16T00:05:57.483Z",
"MaxSessionDuration": 3600,
"AssumeRolePolicyDocument": "{"Version":"2012-10-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/kvm"]},"Action":
["sts:AssumeRole"]}]}"
}
]
Prerequisites
Edit online
An S3 bucket created.
A role created.
Procedure
Edit online
Syntax
Example
{
"RoleId": "ca43045c-082c-491a-8af1-2eebca13deec",
"RoleName": "S3Access1",
"Path": "/application_abc/component_xyz/",
"Arn": "arn:aws:iam:::role/application_abc/component_xyz/S3Access1",
"CreateDate": "2022-06-17T10:18:29.116Z",
"MaxSessionDuration": 3600,
"AssumeRolePolicyDocument": "{"Version":"2012-10-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":
["sts:AssumeRole"]}]}"
}
Prerequisites
Edit online
An S3 bucket created.
A role created.
Procedure
Edit online
Example
{
"Permission policy": "{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":
["s3:*"],"Resource":"arn:aws:s3:::example_bucket"}]}"
}
Prerequisites
Edit online
An S3 bucket created.
A role created.
Procedure
Edit online
Syntax
Example
[
"Policy1"
]
Prerequisites
Edit online
An S3 bucket created.
A role created.
Procedure
Edit online
Syntax
Example
Deleting a role
Edit online
You can delete the role only after removing the permission policy attached to it.
Prerequisites
Edit online
A role created.
An S3 bucket created.
Procedure
Edit online
Syntax
Example
Syntax
Reference
Edit online
Prerequisites
Edit online
An S3 bucket created.
A role created.
Procedure
Edit online
Syntax
Example
Verification
Edit online
Example
Quota management
Edit online
The Ceph Object Gateway enables you to set quotas on users and buckets owned by users. Quotas include the maximum number of
objects in a bucket and the maximum storage size in megabytes.
Bucket: The --bucket option allows you to specify a quota for buckets the user owns.
Maximum Objects: The --max-objects setting allows you to specify the maximum number of objects. A negative value
disables this setting.
Maximum Size: The --max-size option allows you to specify a quota for the maximum number of bytes. A negative value
disables this setting.
Quota Scope: The --quota-scope option sets the scope for the quota. The options are bucket and user. Bucket quotas
apply to buckets a user owns. User quotas apply to a user.
IMPORTANT: Buckets with a large number of objects can cause serious performance issues. The recommended maximum number
of objects in a one bucket is 100,000. To increase this number, configure bucket index sharding. See Configure bucket index
resharding
Syntax
Example
A negative value for num objects and / or max size means that the specific quota attribute check is disabled.
Syntax
Syntax
A negative value for NUMBER_OF_OBJECTS, MAXIMUM_SIZE_IN_BYTES, or both means that the specific quota attribute check is
disabled.
Syntax
Syntax
Syntax
To get quota settings for a tenanted user, specify the user ID and the name of the tenant:
Syntax
Syntax
Syntax
NOTE: You should run the radosgw-admin user stats command with the --sync-stats option to receive the latest data.
Quota cache
Edit online
Quota statistics are cached for each Ceph Gateway instance. If there are multiple instances, then the cache can keep quotas from
being perfectly enforced, as each instance will have a different view of the quotas. The options that control this are rgw bucket
quota ttl, rgw user quota bucket sync interval, and rgw user quota sync interval. The higher these values are,
the more efficient quota operations are, but the more out-of-sync multiple instances will be. The lower these values are, the closer to
perfect enforcement multiple instances will achieve. If all three are 0, then quota caching is effectively disabled, and multiple
instances will have perfect quota enforcement.
The global quota settings can be manipulated with the global quota counterparts of the quota set, quota enable, and quota
disable commands, for example:
[root@host01 ~]# radosgw-admin global quota set --quota-scope bucket --max-objects 1024
[root@host01 ~]# radosgw-admin global quota enable --quota-scope bucket
NOTE: In a multi-site configuration, where there is a realm and period present, changes to the global quotas must be committed
using period update --commit. If there is no period present, the Ceph Object Gateways must be restarted for the changes to
take effect.
Bucket management
Edit online
As a storage administrator, when using the Ceph Object Gateway you can manage buckets by moving them between users and
renaming them. You can create bucket notifications to trigger on specific events. Also, you can find orphan or leaky objects within the
Ceph Object Gateway that can occur over the lifetime of a storage cluster.
NOTE: When millions of objects are uploaded to a Ceph Object Gateway bucket with a high ingest rate, incorrect num_objects are
reported with the radosgw-admin bucket stats command. With the radosgw-admin bucket list command you can
NOTE: The radosgw-admin bucket stats command does not return Unknown error 2002 error and explicitly translates to
POSIX error 2 such as "No such file or directory" error.
Renaming buckets
Moving buckets
Finding orphan and leaky objects
Managing bucket index entries
Bucket notifications
Creating bucket notifications
Reference
Edit online
Developer
Renaming buckets
Edit online
You can rename buckets. If you want to allow underscores in bucket names, then set the rgw_relaxed_s3_bucket_names option
to true.
Prerequisites
Edit online
An existing bucket.
Procedure
Edit online
Example
Syntax
Example
Syntax
Example
Example
Moving buckets
Edit online
The radosgw-admin bucket utility provides the ability to move buckets between users. To do so, link the bucket to a new user and
change the ownership of the bucket to the new user.
Prerequisites
Edit online
An S3 bucket.
Procedure
Edit online
Syntax
Example
Example
Syntax
Example
4. Verify that the ownership of the data bucket has been successfully changed by checking the owner line in the output of the
following command:
Example
Procedure
Edit online
Syntax
Example
Syntax
Example
4. Verify that the ownership of the data bucket has been successfully changed by checking the owner line in the output of the
following command:
Procedure
Edit online
1. Optional: If you do not already have multiple tenants, you can create them by enabling rgw_keystone_implicit_tenants
and accessing the Ceph Object Gateway from an external tenant:
Example
Access the Ceph Object Gateway from an eternal tenant using either the s3cmd or swift command:
Example
Or use s3cmd:
Example
The first access from an external tenant creates an equivalent Ceph Object Gateway user.
Syntax
Example
3. Verify that the data bucket has been linked to tenanted-user successfully:
Example
Syntax
Example
5. Verify that the ownership of the data bucket has been successfully changed by checking the owner line in the output of the
following command:
Example
An orphan object exists in a storage cluster and has an object ID associated with the RADOS object. However, there is no reference of
the RADOS object with the S3 object in the bucket index reference.
For example, if the Ceph Object Gateway goes down in the middle of an operation, this can cause some objects to become orphans.
Also, an undiscovered bug can cause orphan objects to occur.
You can see how the Ceph Object Gateway objects map to the RADOS objects. The radosgw-admin command provides a tool to
search for and produce a list of these potential orphan or leaky objects. Using the radoslist subcommand displays objects stored
within buckets, or all buckets in the storage cluster. The rgw-orphan-list script displays orphan objects within a pool.
NOTE: The radoslist subcommand is replacing the deprecated orphans find and orphans finish subcommands.
IMPORTANT: Do not use this command where Indexless buckets are in use as all the objects appear as orphaned.
IMPORTANT: Another alternate way to identity orphaned objects is to run the rados -p <pool> ls | grep BUCKET_ID
command.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
NOTE: If the BUCKET_NAME is omitted, then all objects in all buckets are displayed.
Example
Example
Example
5. From the pool list, select the pool in which you want to find orphans. This script might run for a long time depending on the
objects in the cluster.
Example
Example
Available pools:
.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
default.rgw.buckets.index
default.rgw.buckets.data
rbd
default.rgw.buckets.non-ec
ma.rgw.control
ma.rgw.meta
ma.rgw.log
ma.rgw.buckets.index
ma.rgw.buckets.data
ma.rgw.buckets.non-ec
Which pool do you want to search for orphans?
IMPORTANT: A data pool must be specified when using the rgw-orphan-list command, and not a metadata pool.
Syntax
rgw-orphan-list -h
rgw-orphan-list POOL_NAME /DIRECTORY
Example
Example
[root@host01 orphans]# ls -l
Example
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.0
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.1
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.2
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.3
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.4
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.5
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.6
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.7
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.8
a9c042bc-be24-412c-9052-dda6b2f01f55.16749.1_key1.cherylf.433-bucky-4865-0.9
Syntax
Example
WARNING: Verify you are removing the correct objects. Running the rados rm command removes data from the storage
cluster.
The stats for the bucket are stored in the bucket index headers. This phase loads those headers and also iterates through all the
plain object entries in the bucket index and recalculates the stats. It then displays the actual and calculated stats in sections labeled
"existing_header" and "calculated_header" respectively, so they can be compared.
If you use the --fix option with the bucket check sub-command, it removes the "orphaned" entries from the bucket index and
also overwrites the existing stats in the header with those that it calculated. It causes all entries, including the multiple entries used
in versioning, to be listed in the output.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
2. Fix the inconsistencies in the bucket index, including removal of orphaned objects:
Syntax
Example
Bucket notifications
Edit online
Bucket notifications provide a way to send information out of the Ceph Object Gateway when certain events happen in the bucket.
Bucket notifications can be sent to HTTP, AMQP0.9.1, and Kafka endpoints. A notification entry must be created to send bucket
notifications for events on a specific bucket and to a specific topic. A bucket notification can be created on a subset of event types or
by default for all event types. The bucket notification can filter out events based on key prefix or suffix, regular expression matching
the keys, and on the metadata attributes attached to the object, or the object tags. Bucket notifications have a REST API to provide
configuration and control interfaces for the bucket notification mechanism.
NOTE: The bucket notifications API is enabled by default. If rgw_enable_apis configuration parameter is explicitly set, ensure that
s3, and notifications are included. To verify this, run the ceph --admin-daemon /var/run/ceph/ceph-
client.rgw.NAME.asok config get rgw_enable_apis command. Replace NAME with the Ceph Object Gateway instance
name.
You can manage list, get, and remove topics for the Ceph Object Gateway buckets:
List topics: Run the following command to list the configuration of all topics:
Example
Get topics: Run the following command to get the configuration of a specific topic:
Example
Remove topics: Run the following command to remove the configuration of a specific topic:
Example
NOTE: The topic is removed even if the Ceph Object Gateway bucket is configured to that topic.
Prerequisites
Edit online
Root-level access.
Endpoint parameters.
IMPORTANT: IBM supports ObjectCreate events, such as put, post, multipartUpload, and copy. IBM also supports
ObjectRemove events, such as object_delete and s3_multi_object_delete.
Procedure
Edit online
Using the boto script
Example
2. Create an S3 bucket.
3. Create a python script topic.py to create an SNS topic for http,amqp, or kafka protocol:
Example
import boto3
from botocore.client import Config
import sys
client = boto3.client('sns',
endpoint_url=endpoint,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
config=Config(signature_version='s3'))
client.create_topic(topic_name="mytopic", Attributes=attributes)
python3 topic.py
5. Create a python script notification.py to create S3 bucket notification for s3:objectCreate and s3:objectRemove
events:
Example
import boto3
import sys
client = boto3.client('s3',
endpoint_url=endpoint,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key)
# regex filter on the object name and metadata based filtering are extension to AWS S3 API
# bucket and topic should be created beforehand
Example
python3 notification.py
Example
endpoint = 'http://127.0.0.1:8000'
access_key='0555b35654ad1656d804'
secret_key='h7GhxuBLTrlhVUyxSPUKUV8r/2EI4ngqJxD7iBdBYLhwluN30JaT3Q=='
client = boto3.client('s3',
endpoint_url=endpoint,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key)
print(client.get_bucket_notification_configuration(Bucket=bucketname))
a. Verify the object deletion events at the http, rabbitmq, or kafka receiver.
1. Create topic:
Syntax
Example
sample topic.json:
{"push-endpoint": "kafka://localhost","verify-ssl": "False", "kafka-ack-level": "broker",
"persistent":"true"}
ref: https://docs.aws.amazon.com/cli/latest/reference/sns/create-topic.html
Syntax
Example
sample notification.json
{
"TopicConfigurations": [
{
"Id": "test_notification",
"TopicArn": "arn:aws:sns:us-west-2:123456789012:test-kafka",
"Events": [
"s3:ObjectCreated:*"
]
}
]
}
Syntax
Example
Bucket lifecycle
Edit online
As a storage administrator, you can use a bucket lifecycle configuration to manage your objects so they are stored effectively
throughout their lifetime. For example, you can transition objects to less expensive storage classes, archive, or even delete them
RADOS Gateway supports S3 API object expiration by using rules defined for a set of bucket objects. Each rule has a prefix, which
selects the objects, and a number of days after which objects become unavailable.
Prerequisites
Edit online
An S3 bucket created.
Access to a Ceph Object Gateway client with the AWS CLI package installed.
Procedure
Edit online
Example
Example
{
"Rules": [
{
"Filter": {
"Prefix": "images/"
},
"Status": "Enabled",
"Expiration": {
"Days": 1
},
"ID": "ImageExpiration"
}
]
}
Syntax
Example
Syntax
Example
Optional: From the Ceph Object Gateway node, log into the Cephadm shell and retrieve the bucket lifecycle configuration:
Syntax
Example
Reference
Edit online
S3 bucket lifecycle
For more information on using the AWS CLI to manage lifecycle configurations, see the Setting lifecycle configuration on a
bucket section of the Amazon Simple Storage Service documentation.
Prerequisites
Edit online
An S3 bucket created.
Access to a Ceph Object Gateway client with the AWS CLI package installed.
Procedure
Edit online
Syntax
Example
Syntax
Example
Optional: From the Ceph Object Gateway node, retrieve the bucket lifecycle configuration:
Syntax
Example
The command does not return any information if a bucket lifecycle policy is not present.
NOTE: The put-bucket-lifecycle-configuration overwrites an existing bucket lifecycle configuration. If you want to retain
any of the current lifecycle policy settings, you must include them in the lifecycle configuration file.
Prerequisites
Edit online
An S3 bucket created.
Access to a Ceph Object Gateway client with the AWS CLI package installed.
Procedure
Edit online
Example
Example
Syntax
Example
Syntax
Example
{
"Rules": [
{
"Expiration": {
"Days": 30
},
"ID": "DocsExpiration",
"Filter": {
"Prefix": "docs/"
},
"Status": "Enabled"
},
{
"Expiration": {
"Days": 1
},
"ID": "ImageExpiration",
"Filter": {
"Prefix": "images/"
},
"Status": "Enabled"
}
]
}
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Example
Example
[
{
“bucket”: “:testbucket:8b63d584-9ea1-4cf3-8443-a6a15beca943.54187.1”,
“started”: “Thu, 01 Jan 1970 00:00:00 GMT”,
“status” : “UNINITIAL”
},
{
“bucket”: “:testbucket1:8b635499-9e41-4cf3-8443-a6a15345943.54187.2”,
“started”: “Thu, 01 Jan 1970 00:00:00 GMT”,
“status” : “UNINITIAL”
}
]
Syntax
Example
Example
Verification
Edit online
Reference
Edit online
S3 bucket lifecycle
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Verification
Edit online
Example
06:00-08:00
Reference
Edit online
S3 bucket lifecycle
You can create a schedule for data movement between a hot storage class and a cold storage class. You can schedule this movement
after a specified time so that the object expires and is deleted permanently for example you can transition objects to a storage class
30 days after you have created or even archived the objects to a storage class one year after creating them. You can do this through a
transition rule. This rule applies to an object transitioning from one storage class to another. The lifecycle configuration contains one
or more rules using the <Rule> element.
You can migrate data between replicated pools, erasure-coded pools, replicated to erasure-coded pools, or erasure-coded to
replicated pools with the Ceph Object Gateway lifecycle transition policy.
Procedure
Edit online
Syntax
Example
Syntax
Example
3. Provide the zone placement information for the new storage class:
Syntax
Example
[ceph: root@host01 /]# radosgw-admin zone placement add --rgw-zone default --placement-id
default-placement --storage-class hot.test --data-pool test.hot.data
{
"key": "default-placement",
"val": {
"index_pool": "test_zone.rgw.buckets.index",
"storage_classes": {
"STANDARD": {
"data_pool": "test.hot.data"
},
"hot.test": {
"data_pool": "test.hot.data",
}
},
NOTE: Consider setting the compression_type when creating cold or archival data storage pools with write once.
Syntax
Example
[ceph: root@host01 /]# ceph osd pool application enable test.hot.data rgw
enabled application 'rgw' on pool 'test.hot.data'
6. Create a bucket:
Example
Example
Syntax
Example
Syntax
Example
10. Provide the zone placement information for the new storage class:
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# ceph osd pool application enable test.cold.data rgw
enabled application 'rgw' on pool 'test.cold.data'
13. To view the zone group configuration, run the following command:
Syntax
Syntax
Example
Example
{
"ETag": "\"211599863395c832a3dfcba92c6a3b90\"",
"Size": 540,
"StorageClass": "STANDARD",
"Key": "obj1",
"VersionId": "W95teRsXPSJI4YWJwwSG30KxSCzSgk-",
"IsLatest": true,
"LastModified": "2023-11-23T10:38:07.214Z",
"Owner": {
"DisplayName": "test-user",
"ID": "test-user"
}
}
Example
{
"Rules": [
{
"Filter": {
"Prefix": ""
},
"Status": "Enabled",
"Transitions": [
{
"Days": 5,
"StorageClass": "hot.test"
},
{
"Days": 20,
"StorageClass": "cold.test"
}
],
"Expiration": {
"Days": 365
},
"ID": "double transition and expiration"
}
]
}
The lifecycle configuration example shows an object that will transition from the default
`STANDARD` storage class to the `hot.test` storage class after 5 days, again transitions after
20 days to the `cold.test` storage class, and finally expires after 365 days in the
`cold.test` storage class.
Example
21. Verify that the object is transitioned to the given storage class:
Example
{
"ETag": "\"211599863395c832a3dfcba92c6a3b90\"",
"Size": 540,
"StorageClass": "cold.test",
"Key": "obj1",
IMPORTANT: The object version(s), not the object name, is the defining and required value for object lock to perform correctly to
support the GOVERNANCE or COMPLIANCE mode. You need to know the version of the object when it is written so that you can
retrieve it at a later time.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
NOTE: You can choose either the GOVERNANCE or COMPLIANCE mode for the RETENTION_MODE in S3 object lock, to apply
different levels of protection to any object version that is protected by object lock.
In GOVERNANCE mode, users cannot overwrite or delete an object version or alter its lock settings unless they have special
permissions.
3. Put the object into the bucket with a retention time set:
Syntax
Example
Syntax
Example
Command-line options
Example
NOTE: Using the object lock legal hold operation, you can place a legal hold on an object version, thereby preventing an object
version from being overwritten or deleted. A legal hold doesn’t have an associated retention period and hence, remains in
effect until removed.
List the objects from the bucket to retrieve only the latest version of the object:
Example
Example
Example
Usage
Edit online
The Ceph Object Gateway logs usage for each user. You can track user usage within date ranges too.
Options include:
Start Date: The --start-date option allows you to filter usage stats from a particular start date (format: yyyy-mm-
dd[HH:MM:SS]).
End Date: The --end-date option allows you to filter usage up to a particular date (format: yyyy-mm-dd[HH:MM:SS]).
Log Entries: The --show-log-entries option allows you to specify whether or not to include log entries with the usage stats
(options: true | false).
NOTE: You can specify time with minutes and seconds, but it is stored with 1 hour resolution.
Show usage
Trim usage
Show usage
Edit online
To show usage statistics, specify the usage show. To show usage for a particular user, you must specify a user ID. You may also
specify a start date, end date, and whether or not to show log entries.
Example
You may also show a summary of usage information for all users by omitting a user ID.
Example
Example
metadata
bucket index
data
Metadata
bucket: Holds a mapping between bucket name and bucket instance ID.
Syntax
Example
IMPORTANT: When using the radosgw-admin tool, ensure that the tool and the Ceph Cluster are of the same version. The use of
mismatched versions is not supported.
NOTE: A Ceph Object Gateway object might consist of several RADOS objects, the first of which is the head that contains the
metadata, such as manifest, Access Control List (ACL), content type, ETag, and user-defined metadata. The metadata is stored in
xattrs. The head might also contain up to 512 KB of object data, for efficiency and atomicity. The manifest describes how each
object is laid out in RADOS objects.
Bucket index
The map itself is kept in OMAP associated with each RADOS object. The key of each OMAP is the name of the objects, and the value
holds some basic metadata of that object, the metadata that appears when listing the bucket. Each OMAP holds a header, and we
keep some bucket accounting metadata in that header such as number of objects, total size, and the like.
NOTE: OMAP is a key-value store, associated with an object, in a way similar to how extended attributes associate with a POSIX file.
An object’s OMAP is not physically located in the object’s storage, but its precise implementation is invisible and immaterial to the
Ceph Object Gateway.
Data Objects data is kept in one or more RADOS objects for each Ceph Object Gateway object.
Account information, which has the access key in S3 or account name in Swift
At present, Ceph Object Gateway only uses account information to find out the user ID and for access control. It uses only the bucket
name and object key to address the object in a pool.
Account information
The user ID in Ceph Object Gateway is a string, typically the actual user name from the user credentials and not a hashed or mapped
identifier.
When accessing a user’s data, the user record is loaded from an object USER_ID in the default.rgw.meta pool with users.uid
namespace.
Bucket names
They are represented in the default.rgw.meta pool with root namespace. Bucket record is loaded in order to obtain a marker,
which serves as a bucket ID.
Object names
The object is located in the default.rgw.buckets.data pool. Object name is MARKER_KEY, for example
default.7593.4_image.png, where the marker is default.7593.4 and the key is image.png. These concatenated names are
not parsed and are passed down to RADOS only. Therefore, the choice of the separator is not important and causes no ambiguity. For
the same reason, slashes are permitted in object names, such as keys.
NOTE: See the user-visible, encoded class cls_user_bucket_entry and its nested class cls_user_bucket for the values of
these OMAP entries.
Objects that belong to a given bucket are listed in a bucket index. The default naming for index objects is .dir.MARKER in the
default.rgw.buckets.index pool.
Reference
Edit online
Known pools:
.rgw.root Unspecified region, zone, and global information records, one per object.
ZONE.rgw.control notify.N
Example
.bucket.meta.prodtx:test%25star:default.84099.6
.bucket.meta.testcont:default.4126.1
.bucket.meta.prodtx:testcont:default.84099.4
prodtx/testcont
prodtx/test%25star
testcont
namespace: users.uid Contains per-user information (RGWUserInfo) in USER objects and per-user lists of buckets in omaps of
USER.buckets objects. The USER might contain the tenant if non-empty.
Example
prodtx$prodt
test2.buckets
prodtx$prodt.buckets
test2
This allows Ceph Object Gateway to look up users by their access keys during authentication.
ZONE.rgw.buckets.index Objects are named .dir.MARKER, each contains a bucket index. If the index is sharded, each shard
appends the shard index after the marker. dow_].488urDFerTYXavx4yAd-Op8mxehnvTI_1 MARKER_pass:KEY
Reference
Edit online
Garbage collection operations typically run in the background. You can configure these operations to either run continuously, or to
run only during intervals of low activity and light workloads. By default, the Ceph Object Gateway conducts GC operations
continuously. Because GC operations are a normal part of Ceph Object Gateway operations, deleted objects that are eligible for
garbage collection exist most of the time.
Prerequisites
Edit online
Procedure
Edit online
Example
NOTE: To list all entries in the queue, including unexpired entries, use the --include-all option.
The Ceph Object Gateway purges the storage space used for deleted objects after deleting the objects from the bucket index.
Similarly, the Ceph Object Gateway will delete data associated with a multi-part upload after the multi-part upload completes or
when the upload has gone inactive or failed to complete for a configurable amount of time.
Viewing the objects awaiting garbage collection can be done with the following command:
radosgw-admin gc list
Garbage collection is a background activity that runs continuously or during times of low loads, depending upon how the storage
administrator configures the Ceph Object Gateway. By default, the Ceph Object Gateway conducts garbage collection operations
continuously. Since garbage collection operations are a normal function of the Ceph Object Gateway, especially with object delete
operations, objects eligible for garbage collection exist most of the time.
Some workloads can temporarily or permanently outpace the rate of garbage collection activity. This is especially true of delete-
heavy workloads, where many objects get stored for a short period of time and then deleted. For these types of workloads, storage
administrators can increase the priority of garbage collection operations relative to other operations with the following configuration
parameters:
The rgw_gc_obj_min_wait configuration option waits a minimum length of time, in seconds, before purging a deleted
object’s data. The default value is two hours, or 7200 seconds. The object is not purged immediately, because a client might
be reading the object. Under heavy workloads, this setting can consume too much storage or have a large number of deleted
objects to purge. IBM recommends not setting this value below 30 minutes, or 1800 seconds.
The rgw_gc_processor_period configuration option is the garbage collection cycle run time. That is, the amount of time
between the start of consecutive runs of garbage collection threads. If garbage collection runs longer than this period, the
Ceph Object Gateway will not wait before running a garbage collection cycle again.
The rgw_gc_max_concurrent_io configuration option specifies the maximum number of concurrent IO operations that the
gateway garbage collection thread will use when purging deleted data. Under delete heavy workloads, consider increasing this
setting to a larger number of concurrent IO operations.
The rgw_gc_max_trim_chunk configuration option specifies the maximum number of keys to remove from the garbage
collector log in a single operation. Under delete heavy operations, consider increasing the maximum number of keys so that
more objects are purged during each garbage collection operation.
Some new configuration parameters have been added to Ceph Object Gateway to tune the garbage collection queue, as follows:
The rgw_gc_max_deferred_entries_size configuration option sets the maximum size of deferred entries in the garbage
collection queue.
The rgw_gc_max_queue_size configuration option sets the maximum queue size used for garbage collection. This value
should not be greater than osd_max_object_size minus rgw_gc_max_deferred_entries_size minus 1 KB.
The rgw_gc_max_deferred configuration option sets the maximum number of deferred entries stored in the garbage
collection queue.
NOTE: In testing, with an evenly balanced delete-write workload, such as 50% delete and 50% write operations, the storage cluster
fills completely in 11 hours. This is because Ceph Object Gateway garbage collection fails to keep pace with the delete operations.
The cluster status switches to the HEALTH_ERR state if this happens. Aggressive settings for parallel garbage collection tunables
significantly delayed the onset of storage cluster fill in testing and can be helpful for many workloads. Typical real-world storage
cluster workloads are not likely to cause a storage cluster fill primarily due to garbage collection.
Procedure
Edit online
1. Set the value of rgw_gc_max_concurrent_io to 20, and the value of rgw_gc_max_trim_chunk to 64:
Example
2. Restart the Ceph Object Gateway to allow the changed settings to take effect.
3. Monitor the storage cluster during GC activity to verify that the increased values do not adversely affect performance.
IMPORTANT: Never modify the value for the rgw_gc_max_objs option in a running cluster. You should only change this value
before deploying the RGW nodes.
Reference
Edit online
The S3 API in the Ceph Object Gateway currently supports a subset of the AWS bucket lifecycle configuration actions:
Expiration
NoncurrentVersionExpiration
AbortIncompleteMultipartUpload
rgw_lc_max_worker specifies the number of lifecycle worker threads to run in parallel. This enables the simultaneous
processing of both bucket and index shards. The default value for this option is 3.
rgw_lc_max_wp_worker specifies the number of threads in each lifecycle worker thread’s work pool. This option helps to
accelerate processing for each bucket. The default value for this option is 3.
For a workload with a large number of buckets — for example, a workload with thousands of buckets — consider increasing the value
of the rgw_lc_max_worker option.
For a workload with a smaller number of buckets but with a higher number of objects in each bucket — such as in the hundreds of
thousands — consider increasing the value of the rgw_lc_max_wp_worker option.
NOTE: Before increasing the value of either of these options, please validate current storage cluster performance and Ceph Object
Gateway utilization. Red Hat does not recommend that you assign a value of 10 or above for either of these options.
Prerequisites
Edit online
Procedure
Edit online
1. To increase the number of threads to run in parallel, set the value of rgw_lc_max_worker to a value between 3 and 9:
Example
2. To increase the number of threads in each thread’s work pool, set the value of rgw_lc_max_wp_worker to a value between
3 and 9:
Example
3. Restart the Ceph Object Gateway to allow the changed settings to take effect.
4. Monitor the storage cluster to verify that the increased values do not adversely affect performance.
Reference
Edit online
For more information about Ceph Object Gateway lifecycle, contact IBM Support.
Testing
Edit online
As a storage administrator, you can do basic functionality testing to verify that the Ceph Object Gateway environment is working as
expected. You can use the REST interfaces by creating an initial Ceph Object Gateway user for the S3 interface, and then create a
subuser for the Swift interface.
Prerequisites
Edit online
Create an S3 user
Edit online
To test the gateway, create an S3 user and grant the user access. The man radosgw-admin command provides information on
additional command options.
NOTE: In a multi-site deployment, always create a user on a host in the master zone of the master zone group.
Prerequisites
Edit online
Procedure
Edit online
1. Create an S3 user:
Syntax
Example
2. Verify the output to ensure that the values of access_key and secret_key do not include a JSON escape character (\).
These values are needed for access validation, but certain clients cannot handle if the values include JSON escape characters.
To fix this problem, perform one of the following actions:
Regenerate the key and ensure that it does not include a JSON escape character.
NOTE: In a multi-site deployment, always create a user on a host in the master zone of the master zone group.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
Test S3 access
Edit online
You need to write and run a Python test script for verifying S3 access. The S3 access test script will connect to the radosgw, create a
new bucket, and list all buckets. The values for aws_access_key_id and aws_secret_access_key are taken from the values of
access_key and secret_key returned by the radosgw_admin command.
Prerequisites
Edit online
Procedure
Edit online
vi s3test.py
Syntax
import boto3
access_key = 'ACCESS'
secret_key = 'SECRET'
s3 = boto3.client(
's3',
endpoint_url=endpoint,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key
)
s3.create_bucket(Bucket='my-new-bucket')
response = s3.list_buckets()
for bucket in response['Buckets']:
print("{name}\t{created}".format(
name = bucket['Name'],
created = bucket['CreationDate']
))
Replace endpoint with the URL of the host where you have configured the gateway service. That is, the gateway
host. Ensure that the host setting resolves with DNS. Replace PORT with the port number of the gateway.
Replace ACCESS and SECRET with the access_key and secret_key values
python3 s3test.py
my-new-bucket 2022-05-31T17:09:10.000Z
Syntax
Replace IP_ADDRESS with the public IP address of the gateway server and SWIFT_SECRET_KEY with its value from the output of
the radosgw-admin key create command issued for the swift user. Replace PORT with the port number you are using with
Beast. If you do not replace the port, it will default to port 80.
For example:
my-new-bucket
Configuration reference
786 IBM Storage Ceph
Edit online
As a storage administrator, you can set various options for the Ceph Object Gateway. These options contain default values. If you do
not specify each option, then the default value is set automatically.
To set specific values for these options, update the configuration database by using the ceph config set client.rgw
_OPTION_ _VALUE_ command.
General settings
About pools
Lifecycle settings
Swift settings
Logging settings
Keystone settings
LDAP settings
General settings
Edit online
Name Description Type Default
rgw_data Sets the location of the data files for Ceph Object Gateway. String /var/lib/cep
h/radosgw/$c
luster-$id
rgw_enable_a Enables the specified APIs. String s3,
pis s3website,
swift,
swift_auth,
admin, sts,
iam,
notification
s
rgw_cache_en Whether the Ceph Object Gateway cache is enabled. Boolean true
abled
rgw_cache_lr The number of entries in the Ceph Object Gateway cache. Integer 10000
u_size
rgw_socket_p The socket path for the domain socket. FastCgiExternalServer uses String N/A
ath this socket. If you do not specify a socket path, Ceph Object Gateway will
not run as an external server. The path you specify here must be the same
as the path specified in the rgw.conf file.
rgw_host The host for the Ceph Object Gateway instance. Can be an IP address or a String 0.0.0.0
hostname.
rgw_port Port the instance listens for requests. If not specified, Ceph Object Gateway String None
runs external FastCGI.
rgw_dns_name The DNS name of the served domain. See also the hostnames setting String None
within zone groups.
rgw_script_u The alternative value for the SCRIPT_URI if not set in the request. String None
ri
rgw_request_ The alternative value for the REQUEST_URI if not set in the request. String None
uri
rgw_print_co Enable 100-continue if it is operational. Boolean true
ntinue
rgw_remote_a The remote address parameter. For example, the HTTP field containing the String REMOTE_ADDR
ddr_param remote address, or the X-Forwarded-For address if a reverse proxy is
operational.
rgw_op_threa The timeout in seconds for open threads. Integer 600
d_timeout
rgw_op_threa The timeout in seconds before a Ceph Object Gateway process dies. Integer 0
d_suicide_ti Disabled if set to 0.
meout
rgw_thread_p The size of the thread pool. Integer 512 threads.
ool_size
rgw_num_cont The number of notification objects used for cache synchronization between Integer 8
rol_oids different rgw instances.
About pools
Edit online
Ceph zones map to a series of Ceph Storage Cluster pools.
If the user key for the Ceph Object Gateway contains write capabilities, the gateway has the ability to create pools automatically.
This is convenient for getting started. However, the Ceph Object Storage Cluster uses the placement group default values unless they
were set in the Ceph configuration file. Additionally, Ceph will use the default CRUSH hierarchy. These settings are NOT ideal for
production systems.
The default pools for the Ceph Object Gateway’s default zone include:
.rgw.root
.default.rgw.control
.default.rgw.meta
.default.rgw.log
.default.rgw.buckets.index
.default.rgw.buckets.data
.default.rgw.buckets.non-ec
The Ceph Object Gateway creates pools on a per zone basis. If you create the pools manually, prepend the zone name. The system
pools store objects related to, for example, system control, logging, and user information. By convention, these pool names have the
zone name prepended to the pool name.
.<zone-name>.log: The log pool contains logs of all bucket/container and object actions, such as create, read, update, and
delete.
.<zone-name>.rgw.meta: The metadata pool stores user_keys and other critical metadata.
.<zone-name>.meta:users.keys: The keys pool contains access keys and secret keys for each user ID.
.<zone-name>.meta:users.email: The email pool contains email addresses associated with a user ID.
.<zone-name>.meta:users.swift: The Swift pool contains the Swift subuser information for a user ID.
Lifecycle settings
Edit online
As a storage administrator, you can set various bucket lifecycle options for a Ceph Object Gateway. These options contain default
values. If you do not specify each option, then the default value is set automatically.
To set specific values for these options, update the configuration database by using the ceph config set client.rgw
_OPTION_ _VALUE_ command.
Swift settings
Edit online
Name Description Type Default
rgw_enforce_swift_acls Enforces the Swift Access Control List (ACL) settings. Boolean true
rgw_swift_token_expira The time in seconds for expiring a Swift token. Integer 24 * 3600
tion
rgw_swift_url The URL for the Ceph Object Gateway Swift API. String None
rgw_swift_url_prefix The URL prefix for the Swift API, for example, swift N/A
http://fqdn.com/swift.
rgw_swift_auth_url Default URL for verifying v1 auth tokens (if not using internal String None
Swift auth).
rgw_swift_auth_entry The entry point for a Swift auth URL. String auth
Keystone settings
Edit online
Name Description Type Default
rgw_keystone_url The URL for the Keystone server. String None
rgw_keystone_admin_token The Keystone admin token (shared secret). String None
rgw_keystone_accepted_roles The roles required to serve requests. String Member,
admin
LDAP settings
Edit online
Name Description Type Example
rgw_ldap_uri A space-separated list of LDAP servers in URI format. String ldaps://<ldap.your.domain
>
rgw_ldap_sea The LDAP search domain name, also known as base String cn=users,cn=accounts,dc=e
rchdn domain. xample,dc=com
rgw_ldap_bin The gateway will bind with this LDAP entry (user match). String uid=admin,cn=users,dc=exa
ddn mple,dc=com
rgw_ldap_sec A file containing credentials for rgw_ldap_binddn. String /etc/openldap/secret
ret
rgw_ldap_dna LDAP attribute containing Ceph object gateway user String uid
ttr names (to form binddns).
Block devices
Edit online
Learn to manage, create, configure, and use IBM Storage Ceph Block Devices.
Hard drives
CD/DVD discs
Floppy disks
Ceph block devices are thin-provisioned, resizable and store data striped over multiple Object Storage Devices (OSD) in a Ceph
storage cluster. Ceph block devices are also known as Reliable Autonomic Distributed Object Store (RADOS) Block Devices (RBDs).
Ceph block devices leverage RADOS capabilities such as:
Replication
Data consistency
Ceph block devices interact with OSDs by using the librbd library.
Ceph block devices deliver high performance with infinite scalability to Kernel Virtual Machines (KVMs), such as Quick Emulator
(QEMU), and cloud-based computing systems, like OpenStack, that rely on the libvirt and QEMU utilities to integrate with Ceph
block devices. You can use the same storage cluster to operate the Ceph Object Gateway and Ceph block devices simultaneously.
IMPORTANT: Using Ceph block devices requires access to a running Ceph storage cluster. For details on installing an IBM Storage
Ceph cluster, see the Installation.
NOTE: The -h option still displays help for all available commands.
Prerequisites
Edit online
1. Use the rbd help command to display help for a particular rbd command and its subcommand:
Syntax
NOTE: You MUST create a pool first before you can specify it as a source.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Reference
Edit online
Prerequisites
794 IBM Storage Ceph
Edit online
Procedure
Edit online
Syntax
Example
This example creates a 1 GB image named image1 that stores information in a pool named pool1.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
1. To list block devices in the rbd pool, execute the following command:
Example
Syntax
rbd ls POOL_NAME
Example
Prerequisites
Edit online
Procedure
Edit online
1. To retrieve information from a particular image in the default rbd pool, run the following command:
Syntax
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
To increase the maximum size of a Ceph block device image for the default rbd pool:
Syntax
To decrease the maximum size of a Ceph block device image for the default rbd pool:
Syntax
Example
To increase the maximum size of a Ceph block device image for a specific pool:
Syntax
Example
To decrease the maximum size of a Ceph block device image for a specific pool:
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
rbd rm IMAGE_NAME
Example
Syntax
Example
Once an image is moved to the trash, it can be removed from the trash at a later time. This helps to avoid accidental deletion.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
NOTE: You need this image ID to specify the image later if you need to use any of the trash options.
2. Execute the rbd trash list POOL_NAME for a list of IDs of the images in the trash. This command also returns the image’s
pre-deletion name. In addition, there is an optional --image-id argument that can be used with rbd info and rbd snap
commands. Use --image-id with the rbd info command to see the properties of an image in the trash, and with rbd
snap to remove an image’s snapshots from the trash.
Syntax
Example
Syntax
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
[ceph: root@host01 /]# rbd trash purge schedule add --pool pool1 10m
Syntax
Example
Example
Syntax
Example
[ceph: root@host01 /]# rbd trash purge schedule remove --pool pool1 10m
NOTE: The deep flatten feature can be only disabled on already existing images but not enabled. To use deep flatten, enable
it when creating images.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
2. Enable a feature:
Syntax
i. To enable the exclusive-lock feature on the image1 image in the pool1 pool:
Example
IMPORTANT: If you enable the fast-diff and object-map features, then rebuild the object map:
Syntax
3. Disable a feature:
Syntax
i. To disable the fast-diff feature on the image1 image in the pool1 pool:
Example
Also, by using metadata, you can set the RADOS Block Device (RBD) configuration parameters for particular images.
Procedure
Edit online
Syntax
Example
This example sets the last_update key to the 2021-06-06 value on the image1 image in the pool1 pool.
Syntax
Example
Syntax
Example
This example lists the metadata set for the image1 image in the pool1 pool.
Syntax
Example
This example removes the last_update key-value pair from the image1 image in the pool1 pool.
5. To override the RBD image configuration settings set in the Ceph configuration file for a particular image:
Syntax
Example
[ceph: root@host01 /]# rbd config image set pool1/image1 rbd_cache false
This example disables the RBD cache for the image1 image in the pool1 pool.
See Block device general options for a list of possible configuration options.
During this process, the source image is copied to the target image with all snapshot history and optionally with link to the source
image’s parent to help preserve sparseness. The source image is read only, the target image is writable. The target image is linked to
the source image while the migration is in progress.
You can safely run this process in the background while the new target image is in use. However, stop all clients using the target
image before the preparation step to ensure that clients using the image are updated to point to the new target image.
IMPORTANT: The krbd kernel module does not support live migration at this time.
Prerequisites
Edit online
Procedure
Edit online
1. Prepare for migration by creating the new target image that cross-links the source and target images:
Syntax
Replace:
SOURCE_IMAGE with the name of the image to be moved. Use the POOL/IMAGE_NAME format.
TARGET_IMAGE with the name of the new image. Use the POOL/IMAGE_NAME format.
Example
2. Verify the state of the new target image, which is supposed to be prepared:
Syntax
Example
3. Optionally, restart the clients using the new target image name.
Syntax
Example
Example
6. Commit the migration by removing the cross-link between the source and target images, and this also removes the source
image:
Syntax
Example
If the source image is a parent of one or more clones, use the --force option after ensuring that the clone images are not in
use:
Example
7. If you did not restart the clients after the preparation step, restart them using the new target image name.
Edit online
The systemd unit file, rbdmap.service, is included with the ceph-common package. The rbdmap.service unit executes the
rbdmap shell script.
This script automates the mapping and unmapping of RADOS Block Devices (RBD) for one or more RBD images. The script can be ran
manually at any time, but the typical use case is to automatically mount RBD images at boot time, and unmount at shutdown. The
script takes a single argument, which can be either map, for mounting or unmap, for unmounting RBD images. The script parses a
configuration file, the default is /etc/ceph/rbdmap, but can be overridden using an environment variable called RBDMAPFILE.
Each line of the configuration file corresponds to an RBD image.
IMAGE_SPEC RBD_OPTS
Where IMAGE_SPEC specifies the POOL_NAME / IMAGE_NAME, or just the IMAGE_NAME, in which case the POOL_NAME defaults
to rbd. The RBD_OPTS is an optional list of options to be passed to the underlying rbd map command. These parameters and their
values should be specified as a comma-separated string:
OPT1=VAL1,OPT2=VAL2,...,OPT_N=VAL_N
This will cause the script to issue an rbd map command like the following:
Syntax
When successful, the rbd map operation maps the image to a /dev/rbdX device, at which point a udev rule is triggered to create a
friendly device name symlink, for example, /dev/rbd/POOL_NAME/IMAGE_NAME, pointing to the real mapped device. For mounting
or unmounting to succeed, the friendly device name must have a corresponding entry in /etc/fstab file. When writing
/etc/fstab entries for RBD images, it is a good idea to specify the noauto or nofail mount option. This prevents the init system
from trying to mount the device too early, before the device exists.
Reference
Edit online
Edit online
To automatically map and mount, or unmap and unmount, RADOS Block Devices (RBD) at boot time, or at shutdown respectively.
Prerequisites
Edit online
Procedure
Edit online
Example
foo/bar1 id=admin,keyring=/etc/ceph/ceph.client.admin.keyring
foo/bar2
id=admin,keyring=/etc/ceph/ceph.client.admin.keyring,options='lock_on_read,queue_depth=1024'
Example
Reference
Edit online
See The rbdmap service for more details on the RBD system service.
In an IBM Storage Ceph cluster, Persistent Write Log (PWL) cache provides a persistent, fault-tolerant write-back cache for librbd-
based RBD clients.
PWL cache uses a log-ordered write-back design which maintains checkpoints internally so that writes that get flushed back to the
cluster are always crash consistent. If the client cache is lost entirely, the disk image is still consistent but the data appears stale.
You can use PWL cache with persistent memory (PMEM) or solid-state disks (SSD) as cache devices.
For PMEM, the cache mode is replica write log (RWL) and for SSD, the cache mode is (SSD). Currently, PWL cache supports RWL and
SSD modes and is disabled by default.
PWL cache can provide high performance when the cache is not full. The larger the cache, the longer the duration of high
performance.
PWL cache provides persistence and is not much slower than RBD cache. RBD cache is faster but volatile and cannot
guarantee data order and persistence.
In a steady state, where the cache is full, performance is affected by the number of I/Os in flight. For example, PWL can
provide higher performance at low io_depth, but at high io_depth, such as when the number of I/Os is greater than 32, the
performance is often worse than that in cases without cache.
Different from RBD cache, PWL cache has non-volatile characteristics and is used in scenarios where you do not want data
loss and need performance.
RWL mode provides low latency. It has a stable low latency for burst I/Os and it is suitable for those scenarios with high
requirements for stable low latency.
RWL mode also has high continuous and stable performance improvement in scenarios with low I/O depth or not too much
inflight I/O.
The advantages of SSD mode are similar to RWL mode. SSD hardware is relatively cheap and popular, but its performance is
slightly lower than PMEM.
The underlying implementation of persistent memory (PMEM) and solid-state disks (SSD) is different, with PMEM having
higher performance. At present, PMEM can provide "persist on write" and SSD is "persist on flush or checkpoint". In future
releases, these two modes will be configurable.
When users switch frequently and open and close images repeatedly, Ceph displays poor performance. If PWL cache is
enabled, the performance is worse. It is not recommended to set num_jobs in a Flexible I/O (fio) test, but instead setup
multiple jobs to write different images.
Set the persistent write log cache options at the host level by using the ceph config set command. Set the persistent write log
cache options at the pool or image level is by using the rbd config pool set or the rbd config image set commands.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
ii. At the pool level, use the rbd config pool set command:
Syntax
Example
[ceph: root@host01 /]# rbd config pool set pool1 rbd_persistent_cache_mode ssd
[ceph: root@host01 /]# rbd config pool set pool1 rbd_plugins pwl_cache
iii. At the image level, use the rbd config image set command:
Syntax
Example
[ceph: root@host01 /]# rbd config image set pool1/image1 rbd_persistent_cache_mode ssd
[ceph: root@host01 /]# rbd config image set pool1/image1 rbd_plugins pwl_cache
Syntax
rbd_persistent_cache_mode CACHE_MODE
rbd_plugins pwl_cache
rbd_persistent_cache_path /PATH_TO_DAX_ENABLED_FOLDER/WRITE_BACK_CACHE_FOLDER <1>
rbd_persistent_cache_size PERSISTENT_CACHE_SIZE <2>
<1> rbd_persistent_cache_path - A file folder to cache data that must have direct access (DAX) enabled when using the
rwl mode to avoid performance degradation.
<2> rbd_persistent_cache_size - The cache size per image, with a minimum cache size of 1 GB. The larger the cache
size, the better the performance.
Example
rbd_cache false
rbd_persistent_cache_mode rwl
rbd_plugins pwl_cache
rbd_persistent_cache_path /mnt/pmem/cache/
rbd_persistent_cache_size 1073741824
Reference
Edit online
See Direct Access for files article on kernel.org for more details on using DAX.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Reference
Edit online
Prerequisites
Procedure
Edit online
Syntax
Example
A new Ceph Manager module,rbd_support, aggregates the performance metrics when enabled. The rbd command has two new
actions: iotop and iostat.
NOTE: The initial use of these actions can take around 30 seconds to populate the data fields.
Prerequisites
Edit online
Procedure
Edit online
Example
{
"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator",
"pg_autoscaler",
"progress",
"rbd_support", <--
"status",
"telemetry",
"volumes"
}
Example
NOTE: The write ops, read-ops, write-bytes, read-bytes, write-latency, and read-latency columns can be sorted dynamically
by using the right and left arrow keys.
Example
NOTE: The output from this command can be in JSON or XML format, and then can be sorted using other command-line tools.
IMPORTANT: Currently, the krbd kernel module does not support live migration.
Prepare Migration: The first step is to create new target image and link the target image to the source image. If the import-only
mode is not configured, the source image will also be linked to the target image and marked read-only. Attempts to read uninitialized
data extents within the target image will internally redirect the read to the source image, and writes to uninitialized extents within the
target image will internally deep copy, the overlapping source image extents to the target image.
Execute Migration: This is a background operation that deep-copies all initialized blocks from the source image to the target. You
can run this step when clients are actively using the new target image.
Finish Migration: You can commit or abort the migration, once the background migration process is completed. Committing the
migration removes the cross-links between the source and target images, and will remove the source image if not configured in the
import-only mode. Aborting the migration remove the cross-links, and will remove the target image.
Syntax
{
"type": "native",
"pool_name": "POOL_NAME",
["pool_id": "POOL_ID",] (optional, alternative to "POOL_NAME" key)
["pool_namespace": "POOL_NAMESPACE",] (optional)
"image_name": "IMAGE_NAME>",
["image_id": "IMAGE_ID",] (optional, useful if image is in trash)
"snap_name": "SNAP_NAME",
["snap_id": "SNAP_ID",] (optional, alternative to "SNAP_NAME" key)
}
Note that the native format does not include the stream object since it utilizes native Ceph operations. For example, to import from
the image rbd/ns1/image1@snap1, the source-spec could be encoded as:
Example
{
"type": "native",
"pool_name": "rbd",
"pool_namespace": "ns1",
"image_name": "image1",
"snap_name": "snap1"
}
You can use the qcow format to describe a QEMU copy-on-write (QCOW) block device. Both the QCOW v1 and v2 formats are
currently supported with the exception of advanced features such as compression, encryption, backing files, and external data files.
You can link the qcow format data to any supported stream source:
Example
{
"type": "qcow",
"stream": {
"type": "file",
"file_path": "/mnt/image.qcow"
}
}
You can use the raw format to describe a thick-provisioned, raw block device export that is rbd export –export-format 1
_SNAP_SPEC_. You can link the raw format data to any supported stream source:
Example
{
"type": "raw",
"stream": {
"type": "file",
"file_path": "/mnt/image-head.raw"
},
"snapshots": [
{
"type": "raw",
"name": "snap1",
"stream": {
"type": "file",
"file_path": "/mnt/image-snap1.raw"
}
},
] (optional oldest to newest ordering of snapshots)
}
The inclusion of the snapshots array is optional and currently only supports thick-provisioned raw snapshot exports.
You can use the file stream to import from a locally accessible POSIX file source.
Syntax
{
<format unique parameters>
"stream": {
"type": "file",
"file_path": "FILE_PATH"
}
}
For example, to import a raw-format image from a file located at /mnt/image.raw, the source-spec JSON file is:
Example
{
"type": "raw",
"stream": {
"type": "file",
"file_path": "/mnt/image.raw"
}
}
HTTP stream
You can use the HTTP stream to import from a remote HTTP or HTTPS web server.
Syntax
{
<format unique parameters>
"stream": {
"type": "http",
"url": "URL_PATH"
}
}
For example, to import a raw-format image from a file located at http://download.ceph.com/image.raw, the source-spec
JSON file is:
Example
{
"type": "raw",
"stream": {
"type": "http",
"url": "http://download.ceph.com/image.raw"
}
}
S3 stream
Syntax
{
<format unique parameters>
"stream": {
"type": "s3",
"url": "URL_PATH",
"access_key": "ACCESS_KEY",
"secret_key": "SECRET_KEY"
}
}
Example
{
"type": "raw",
"stream": {
"type": "s3",
"url": "http://s3.ceph.com/bucket/image.raw",
"access_key": "NX5QOQKC6BH2IDN8HC7A",
"secret_key": "LnEsqNNqZIpkzauboDcLXLcYaWwLQ3Kop0zAnKIn"
}
}
NOTE: You cannot restart the clients using the source image as it will result in a failure.
Prerequisites
Edit online
Syntax
Example
OR
Syntax
Example
2. You can check the current state of the live migration process with the following command:
Syntax
Example
IMPORTANT: During the migration process, the source image is moved into the RBD trash to prevent mistaken usage.
Example
Example
Example
Syntax
Example
NOTE: The rbd migration prepare command accepts all the same image options as the rbd create command.
Example
NOTE: The sub-commands help users copy the image blocks. The user is not required to take any further action other than the
execute command.
Prerequisites
Edit online
One block device image with migration prepared using Live migration of images.
Procedure
Edit online
Syntax
Example
2. You can check the feedback on the progress of the migration block deep-copy process:
Syntax
Example
Prerequisites
Edit online
One block device image using Executing the live migration process.
Procedure
Edit online
Syntax
Example
Verification
Committing the live migration will remove the cross-links between the source and target images, and also removes the source image
from the source pool:
Example
NOTE: You can abort only if you have not committed the live migration.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Verification
When the live migration process is aborted, the target image is deleted and access to the original source image is restored in the
source pool:
Image encryption
Edit online
As a storage administrator, you can set a secret key that is used to encrypt a specific RBD image. Image level encryption is handled
internally by RBD clients.
NOTE: The krbd module does not support image level encryption.
NOTE: You can use external tools such as dm-crypt or QEMU to encrypt an RBD image.
Encryption format
Encryption load
Supported formats
Adding encryption format to images and clones
Prerequisites
Edit online
Encryption format
Edit online
RBD images are not encrypted by default. You can encrypt an RBD image by formatting to one of the supported encryption formats.
The format operation persists the encryption metadata to the RBD image. The encryption metadata includes information such as the
encryption format and version, cipher algorithm and mode specifications, as well as the information used to secure the encryption
key.
The encryption key is protected by a user kept secret that is a passphrase, which is never stored as persistent data in the RBD image.
The encryption format operation requires you to specify the encryption format, cipher algorithm, and mode specification as well as a
passphrase. The encryption metadata is stored in the RBD image, currently as an encryption header that is written at the start of the
raw image. This means that the effective image size of the encrypted image would be lower than the raw image size.
NOTE: Currently you can only encrypt flat RBD images. Clones of an encrypted RBD image are inherently encrypted using the same
encryption profile and passphrase.
NOTE: Any data written to the RBD image before formatting might become unreadable, even though it might still occupy storage
resources. RBD images with the journal feature enabled cannot be encrypted.
Encryption load
Edit online
By default, all RBD APIs treat encrypted RBD images the same way as unencrypted RBD images. You can read or write raw data
anywhere in the image. Writing raw data into the image might risk the integrity of the encryption format. For example, the raw data
could override the encryption metadata located at the beginning of the image. To safely perform encrypted Input/Output(I/O) or
maintenance operations on the encrypted RBD image, an additional encryption load operation must be applied immediately after
opening the image.
NOTE: Once the encryption is loaded on the RBD image, no other encryption load or format operation can be applied. Additionally,
API calls for retrieving the RBD image size using the opened image context return the effective image size. The encryption is loaded
automatically when mapping the RBD images as block devices through rbd-nbd.
NOTE: API calls for retrieving the image size and the parent overlap using the opened image context returns the effective image size
and the effective parent overlap.
NOTE: If a clone of an encrypted image is explicitly formatted, flattening or shrinking of the cloned image ceases to be transparent
since the parent data must be re-encrypted according to the cloned image format as it is copied from the parent snapshot. If
encryption is not loaded before the flatten operation is issued, any parent data that was previously accessible in the cloned image
might become unreadable.
NOTE: If a clone of an encrypted image is explicitly formatted, the operation of shrinking the cloned image ceases to be transparent.
This is because, in scenarios such as the cloned image containing snapshots or the cloned image being shrunk to a size that is not
aligned with the object size, the action of copying some data from the parent snapshot, similar to flattening is involved. If encryption
is not loaded before the shrink operation is issued, any parent data that was previously accessible in the cloned image might become
unreadable.
Supported formats
Edit online
Both Linux Unified Key Setup (LUKS) 1 and 2 are supported. The data layout is fully compliant with the LUKS specification. External
LUKS compatible tools such as dm-crypt or QEMU can safely perform encrypted Input/Outout (I/O) on encrypted RBD images.
Additionally, you can import existing LUKS images created by external tools, by copying the raw LUKS data into the RBD image.
Currently, only Advanced Encryption Standards (AES) 128 and 256 encryption algorithms are supported. xts-plain64 is currently the
only supported encryption mode.
To use the LUKS format, format the RBD image with the following command:
NOTE: You need to create a file named passphrase.txt and enter a passphrase. You can randomly generate the passphrase, which
might contain NULL characters. If the passphrase ends with a newline character, it will be stripped off.
Syntax
Example
The encryption format operation generates a LUKS header and writes it at the start of the RBD image. A single keyslot is appended to
the header. The keyslot holds a randomly generated encryption key, and is protected by the passphrase read from the passphrase
file. By default, AES-256 in xts-plain64 mode, which is the current recommended mode and the default for other LUKS tools, is used.
Adding or removing additional passphrases is currently not supported natively, but can be achieved using LUKS tools such as
cryptsetup. The LUKS header size can vary that is upto 136MiB in LUKS, but it is usually upto 16MiB, dependent on the version of
libcryptsetup installed. For optimal performance, the encryption format will set the data offset to be aligned with the image
object size. For example, expect a minimum overhead of 8MiB if using an image configured with an 8MiB object size.
In LUKS1, sectors, which are the minimal encryption units, are fixed at 512 bytes. LUKS2 supports larger sectors, and for better
performance, the default sector size is set to the maximum of 4KiB. Writes which are either smaller than a sector, or are not aligned
to a sector start, will trigger a guarded read-modify-write chain on the client, with a considerable latency penalty. A batch of
such unaligned writes can lead to I/O races which will further deteriorate performance. IBM recommends to avoid using RBD
encryption in cases where incoming writes cannot be guaranteed to be LUKS sector aligned.
Syntax
Example
NOTE: For security reasons, both the encryption format and encryption load operations are CPU-intensive, and may take a few
seconds to complete. For encrypted I/O, assuming AES-NI is enabled, a relative small microseconds latency might be added, as well
as a small increase in CPU utilization.
Add encryption format to images and clones with the rbd encryption format command. Given a LUKS2-formatted image, you can
create both a LUKS2-formatted clone and a LUKS1-formatted clone.
Prerequisites
Edit online
A running IBM Storage Ceph cluster with Block Device (RBD) configured.
Procedure
Edit online
Syntax
Example
The rbd resize command grows the image to compensate for the overhead associated with the LUKS2 header.
2. With the LUKS2-formatted image, create a LUKS2-formatted clone with the same effective size:
Syntax
Example
Syntax
Example
Since LUKS1 header is usually smaller than LUKS2 header, the rbd resize command at the end shrinks the cloned image to get
rid of unwanted space allowance.
4. With the LUKS-1-formatted image, create a LUKS2-formatted clone with the same effective size:
Syntax
Example
Since LUKS2 header is usually bigger than LUKS1 header, the rbd resize command at the beginning temporarily grows the
parent image to reserve some extra space in the parent snapshot and consequently the cloned image. This is necessary to
make all parent data accessible in the cloned image. The rbd resize command at the end shrinks the parent image back to its
original size and does not impact the parent snapshot and the cloned image to get rid of the unused reserved space.
The same applies to creating a formatted clone of an unformatted image, since an unformatted image does not have a header
at all.
Snapshot management
Edit online
As a storage administrator, being familiar with Ceph's snapshotting feature can help you manage the snapshots and clones of images
stored in the IBM Storage Ceph cluster.
NOTE: If a snapshot is taken while I/O is occurring, then the snapshot might not get the exact or latest data of the image and the
snapshot might have to be cloned to a new image to be mountable. IBM recommends stopping I/O before taking a snapshot of an
image. If the image contains a filesystem, the filesystem must be in a consistent state before taking a snapshot. To stop I/O you can
use fsfreeze command. For virtual machines, the qemu-guest-agent can be used to automatically freeze filesystems when
creating a snapshot.
Reference
Edit online
You might also add the CEPH_ARGS environment variable to avoid re-entry of the following parameters:
Syntax
Example
TIP: Add the user and secret to the CEPH_ARGS environment variable so that you do not need to enter them each time.
Prerequisites
Edit online
Procedure
Edit online
1. Specify the snap create option, the pool name, and the image name:
Method 1:
Syntax
Example
[root@rbd-client ~]# rbd --pool pool1 snap create --snap snap1 image1
Method 2:
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
NOTE: Rolling back an image to a snapshot means overwriting the current version of the image with data from a snapshot. The time it
takes to execute a rollback increases with the size of the image. It is faster to clone from a snapshot than to rollback an image to a
snapshot, and it is the preferred method of returning to a pre-existing state.
Prerequisites
Edit online
Procedure
Edit online
1. Specify the snap rollback option, the pool name, the image name and the snap name:
Syntax
Example
[root@rbd-client ~]# rbd --pool pool1 snap rollback --snap snap1 image1
[root@rbd-client ~]# rbd snap rollback pool1/image1@snap1
Prerequisites
Edit online
Procedure
Edit online
1. To delete a block device snapshot, specify the snap rm option, the pool name, the image name and the snapshot name:
Syntax
IMPORTANT: If an image has any clones, the cloned images retain reference to the parent image snapshot. To delete the parent
image snapshot, you must flatten the child images first.
NOTE: Ceph OSD daemons delete data asynchronously, so deleting a snapshot does not free up the disk space immediately.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
1. Specify the snap purge option and the image name on a specific pool:
Syntax
Example
Prerequisites
Edit online
Procedure
824 IBM Storage Ceph
Edit online
1. To rename a snapshot:
Syntax
Example
This renames snap1 snapshot of the dataset image on the data pool to snap2.
2. Run the rbd help snap rename command to display additional details on renaming snapshots.
Parent Child
NOTE: The terms parent and child mean a Ceph block device snapshot, parent, and the corresponding image cloned from the
snapshot, child. These terms are important for the command line usage below.
Each cloned image, the child, stores a reference to its parent image, which enables the cloned image to open the parent snapshot
and read it. This reference is removed when the clone is flattened that is, when information from the snapshot is completely
copied to the clone.
A clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize the
cloned images. There are no special restrictions with cloned images. However, the clone of a snapshot refers to the snapshot, so you
MUST protect the snapshot before you clone it.
A clone of a snapshot can be a copy-on-write (COW) or copy-on-read (COR) clone. Copy-on-write (COW) is always enabled for clones
while copy-on-read (COR) has to be enabled explicitly. Copy-on-write (COW) copies data from the parent to the clone when it writes
to an unallocated object within the clone. Copy-on-read (COR) copies data from the parent to the clone when it reads from an
unallocated object within the clone. Reading data from a clone will only read data from the parent if the object does not yet exist in
the clone. Rados block device breaks up large images into multiple objects. The default is set to 4 MB and all copy-on-write (COW)
and copy-on-read (COR) operations occur on a full object, that is writing 1 byte to a clone will result in a 4 MB object being read from
the parent and written to the clone if the destination object does not already exist in the clone from a previous COW/COR operation.
Whether or not copy-on-read (COR) is enabled, any reads that cannot be satisfied by reading an underlying object from the clone will
be rerouted to the parent. Since there is practically no limit to the number of parents, meaning that you can clone a clone, this
reroute continues until an object is found or you hit the base parent image. If copy-on-read (COR) is enabled, any reads that fail to be
satisfied directly from the clone result in a full object read from the parent and writing that data to the clone so that future reads of
the same extent can be satisfied from the clone itself without the need of reading from the parent.
This is essentially an on-demand, object-by-object flatten operation. This is specially useful when the clone is in a high-latency
connection away from it’s parent, that is the parent in a different pool, in another geographical location. Copy-on-read (COR) reduces
the amortized latency of reads. The first few reads will have high latency because it will result in extra data being read from the
parent, for example, you read 1 byte from the clone but now 4 MB has to be read from the parent and written to the clone, but all
future reads will be served from the clone itself.
Reference
Edit online
You can set the set-require-min-compat-client parameter to greater than or equal to mimic versions of Ceph.
Example
This creates clone v2, by default. However, clients older than mimic cannot access those block device images.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
[root@rbd-client ~]# rbd --pool pool1 snap protect --image image1 --snap snap1
[root@rbd-client ~]# rbd snap protect pool1/image1@snap1
Prerequisites
Procedure
Edit online
1. To clone a snapshot, you need to specify the parent pool, snapshot, child pool and image name:
Syntax
rbd snap --pool POOL_NAME --image PARENT_IMAGE --snap SNAP_NAME --dest-pool POOL_NAME --dest
CHILD_IMAGE_NAME
rbd clone POOL_NAME/PARENT_IMAGE@SNAP_NAME POOL_NAME/CHILD_IMAGE_NAME
Example
[root@rbd-client ~]# rbd clone --pool pool1 --image image1 --snap snap2 --dest-pool pool2 --
dest childimage1
[root@rbd-client ~]# rbd clone pool1/image1@snap1 pool1/childimage1
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
[root@rbd-client ~]# rbd --pool pool1 snap unprotect --image image1 --snap snap1
Procedure
Edit online
Syntax
Example
[root@rbd-client ~]# rbd --pool pool1 children --image image1 --snap snap1
[root@rbd-client ~]# rbd children pool1/image1@snap1
NOTE: If the deep flatten feature is enabled on an image, the image clone is dissociated from its parent by default.
Prerequisites
Edit online
Procedure
Edit online
1. To delete a parent image snapshot associated with child images, you must flatten the child images first:
Syntax
Example
RBD mirroring uses exclusive locks and the journaling feature to record all modifications to an image in the order in which they occur.
This ensures that a crash-consistent mirror of an image is available.
IMPORTANT: The CRUSH hierarchies supporting primary and secondary pools that mirror block device images must have the same
capacity and performance characteristics, and must have adequate bandwidth to ensure mirroring without excess latency. For
example, if you have X MB/s average write throughput to images in the primary storage cluster, the network must support N \* X
throughput in the network connection to the secondary site plus a safety factor of Y% to mirror N images.
The rbd-mirror daemon is responsible for synchronizing images from one Ceph storage cluster to another Ceph storage cluster by
pulling changes from the remote primary image and writes those changes to the local, non-primary image. The rbd-mirror
daemon can run either on a single Ceph storage cluster for one-way mirroring or on two Ceph storage clusters for two-way mirroring
that participate in the mirroring relationship.
For RBD mirroring to work, either using one-way or two-way replication, a couple of assumptions are made:
IMPORTANT: In one-way or two-way replication, each instance of rbd-mirror must be able to connect to the other Ceph storage
cluster simultaneously. Additionally, the network must have sufficient bandwidth between the two data center sites to handle
mirroring.
One-way Replication
One-way mirroring implies that a primary image or pool of images in one storage cluster gets replicated to a secondary storage
cluster. One-way mirroring also supports replicating to multiple secondary storage clusters.
On the secondary storage cluster, the image is the non-primary replicate; that is, Ceph clients cannot write to the image. When data
is mirrored from a primary storage cluster to a secondary storage cluster, the rbd-mirror runs ONLY on the secondary storage
cluster.
You have two Ceph storage clusters and you want to replicate images from a primary storage cluster to a secondary storage
cluster.
The secondary storage cluster has a Ceph client node attached to it running the rbd-mirror daemon. The rbd-mirror
daemon will connect to the primary storage cluster to sync images to the secondary storage cluster.
Two-way replication adds an rbd-mirror daemon on the primary cluster so images can be demoted on it and promoted on the
secondary cluster. Changes can then be made to the images on the secondary cluster and they will be replicated in the reverse
direction, from secondary to primary. Both clusters must have rbd-mirror running to allow promoting and demoting images on
either cluster. Currently, two-way replication is only supported between two sites.
You have two storage clusters and you want to be able to replicate images between them in either direction.
Both storage clusters have a client node attached to them running the rbd-mirror daemon. The rbd-mirror daemon
running on the secondary storage cluster will connect to the primary storage cluster to synchronize images to secondary, and
the rbd-mirror daemon running on the primary storage cluster will connect to the secondary storage cluster to synchronize
images to primary.
Mirroring Modes
Mirroring is configured on a per-pool basis with mirror peering storage clusters. Ceph supports two mirroring modes, depending on
the type of images in the pool.
Pool Mode
All images in a pool with the journaling feature enabled are mirrored.
Image States
Images are automatically promoted to primary when mirroring is first enabled on an image. The promotion can happen:
Reference
Edit online
The actual image is not modified until every write to the RBD image is first recorded to the associated journal. The remote cluster
reads from this journal and replays the updates to its local copy of the image. Because each write to the RBD images results in two
writes to the Ceph cluster, write latencies nearly double with the usage of the RBD journaling image feature.
Snapshot-based mirroring
The remote cluster determines any data or metadata updates between two mirror snapshots and copies the deltas to its local copy
of the image.The RBD fast-diff image feature enables the quick determination of updated data blocks without the need to scan
the full RBD image. The complete delta between two snapshots needs to be synchronized prior to use during a failover scenario. Any
partially applied set of deltas are rolled back at moment of failover.
NOTE: When using one-way replication you can mirror to multiple secondary storage clusters.
NOTE: Examples in this section will distinguish between two storage clusters by referring to the primary storage cluster with the
primary images as site-a, and the secondary storage cluster you are replicating the images to, as site-b. The pool name used in
these examples is called data.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
NOTE: The nodename is the host where you want to configure mirroring in the secondary cluster.
Syntax
rbd create IMAGE_NAME --size MEGABYTES --pool POOL_NAME --image-feature FEATURE FEATURE
Example
[ceph: root@site-a /]# rbd create image1 --size 1024 --pool data --image-feature
exclusive-lock,journaling
NOTE: If exclusive-lock is already enabled, usejournaling as the only argument, else it returns the following error:
one or more requested features are already enabled (22) Invalid argument
ii. For existing images, use the rbd feature enable command:
Syntax
Example
iii. To enable journaling on all new images by default, set the configuration parameter using ceph config set command:
Example
4. Choose the mirroring mode, either pool or image mode, on both the storage clusters.
Syntax
Example
Syntax
Example
This example enables image mode mirroring on the pool named data.
iii. Verify that mirroring has been successfully enabled at both the sites:
Syntax
Example
i. Create Ceph user accounts, and register the storage cluster peer to the pool:
Syntax
rbd mirror pool peer bootstrap create --site-name *PRIMARY_LOCAL_SITE_NAME* *POOL_NAME* >
*PATH_TO_BOOTSTRAP_TOKEN*
Example
[ceph: root@rbd-client-site-a /]# rbd mirror pool peer bootstrap create --site-name site-a
data > /root/bootstrap_token_site-a
NOTE: This example bootstrap command creates the client.rbd-mirror.site-a and the client.rbd-mirror-peer
Ceph users.
ii. Copy the bootstrap token file to the site-b storage cluster.
Syntax
rbd mirror pool peer bootstrap import --site-name *SECONDARY_LOCAL_SITE_NAME* --direction rx-
only *POOL_NAME PATH_TO_BOOTSTRAP_TOKEN*
Example
[ceph: root@rbd-client-site-b /]# rbd mirror pool peer bootstrap import --site-name site-b --
direction rx-only data /root/bootstrap_token_site-a
6. To verify the mirroring status, run the following command from a Ceph Monitor node on the primary and secondary sites:
Syntax
Here, up means the rbd-mirror daemon is running, and stopped means this image is not the target for replication from
another storage cluster. This is because the image is primary on this storage cluster.
Example
Reference
Edit online
NOTE: When using two-way replication you can only mirror between two storage clusters.
NOTE: Examples in this section will distinguish between two storage clusters by referring to the primary storage cluster with the
primary images as site-a, and the secondary storage cluster you are replicating the images to, as site-b. The pool name used in
these examples is called data.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
Syntax
Example
NOTE: The nodename is the host where you want to configure mirroring in the secondary cluster.
Syntax
rbd create IMAGE_NAME --size MEGABYTES --pool POOL_NAME --image-feature FEATURE FEATURE
Example
[ceph: root@site-a /]# rbd create image1 --size 1024 --pool data --image-feature
exclusive-lock,journaling
NOTE: If exclusive-lock is already enabled, usejournaling as the only argument, else it returns the following error:
one or more requested features are already enabled (22) Invalid argument
ii. For existing images, use the rbd feature enable command:
Syntax
Example
iii. To enable journaling on all new images by default, set the configuration parameter using ceph config set command:
Example
5. Choose the mirroring mode, either pool or image mode, on both the storage clusters.
Syntax
Example
Syntax
Example
iii. Verify that mirroring has been successfully enabled at both the sites:
Syntax
Example
i. Create Ceph user accounts, and register the storage cluster peer to the pool:
Syntax
rbd mirror pool peer bootstrap create --site-name PRIMARY_LOCAL_SITE_NAME POOL_NAME >
PATH_TO_BOOTSTRAP_TOKEN
Example
[ceph: root@rbd-client-site-a /]# rbd mirror pool peer bootstrap create --site-name site-a
data > /root/bootstrap_token_site-a
NOTE: This example bootstrap command creates the client.rbd-mirror.site-a and the client.rbd-mirror-peer
Ceph users.
ii. Copy the bootstrap token file to the site-b storage cluster.
Syntax
rbd mirror pool peer bootstrap import --site-name SECONDARY_LOCAL_SITE_NAME --direction rx-tx
POOL_NAME PATH_TO_BOOTSTRAP_TOKEN
Example
[ceph: root@rbd-client-site-b /]# rbd mirror pool peer bootstrap import --site-name site-b --
direction rx-tx data /root/bootstrap_token_site-a
NOTE: The --direction argument is optional, as two-way mirroring is the default when bootstrapping peers.
7. To verify the mirroring status, run the following command from a Ceph Monitor node on the primary and secondary sites:
Syntax
Example
Here, up means the rbd-mirror daemon is running, and stopped means this image is not the target for replication from
another storage cluster. This is because the image is primary on this storage cluster.
If images are in the state up+replaying, then mirroring is functioning properly. Here, up means the rbd-mirror daemon is
running, and replaying means this image is the target for replication from another storage cluster.
NOTE: Depending on the connection between the sites, mirroring can take a long time to sync the images.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Peer Sites:
UUID: 950ddadf-f995-47b7-9416-b9bb233f66e3
Name: b
Mirror UUID: 4696cd9d-1466-4f98-a97a-3748b6b722b3
Direction: rx-tx
Client: client.rbd-mirror-peer
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Example
This example enables image mode mirroring on the pool named data.
Reference
Edit online
NOTE: When you disable mirroring on a pool, you also disable it on any images within the pool for which mirroring was enabled
separately in image mode.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
This example enables mirroring for the image2 image in the data pool.
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
This example disables mirroring of the image2 image in the data pool.
Reference
Edit online
See Configuring Ansible inventory location in the IBM Storage Ceph Installation Guide for more details on adding clients to the
cephadm-ansible inventory.
NOTE: Do not force promote non-primary images that are still syncing, because the images will not be valid after the promotion.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
Syntax
Example
Use forced promotion when the demotion cannot be propagated to the peer Ceph storage cluster. For example, because of
cluster failure or communication outage.
Reference
Edit online
Image resynchronization
Edit online
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Reference
Edit online
To recover from an inconsistent state because of a disaster, see either Recover from a disaster with one-way mirroring or
Recover from a disaster with two-way mirroring for details.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
Specify the pool name and the peer Universally Unique Identifier (UUID).
Syntax
Example
To view the peer UUID, use the rbd mirror pool info command.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
TIP: To output status details for every mirroring image in a pool, use the --verbose option.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
This example gets the status of the image2 image in the data pool.
To implement delayed replication, the rbd-mirror daemon within the destination storage cluster should set the
rbd_mirroring_replay_delay = _MINIMUM_DELAY_IN_SECONDS_ configuration option. This setting can either be applied
globally within the ceph.conf file utilized by the rbd-mirror daemons, or on an individual image basis.
Prerequisites
Edit online
1. To utilize delayed replication for a specific image, on the primary image, run the following rbd CLI command:
Syntax
Example
This example sets a 10 minute minimum replication delay on image vm-1 in the vms pool.
NOTE: There is no required order for restarting the instances. Restart the instance pointing to the pool with primary images followed
by the instance pointing to the mirrored pool.
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Example
This example enables snapshot-based mirroring for the mirror_image image in the mirror_pool pool.
Prerequisites
Edit online
Root-level access to the Ceph client nodes for the IBM Storage Ceph clusters.
Access to the IBM Storage Ceph cluster where a snapshot mirror will be created.
IMPORTANT: By default, a maximum of 5 image mirror-snapshots are retained. The most recent image mirror-snapshot is
automatically removed if the limit is reached. If required, the limit can be overridden through the
rbd_mirroring_max_mirroring_snapshots configuration. Image mirror-snapshots are automatically deleted when the image
is removed or when mirroring is disabled.
Syntax
Example
Reference
Edit online
See Mirroring Ceph block devices in the IBM Storage Block Device Guide for details.
Scheduling mirror-snapshots
Edit online
Mirror-snapshots can be automatically created when mirror-snapshot schedules are defined. The mirror-snapshot can be scheduled
globally, per-pool or per-image levels. Multiple mirror-snapshot schedules can be defined at any level but only the most specific
snapshot schedules that match an individual mirrored image will run.
Prerequisites
Edit online
Root-level access to the Ceph client nodes for the IBM Storage Ceph clusters.
Access to the IBM Storage Ceph cluster where a snapshot mirror will be created.
Procedure
Edit online
Syntax
rbd --cluster CLUSTER_NAME mirror snapshot schedule add --pool POOL_NAME --image IMAGE_NAME
INTERVAL [START_TIME]
The CLUSTER_NAME should be used only when the cluster name is different from the default name ceph. The interval can be
specified in days, hours, or minutes using d, h, or m suffix respectively. The optional START_TIME can be specified using the
ISO 8601 time format.
Example
[root@site-a ~]# rbd mirror snapshot schedule add --pool data --image image1 6h
Example
[root@site-a ~]# rbd mirror snapshot schedule add --pool data --image image1 24h 14:00:00-
05:00
Reference
Edit online
Prerequisites
Edit online
Root-level access to the Ceph client nodes for the IBM Storage Ceph clusters.
Procedure
Edit online
1. To list all snapshot schedules for a specific global, pool or image level, with an optional pool or image name:
Syntax
Additionally, the --recursive option can be specified to list all schedules at the specified level as shown below:
Example
Reference
Edit online
Prerequisites
Edit online
Root-level access to the Ceph client nodes for the IBM Storage Ceph clusters.
Access to the IBM Storage Ceph cluster where a snapshot mirror will be created.
Procedure
Edit online
Syntax
rbd --cluster CLUSTER_NAME mirror snapshot schedule remove --pool POOL_NAME --image IMAGE_NAME
INTERVAL START_TIME
The interval can be specified in days, hours, or minutes using d, h, m suffix respectively. The optional START_TIME can be
specified using the ISO 8601 time format.
Example
[root@site-a ~]# rbd mirror snapshot schedule remove --pool data --image image1 6h
Example
Reference
Edit online
Prerequisites
Edit online
Root-level access to the Ceph client nodes for the IBM Storage Ceph clusters.
Access to the IBM Storage Ceph cluster where a snapshot mirror will be created.
Procedure
Edit online
Syntax
rbd --cluster site-a mirror snapshot schedule status [--pool POOL_NAME] [--image IMAGE_NAME]
Example
Reference
Edit online
In the examples, the primary storage cluster is known as the site-a, and the secondary storage cluster is known as the site-b.
Additionally, the storage clusters both have a data pool with two images, image1 and image2.
Disaster recovery
Recover from a disaster with one-way mirroring
Disaster recovery
Edit online
These failures have a widespread impact, also referred as a large blast radius, and can be caused by impacts to the power grid and
natural disasters.
Customer data needs to be protected during these scenarios. Volumes must be replicated with consistency and efficiency and also
within Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets. This solution is called a Wide Area Network-
Disaster Recovery (WAN-DR).
In such scenarios it is hard to restore the primary system and the data center. The solutions that are used to recover from these
failure scenarios are guided by the application:
Recovery Point Objective (RPO): The amount of data loss, an application tolerate in the worst case.
Recovery Time Objective (RTO): The time taken to get the application back on line with the latest copy of the data available.
Reference
Edit online
See Encryption in transit to learn more about data transmission over the wire in an encrypted state.
IMPORTANT: One-way mirroring supports multiple secondary sites. If you are using additional secondary clusters, choose one of the
secondary clusters to fail over to. Synchronize from the same cluster during fail back.
Prerequisites
Procedure
Edit online
1. Stop all clients that use the primary image. This step depends on which clients use the image. For example, detach volumes
from any OpenStack instances that use the image.
2. Demote the primary images located on the site-a cluster by running the following commands on a monitor node in the
site-a cluster:
Syntax
Example
3. Promote the non-primary images located on the site-b cluster by running the following commands on a monitor node in the
site-b cluster:
Syntax
Example
4. After some time, check the status of the images from a monitor node in the site-b cluster. They should show a state of
up+stopped and be listed as primary:
5. Resume the access to the images. This step depends on which clients use the image.
Reference
Edit online
See the Block Storage and Volumeschapter in the Red Hat OpenStack Platform Storage Guide.
Procedure
Edit online
2. Stop all clients that use the primary image. This step depends on which clients use the image. For example, detach volumes
from any OpenStack instances that use the image.
3. Promote the non-primary images from a Ceph Monitor node in the site-b storage cluster. Use the --force option, because
the demotion cannot be propagated to the site-a storage cluster:
Syntax
Example
4. Check the status of the images from a Ceph Monitor node in the site-b storage cluster. They should show a state of
up+stopping_replay. The description should say force promoted, meaning it is in the intermittent state. Wait until the
state comes to up+stopped to validate the successful promotion of the site.
Example
Reference
Edit online
See Block Storage and Volumes in the Red Hat OpenStack Platform Storage Guide.
During failback scenario, the existing peer that is inaccessible must be removed before adding a new peer to an existing cluster.
Prerequisites
Edit online
Procedure
Edit online
Example
Example
IMPORTANT: This step must be run on the peer site which is up and running.
NOTE: Multiple peers are supported only for one way mirroring.
Syntax
Example
Syntax
Example
[ceph: root@host01 /]# rbd mirror pool peer remove pool_failback f055bb88-6253-4041-923d-
08c4ecbe799a
4. Create a block device pool with a name same as its peer mirror pool.
Syntax
Example
Syntax
rbd mirror pool peer bootstrap create --site-name LOCAL_SITE_NAME POOL_NAME >
PATH_TO_BOOTSTRAP_TOKEN
Example
[ceph: root@rbd-client-site-a /]# rbd mirror pool peer bootstrap create --site-name site-
a data > /root/bootstrap_token_site-a
NOTE: This example bootstrap command creates the client.rbd-mirror.site-a and the client.rbd-mirror-
peer Ceph users.
Syntax
rbd mirror pool peer bootstrap import --site-name LOCAL_SITE_NAME --direction rx-only
POOL_NAME PATH_TO_BOOTSTRAP_TOKEN
Example
[ceph: root@rbd-client-site-b /]# rbd mirror pool peer bootstrap import --site-name site-
b --direction rx-only data /root/bootstrap_token_site-a
NOTE: For one-way RBD mirroring, you must use the --direction rx-only argument, as two-way mirroring is the
default when bootstrapping peers.
6. From a monitor node in the site-a storage cluster, verify the site-b storage cluster was successfully added as a peer:
Example
Reference
Edit online
NOTE: If you have scheduled snapshots at the image level, then you need to re-add the schedule as image resync operations
changes the RBD Image ID and the previous schedule becomes obsolete.
Prerequisites
Edit online
1. Check the status of the images from a monitor node in the site-b cluster again. They should show a state of up-stopped
and the description should say local image is primary:
Example
2. From a Ceph Monitor node on the site-a storage cluster determine if the images are still primary:
Syntax
Example
In the output from the commands, look for mirroring primary: true or mirroring primary: false, to determine
the state.
3. Demote any images that are listed as primary by running a command like the following from a Ceph Monitor node in the site-
a storage cluster:
Syntax
Example
4. Resynchronize the images ONLY if there was a non-orderly shutdown. Run the following commands on a monitor node in the
site-a storage cluster to resynchronize the images from site-b to site-a:
Syntax
Example
5. After some time, ensure resynchronization of the images is complete by verifying they are in the up+replaying state. Check
their state by running the following commands on a monitor node in the site-a storage cluster:
Syntax
Example
Syntax
Example
NOTE: If there are multiple secondary storage clusters, this only needs to be done from the secondary storage cluster where it
was promoted.
7. Promote the formerly primary images located on the site-a storage cluster by running the following commands on a Ceph
Monitor node in the site-a storage cluster:
Syntax
Example
8. Check the status of the images from a Ceph Monitor node in the site-a storage cluster. They should show a status of
up+stopped and the description should say local image is primary:
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
1. Remove the site-b storage cluster as a peer from the site-a storage cluster:
[root@rbd-client ~]# rbd mirror pool peer remove data client.remote@remote --cluster local
[root@rbd-client ~]# rbd --cluster site-a mirror pool peer remove data client.site-b@site-b -n
client.site-a
Syntax
Example
Edit online
As a storage administrator, use the ceph-immutable-object-cache daemons to cache the parent image content on the local
disk. This cache is in the local caching directory. Future reads on that data use the local cache.
It is a scalable, open-source, and distributed storage system. It connects to local clusters with the RADOS protocol, relying on
default search paths to find ceph.conf files, monitor addresses and authentication information for them such as
/etc/ceph/_CLUSTER_.conf, /etc/ceph/_CLUSTER_.keyring and /etc/ceph/CLUSTER.NAME.keyring, where CLUSTER
is the human-friendly name of the cluster, and NAME is the RADOS user to connect as an example, client.ceph-immutable-
object-cache.
Domain socket based inter-process communication (IPC): The daemon listens on a local domain socket on start-up and waits
for connections from librbd clients.
Least recently used (LRU) based promotion or demotion policy: The daemon maintains in-memory statistics of cache-hits on
each cache file. It demotes the cold cache if capacity reaches to the configured threshold.
File-based caching store: The daemon maintains a simple file based cache store. On promotion the RADOS objects are
fetched from RADOS cluster and stored in the local caching directory.
When you open each cloned RBD image, librbd tries to connect to the cache daemon through its Unix domain socket. Once
successfully connected, librbd coordinates with the daemon on the subsequent reads.
If there is a read that is not cached, the daemon promotes the RADOS object to the local caching directory, so the next read on that
object is serviced from cache. The daemon also maintains simple LRU statistics so that under capacity pressure it evicts cold cache
files as needed.
Edit online
The ceph-immutable-object-cache is a daemon for object cache of RADOS objects among Ceph clusters.
IMPORTANT: To use the ceph-immutable-object-cache daemon, you must be able to connect RADOS clusters.
The daemon promotes the objects to a local directory. These cache objects service the future reads. You can configure the daemon
by installing the ceph-immutable-object-cache package.
Prerequisites
Edit online
Procedure
Edit online
1. Enable the RBD shared read only parent image cache. Add the following parameters under [client] in the
/etc/ceph/ceph.conf file:
Example
[client]
rbd parent cache enabled = true
rbd plugins = parent_cache
Example
Syntax
Example
[client.ceph-immutable-object-cache.user]
key = AQCVPH1gFgHRAhAAp8ExRIsoxQK4QSYSRoVJLw==
Example
[root@ceph-host1 ]# vi /etc/ceph/ceph.client.ceph-immutable-object-cache.user.keyring
[client.ceph-immutable-object-cache.user]
key = AQCVPH1gFgHRAhAAp8ExRIsoxQK4QSYSRoVJLw
Syntax
Example
Syntax
Example
Verification
Syntax
● [email protected]>
Loaded: loaded (/usr/lib/systemd/system/ceph-immutable-objec>
Active: active (running) since Mon 2021-04-19 13:49:06 IST; >
Main PID: 85020 (ceph-immutable-)
Tasks: 15 (limit: 49451)
Memory: 8.3M
CGroup: /system.slice/system-ceph\x2dimmutable\x2dobject\x2d>
└─85020 /usr/bin/ceph-immutable-object-cache -f --cl>
Edit online
A few important generic settings of ceph-immutable-object-cache daemons are listed.
immutable_object_cache_sock
Description
The path to the domain socket used for communication between librbd clients and the ceph-immutable-object-cache daemon.
Type
String
Default /var/run/ceph/immutable_object_cache_sock
immutable_object_cache_path
Description
The immutable object cache data directory.
Type
String
Default
/tmp/ceph_immutable_object_cache
immutable_object_cache_max_size
Type
Size
Default
1G
immutable_object_cache_watermark
Description The high-water mark for the cache. The value is between zero and one. If the cache size reaches this threshold the
daemon starts to delete cold cache based on LRU statistics.
Type
Float
Default 0.9
Edit online
The ceph-immutable-object-cache daemons supports throttling which supports the settings described.
Description
Minimum schedule tick for immutable object cache.
Type
Milliseconds
Default
50
immutable_object_cache_qos_iops_limit
Description
User-defined immutable object cache IO operations limit per second.
Type
Integer
Default
0
immutable_object_cache_qos_iops_burst
Type
Integer
Default
0
immutable_object_cache_qos_iops_burst_seconds
Description
User-defined burst duration in seconds of immutable object cache IO operations.
Type
Seconds
Default 1
immutable_object_cache_qos_bps_limit
Type
Integer
Default
0
immutable_object_cache_qos_bps_burst
Description
User-defined burst limit of immutable object cache IO bytes.
Type
Integer
Default
0
immutable_object_cache_qos_bps_burst_seconds
Description
The desired burst limit of read operations.
Type
Seconds
Edit online
As a storage administrator, you can access Ceph block devices through the rbd kernel module. You can map and unmap a block
device, and displaying those mappings. Also, you can get a list of images through the rbd kernel module.
IMPORTANT: Kernel clients on Linux distributions other than Red Hat Enterprise Linux (RHEL) are permitted but not supported. If
issues are found in the storage cluster when using these kernel clients, IBM addresses them, but if the root cause is found to be on
the kernel client side, the issue will have to be addressed by the software vendor.
Creating a Ceph Block Device and using it from a Linux kernel module client
Mapping a block device
Displaying mapped block devices
Unmapping a block device
Kernel module client supports features like Deep flatten, Layering, Exclusive lock, Object map, and Fast diff.
Creating a Ceph block device for a Linux kernel module client using dashboard
Map and mount a Ceph Block Device on Linux using the command line
Prerequisites
Edit online
Creating a Ceph block device for a Linux kernel module client using
dashboard
Edit online
You can create a Ceph block device specifically for a Linux kernel module client using the dashboard web interface by enabling only
the features it supports.
Kernel module client supports features like Deep flatten, Layering, Exclusive lock, Object map, and Fast diff.
Prerequisites
Edit online
Procedure
Edit online
2. Click Create.
3. In the Create RBD window, enter a image name, select the RBD enabled pool, select the supported features:
Verification
Map and mount a Ceph Block Device on Linux using the command
line
Edit online
After mapping it, you can partition, format, and mount it, so you can write files to it.
Prerequisites
Edit online
A Ceph block device for a Linux kernel module client using the dashboard is created.
1. On the Red Hat Enterprise Linux client node, enable the IBM Ceph Storage 5 Tools repository:
3. Copy the Ceph configuration file from a Monitor node to the Client node:
Syntax
Example
4. Copy the key file from a Monitor node to the Client node:
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Syntax
mkfs.xfs /dev/MAPPED_BLOCK_DEVICE_WITH_PARTITION_NUMBER
Example
Syntax
mkdir PATH_TO_DIRECTORY
Example
Syntax
Example
11. Verify that the file system is mounted and showing the correct size:
Syntax
df -h PATH_TO_DIRECTORY
Example
Reference
Edit online
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Specify a secret when using cephx authentication by either the keyring or a file containing the secret:
Syntax
or
Prerequisites
Edit online
Procedure
Edit online
Prerequisites
Edit online
Procedure
Edit online
Example
Syntax
Example
Prerequisites
Edit online
Procedure
Edit online
cluster = rados.Rados(conffile='my_ceph.conf')
cluster.connect()
ioctx = cluster.open_ioctx('mypool')
rbd_inst = rbd.RBD()
size = 4 * 1024**3 # 4 GiB
rbd_inst.create(ioctx, 'myimage', size)
This writes foo to the first 600 bytes of the image. Note that data cannot be :type:unicode - librbd does not know how to
deal with characters wider than a :c:type:char.
image.close()
ioctx.close()
cluster.shutdown()
import rados
import rbd
cluster = rados.Rados(conffile='my_ceph_conf')
try:
ioctx = cluster.open_ioctx('my_pool')
try:
rbd_inst = rbd.RBD()
size = 4 * 1024**3 # 4 GiB
rbd_inst.create(ioctx, 'myimage', size)
image = rbd.Image(ioctx, 'myimage')
try:
data = 'foo' * 200
image.write(data, 0)
finally:
image.close()
finally:
ioctx.close()
finally:
cluster.shutdown()
This can be cumbersome, so the Rados, Ioctx, and Image classes can be used as context managers that close or shut down
automatically. Using them as context managers, the above example becomes:
rbd_default_format
Description The default format (2) if no other format is specified. Format 1 is the original format for a new image, which is
compatible with all versions of librbd and the kernel module, but does not support newer features like cloning. Format 2 is
supported by librbd and the kernel module since version 3.11 (except for striping). Format 2 adds support for cloning and is more
easily extensible to allow more features in the future.
Type
Integer
Default 2
rbd_default_order
Type
Integer
Default 22
rbd_default_stripe_count
Description The default stripe count if no other stripe count is specified. Changing the default value requires striping v2 feature.
Type
64-bit Unsigned Integer
Default 0
rbd_default_stripe_unit
Description The default stripe unit if no other stripe unit is specified. Changing the unit from 0 (that is, the object size) requires the
striping v2 feature.
Type
64-bit Unsigned Integer
Default
0
rbd_default_features
Description
The default features enabled when creating an block device image. This setting only applies to format 2 images. The settings are:
2: Striping v2 support. Striping spreads data across multiple objects. Striping helps with parallelism for sequential read/write
workloads.
4: Exclusive locking support. When enabled, it requires a client to get a lock on an object before making a write.
8: Object map support. Block devices are thin-provisioned—meaning, they only store data that actually exists. Object map support
helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for
cloning, or importing and exporting a sparsely populated image.
32: Deep-flatten support. Deep-flatten makes rbd flatten work on all the snapshots of an image, in addition to the image itself.
Without it, snapshots of an image will still rely on the parent, so the parent will not be delete-able until the snapshots are deleted.
Deep-flatten makes a parent independent of its clones, even if they have snapshots.
64: Journaling support. Journaling records all modifications to an image in the order they occur. This ensures that a crash-
consistent mirror of the remote image is available locally
The enabled features are the sum of the numeric settings. Type;; Integer Default;; 61 - layering, exclusive-lock, object-map, fast-diff,
and deep-flatten are enabled
IMPORTANT: The current default setting is not compatible with the RBD kernel driver nor older RBD clients.
rbd_default_map_options
Description
Most of the options are useful mainly for debugging and benchmarking. See man rbd under Map Options for details.
Type
String
Default ""
Type Integer
Default 1
WARNING: Do not change the default value of rbd_op_threads because setting it to a number higher than 1 might cause data
corruption.
rbd_op_thread_timeout
Description
The timeout (in seconds) for block device operation threads.
Type Integer
Default
60
rbd_non_blocking_aio
Description If true, Ceph will process block device asynchronous I/O operations from a worker thread to prevent blocking.
Type
Boolean
Default true
rbd_concurrent_management_ops
Description
The maximum number of concurrent management operations in flight (for example, deleting or resizing an image).
Type Integer
rbd_request_timed_out_seconds
Description
The number of seconds before a maintenance request times out.
Type
Integer
Default
30
rbd_clone_copy_on_read
Type
Boolean
Default
false
rbd_enable_alloc_hint
Description If true, allocation hinting is enabled, and the block device will issue a hint to the OSD back end to indicate the expected
size object.
Type
Boolean
Default true
rbd_skip_partial_discard
Description
If true, the block device will skip zeroing a range when trying to discard a range inside an object.
Type
Boolean
Default
false
rbd_tracing
Description
Set this option to true to enable the Linux Trace Toolkit Next Generation User Space Tracer (LTTng-UST) tracepoints. See Tracing
RADOS Block Device (RBD) Workloads with the RBD Replay Feature for details.
Type
Boolean
Default
false
rbd_validate_pool
Description
Set this option to true to validate empty pools for RBD compatibility.
Type
Boolean
Default
true
rbd_validate_names
Type
Boolean
Default true
Ceph block devices support write-back caching. To enable write-back caching, set rbd_cache = true to the [client] section of
the Ceph configuration file. By default, librbd does not perform any caching. Writes and reads go directly to the storage cluster, and
writes return only when the data is on disk on all replicas. With caching enabled, writes return immediately, unless there are more
than rbd_cache_max_dirty unflushed bytes. In this case, the write triggers write-back and blocks until enough bytes are flushed.
Ceph block devices support write-through caching. You can set the size of the cache, and you can set targets and limits to switch
from write-back caching to write-through caching. To enable write-through mode, set rbd_cache_max_dirty to 0. This means
writes return only when the data is on disk on all replicas, but reads may come from the cache. The cache is in memory on the client,
and each Ceph block device image has its own. Since the cache is local to the client, there is no coherency if there are others
accessing the image. Running other file systems, such as GFS or OCFS, on top of Ceph block devices will not work with caching
enabled.
The Ceph configuration settings for Ceph block devices must be set in the [client] section of the Ceph configuration file, by
default, /etc/ceph/ceph.conf.
rbd_cache
Description
Enable caching for RADOS Block Device (RBD).
Type Boolean
Required
No
Default
true
rbd_cache_size
Type
64-bit Integer
Required
No
Default
32 MiB
rbd_cache_max_dirty
Description
The dirty limit in bytes at which the cache triggers write-back. If 0, uses write-through caching.
Required No
Default
24 MiB
rbd_cache_target_dirty
Description
The dirty target before the cache begins writing data to the data storage. Does not block writes to the cache.
Type
64-bit Integer
Required
No
Constraint
Must be less than rbd cache max dirty.
Default
16 MiB
rbd_cache_max_dirty_age
Description
The number of seconds dirty data is in the cache before writeback starts.
Type Float
Required
No
Default
1.0
rbd_cache_max_dirty_object
Description The dirty limit for objects - set to 0 for auto calculate from rbd_cache_size.
Type
Integer
Default
0
rbd_cache_block_writes_upfront
Description
If true, it will block writes to the cache before the aio_write call completes. If false, it will block before the aio_completion
is called.
Type Boolean
Default
false
rbd_cache_writethrough_until_flush
Description
Start out in write-through mode, and switch to write-back after the first flush request is received. Enabling this is a conservative but
safe setting in case VMs running on rbd are too old to send flushes, like the virtio driver in Linux before 2.6.32.
Type
Boolean
Default
true
Description
Ceph typically reads objects from the primary OSD. Since reads are immutable, you may enable this feature to balance snap reads
between the primary OSD and the replicas.
Type
Boolean
Default false
rbd_localize_snap_reads
Description
Whereas rbd_balance_snap_reads will randomize the replica for reading a snapshot. If you enable
rbd_localize_snap_reads, the block device will look to the CRUSH map to find the closest or local OSD for reading the
snapshot.
Type
Boolean
Default
false
rbd_balance_parent_reads
Description
Ceph typically reads objects from the primary OSD. Since reads are immutable, you may enable this feature to balance parent reads
between the primary OSD and the replicas.
Type
Boolean
Default false
rbd_localize_parent_reads
Description
Whereas rbd_balance_parent_reads will randomize the replica for reading a parent. If you enable
rbd_localize_parent_reads, the block device will look to the CRUSH map to find the closest or local OSD for reading the
parent.
Type
Boolean
Default
true
Description
Number of sequential read requests necessary to trigger read-ahead.
Type Integer
Required
No
Default
10
rbd_readahead_max_bytes
Description
Maximum size of a read-ahead request. If zero, read-ahead is disabled.
Type
64-bit Integer
Required
No
rbd_readahead_disable_after_bytes
Description
After this many bytes have been read from an RBD image, read-ahead is disabled for that image until it is closed. This allows the
guest OS to take over read-ahead once it is booted. If zero, read-ahead stays enabled.
Type
64-bit Integer
Required No
Default 50 MiB
Type
Boolean
Default true
rbd_blocklistexpire_seconds
Description
The number of seconds to blocklist - set to 0 for OSD default.
Type
Integer
Default
0
Description
The number of bits to shift to compute the journal object maximum size. The value is between 12 and 64.
Type
32-bit Unsigned Integer
Default 24
rbd_journal_splay_width
Description
The number of active journal objects.
Type
32-bit Unsigned Integer
Default
4
rbd_journal_commit_age
Type
Double Precision Floating Point Number
Default
5
rbd_journal_object_flush_interval
Description
The maximum number of pending commits per a journal object.
Type Integer
Default
0
rbd_journal_object_flush_bytes
Description
The maximum number of pending bytes per a journal object.
Type
Integer
Default
0
rbd_journal_object_flush_age
Description
The maximum time interval in seconds for pending commits.
Type
Double Precision Floating Point Number
Default
0
rbd_journal_pool
Description
Specifies a pool for journal objects.
Type String
Global level
Available keys
rbd_qos_bps_burst
Description
The desired burst limit of IO bytes.
Type
Integer
Default 0
rbd_qos_bps_limit
Description
The desired limit of IO bytes per second.
Type Integer
Default
0
rbd_qos_iops_burst
Description
The desired burst limit of IO operations.
Type
Integer
Default
0
rbd_qos_iops_limit
Description
The desired limit of IO operations per second.
Type Integer
Default
0
rbd_qos_read_bps_burst
Description
The desired burst limit of read bytes.
Type
Integer
Default
0
rbd_qos_read_bps_limit
Description
The desired limit of read bytes per second.
Default
0
rbd_qos_read_iops_burst
Description
The desired burst limit of read operations.
Type
Integer
Default
0
rbd_qos_read_iops_limit
Description
The desired limit of read operations per second.
Type Integer
Default
0
rbd_qos_write_bps_burst
Description
The desired burst limit of write bytes.
Type
Integer
Default
0
rbd_qos_write_bps_limit
Description
The desired limit of write bytes per second.
Type
Integer
Default
0
rbd_qos_write_iops_burst
Description
The desired burst limit of write operations.
Type
Integer
Default
0
rbd_qos_write_iops_limit
Description
The desired burst limit of write operations per second.
Type
Integer
Default 0
Description
Set a global level configuration override.
Description
Get a global level configuration override.
Description
List the global level configuration overrides.
Description
Remove a global level configuration override.
Pool level
Description
Set a pool level configuration override.
Description
Get a pool level configuration override.
Description
List the pool level configuration overrides.
Description
Remove a pool level configuration override.
NOTE: _CONFIG_ENTITY_ is global, client or client id. _KEY_ is the config key. _VALUE_ is the config value. POOL_NAME is the name
of the pool.
Description
Hint to send to the OSDs on write operations. If set to compressible and the OSD bluestore_compression_mode setting is
passive, the OSD attempts to compress data. If set to incompressible and the OSD bluestore_compression_mode setting is
aggressive, the OSD will not attempt to compress data.
Type
Enum
Required
No
Default
none
Values
none, compressible, incompressible
rbd_read_from_replica_policy
NOTE: This feature requires the storage cluster to be configured with a minimum compatible OSD release of the latest version of IBM
Storage Ceph.
Type
Enum
Required
No
Default
default
Values
default, balance, localize
Developer
Edit online
Use the various application programming interfaces (APIs) for IBM Storage Ceph running on AMD64 and Intel 64 architectures.
HTTP 1.1
JSON
JWT
These standards are OpenAPI 3.0 compliant, regulating the API syntax, semantics, content encoding, versioning, authentication, and
authorization.
Prerequisites
Versioning for the Ceph API
Authentication and authorization for the Ceph API
Prerequisites
Edit online
A mandatory explicit default version for all endpoints to avoid implicit defaults.
The expected version from a specific endpoint is stated in the HTTP header.
Syntax
Accept: application/vnd.ceph.api.vMAJOR.MINOR+json
Example
Accept: application/vnd.ceph.api.v1.0+json
If the current Ceph API server is not able to address that specific version, a 415 - Unsupported Media Type
response will be returned.
Major changes are backwards incompatible. Changes might result in non-additive changes to the request, and to the
response formats for a specific endpoint.
Minor changes are backwards and forwards compatible. Changes consist of additive changes to the request or
response formats for a specific endpoint.
Before users start using the Ceph API, they need a valid JSON Web Token (JWT). The /api/auth endpoint allows you to retrieve
this token.
Example
This token must be used together with every API request by placing it within the Authorization HTTP header.
Reference
Edit online
IMPORTANT: If disabling SSL, then user names and passwords are sent unencrypted to the IBM Storage Ceph Dashboard.
Prerequisites
Edit online
If you use a firewall, ensure that TCP port 8443, for SSL, and TCP port 8080, without SSL, are open on the node with the active
ceph-mgr daemon.
Procedure
Edit online
Example
a. If your organization’s certificate authority (CA) provides a certificate, then set using the certificate files:
Syntax
Example
If you want to set unique node-based certificates, then add a HOST_NAME to the commands:
Example
b. Alternatively, you can generate a self-signed certificate. However, using a self-signed certificate does not provide full
security benefits of the HTTPS protocol:
WARNING: Most modern web browsers will complain about self-signed certificates, which require you to confirm
before establishing a secure connection.
Syntax
Example
This example creates a user named user1 with the administrator role.
5. Connect to the RESTful plug-in web page. Open a web browser and enter the following URL:
Syntax
https://HOST_NAME:8443
Example
https://host01:8443
Reference
Edit online
The https://HOST_NAME:8443/doc page, where HOST -NAME is the IP address or name of the node with the running
ceph-mgr instance.
Getting information
Changing Configuration
Administering the Cluster
Getting information
Edit online
This section describes how to use the Ceph API to view information about the storage cluster, Ceph Monitors, OSDs, pools, and
hosts.
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
CEPH_MANAGER_PORT with the TCP port number. The default TCP port number is 8443.
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/cluster_conf', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/cluster_conf', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Reference
Edit online
Configuration
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/cluster_conf/ARGUMENT', auth=("USER",
"PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/cluster_conf/ARGUMENT
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Reference
Edit online
Configuration
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/flags', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/flags', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/osd/flags
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Reference
Edit online
Configuration
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/crush_rule', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/crush_rule', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/crush_rule
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Reference
Edit online
CRUSH Rules
IP address
Name
Quorum status
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
888 IBM Storage Ceph
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/monitor', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/monitor', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/monitor
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
IP address
Name
Quorum status
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/monitor/NAME', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/monitor/NAME', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/monitor/NAME
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
IP address
Its pools
Affinity
Weight
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/osd
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
IP address
Its pools
Affinity
Weight
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/ID', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/ID', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/osd/ID
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/ID/command', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/osd/ID/command', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/osd/ID/command
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Flags
Size
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/pool', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/pool', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/pool
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Flags
Size
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/pool/ID', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/pool/ID', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
https://CEPH_MANAGER:8080/api/pool/ID
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Host names
Ceph version
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/host', auth=("USER, "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/host', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
https://CEPH_MANAGER:8080/api/host
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Host names
Ceph version
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
HOST_NAME with the host name of the host listed in the hostname field
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/host/HOST_NAME', auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
HOST_NAME with the host name of the host listed in the hostname field
$ python
>> import requests
>> result = requests.get('https://CEPH_MANAGER:8080/api/host/HOST_NAME', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
Web Browser
Edit online
In the web browser, enter:
https://CEPH_MANAGER:8080/api/host/HOST_NAME
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
HOST_NAME with the host name of the host listed in the hostname field
Changing Configuration
Edit online
This section describes how to use the Ceph API to change OSD configuration options, the state of an OSD, and information about
pools.
echo -En '{"=OPTION": VALUE}' | curl --request PATCH --data @- --silent --user USER
'https://CEPH_MANAGER:8080/api/osd/flags'
Replace:
OPTION with the option to modify; pause, noup, nodown, noout, noin, nobackfill, norecover, noscrub, nodeep-
scrub
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.patch('https://CEPH_MANAGER:8080/api/osd/flags', json={"OPTION": VALUE}, auth=
("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
OPTION with the option to modify; pause, noup, nodown, noout, noin, nobackfill, norecover, noscrub, nodeep-
scrub
$ python
>> import requests
>> result = requests.patch('https://CEPH_MANAGER:8080/api/osd/flags', json={"OPTION": VALUE}, auth=
("USER", "PASSWORD"), verify=False)
>> print result.json()
echo -En '{"STATE": VALUE}' | curl --request PATCH --data @- --silent --user USER
'https://CEPH_MANAGER:8080/api/osd/ID'
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
echo -En '{"STATE": VALUE}' | curl --request PATCH --data @- --silent --insecure --user USER
'https://CEPH_MANAGER:8080/api/osd/ID'
$ python
>> import requests
>> result = requests.patch('https://CEPH_MANAGER:8080/api/osd/ID', json={"STATE": VALUE}, auth=
("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.patch('https://CEPH_MANAGER:8080/api/osd/ID', json={"STATE": VALUE}, auth=
("USER", "PASSWORD"), verify=False)
>> print result.json()
echo -En '{"reweight": VALUE}' | curl --request PATCH --data @- --silent --user USER
'https://CEPH_MANAGER:8080/api/osd/ID'
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
echo -En '{"reweight": VALUE}' | curl --request PATCH --data @- --silent --insecure --user USER
'https://CEPH_MANAGER:8080/api/osd/ID'
Python
Edit online
$ python
>> import requests
>> result = requests.patch('https://CEPH_MANAGER:8080/osd/ID', json={"reweight": VALUE}, auth=
("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.patch('https://CEPH_MANAGER:8080/api/osd/ID', json={"reweight": VALUE}, auth=
("USER", "PASSWORD"), verify=False)
>> print result.json()
echo -En '{"OPTION": VALUE}' | curl --request PATCH --data @- --silent --user USER
'https://CEPH_MANAGER:8080/api/pool/ID'
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
echo -En '{"OPTION": VALUE}' | curl --request PATCH --data @- --silent --insecure --user USER
'https://CEPH_MANAGER:8080/api/pool/ID'
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.patch('https://CEPH_MANAGER:8080/api/pool/ID', json={"OPTION": VALUE}, auth=
("USER, "PASSWORD"), verify=False)
>> print result.json()
echo -En '{"command": "COMMAND"}' | curl --request POST --data @- --silent --user USER
'https://CEPH_MANAGER:8080/api/osd/ID/command'
Replace:
COMMAND with the process (scrub, deep-scrub, or repair) you want to start. Verify it the process is supported on the OSD.
See How can I determine what process can be scheduled on an OSD? for details.
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
$ python
>> import requests
>> result = requests.post('https://CEPH_MANAGER:8080/api/osd/ID/command', json={"command":
"COMMAND"}, auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
COMMAND with the process (scrub, deep-scrub, or repair) you want to start. Verify it the process is supported on the OSD.
See How can I determine what process can be scheduled on an OSD? for details.
$ python
>> import requests
>> result = requests.post('https://CEPH_MANAGER:8080/api/osd/ID/command', json={"command":
"COMMAND"}, auth=("USER", "PASSWORD"), verify=False)
>> print result.json()
echo -En '{"name": "NAME", "pg_num": NUMBER}' | curl --request POST --data @- --silent --user USER
'https://CEPH_MANAGER:8080/api/pool'
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
echo -En '{"name": "NAME", "pg_num": NUMBER}' | curl --request POST --data @- --silent --insecure -
-user USER 'https://CEPH_MANAGER:8080/api/pool'
$ python
>> import requests
>> result = requests.post('https://CEPH_MANAGER:8080/api/pool', json={"name": "NAME", "pg_num":
NUMBER}, auth=("USER", "PASSWORD"))
>> print result.json()
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.post('https://CEPH_MANAGER:8080/api/pool', json={"name": "NAME", "pg_num":
NUMBER}, auth=("USER", "PASSWORD"), verify=False)
>> print result.json()
This request is by default forbidden. To allow it, add the following parameter to the Ceph configuration.
mon_allow_pool_delete = true
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
Python
Edit online
In the Python interpreter, enter:
Replace:
CEPH_MANAGER with the IP address or short host name of the node with the active ceph-mgr instance
$ python
>> import requests
>> result = requests.delete('https://CEPH_MANAGER:8080/api/pool/ID', auth=("USER", "PASSWORD"),
verify=False)
>> print result.json()
NOTE: IBM recommends using the command-line interface when configuring the Ceph Object Gateway.
Prerequisites
Administration operations
Administration authentication requests
Creating an administrative user
Get user information
Create a user
Modify a user
Remove a user
Create a subuser
Modify a subuser
Remove a subuser
Add capabilities to a user
Remove capabilities from a user
Create a key
Remove a key
Bucket notifications
Get bucket information
Check a bucket index
Remove a bucket
Link a bucket
Unlink a bucket
Get a bucket or object policy
Remove an object
Prerequisites
Edit online
A RESTful client.
Administration operations
Edit online
An administrative Application Programming Interface (API) request will be done on a URI that starts with the configurable admin
resource entry point. Authorization for the administrative API duplicates the S3 authorization mechanism. Some operations require
that the user holds special administrative capabilities. The response entity type, either XML or JSON, might be specified as the
format option in the request and defaults to JSON if not specified.
Example
usage=read
Most use cases for the S3 API involve using open-source S3 clients such as the AmazonS3Client in the Amazon SDK for Java or
Python Boto. These libraries do not support the Ceph Object Gateway Admin API. You can subclass and extend these libraries to
support the Ceph Admin API. Alternatively, you can create a unique Gateway client.
The CephAdminAPI example class in this section illustrates how to create an execute() method that can take request parameters,
authenticate the request, call the Ceph Admin API and receive a response.
NOTE: The CephAdminAPI class example is not supported or intended for commercial use. It is for illustrative purposes only.**
The client code contains five calls to the Ceph Object Gateway to demonstrate CRUD operations:
Create a User
Get a User
Create a Subuser
Delete a User
To use this example, get the httpcomponents-client-4.5.3 Apache HTTP components. You can download it for example here:
http://hc.apache.org/downloads.cgi. Then unzip the tar file, navigate to its lib directory and copy the contents to the
/jre/lib/ext directory of the JAVA_HOME directory, or a custom classpath.
As you examine the CephAdminAPI class example, notice that the execute() method takes an HTTP method, a request path, an
optional subresource, null if not specified, and a map of parameters. To execute with subresources, for example, subuser, and
key, you will need to specify the subresource as an argument in the execute() method.
1. Builds a URI.
4. Adds the Date header to the HTTP header string and the request header.
7. Makes a request.
8. Returns a response.
Building the header string Building the header string is the portion of the process that involves Amazon’s S3 authentication
procedure. Specifically, the example method does the following:
The request type should be uppercase with no leading or trailing white space. If you do not trim white space, authentication will fail.
The date MUST be expressed in GMT, or authentication will fail.
The exemplary method does not have any other headers. The Amazon S3 authentication procedure sorts x-amz headers
lexicographically. So if you are adding x-amz headers, be sure to add them lexicographically.
Once you have built the header string, the next step is to instantiate an HTTP request and pass it the URI. The exemplary method
uses PUT for creating a user and subuser, GET for getting a user, POST for modifying a user and DELETE for deleting a user.
Once you instantiate a request, add the Date header followed by the Authorization header. Amazon’s S3 authentication uses the
standard Authorization header, and has the following structure:
The CephAdminAPI example class has a base64Sha1Hmac() method, which takes the header string and the secret key for the
admin user, and returns a SHA1 HMAC as a base-64 encoded string. Each execute() call will invoke the same line of code to build
the Authorization header:
The following CephAdminAPI example class requires you to pass the access key, secret key, and an endpoint to the constructor. The
class provides accessor methods to change them at runtime.
Example
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.time.OffsetDateTime;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.Header;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpPut;
import org.apache.http.client.methods.HttpDelete;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import org.apache.http.client.utils.URIBuilder;
import java.util.Base64;
import java.util.Base64.Encoder;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import javax.crypto.spec.SecretKeySpec;
import javax.crypto.Mac;
import java.util.Map;
import java.util.Iterator;
import java.util.Set;
import java.util.Map.Entry;
/*
* Each call must specify an access key, secret key, endpoint and format.
*/
String accessKey;
String secretKey;
String endpoint;
String scheme = "http"; //http only.
int port = 80;
/*
* A constructor that takes an access key, secret key, endpoint and format.
*/
public CephAdminAPI(String accessKey, String secretKey, String endpoint){
this.accessKey = accessKey;
this.secretKey = secretKey;
this.endpoint = endpoint;
}
/*
* Accessor methods for access key, secret key, endpoint and format.
*/
public String getEndpoint(){
return this.endpoint;
}
/*
* Takes an HTTP Method, a resource and a map of arguments and
* returns a CloseableHTTPResponse.
*/
public CloseableHttpResponse execute(String HTTPMethod, String resource,
String subresource, Map arguments) {
try {
if (subresource != null){
uri = new URIBuilder(uri)
.setCustomQuery(subresource)
.build();
}
request.append(uri);
headerString.append(HTTPMethod.toUpperCase().trim() + "\n\n\n");
headerString.append(date + "\n");
headerString.append(requestPath);
if (HTTPMethod.equalsIgnoreCase("PUT")){
httpRequest = new HttpPut(uri);
} else if (HTTPMethod.equalsIgnoreCase("POST")){
httpRequest = new HttpPost(uri);
} else if (HTTPMethod.equalsIgnoreCase("GET")){
httpRequest = new HttpGet(uri);
} else if (HTTPMethod.equalsIgnoreCase("DELETE")){
httpRequest = new HttpDelete(uri);
} else {
System.err.println("The HTTP Method must be PUT,
POST, GET or DELETE.");
throw new IOException();
}
httpRequest.addHeader("Date", date);
httpRequest.addHeader("Authorization", "AWS " + this.getAccessKey()
+ ":" + base64Sha1Hmac(headerString.toString(),
this.getSecretKey()));
httpclient = HttpClients.createDefault();
/*
* Takes a uri and a secret key and returns a base64-encoded
* SHA-1 HMAC.
*/
public String base64Sha1Hmac(String uri, String secretKey) {
try {
} catch (Exception e) {
throw new RuntimeException(e);
}
}
The subsequent CephAdminAPIClient example illustrates how to instantiate the CephAdminAPI class, build a map of request
parameters, and use the execute() method to create, get, update and delete a user.
Example
import java.io.IOException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.HttpEntity;
import org.apache.http.util.EntityUtils;
import java.util.*;
/*
* Create a user
*/
Map requestArgs = new HashMap();
requestArgs.put("access", "usage=read, write; users=read, write");
requestArgs.put("display-name", "New User");
requestArgs.put("email", "[email protected]");
requestArgs.put("format", "json");
requestArgs.put("uid", "new-user");
CloseableHttpResponse response =
adminApi.execute("PUT", "/admin/user", null, requestArgs);
System.out.println(response.getStatusLine());
HttpEntity entity = response.getEntity();
try {
System.out.println("\nResponse Content is: "
/*
* Get a user
*/
requestArgs = new HashMap();
requestArgs.put("format", "json");
requestArgs.put("uid", "new-user");
System.out.println(response.getStatusLine());
entity = response.getEntity();
try {
System.out.println("\nResponse Content is: "
+ EntityUtils.toString(entity, "UTF-8") + "\n");
response.close();
} catch (IOException e){
System.err.println ("Encountered an I/O exception.");
e.printStackTrace();
}
/*
* Modify a user
*/
requestArgs = new HashMap();
requestArgs.put("display-name", "John Doe");
requestArgs.put("email", "[email protected]");
requestArgs.put("format", "json");
requestArgs.put("uid", "new-user");
requestArgs.put("max-buckets", "100");
System.out.println(response.getStatusLine());
entity = response.getEntity();
try {
System.out.println("\nResponse Content is: "
+ EntityUtils.toString(entity, "UTF-8") + "\n");
response.close();
} catch (IOException e){
System.err.println ("Encountered an I/O exception.");
e.printStackTrace();
}
/*
* Create a subuser
*/
requestArgs = new HashMap();
requestArgs.put("format", "json");
requestArgs.put("uid", "new-user");
requestArgs.put("subuser", "foobar");
try {
System.out.println("\nResponse Content is: "
+ EntityUtils.toString(entity, "UTF-8") + "\n");
response.close();
} catch (IOException e){
System.err.println ("Encountered an I/O exception.");
e.printStackTrace();
}
/*
try {
System.out.println("\nResponse Content is: "
+ EntityUtils.toString(entity, "UTF-8") + "\n");
response.close();
} catch (IOException e){
System.err.println ("Encountered an I/O exception.");
e.printStackTrace();
}
}
}
Reference
Edit online
For a more extensive explanation of the Amazon S3 authentication procedure, consult the Signing and Authenticating REST
Requests section of Amazon Simple Storage Service documentation.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Example output
{
"user_id": "admin-api-user",
Syntax
Example
The radosgw-admin command-line interface will return the user. The "caps": will have the capabilities you assigned to the
user:
Example output
{
"user_id": "admin-api-user",
"display_name": "Admin API User",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "admin-api-user",
"access_key": "NRWGT19TWMYOB1YDBV1Y",
"secret_key": "gr1VEGIV7rxcP3xvXDFCo4UDwwl2YoNrmtRlIAty"
}
],
"swift_keys": [],
"caps": [
{
"type": "users",
"perm": "*"
}
],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
Capabilities
users=read
Syntax
Request Parameters
uid
Description
The user for which the information is requested.
Type
String
Example
foo_user
Required
Yes
Response Entities
user
Description
A container for the user data information.
Type
Container
Parent
N/A
user_id
Description
The user ID.
Type
String
Parent
user
display_name
Type
String
Parent
user
suspended
Description
True if the user is suspended.
Type
Boolean
Parent
user
max_buckets
Description
The maximum number of buckets to be owned by the user.
Type
Integer
Parent
user
subusers
Description
Subusers associated with this user account.
Type
Container
Parent
user
keys
Description
S3 keys associated with this user account.
Type
Container
Parent
user
swift_keys
Description
Swift keys associated with this user account.
Type
Container
Parent
user
caps
Description
User capabilities.
Type
Container
None. S
Create a user
Edit online
Create a new user. By default, an S3 key pair will be created automatically and returned in the response. If only a access-key or
secret-key is provided, the omitted key will be automatically generated. By default, a generated key is added to the keyring
without replacing an existing key pair. If access-key is specified and refers to an existing key owned by the user then it will be
modified.
Capabilities
`users=write`
Syntax
Request Parameters
uid
Description
The user ID to be created.
Type
String
Example
foo_user
Required
Yes
display-name
Description
The display name of the user to be created.
Type
String
Example
foo_user
Required
Yes
Description
The email address associated with the user.
Type
String
Example
[email protected]
Required
key-type
Description
Key type to be generated, options are: swift, s3 (default).
Type
String
Example
s3
Required
No
access-key
Description
Specify access key.
Type
String
Example
ABCD0EF12GHIJ2K34LMN
Required
No
secret-key
Description
Specify secret key.
Type
String
Example
0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8
Required
No
user-caps
Description
User capabilities.
Type
String
Example
usage=read, write; users=read
Required
No
generate-key
Description
Generate a new key pair and add to the existing keyring.
Type
Boolean
Example
True
Required
No
Description
Specify the maximum number of buckets the user can own.
Type
Integer
Example
500
Required
No
suspended
Description
Specify whether the user should be suspended
Type
Boolean
Example
False
Required
No
Response Entities
user
Description
Specify whether the user should be suspended
Type
Boolean
Parent
No
user_id
Description
The user ID.
Type
String
Parent
user
display_name
Description
Display name for the user.
Type
String
Parent
user
suspended
Description
True if the user is suspended.
Type
Boolean
max_buckets
Description
The maximum number of buckets to be owned by the user.
Type
Integer
Parent
user
subusers
Description
Subusers associated with this user account.
Type
Container
Parent
user
keys
Description
S3 keys associated with this user account.
Type
Container
Parent
user
swift_keys
Description
Swift keys associated with this user account.
Type
Container
Parent
user
caps
Description
User capabilities.
Type
Container
Parent
If successful, the response contains the user information.
UserExists
Description
Attempt to create existing user.
Code
409 Conflict
InvalidAccessKey
Description
Invalid access key specified.
InvalidKeyType
Description
Invalid key type specified.
Code
400 Bad Request
InvalidSecretKey
Description
Invalid secret key specified.
Code
400 Bad Request
KeyExists
Description
Provided access key exists and belongs to another user.
Code
409 Conflict
EmailExists
Description
Provided email address exists.
Code
409 Conflict
InvalidCap
Description
Attempt to grant invalid admin capability.
Code
400 Bad Request
Reference
Edit online
Modify a user
Edit online
Modify an existing user.
Capabilities
`users=write`
Syntax
Request Parameters
uid
Type
String
Example
foo_user
Required
Yes
display-name
Description
The display name of the user to be created.
Type
String
Example
foo_user
Required
Yes
Description
The email address associated with the user.
Type
String
Example
[email protected]
Required
No
generate-key
Description
Generate a new key pair and add to the existing keyring.
Type
Boolean
Example
True
Required
No
access-key
Description
Specify access key.
Type
String
Example
ABCD0EF12GHIJ2K34LMN
Required
No
secret-key
Description
Type
String
Example
0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8
Required
No
key-type
Description
Key type to be generated, options are: swift, s3 (default).
Type
String
Example
s3
Required
No
user-caps
Description
User capabilities.
Type
String
Example
usage=read, write; users=read
Required
No
max-buckets
Description
Specify the maximum number of buckets the user can own.
Type
Integer
Example
500
Required
No
suspended
Description
Specify whether the user should be suspended
Type
Boolean
Example
False
Required
No
Response Entities
user
Type
Boolean
Parent
No
user_id
Description
The user ID.
Type
String
Parent
user
display_name
Description
Display name for the user.
Type
String
Parent
user
suspended
Description
True if the user is suspended.
Type
Boolean
Parent
user
max_buckets
Description
The maximum number of buckets to be owned by the user.
Type
Integer
Parent
user
subusers
Description
Subusers associated with this user account.
Type
Container
Parent
user
keys
Description
S3 keys associated with this user account.
Type
Container
swift_keys
Description
Swift keys associated with this user account.
Type
Container
Parent
user
caps
Description
User capabilities.
Type
Container
Parent
If successful, the response contains the user information.
InvalidAccessKey
Description
Invalid access key specified.
Code
400 Bad Request
InvalidKeyType
Description
Invalid key type specified.
Code
:400 Bad Request
InvalidSecretKey
Description
Invalid secret key specified.
Code
400 Bad Request
KeyExists
Description
Provided access key exists and belongs to another user.
Code
409 Conflict
EmailExists
Description
Provided email address exists.
Code
409 Conflict
InvalidCap
Description
Attempt to grant invalid admin capability.
Reference
Edit online
Modifying subusers
Remove a user
Edit online
Remove an existing user.
Capabilities
`users=write`
Syntax
Request Parameters
uid
Description
The user ID to be removed.
Type
String
Example
foo_user
Required
Yes
purge-data
Description
When specified the buckets and objects belonging to the user will also be removed.
Type
Boolean
Example
True
Required
No
Response Entities
None.
None.
Reference
Edit online
Create a subuser
Edit online
Create a new subuser, primarily useful for clients using the Swift API.
NOTE: Either gen-subuser or subuser is required for a valid request. In general, for a subuser to be useful, it must be granted
permissions by specifying access. As with user creation if subuser is specified without secret, then a secret key is automatically
generated.
Capabilities
`users=write`
Syntax
Request Parameters
uid
Description
The user ID under which a subuser is to be created.
Type
String
Example
foo_user
Required
Yes
subuser
Description
Specify the subuser ID to be created.
Type
String
Example
sub_foo
Required
Yes (or gen-subuser)
gen-subuser
Description
Specify the subuser ID to be created.
Type
String
Example
sub_foo
Required
Yes (or gen-subuser)
secret-key
Description
Specify secret key.
Example
0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8
Required
No
key-type
Description
Key type to be generated, options are: swift (default), s3.
Type
String
Example
swift
Required
No
access
Description
Set access permissions for sub-user, should be one of read, write, readwrite, full.
Type
String
Example
read
Required
No
generate-secret
Description
Generate the secret key.
Type
Boolean
Example
True
Required
No
Response Entities
subusers
Description
Subusers associated with the user account.
Type
Container
Parent
N/A
permissions
Description
Subuser access to user account.
Type
String
SubuserExists
Description
Specified subuser exists.
Code
409 Conflict
InvalidKeyType
Description
Invalid key type specified.
Code
400 Bad Request
InvalidSecretKey
Description
Invalid secret key specified.
Code
400 Bad Request
InvalidAccess
Description
Invalid subuser access specified
Code
400 Bad Request
Modify a subuser
Edit online
Modify an existing subuser.
Capabilities
`users=write`
Syntax
Request Parameters
uid
Description
The user ID under which a subuser is to be created.
Type
String
Example
foo_user
Required
Yes
subuser
Type
String
Example
sub_foo
Required
No
generate-secret
Description
Generate a new secret key for the subuser, replacing the existing key.
Type
Boolean
Example
True
Required
No
secret
Description
Specify secret key.
Type
String
Example
0AbCDEFg1h2i34JklM5nop6QrSTUV+WxyzaBC7D8
Required
No
key-type
Description
Key type to be generated, options are: swift (default), s3.
Type
String
Example
swift
Required
No
access
Description
Set access permissions for sub-user, should be one of read, write, readwrite, full.
Type
String
Example
read
Required
No
Response Entities
subusers
Type
Container
Parent
N/A
id
Description
Subuser ID
Type
String
Parent
subusers
permissions
Description
Subuser access to user account.
Type
String
Parent
subusers
InvalidKeyType
Description
Invalid key type specified.
Code
400 Bad Request
InvalidSecretKey
Description
Invalid secret key specified.
Code
400 Bad Request
InvalidAccess
Description
Invalid subuser access specified
Code
400 Bad Request
Remove a subuser
Edit online
Remove an existing subuser.
Capabilities
`users=write`
Syntax
Request Parameters
uid
Description
The user ID to be removed.
Type
String
Example
foo_user
Required
Yes
subuser
Description
The subuser ID to be removed.
Type
String
Example
sub_foo
Required
Yes
purge-keys
Description
Remove keys belonging to the subuser.
Type
Boolean
Example
True
Required
No
Response Entities
None.
None.
Capabilities
`users=write`
Syntax
Description
The user ID to add an administrative capability to.
Type
String
Example
foo_user
Required
Yes
user-caps
Description
The administrative capability to add to the user.
Type
String
Example
usage=read, write
Required
Yes
user
Description
A container for the user data information.
Type
Container
Parent
N/A
user_id
Description
The user ID
Type
String
Parent
user
caps
Description
User capabilities
Type
Container
Parent
user
InvalidCap
Description
Attempt to grant invalid admin capability.
Code
Capabilities
`users=write`
Syntax
Request Parameters
uid
Description
The user ID to remove an administrative capability from.
Type
String
Example
foo_user
Required
Yes
user-caps
Description
The administrative capabilities to remove from the user.
Type
String
Example
usage=read, write
Required
Yes
Response Entities
user
Description
A container for the user data information.
Type
Container
Parent
N/A
user_id
Description
The user ID.
Type
String
Parent
caps
Description
User capabilities.
Type
Container
Parent
user
InvalidCap
Description
Attempt to remove an invalid admin capability.
Code
400 Bad Request
NoSuchCap
Description
User does not possess specified capability.
Code
404 Not Found
Create a key
Edit online
Create a new key. If a subuser is specified then by default created keys will be swift type. If only one of access-key or secret-
key is provided the committed key will be automatically generated, that is if only secret-key is specified then access-key will be
automatically generated. By default, a generated key is added to the keyring without replacing an existing key pair. If access-key is
specified and refers to an existing key owned by the user then it will be modified. The response is a container listing all keys of the
same type as the key created.
NOTE: When creating a swift key, specifying the option access-key will have no effect. Additionally, only one swift key might be
held by each user or subuser.
Capabilities
`users=write`
Syntax
Request Parameters
uid
Description
The user ID to receive the new key.
Type
String
Example
foo_user
Required
Yes
Description
The subuser ID to receive the new key.
Type
String
Example
sub_foo
Required
No
key-type
Description
Key type to be generated, options are: swift, s3 (default).
Type
String
Example
s3
Required
No
access-key
Description
Specify access key.
Type
String
Example
AB01C2D3EF45G6H7IJ8K
Required
No
secret-key
Description
Specify secret key.
Type
String
Example
0ab/CdeFGhij1klmnopqRSTUv1WxyZabcDEFgHij
Required
No
generate-key
Description
Generate a new key pair and add to the existing keyring.
Type
Boolean
Example
True
Required
No
Response Entities
Description
Keys of type created associated with this user account.
Type
Container
Parent
N/A
user
Description
The user account associated with the key.
Type
String
Parent
keys
access-key
Description
The access key.
Type
String
Parent
keys
secret-key
Description
The secret key.
Type
String
Parent
keys
InvalidAccessKey
Description
Invalid access key specified.
Code
400 Bad Request
InvalidKeyType
Description
Invalid key type specified.
Code
400 Bad Request
InvalidSecretKey
Description
Invalid secret key specified.
Code
400 Bad Request
InvalidKeyType
Code
400 Bad Request
KeyExists
Description
Provided access key exists and belongs to another user.
Code
409 Conflict
Remove a key
Edit online
Remove an existing key.
Capabilities
`users=write`
Syntax
Request Parameters
access-key
Description
The S3 access key belonging to the S3 key pair to remove.
Type
String
Example
AB01C2D3EF45G6H7IJ8K
Required
Yes
uid
Description
The user to remove the key from.
Type
String
Example
foo_user
Required
No
subuser
Description
The subuser to remove the key from.
Type
String
Example
sub_foo
key-type
Description
Key type to be removed, options are: swift, s3.
Type
String
Example
swift
Required
No
None.
Response Entities
None.
Bucket notifications
Edit online
As a storage administrator, you can use these APIs to provide configuration and control interfaces for the bucket notification
mechanism. The API topics are named objects that contain the definition of a specific endpoint. Bucket notifications associate topics
with a specific bucket. The S3 bucket operations section gives more details on bucket notifications.
NOTE: In all topic actions, the parameters are URL encoded, and sent in the message body using application/x-www-form-
urlencoded content type.
NOTE: Any bucket notification already associated with the topic needs to be re-created for the topic update to take effect.
Prerequisites
Overview of bucket notifications
Persistent notifications
Creating a topic
Getting topic information
Listing topics
Deleting topics
Using the command-line interface for topic management
Event record
Supported event types
Prerequisites
Edit online
Persistent notifications
Edit online
Persistent notifications enable reliable and asynchronous delivery of notifications from the Ceph Object Gateway to the endpoint
configured at the topic. Regular notifications are also reliable because the delivery to the endpoint is performed synchronously
during the request. With persistent notifications, the Ceph Object Gateway retries sending notifications even when the endpoint is
down or there are network issues during the operations, that is notifications are retried if not successfully delivered to the endpoint.
Notifications are sent only after all other actions related to the notified operation are successful. If an endpoint goes down for a
longer duration, the notification queue fills up and the S3 operations that have configured notifications for these endpoints will fail.
NOTE: With kafka-ack-level=none, there is no indication for message failures, and therefore messages sent while broker is
down are not retried, when the broker is up again. After the broker is up again, only new notifications are seen.
Creating a topic
Edit online
You can create topics before creating bucket notifications. A topic is a Simple Notification Service (SNS) entity and all the topic
operations, that is, create, delete, list, and get, are SNS operations. The topic needs to have endpoint parameters that are
used when a bucket notification is created. Once the request is successful, the response includes the topic Amazon Resource Name
(ARN) that can be used later to reference this topic in the bucket notification request.
NOTE: A topic_arn provides the bucket notification configuration and is generated after a topic is created.
Prerequisites
Edit online
Root-level access.
Endpoint parameters.
Syntax
POST
Action=CreateTopic
&Name=TOPIC_NAME
[&Attributes.entry.1.key=amqp-exchange&Attributes.entry.1.value=EXCHANGE]
[&Attributes.entry.2.key=amqp-ack-level&Attributes.entry.2.value=none|broker|routable]
[&Attributes.entry.3.key=verify-ssl&Attributes.entry.3.value=true|false]
[&Attributes.entry.4.key=kafka-ack-level&Attributes.entry.4.value=none|broker]
[&Attributes.entry.5.key=use-ssl&Attributes.entry.5.value=true|false]
[&Attributes.entry.6.key=ca-location&Attributes.entry.6.value=FILE_PATH]
[&Attributes.entry.7.key=OpaqueData&Attributes.entry.7.value=OPAQUE_DATA]
[&Attributes.entry.8.key=push-endpoint&Attributes.entry.8.value=ENDPOINT]
[&Attributes.entry.9.key=persistent&Attributes.entry.9.value=true|false]
OpaqueData: opaque data is set in the topic configuration and added to all notifications triggered by the topic.
persistent: indication of whether notifications to this endpoint are persistent that is asynchronous or not. By default
the value is false.
HTTP endpoint:
URL: https://FQDN:PORT
verify-ssl: Indicates whether the server certificate is validated by the client or not. By default , it is true.
AMQP0.9.1 endpoint:
URL: amqp://USER:PASSWORD@FQDN:PORT.
User and password details should be provided over HTTPS, otherwise the topic creation request is rejected.
amqp-exchange: The exchanges must exist and be able to route messages based on topics. This is a mandatory
parameter for AMQP0.9.1. Different topics pointing to the same endpoint must use the same exchange.
amqp-ack-level: No end to end acknowledgment is required, as messages may persist in the broker before
being delivered into their final destination. Three acknowledgment methods exist:
NOTE: The key and value of a specific parameter do not have to reside in the same line, or in any specific
order, but must use the same index. Attribute indexing does not need to be sequential or start from any
specific value.
Kafka endpoint:
URL: kafka:USER:PASSWORD@FQDN:PORT.
use-ssl is set to false by default. If use-ssl is set to true, secure connection is used for connecting with
the broker.
If ca-location is provided, and secure connection is used, the specified CA will be used, instead of the default
one, to authenticate the broker.
User and password can only be provided over HTTP[S\]. Otherwise, the topic creation request is rejected.
User and password may only be provided together with use-ssl, otherwise, the connection to the broker will
fail.
kafka-ack-level: no end to end acknowledgment required, as messages may persist in the broker before
being delivered into their final destination. Two acknowledgment methods exist:
<CreateTopicResponse xmlns="https://sns.amazonaws.com/doc/2010-03-31/">
<CreateTopicResult>
<TopicArn></TopicArn>
</CreateTopicResult>
<ResponseMetadata>
<RequestId></RequestId>
</ResponseMetadata>
</CreateTopicResponse>
NOTE: The topic Amazon Resource Name (ARN) in the response will have the following format:
arn:aws:sns:_ZONE_GROUP_:_TENANT_:_TOPIC
Example
Prerequisites
Edit online
Root-level access.
Endpoint parameters.
Procedure
Edit online
Syntax
POST
Action=GetTopic
&TopicArn=TOPIC_ARN
<GetTopicResponse>
<GetTopicRersult>
<Topic>
<User></User>
<Name></Name>
<EndPoint>
<EndpointAddress></EndpointAddress>
<EndpointArgs></EndpointArgs>
<EndpointTopic></EndpointTopic>
<HasStoredSecret></HasStoredSecret>
<Persistent></Persistent>
</EndPoint>
<TopicArn></TopicArn>
<OpaqueData></OpaqueData>
EndpointAddress: The endpoint URL. If the endpoint URL contains user and password information, the
request must be made over HTTPS. Otheriwse, the topic get request is rejected.
EndpointTopic: The topic name that is be sent to the endpoint can be different than the above example
topic name.
HasStoredSecret: true when the endpoint URL contains user and password information.
Listing topics
Edit online
List the topics that the user has defined.
Prerequisites
Edit online
Root-level access.
Endpoint parameters.
Procedure
Edit online
Syntax
POST
Action=ListTopics
<ListTopicdResponse xmlns="https://sns.amazonaws.com/doc/2020-03-31/">
<ListTopicsRersult>
<Topics>
NOTE: If endpoint URL contains user and password information, in any of the topics, the request must be made over HTTPS.
Otherwise, the topic list request is rejected.
Deleting topics
Edit online
Removing a deleted topic results in no operation and is not a failure.
Prerequisites
Edit online
Root-level access.
Endpoint parameters.
Procedure
Edit online
Syntax
POST
Action=DeleteTopic
&TopicArn=TOPIC_ARN
<DeleteTopicResponse xmlns="https://sns.amazonaws.com/doc/2020-03-31/">
<ResponseMetadata>
<RequestId></RequestId>
</ResponseMetadata>
</DeleteTopicResponse>
Prerequisites
Edit online
Syntax
Example
Syntax
Example
Syntax
Example
Event record
Edit online
An event holds information about the operation done by the Ceph Object Gateway and is sent as a payload over the chosen endpoint,
such as HTTP, HTTPS, Kafka, or AMQ0.9.1. The event record is in JSON format.
Example
{"Records":[
{
"eventVersion":"2.1",
"eventSource":"ceph:s3",
"awsRegion":"us-east-1",
"eventTime":"2019-11-22T13:47:35.124724Z",
"eventName":"ObjectCreated:Put",
"userIdentity":{
"principalId":"tester"
},
"requestParameters":{
"sourceIPAddress":""
},
"responseElements":{
"x-amz-request-id":"503a4c37-85eb-47cd-8681-2817e80b4281.5330.903595",
"x-amz-id-2":"14d2-zone1-zonegroup1"
},
"s3":{
"s3SchemaVersion":"1.0",
"configurationId":"mynotif1",
"bucket":{
"name":"mybucket1",
"ownerIdentity":{
awsRegion
Zonegroup.
eventTime
Timestamp that indicates when the event was triggered.
eventName
The type of the event.
userIdentity.principalId
The identity of the user that triggered the event.
requestParameters.sourceIPAddress
The IP address of the client that triggered the event. This field is not supported.
responseElements.x-amz-request-id
The request ID that triggered the event.
responseElements.x_amzID2
The identity of the Ceph Object Gateway on which the event was triggered. The identity format is RGWID*-ZONE-ZONEGROUP.
s3.configurationId
The notification ID that created the event.
s3.bucket.name
The name of the bucket.
s3.bucket.ownerIdentity.principalId
The owner of the bucket.
s3.bucket.arn
Amazon Resource Name (ARN) of the bucket.
s3.bucket.id
Identity of the bucket.
s3.object.key
The object key.
s3.object.size
The size of the object.
s3.object.eTag
The object etag.
s3.object.version
The object version in a versioned bucket.
s3.object.sequencer
Monotonically increasing identifier of the change per object in the hexadecimal format.
s3.object.metadata
s3.object.tags
Any tags set on the object.
s3.eventId
Unique identity of the event.
s3.opaqueData
Opaque data is set in the topic configuration and added to all notifications triggered by the topic.
Reference
Edit online
s3:ObjectCreated:
s3:ObjectCreated:Put
s3:ObjectCreated:Post
s3:ObjectCreated:Copy
s3:ObjectCreated:CompleteMultipartUpload
s3:ObjectRemoved:*
s3:ObjectRemoved:Delete
s3:ObjectRemoved:DeleteMarkerCreated
Capabilities
`buckets=read`
Syntax
Request Parameters
bucket
Description
The bucket to return info on.
Type
String
Example
Required
No
uid
Description
The user to retrieve bucket information for.
Type
String
Example
foo_user
Required
No
stats
Description
Return bucket statistics.
Type
Boolean
Example
True
Required
No
Response Entities
stats
Description
Per bucket information.
Type
Container
Parent
N/A
buckets
Description
Contains a list of one or more bucket containers.
Type
Container
Parent
buckets
bucket
Description
Container for single bucket information.
Type
Container
Parent
buckets
name
Description
Type
String
Parent
bucket
pool
Description
The pool the bucket is stored in.
Type
String
Parent
bucket
id
Description
The unique bucket ID.
Type
String
Parent
bucket
marker
Description
Internal bucket tag.
Type
String
Parent
bucket
owner
Description
The user ID of the bucket owner.
Type
String
Parent
bucket
usage
Description
Storage usage information.
Type
Container
Parent
bucket
index
Description
Status of bucket index.
Type
String
Parent
If successful, then the request returns a bucket’s container with the bucket information.
IndexRepairFailed
Description
Bucket index repair failed.
Code
409 Conflict
NOTE: To check multipart object accounting with check-objects, fix must be set to True.
Capabilities
buckets=write
Syntax
Request Parameters
bucket
Description
The bucket to return info on.
Type
String
Example
foo_bucket
Required
Yes
check-objects
Description
Check multipart object accounting.
Type
Boolean
Example
True
Required
No
fix
Description
Also fix the bucket index when checking.
Type
Boolean
Required
No
Response Entities
index
Description
Status of bucket index.
Type
String
IndexRepairFailed
Description
Bucket index repair failed.
Code
409 Conflict
Remove a bucket
Edit online
Removes an existing bucket.
Capabilities
`buckets=write`
Syntax
Request Parameters
bucket
Description
The bucket to remove.
Type
String
Example
foo_bucket
Required
Yes
purge-objects
Description
Remove a bucket’s objects before deletion.
Type
Boolean
Example
True
Required
Response Entities
None.
BucketNotEmpty
Description
Attempted to delete non-empty bucket.
Code
409 Conflict
ObjectRemovalFailed
Description
Unable to remove objects.
Code
409 Conflict
Link a bucket
Edit online
Link a bucket to a specified user, unlinking the bucket from any previous user.
Capabilities
`buckets=write`
Syntax
Request Parameters
bucket
Description
The bucket to unlink.
Type
String
Example
foo_bucket
Required
Yes
uid
Description
The user ID to link the bucket to.
Type
String
Example
foo_user
Required
Yes
bucket
Description
Container for single bucket information.
Type
Container
Parent
N/A
name
Description
The name of the bucket.
Type
String
Parent
bucket
pool
Description
The pool the bucket is stored in.
Type
String
Parent
bucket
id
Description
The unique bucket ID.
Type
String
Parent
bucket
marker
Description
Internal bucket tag.
Type
String
Parent
bucket
owner
Description
The user ID of the bucket owner.
Type
String
Parent
bucket
usage
Description
Storage usage information.
Parent
bucket
index
Description
Status of bucket index.
Type
String
Parent
bucket
BucketUnlinkFailed
Description
Unable to unlink bucket from specified user.
Code
409 Conflict
BucketLinkFailed
Description
Unable to link bucket to specified user.
Code
409 Conflict
Unlink a bucket
Edit online
Unlink a bucket from a specified user. Primarily useful for changing bucket ownership.
Capabilities
`buckets=write`
Syntax
bucket
Description
The bucket to unlink.
Type
String
Example
foo_bucket
Required
Yes
uid
Description
The user ID to link the bucket to.
Example
foo_user
Required
Yes
BucketUnlinkFailed
Description
Unable to unlink bucket from specified user.
Type
409 Conflict
Capabilities
`buckets=read`
Syntax
Request Parameters
bucket
Description
The bucket to read the policy from.
Type
String
Example
foo_bucket
Required
Yes
object
Description
The object to read the policy from.
Type
String
Example
foo.txt
Required
No
Response Entities
policy
Description
Type
Container
Parent
N/A
IncompleteBody
Description
Either bucket was not specified for a bucket policy request or bucket and object were not specified for an object policy
request.
Code
400 Bad Request
Remove an object
Edit online
Remove an existing object.
NOTE:
Capabilities
`buckets=write`
Syntax
Request Parameters
bucket
Description
The bucket containing the object to be removed.
Type
String
Example
foo_bucket
Required
Yes
object
Description
The object to remove
Type
String
Example
foo.txt
Required
Yes
None.
NoSuchObject
Description
Specified object does not exist.
Code
404 Not Found
ObjectRemovalFailed
Description
Unable to remove objects.
Code
409 Conflict
Quotas
Edit online
The administrative Operations API enables you to set quotas on users and on buckets owned by users. Quotas include the maximum
number of objects in a bucket and the maximum storage size in megabytes.
To view quotas, the user must have a users=read capability. To set, modify or disable a quota, the user must have users=write
capability.
Bucket
The bucket option allows you to specify a quota for buckets owned by a user.
Maximum Objects
The max-objects setting allows you to specify the maximum number of objects. A negative value disables this setting.
Maximum Size
The max-size option allows you to specify a quota for the maximum number of bytes. A negative value disables this setting.
Quota Scope
The quota-scope option sets the scope for the quota. The options are bucket and user.
Syntax
GET /admin/user?quota&uid=UID"a-type=user
Syntax
The content must include a JSON representation of the quota settings as encoded in the corresponding read operation.
Capabilities
`buckets=read`
Syntax
Request Parameters
bucket
Description
The bucket to return info on.
Type
String
Example
foo_bucket
Required
No
uid
Description
The user to retrieve bucket information for.
Type
String
Example
foo_user
Required
No
stats
Description
Return bucket statistics.
Type
Boolean
Example
True
Required
No
Response Entities
stats
Description
Type
Container
Parent
N/A
buckets
Description
Contains a list of one or more bucket containers.
Type
Container
Parent
N/A
bucket
Description
Container for single bucket information.
Type
Container
Parent
buckets
name
Description
The name of the bucket.
Type
String
Parent
bucket
pool
Description
The pool the bucket is stored in.
Type
String
Parent
bucket
id
Description
The unique bucket ID.
Type
String
Parent
bucket
marker
Description
Internal bucket tag.
Type
String
owner
Description
The user ID of the bucket owner.
Type
String
Parent
bucket
usage
Description
Storage usage information.
Type
Container
Parent
bucket
index
Description
Status of bucket index.
Type
String
Parent
bucket
If successful, then the request returns a bucket’s container with the bucket information.
IndexRepairFailed
Description
Bucket index repair failed.
Code
409 Conflict
Syntax
PUT /admin/user?quota&uid=UID"a-type=bucket
The content must include a JSON representation of the quota settings as encoded in the corresponding read operation.
`usage=read`
Syntax
Request Parameters
uid
Description
The user for which the information is requested.
Type
String
Required
Yes
start
Description
The date, and optionally, the time of when the data request started. For example, 2012-09-25 16:00:00.
Type
String
Required
No
end
Description
The date, and optionally, the time of when the data request ended. For example, 2012-09-25 16:00:00.
Type
String
Required
No
show-entries
Description
Specifies whether data entries should be returned.
Type
Boolean
Required
No
show-summary
Description
Specifies whether data entries should be returned.
Type
Boolean
Required
No
Response Entities
usage
Description
Type
Container
entries
Description
A container for the usage entries information.
Type
Container
user
Description
A container for the user data information.
Type
Container
owner
Description
The name of the user that owns the buckets.
Type
String
bucket
Description
The bucket name.
Type
String
time
Description
Time lower bound for which data is being specified that is rounded to the beginning of the first relevant hour.
Type
String
epoch
Description
The time specified in seconds since 1/1/1970.
Type
String
categories
Description
A container for stats categories.
Type
Container
entry
Description
A container for stats entry.
Type
Container
category
Type
String
bytes_sent
Description
Number of bytes sent by the Ceph Object Gateway.
Type
Integer
bytes_received
Description
Number of bytes received by the Ceph Object Gateway.
Type
Integer
ops
Description
Number of operations.
Type
Integer
successful_ops
Description
Number of successful operations.
Type
Integer
summary
Description
Number of successful operations.
Type
Container
total
Description
A container for stats summary aggregated total.
Type
Container
Capabilities
`usage=write`
Syntax
Request Parameters
uid
Description
The user for which the information is requested.
Type
String
Example
foo_user
Required
Yes
start
Description
The date, and optionally, the time of when the data request started.
Type
String
Example
2012-09-25 16:00:00
Required
No
end
Description
The date, and optionally, the time of when the data request ended.
Type
String
Example
2012-09-25 16:00:00
Required
No
remove-all
Description
Required when uid is not specified, in order to acknowledge multi-user data removal.
Type
Boolean
Example
True
Required
No
AccessDenied
Code
403 Forbidden
InternalError
Description
Internal server error.
Code
500 Internal Server Error
NoSuchUser
Description
User does not exist.
Code
404 Not Found
NoSuchBucket
Description
Bucket does not exist.
Code
404 Not Found
NoSuchKey
Description
No such access key.
Code
404 Not Found
Prerequisites
S3 limitations
Accessing the Ceph Object Gateway with the S3 API
S3 bucket operations
S3 object operations
S3 select operations (Technology Preview)
Prerequisites
Edit online
A RESTful client.
S3 limitations
IBM Storage Ceph 965
Edit online
IMPORTANT: The following limitations should be used with caution. There are implications related to your hardware selections, so
you should always discuss these requirements with your IBM account team.
Maximum object size when using Amazon S3: Individual Amazon S3 objects can range in size from a minimum of 0B to a
maximum of 5TB. The largest object that can be uploaded in a single PUT is 5GB. For objects larger than 100MB, you should
consider using the Multipart Upload capability.
Maximum metadata size when using Amazon S3: There is no defined limit on the total size of user metadata that can be
applied to an object, but a single HTTP request is limited to 16,000 bytes.
The amount of data overhead IBM Storage cluster produces to store S3 objects and metadata: The estimate here is 200-
300 bytes plus the length of the object name. Versioned objects consume additional space proportional to the number of
versions. Also, transient overhead is produced during multi-part upload and other transactional updates, but these overheads
are recovered during garbage collection.
Reference
Edit online
Prerequisites
S3 authentication
S3 server-side encryption
S3 access control lists
Preparing access to the Ceph Object Gateway using S3
Accessing the Ceph Object Gateway using Ruby AWS S3
Accessing the Ceph Object Gateway using Ruby AWS SDK
Accessing the Ceph Object Gateway using PHP
Secure Token Service
Prerequisites
Edit online
A RESTful client.
S3 authentication
Edit online
Requests to the Ceph Object Gateway can be either authenticated or unauthenticated. Ceph Object Gateway assumes
unauthenticated requests are sent by an anonymous user. Ceph Object Gateway supports canned ACLs.
For most use cases, clients use existing open source libraries like the Amazon SDK’s AmazonS3Client for Java, and Python Boto.
With open source libraries you simply pass in the access key and secret key and the library builds the request header and
authentication signature for you. However, you can create requests and sign them too.
Example
HTTP/1.1
PUT /buckets/bucket/object.mpeg
Host: cname.domain.com
Date: Mon, 2 Jan 2012 00:01:01 +0000
Content-Encoding: mpeg
Content-Length: 9999999
In the above example, replace ACCESS_KEY with the value for the access key ID followed by a colon (:). Replace
HASH_OF_HEADER_AND_SECRET with a hash of a canonicalized header string and the secret corresponding to the access key ID.
Normalize header
5. Ensure you have a Date header AND ensure the specified date uses GMT and not an offset.
9. Combine multiple instances of the same field name into a single field and separate the field values with a comma.
10. Replace white space and line breaks in header values with a single space.
Reference
Edit online
For additional details, consult the Signing and Authenticating REST Requests section of Amazon Simple Storage Service
documentation.
NOTE: IBM does NOT support S3 object encryption of Static Large Object (SLO) or Dynamic Large Object (DLO).
IMPORTANT: To use encryption, client requests MUST send requests over an SSL connection. IBM does not support S3 encryption
from a client unless the Ceph Object Gateway uses SSL. However, for testing purposes, administrators can disable SSL during testing
by setting the rgw_crypt_require_ssl configuration setting to false at runtime, using the ceph config set client.rgw
command, and then restarting the Ceph Object Gateway instance.
In a production environment, it might not be possible to send encrypted requests over SSL. In such a case, send requests using
HTTP with server-side encryption.
Customer-provided Keys
When using customer-provided keys, the S3 client passes an encryption key along with each request to read or write
encrypted data. It is the customer’s responsibility to manage those keys. Customers must remember which key the Ceph
Object Gateway used to encrypt each object.
Ceph Object Gateway implements the customer-provided key behavior in the S3 API according to the Amazon SSE-C specification.
Since the customer handles the key management and the S3 client passes keys to the Ceph Object Gateway, the Ceph Object
Gateway requires no special configuration to support this encryption mode.
Ceph Object Gateway implements the key management service behavior in the S3 API according to the Amazon SSE-KMS
specification.
IMPORTANT: Currently, the only tested key management implementations are HashiCorp Vault, and OpenStack Barbican. However,
OpenStack Barbican is a Technology Preview and is not supported for use in production systems.
Reference
Edit online
Amazon SSE-C
Amazon SSE-KMS
Prerequisites
Edit online
Procedure
Edit online
2. Add a wildcard to the DNS server that you are using for the gateway as mentioned in the Add a wildcard to the DNS section.
You can also set up the gateway node for local DNS caching. To do so, execute the following steps:
Replace IP_OF_GATEWAY_NODE and FQDN_OF_GATEWAY_NODE with the IP address and FQDN of the gateway node.
Replace IP_OF_GATEWAY_NODE and FQDN_OF_GATEWAY_NODE with the IP address and FQDN of the gateway node.
3. Create the radosgw user for S3 access carefully and copy the generated access_key and secret_key.You will need these
keys for S3 access and subsequent bucket management tasks. For more details, see Create an S3 user
Prerequisites
Edit online
Internet access.
Procedure
Edit online
NOTE: The above command will install ruby and its essential dependencies like rubygems and ruby-libs. If somehow the
command does not install all the dependencies, install them separately.
Syntax
#!/usr/bin/env ruby
require aws/s3
require resolv-replace
AWS::S3::Base.establish_connection!(
:server => FQDN_OF_GATEWAY_NODE,
:port => 8080,
:access_key_id => MY_ACCESS_KEY,
:secret_access_key => MY_SECRET_KEY
)
Replace FQDN_OF_GATEWAY_NODE with the FQDN of the Ceph Object Gateway node.
Example
require 'aws/s3'
require 'resolv-replace'
AWS::S3::Base.establish_connection!(
:server => 'testclient.englab.pnq.redhat.com',
:port => '8080',
:access_key_id => '98J4R9P22P5CDL65HKP8',
:secret_access_key => '6C+jcaP0dp0+FZfrRNgyGA9EzRy25pURldwje049'
)
If you have provided the values correctly in the file, the output of the command will be 0.
#!/usr/bin/env ruby
load 'conn.rb'
AWS::S3::Bucket.create('my-new-bucket1')
If the output of the command is true it would mean that bucket my-new-bucket1 was created successfully.
#!/usr/bin/env ruby
load 'conn.rb'
AWS::S3::Service.buckets.each do |bucket|
puts "{bucket.name}\t{bucket.creation_date}"
end
#!/usr/bin/env ruby
load 'conn.rb'
AWS::S3::S3Object.store(
'hello.txt',
'Hello World!',
'my-new-bucket1',
:content_type => 'text/plain'
)
This will create a file hello.txt with the string Hello World!.
#!/usr/bin/env ruby
load 'conn.rb'
new_bucket = AWS::S3::Bucket.find('my-new-bucket1')
new_bucket.each do |object|
puts "{object.key}\t{object.about['content-length']}\t{object.about['last-modified']}"
end
#!/usr/bin/env ruby
load 'conn.rb'
AWS::S3::Bucket.delete('my-new-bucket1')
NOTE: Edit the create_bucket.rb file to create empty buckets, for example, my-new-bucket4, my-new-bucket5. Next,
edit the above-mentioned del_empty_bucket.rb file accordingly before trying to delete empty buckets.
#!/usr/bin/env ruby
load 'conn.rb'
#!/usr/bin/env ruby
load 'conn.rb'
AWS::S3::S3Object.delete('hello.txt', 'my-new-bucket1')
Prerequisites
Edit online
Internet access.
Procedure
Edit online
NOTE: The above command will install ruby and its essential dependencies like rubygems and ruby-libs. If somehow the
command does not install all the dependencies, install them separately.
Syntax
#!/usr/bin/env ruby
require aws-sdk
require resolv-replace
Aws.config.update(
endpoint: http://FQDN_OF_GATEWAY_NODE:8080,
access_key_id: MY_ACCESS_KEY,
secret_access_key: MY_SECRET_KEY,
force_path_style: true,
region: us-east-1
)
Replace FQDN_OF_GATEWAY_NODE with the FQDN of the Ceph Object Gateway node. Replace MY_ACCESS_KEY and
MY_SECRET_KEY with the access_key and secret_key that were generated when you created the radosgw user for S3
access as mentioned in the Create an S3 user section.
Example
#!/usr/bin/env ruby
require 'aws-sdk'
require 'resolv-replace'
Aws.config.update(
endpoint: 'http://testclient.englab.pnq.redhat.com:8080',
access_key_id: '98J4R9P22P5CDL65HKP8',
secret_access_key: '6C+jcaP0dp0+FZfrRNgyGA9EzRy25pURldwje049',
force_path_style: true,
region: 'us-east-1'
)
Syntax
#!/usr/bin/env ruby
load 'conn.rb'
s3_client = Aws::S3::Client.new
s3_client.create_bucket(bucket: 'my-new-bucket2')
If the output of the command is true, this means that bucket my-new-bucket2 was created successfully.
#!/usr/bin/env ruby
load 'conn.rb'
s3_client = Aws::S3::Client.new
s3_client.list_buckets.buckets.each do |bucket|
puts "{bucket.name}\t{bucket.creation_date}"
end
#!/usr/bin/env ruby
load 'conn.rb'
s3_client = Aws::S3::Client.new
s3_client.put_object(
key: 'hello.txt',
body: 'Hello World!',
bucket: 'my-new-bucket2',
content_type: 'text/plain'
)
This will create a file hello.txt with the string Hello World!.
#!/usr/bin/env ruby
load 'conn.rb'
s3_client = Aws::S3::Client.new
s3_client.list_objects(bucket: 'my-new-bucket2').contents.each do |object|
puts "{object.key}\t{object.size}"
end
#!/usr/bin/env ruby
load 'conn.rb'
s3_client = Aws::S3::Client.new
s3_client.delete_bucket(bucket: 'my-new-bucket2')
NOTE: Edit the create_bucket.rb file to create empty buckets, for example, my-new-bucket6, my-new-bucket7. Next,
edit the above-mentioned del_empty_bucket.rb file accordingly before trying to delete empty buckets.
load 'conn.rb'
s3_client = Aws::S3::Client.new
Aws::S3::Bucket.new('my-new-bucket2', client: s3_client).clear!
s3_client.delete_bucket(bucket: 'my-new-bucket2')
#!/usr/bin/env ruby
load 'conn.rb'
s3_client = Aws::S3::Client.new
s3_client.delete_object(key: 'hello.txt', bucket: 'my-new-bucket2')
IMPORTANT: The examples given below are tested against php v5.4.16 and aws-sdk v2.8.24. DO NOT use the latest version
of aws-sdk for php as it requires php >= 5.5+.php 5.5 is not available in the default repositories of RHEL 7. If you want to use
php 5.5, you will have to enable epel and other third-party repositories. Also, the configuration options for php 5.5 and latest
version of aws-sdk are different.
Prerequisites
Edit online
Internet access.
4. Copy the extracted aws directory to the project directory. For example:
Syntax
<?php
define(AWS_KEY, MY_ACCESS_KEY);
define(AWS_SECRET_KEY, MY_SECRET_KEY);
define(HOST, FQDN_OF_GATEWAY_NODE);
define(PORT, 8080);
use Aws\S3\S3Client;
Replace FQDN_OF_GATEWAY_NODE with the FQDN of the gateway node. Replace PATH_TO_AWS with the absolute path to the
extracted aws directory that you copied to the php project directory.
If you have provided the values correctly in the file, the output of the command will be 0.
Syntax
<?php
include 'conn.php';
?>
Syntax
<?php
include 'conn.php';
blist = client->listBuckets();
echo "Buckets belonging to " . blist['Owner']['ID'] . ":\n";
foreach (blist['Buckets'] as b) {
echo "{b['Name']}\t{b['CreationDate']}\n";
}
?>
Syntax
<?php
include 'conn.php';
key = 'hello.txt';
source_file = './hello.txt';
acl = 'private';
bucket = 'my-new-bucket3';
client->upload(bucket, key, fopen(source_file, 'r'), acl);
?>
Syntax
<?php
include 'conn.php';
Syntax
<?php
include 'conn.php';
NOTE: Edit the create_bucket.php file to create empty buckets, for example, my-new-bucket4, my-new-bucket5. Next,
edit the above-mentioned del_empty_bucket.php file accordingly before trying to delete empty buckets.
IMPORTANT: Deleting a non-empty bucket is currently not supported in PHP 2 and newer versions of aws-sdk.
Syntax
<?php
include 'conn.php';
client->deleteObject(array(
'Bucket' => 'my-new-bucket3',
'Key' => 'hello.txt',
));
?>
Reference
Edit online
See the Configuring and using STS Lite with Keystone section for details on STS Lite and Keystone.
See the Working around the limitations of using STS Lite with Keystone section for details on the limitations of STS Lite and
Keystone.
AssumeRole
This API returns a set of temporary credentials for cross-account access. These temporary credentials allow for both, permission
policies attached with Role and policies attached with AssumeRole API. The RoleArn and the RoleSessionName request
parameters are required, but the other request parameters are optional.
RoleArn
Description
The role to assume for the Amazon Resource Name (ARN) with a length of 20 to 2048 characters.
Type
String
Required
Yes
RoleSessionName
Description
Identifying the role session name to assume. The role session name can uniquely identify a session when different principals
or different reasons assume a role. This parameter’s value has a length of 2 to 64 characters. The =, ,, ., @, and - characters
are allowed, but no spaces allowed.
Type
String
Required
Yes
Policy
Description
An identity and access management policy (IAM) in a JSON format for use in an inline session. This parameter’s value has a
length of 1 to 2048 characters.
Required
No
DurationSeconds
Description
The duration of the session in seconds, with a minimum value of 900 seconds to a maximum value of 43200 seconds. The
default value is 3600 seconds.
Type
Integer
Required
No
ExternalId
Description
When assuming a role for another account, provide the unique external identifier if available. This parameter’s value has a
length of 2 to 1224 characters.
Type
String
Required
No
SerialNumber
Description
A user’s identification number from their associated multi-factor authentication (MFA) device. The parameter’s value can be
the serial number of a hardware device or a virtual device, with a length of 9 to 256 characters.
Type
String
Required
No
TokenCode
Description
The value generated from the multi-factor authentication (MFA) device, if the trust policy requires MFA. If an MFA device is
required, and if this parameter’s value is empty or expired, then AssumeRole call returns an "access denied" error message.
This parameter’s value has a fixed length of 6 characters.
Type
String
Required
No
AssumeRoleWithWebIdentity
This API returns a set of temporary credentials for users who have been authenticated by an application, such as OpenID Connect or
OAuth 2.0 Identity Provider. The RoleArn and the RoleSessionName request parameters are required, but the other request
parameters are optional.
RoleArn
Description
The role to assume for the Amazon Resource Name (ARN) with a length of 20 to 2048 characters.
Type
String
Required
RoleSessionName
Description
Identifying the role session name to assume. The role session name can uniquely identify a session when different principals
or different reasons assume a role. This parameter’s value has a length of 2 to 64 characters. The =, ,, ., @, and - characters
are allowed, but no spaces are allowed.
Type
String
Required
Yes
Policy
Description
An identity and access management policy (IAM) in a JSON format for use in an inline session. This parameter’s value has a
length of 1 to 2048 characters.
Type
String
Required
No
DurationSeconds
Description
The duration of the session in seconds, with a minimum value of 900 seconds to a maximum value of 43200 seconds. The
default value is 3600 seconds.
Type
Integer
Required
No
ProviderId
Description
The fully qualified host component of the domain name from the identity provider. This parameter’s value is only valid for
OAuth 2.0 access tokens, with a length of 4 to 2048 characters.
Type
String
Required
No
WebIdentityToken
Description
The OpenID Connect identity token or OAuth 2.0 access token provided from an identity provider. This parameter’s value has
a length of 4 to 2048 characters.
Type
String
Required
No
Reference
Edit online
See the Examples using the Secure Token Service APIs for more details.
NOTE: The S3 and STS APIs co-exist in the same namespace, and both can be accessed from the same endpoint in the Ceph Object
Gateway.
Prerequisites
Edit online
Procedure
Edit online
1. Set the following configuration options for the Ceph Object Gateway client:
Syntax
The rgw_sts_key is the STS key for encrypting or decrypting the session token and is exactly 16 hex characters.
Example
2. Restart the Ceph Object Gateway for the added key to take effect.
NOTE: Use the output from the ceph orch ps command, under the NAME column, to get the SERVICE_TYPE.ID information.
a. To restart the Ceph Object Gateway on an individual node in the storage cluster:
Syntax
Example
b. To restart the Ceph Object Gateways on all nodes in the storage cluster:
Syntax
Example
See Secure Token Service application programming interfaces for more details on the STS APIs.
See the The basics of Ceph configuration for more details on using the Ceph configuration database.
Prerequisites
Edit online
Procedure
Edit online
Syntax
Example
Syntax
Example
3. Add a condition to the role trust policy using the Secure Token Service (STS) API:
Syntax
"{\"Version\":\"2020-01-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":
{\"Federated\":[\"arn:aws:iam:::oidc-provider/IDP_URL\"]},\"Action\":
[\"sts:AssumeRoleWithWebIdentity\"],\"Condition\":{\"StringEquals\":
{\"IDP_URL:app_id\":\"AUD_FIELD\"\}\}\}\]\}"
IMPORTANT: The app_id in the syntax example above must match the AUD_FIELD field of the incoming token.
Reference
Edit online
See the Obtaining the Root CA Thumbprint for an OpenID Connect Identity Provider article on Amazon’s website.
See the Examples using the Secure Token Service APIs for more details.
Prerequisites
Edit online
Procedure
Edit online
Syntax
curl -k -v \
-X GET \
-H "Content-Type: application/x-www-form-urlencoded" \
"IDP_URL:8000/CONTEXT/realms/REALM/.well-known/openid-configuration" \
| jq .
Example
Syntax
curl -k -v \
-X GET \
-H "Content-Type: application/x-www-form-urlencoded" \
"IDP_URL/CONTEXT/realms/REALM/protocol/openid-connect/certs" \
| jq .
Example
3. Copy the result of the x5c response from the previous command and paste it into the certificate.crt file. Include —–
BEGIN CERTIFICATE—– at the beginning and —–END CERTIFICATE—– at the end.
Syntax
Example
5. Remove all the colons from the SHA1 fingerprint and use this as the input for creating the IDP entity in the IAM request.
Reference
Edit online
See the Obtaining the Root CA Thumbprint for an OpenID Connect Identity Provider article on Amazon’s website.
See the Secure Token Service application programming interfaces for more details on the STS APIs.
See the Examples using the Secure Token Service APIs for more details.
NOTE: Both S3 and STS APIs can be accessed using the same endpoint in Ceph Object Gateway.
Prerequisites
Edit online
Procedure
Edit online
1. Set the following configuration options for the Ceph Object Gateway client:
Syntax
The rgw_sts_key is the STS key for encrypting or decrypting the session token and is exactly 16 hex characters.
Example
Example
+------------+--------------------------------------------------------+
| Field | Value |
+------------+--------------------------------------------------------+
| access | b924dfc87d454d15896691182fdeb0ef |
3. Use the generated credentials to get back a set of temporary security credentials using GetSessionToken API:
Example
import boto3
access_key = b924dfc87d454d15896691182fdeb0ef
secret_key = 6a2142613c504c42a94ba2b82147dc28
client = boto3.client('sts',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
endpoint_url=https://www.example.com/rgw,
region_name='',
)
response = client.get_session_token(
DurationSeconds=43200
)
Example
s3client = boto3.client('s3',
aws_access_key_id = response['Credentials']['AccessKeyId'],
aws_secret_access_key = response['Credentials']['SecretAccessKey'],
aws_session_token = response['Credentials']['SessionToken'],
endpoint_url=https://www.example.com/s3,
region_name='')
bucket = s3client.create_bucket(Bucket='my-new-shiny-bucket')
response = s3client.list_buckets()
for bucket in response["Buckets"]:
print "{name}\t{created}".format(
name = bucket['Name'],
created = bucket['CreationDate'],
)
Syntax
Example
Syntax
Example
Syntax
Example
d. Now another user can assume the role of the gwadmin user. For example, the gwuser user can assume the
permissions of the gwadmin user.
Example
6. Use the AssumeRole API call, providing the access_key and secret_key values from the assuming user:
Example
import boto3
access_key = 11BS02LGFB6AL6H1ADMW
secret_key = vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY
client = boto3.client('sts',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
endpoint_url=https://www.example.com/rgw,
region_name='',
)
response = client.assume_role(
RoleArn='arn:aws:iam:::role/application_abc/component_xyz/S3Access',
RoleSessionName='Bob',
DurationSeconds=3600
)
Reference
Edit online
See the Test S3 Access for more information on installing the Boto Python module.
Prerequisites
IBM Storage Ceph 989
Edit online
Procedure
Edit online
a. Add the following four lines to the code block: python class SigV4Auth(BaseSigner): """ Sign a request with Signature
V4. """ REQUIRES_REGION = True
b. Add the following two lines to the code block: python def _modify_request_before_signing(self, request): if
'Authorization' in request.headers: del request.headers self._set_necessary_date_headers(request) if
self.credentials.token: if 'X-Amz-Security-Token' in request.headers: del request.headers request.headers =
self.credentials.token
Reference
Edit online
See the Test S3 Access section for more information on installing the Boto Python module.
S3 bucket operations
Edit online
As a developer, you can perform bucket operations with the Amazon S3 application programming interface (API) through the Ceph
Object Gateway.
The following table list the Amazon S3 functional operations for buckets, along with the function's support status.
Prerequisites
S3 create bucket notifications
S3 get bucket notifications
S3 delete bucket notifications
Accessing bucket host names
S3 list buckets
S3 return a list of bucket objects
S3 create a new bucket
S3 put bucket website
S3 get bucket website
S3 delete bucket website
S3 delete a bucket
S3 bucket lifecycle
S3 GET bucket lifecycle
S3 create or replace a bucket lifecycle
S3 delete a bucket lifecycle
S3 get bucket location
S3 get bucket versioning
S3 put bucket versioning
S3 get bucket access control lists
S3 put bucket Access Control Lists
S3 get bucket cors
S3 put bucket cors
S3 delete a bucket cors
S3 list bucket object versions
S3 head bucket
S3 list multipart uploads
S3 bucket policies
S3 get the request payment configuration on a bucket
Prerequisites
Edit online
A RESTful client.
To create a bucket notification for s3:objectCreate and s3:objectRemove events, use PUT:
Example
client.put_bucket_notification_configuration(
Bucket=bucket_name,
NotificationConfiguration={
'TopicConfigurations': [
{
'Id': notification_name,
'TopicArn': topic_arn,
'Events': ['s3:ObjectCreated:*', 's3:ObjectRemoved:*']
}]})
IMPORTANT: IBM supports ObjectCreate events, such as put, post, multipartUpload, and copy. IBM also supports
ObjectRemove events, such as object_delete and s3_multi_object_delete.
Request Entities
NotificationConfiguration
Description
list of TopicConfiguration entities.
Type
Container
Required
Yes
TopicConfiguration
Description
Id, Topic, and list of Event entities.
Type
Container
Required
Yes
id
Type
String
Required
Yes
Topic
Description
Topic Amazon Resource Name(ARN)
NOTE:
Type
String
Required
Yes
Event
Description
List of supported events. Multiple event entities can be used. If omitted, all events are handled.
Type
String
Required
No
Filter
Description
S3Key, S3Metadata and S3Tags entities.
Type
Container
Required
No
S3Key
Description
A list of FilterRule entities, for filtering based on the object key. At most, 3 entities may be in the list, for example Name
would be prefix, suffix, or regex. All filter rules in the list must match for the filter to match.
Type
Container
Required
No
S3Metadata
Description
A list of FilterRule entities, for filtering based on object metadata. All filter rules in the list must match the metadata
defined on the object. However, the object still matches if it has other metadata entries not listed in the filter.
Type
Container
Required
No
Description
A list of FilterRule entities, for filtering based on object tags. All filter rules in the list must match the tags defined on the
object. However, the object still matches if it has other tags not listed in the filter.
Type
Container
Required
No
S3Key.FilterRule
Description
Name and Value entities. Name is : prefix, suffix, or regex. The Value would hold the key prefix, key suffix, or a regular
expression for matching the key, accordingly.
Type
Container
Required
Yes
S3Metadata.FilterRule
Description
Name and Value entities. Name is the name of the metadata attribute for example x-amz-meta-xxx. The value is the
expected value for this attribute.
Type
Container
Required
Yes
S3Tags.FilterRule
Description
Name and Value entities. Name is the tag key, and the value is the tag value.
Type
Container
Required
Yes
HTTP response
400
Status Code
MalformedXML
Description
The XML is not well-formed.
400
Status Code
InvalidArgument
Description
Missing Id or missing or invalid topic ARN or invalid event.
404
Status Code
NoSuchBucket
Description
404
Status Code
NoSuchKey
Description
The topic does not exist.
Syntax
Example
Example Response
<NotificationConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<TopicConfiguration>
<Id></Id>
<Topic></Topic>
<Event></Event>
<Filter>
<S3Key>
<FilterRule>
<Name></Name>
<Value></Value>
</FilterRule>
</S3Key>
<S3Metadata>
<FilterRule>
<Name></Name>
<Value></Value>
</FilterRule>
</S3Metadata>
<S3Tags>
<FilterRule>
<Name></Name>
<Value></Value>
</FilterRule>
</S3Tags>
</Filter>
</TopicConfiguration>
</NotificationConfiguration>
NOTE: The notification subresource returns the bucket notification configuration or an empty NotificationConfiguration
element. The caller must be the bucket owner.
Request Entities
notification-id
Description
Name of the notification. All notifications are listed if the ID is not provided.
Type
NotificationConfiguration
Description
list of TopicConfiguration entities.
Type
Container
Required
Yes
TopicConfiguration
Description
Id, Topic, and list of Event entities.
Type
Container
Required
Yes
id
Description
Name of the notification.
Type
String
Required
Yes
Topic
Description
Topic Amazon Resource Name(ARN)
Type
String
Required
Yes
Event
Description
Handled event. Multiple event entities may exist.
Type
String
Required
Yes
Filter
Description
The filters for the specified configuration.
Type
Container
Required
No
HTTP response
Status Code
NoSuchBucket
Description
The bucket does not exist.
404
Status Code
NoSuchKey
Description
The notification does not exist if it has been provided.
NOTE: Notification deletion is an extension to the S3 notification API. Any defined notifications on a bucket are deleted when the
bucket is deleted. Deleting an unknown notification for example double delete, is not considered an error.
Syntax
Example
Request Entities
notification-id
Description
Name of the notification. All notifications on the bucket are deleted if the notification ID is not provided.
Type
String
HTTP response
404
Status Code
NoSuchBucket
Description
The bucket does not exist.
Example
Example
GET / HTTP/1.1
Host: mybucket.cname.domain.com
TIP: IBM prefers the first method, because the second method requires expensive domain certification and DNS wild cards.
S3 list buckets
Edit online
GET / returns a list of buckets created by the user making the request. GET / only returns buckets created by an authenticated
user. You cannot make an anonymous request.
Syntax
GET / HTTP/1.1
Host: cname.domain.com
Authorization: AWS_ACCESS_KEY_:_HASH_OF_HEADER_AND_SECRET_
Response Entities
Buckets
Description
Container for list of buckets.
Type
Container
Bucket
Description
Container for bucket information.
Type
Container
Name
Description
Bucket name.
Type
String
CreationDate
Description
UTC time when the bucket was created.
Type
Date
ListAllMyBucketsResult
Description
A container for the result.
Type
Container
Owner
Description
A container for the bucket owner’s ID and DisplayName.
ID
Description
The bucket owner’s ID.
Type
String
DisplayName
Description
The bucket owner’s display name.
Type
String
Syntax
Parameters
prefix
Description
Only returns objects that contain the specified prefix.
Type
String
delimiter
Description
The delimiter between the prefix and the rest of the object name.
Type
String
marker
Description
A beginning index for the list of objects returned.
Type
String
max-keys
Description
The maximum number of keys to return. Default is 1000.
Type
Integer
HTTP Response
200
Status Code
Description
Buckets retrieved.
GET /_BUCKET returns a container for buckets with the following fields:
ListBucketResult
Description
The container for the list of objects.
Type
Entity
Name
Description
The name of the bucket whose contents will be returned.
Type
String
Prefix
Description
A prefix for the object keys.
Type
String
Marker
Description
A beginning index for the list of objects returned.
Type
String
MaxKeys
Description
The maximum number of keys returned.
Type
Integer
Delimiter
Description
If set, objects with the same prefix will appear in the CommonPrefixes list.
Type
String
IsTruncated
Description
If true, only a subset of the bucket’s contents were returned.
Type
Boolean
CommonPrefixes
Description
If multiple objects contain the same prefix, they will appear in this list.
Type
The ListBucketResult contains objects, where each object is within a Contents container.
Contents
Description
A container for the object.
Type
Object
Key
Description
The object’s key.
Type
String
LastModified
Description
The object’s last-modified date and time.
Type
Date
ETag
Description
An MD-5 hash of the object. Etag is an entity tag.
Type
String
Size
Description
The object’s size.
Type
Integer
StorageClass
Description
Should always return STANDARD.
Type
String
Constraints
Bucket names must be a series of one or more labels. Adjacent labels are separated by a single period (.). Bucket names can
contain lowercase letters, numbers, and hyphens. Each label must start and end with a lowercase letter or a number.
NOTE: The above constraints are relaxed if rgw_relaxed_s3_bucket_names is set to true. The bucket names must still be
unique, cannot be formatted as IP address, and can contain letters, numbers, periods, dashes, and underscores of up to 255
characters long.
Syntax
Parameters x-amz-acl
Description
Canned ACLs.
Valid Values
private, public-read,public-read-write, authenticated-read
Required
No
HTTP Response
If the bucket name is unique, within constraints, and unused, the operation will succeed. If a bucket with the same name already
exists and the user is the bucket owner, the operation will succeed. If the bucket name is already in use, the operation will fail.
409
Status Code
BucketAlreadyExists
Description
Bucket already exists under different user’s ownership.
NOTE: Put operation requires S3:PutBucketWebsite` permission. By default, only the bucket owner can configure the
website attached to a bucket.
Syntax
PUT /BUCKET?website-configuration=HTTP/1.1
Example
PUT /testbucket?website-configuration=HTTP/1.1
Reference
Edit online
NOTE: Get operation requires the S3:GetBucketWebsite permission. By default, only the bucket owner can read the bucket
website configuration.
Syntax
GET /BUCKET?website-configuration=HTTP/1.1
Example
GET /testbucket?website-configuration=HTTP/1.1
Reference
Edit online
Syntax
DELETE /BUCKET?website-configuration=HTTP/1.1
Example
DELETE /testbucket?website-configuration=HTTP/1.1
Reference
Edit online
S3 delete a bucket
Edit online
Deletes a bucket. You can reuse bucket names following a successful bucket removal.
Syntax
HTTP Response
204
Description
Bucket removed.
S3 bucket lifecycle
Edit online
You can use a bucket lifecycle configuration to manage your objects so they are stored effectively throughout their lifetime. The S3
API in the Ceph Object Gateway supports a subset of the AWS bucket lifecycle actions:
Expiration
This defines the lifespan of objects within a bucket. It takes the number of days the object should live or expiration date, at
which point Ceph Object Gateway will delete the object. If the bucket doesn’t enable versioning, Ceph Object Gateway will
delete the object permanently. If the bucket enables versioning, Ceph Object Gateway will create a delete marker for the
current version, and then delete the current version.
NoncurrentVersionExpiration
This defines the lifespan of non-current object versions within a bucket. To use this feature, the bucket must enable
versioning. It takes the number of days a non-current object should live, at which point Ceph Object Gateway will delete the
non-current object.
AbortIncompleteMultipartUpload
This defines the number of days an incomplete multipart upload should live before it is aborted.
The lifecycle configuration contains one or more rules using the <Rule> element.
Example
<LifecycleConfiguration>
<Rule>
<Prefix/>
<Status>Enabled</Status>
<Expiration>
<Days>10</Days>
</Expiration>
</Rule>
</LifecycleConfiguration>
A lifecycle rule can apply to all or a subset of objects in a bucket based on the <Filter> element that you specify in the lifecycle
rule. You can specify a filter in several ways:
Key prefixes
Object tags
Key prefixes
You can apply a lifecycle rule to a subset of objects based on the key name prefix. For example, specifying <keypre/> would apply
to objects that begin with keypre/:
<LifecycleConfiguration>
<Rule>
<Status>Enabled</Status>
<Filter>
<Prefix>keypre/</Prefix>
</Filter>
</Rule>
</LifecycleConfiguration>
You can also apply different lifecycle rules to objects with different key prefixes:
<LifecycleConfiguration>
<Rule>
<Status>Enabled</Status>
<Filter>
Object tags
You can apply a lifecycle rule to only objects with a specific tag using the <Key> and <Value> elements:
<LifecycleConfiguration>
<Rule>
<Status>Enabled</Status>
<Filter>
<Tag>
<Key>key</Key>
<Value>value</Value>
</Tag>
</Filter>
</Rule>
</LifecycleConfiguration>
In a lifecycle rule, you can specify a filter based on both the key prefix and one or more tags. They must be wrapped in the <And>
element. A filter can have only one prefix, and zero or more tags:
<LifecycleConfiguration>
<Rule>
<Status>Enabled</Status>
<Filter>
<And>
<Prefix>key-prefix</Prefix>
<Tag>
<Key>key1</Key>
<Value>value1</Value>
</Tag>
<Tag>
<Key>key2</Key>
<Value>value2</Value>
</Tag>
...
</And>
</Filter>
</Rule>
</LifecycleConfiguration>
Additional Resources
See the S3 GET bucket lifecycle section for details on getting a bucket lifecycle.
See the S3 create or replace a bucket lifecycle section for details on creating a bucket lifecycle.
See the S3 delete a bucket lifecycle section for details on deleting a bucket lifecycle.
Syntax
See the S3 common request headers in Appendix B for more information about common request headers.
Response
Syntax
Request Headers
content-md5
Description
A base64 encoded MD-5 hash of the message
Valid Values
String No defaults or constraints.
Required
No
Reference
Edit online
See the S3 common request headers sfor more information on Amazon S3 common request headers.
See the S3 bucket lifecycles for more information on Amazon S3 bucket lifecycles.
Syntax
Response
Reference
Edit online
See the S3 common request headers for more information on Amazon S3 common request headers.
See the S3 common response status codes for more information on Amazon S3 common response status codes.
Syntax
Response Entities
LocationConstraint
Description
The zone group where bucket resides, an empty string for default zone group.
Type
String
Syntax
Enabled: Enables versioning for the objects in the bucket. All objects added to the bucket receive a unique version ID. Suspended:
Disables versioning for the objects in the bucket. All objects added to the bucket receive the version ID null.
Syntax
Example
VersioningConfiguration
Description
A container for the request.
Type
Container
Status
Description
Sets the versioning state of the bucket. Valid Values: Suspended/Enabled
Type
String
Syntax
Response Entities
AccessControlPolicy
Description
A container for the response.
Type
Container
AccessControlList
Description
A container for the ACL information.
Type
Container
Owner
Description
Type
Container
ID
Description
The bucket owner’s ID.
Type
String
DisplayName
Description
The bucket owner’s display name.
Type
String
Grant
Description
A container for Grantee and Permission.
Type
Container
Grantee
Description
A container for the DisplayName and ID of the user receiving a grant of permission.
Type
Container
Permission
Description
The permission given to the Grantee bucket.
Type
String
Syntax
AccessControlList
Description
A container for the ACL information.
Type
Container
Owner
Type
Container
ID
Description
The bucket owner’s ID.
Type
String
DisplayName
Description
The bucket owner’s display name.
Type
String
Grant
Description
A container for Grantee and Permission.
Type
Container
Grantee
Description
A container for the DisplayName and ID of the user receiving a grant of permission.
Type
Container
Permission
Description
The permission given to the Grantee bucket.
Type
String
Syntax
Syntax
Syntax
Syntax
You can specify parameters for GET /_BUCKET_?versions, but none of them are required.
Parameters
prefix
Description
Returns in-progress uploads whose keys contain the specified prefix.
Type
String
delimiter
Description
The delimiter between the prefix and the rest of the object name.
Type
String
key-marker
Description
The beginning marker for the list of uploads.
max-keys
Description
The maximum number of in-progress uploads. The default is 1000.
Type
Integer
version-id-marker
Description
Specifies the object version to begin the list.
Type
String
Response Entities
KeyMarker
Description
The key marker specified by the key-marker request parameter, if any.
Type
String
NextKeyMarker
Description
The key marker to use in a subsequent request if IsTruncated is true.
Type
String
NextUploadIdMarker
Description
The upload ID marker to use in a subsequent request if IsTruncated is true.
Type
String
IsTruncated
Description
If true, only a subset of the bucket’s upload contents were returned.
Type
Boolean
Size
Description
The size of the uploaded part.
Type
Integer
DisplayName
Description
The owner’s display name.
Type
String
ID
Type
String
Owner
Description
A container for the ID and DisplayName of the user who owns the object.
Type
Container
StorageClass
Description
The method used to store the resulting object. STANDARD or REDUCED_REDUNDANCY
Type
String
Version
Description
Container for the version information.
Type
Container
versionId
Description
Version ID of an object.
Type
String
versionIdMarker
Description
The last version of the key in a truncated response.
Type
String
S3 head bucket
Edit online
Calls HEAD on a bucket to determine if it exists and if the caller has access permissions. Returns 200 OK if the bucket exists and the
caller has permissions; 404 Not Found if the bucket does not exist; and, 403 Forbidden if the bucket exists but the caller does
not have access permissions.
Syntax
Syntax
You can specify parameters for GET /_BUCKET_?uploads, but none of them are required.
Parameters
prefix
Description
Returns in-progress uploads whose keys contain the specified prefix.
Type
String
delimiter
Description
The delimiter between the prefix and the rest of the object name.
Type
String
key-marker
Description
The beginning marker for the list of uploads.
Type
String
max-keys
Description
The maximum number of in-progress uploads. The default is 1000.
Type
Integer
max-uploads
Description
The maximum number of multipart uploads. The range is from 1-1000. The default is 1000.
Type
Integer
version-id-marker
Description
Ignored if key-marker isn’t specified. Specifies the ID of the first upload to list in lexicographical order at or following the ID.
Type
String
Response Entities
ListMultipartUploadsResult
Description
A container for the results.
Type
Container
ListMultipartUploadsResult.Prefix
Description
Type
String
Bucket
Description
The bucket that will receive the bucket contents.
Type
String
KeyMarker
Description
The key marker specified by the key-marker request parameter, if any.
Type
String
UploadIdMarker
Description
The marker specified by the upload-id-marker request parameter, if any.
Type
String
NextKeyMarker
Description
The key marker to use in a subsequent request if IsTruncated is true.
Type
String
NextUploadIdMarker
Description
The upload ID marker to use in a subsequent request if IsTruncated is true.
Type
String
MaxUploads
Description
The max uploads specified by the max-uploads request parameter.
Type
Integer
Delimiter
Description
If set, objects with the same prefix will appear in the CommonPrefixes list.
Type
String
IsTruncated
Description
If true, only a subset of the bucket’s upload contents were returned.
Type
Boolean
Upload
Type
Container
Key
Description
The key of the object once the multipart upload is complete.
Type
String
UploadId
Description
The ID that identifies the multipart upload.
Type
String
Initiator
Description
Contains the ID and DisplayName of the user who initiated the upload.
Type
Container
DisplayName
Description
The initiator’s display name.
Type
String
ID
Description
The initiator’s ID.
Type
String
Owner
Description
A container for the ID and DisplayName of the user who owns the uploaded object.
Type
Container
StorageClass
Description
The method used to store the resulting object. STANDARD or REDUCED_REDUNDANCY
Type
String
Initiated
Description
The date and time the user initiated the upload.
Type
Date
CommonPrefixes
Type
Container
CommonPrefixes.Prefix
Description
The substring of the key after the prefix as defined by the prefix request parameter.
Type
String
S3 bucket policies
Edit online
The Ceph Object Gateway supports a subset of the Amazon S3 policy language applied to buckets.
Ceph Object Gateway manages S3 Bucket policies through standard S3 operations rather than using the radosgw-admin CLI tool.
Example
Limitations
s3:AbortMultipartUpload
s3:CreateBucket
s3:DeleteBucketPolicy
s3:DeleteBucket
s3:DeleteBucketWebsite
s3:DeleteObject
s3:DeleteObjectVersion
s3:GetBucketAcl
s3:GetBucketCORS
s3:GetBucketLocation
s3:GetBucketPolicy
s3:GetBucketVersioning
s3:GetBucketWebsite
s3:GetLifecycleConfiguration
s3:GetObjectAcl
s3:GetObject
s3:GetObjectTorrent
s3:GetObjectVersionAcl
s3:GetObjectVersion
s3:GetObjectVersionTorrent
s3:ListAllMyBuckets
s3:ListBucketMultiPartUploads
s3:ListBucket
s3:ListBucketVersions
s3:ListMultipartUploadParts
s3:PutBucketAcl
s3:PutBucketCORS
s3:PutBucketPolicy
s3:PutBucketRequestPayment
s3:PutBucketVersioning
s3:PutBucketWebsite
s3:PutLifecycleConfiguration
s3:PutObjectAcl
s3:PutObject
s3:PutObjectVersionAcl
NOTE: Ceph Object Gateway does not support setting policies on users, groups, or roles.
The Ceph Object Gateway uses the RGW ‘tenant’ identifier in place of the Amazon twelve-digit account ID. Ceph Object Gateway
administrators who want to use policies between Amazon Web Service (AWS) S3 and Ceph Object Gateway S3 will have to use the
Amazon account ID as the tenant ID when creating users.
With AWS S3, all tenants share a single namespace. By contrast, Ceph Object Gateway gives every tenant its own namespace of
buckets. At present, Ceph Object Gateway clients trying to access a bucket belonging to another tenant MUST address it as
tenant:bucket in the S3 request.
In the AWS, a bucket policy can grant access to another account, and that account owner can then grant access to individual users
with user permissions. Since Ceph Object Gateway does not yet support user, role, and group permissions, account owners will need
to grant access directly to individual users.
IMPORTANT: Granting an entire account access to a bucket grants access to ALL users in that account.
aws:CurrentTime
aws:PrincipalType
aws:Referer
aws:SecureTransport
aws:SourceIp
aws:UserAgent
aws:username
Ceph Object Gateway ONLY supports the following condition keys for the ListBucket action:
s3:prefix
s3:delimiter
s3:max-keys
Impact on Swift
Ceph Object Gateway provides no functionality to set bucket policies under the Swift API. However, bucket policies that have been
set with the S3 API govern Swift as well as S3 operations.
Ceph Object Gateway matches Swift credentials against Principals specified in a policy.
Syntax
Syntax
Request Entities
Payer
Description
Specifies who pays for the download and request fees.
RequestPaymentConfiguration
Description
A container for Payer.
Type
Container
Extensions employed to specify an explicit tenant differ according to the protocol and authentication system used.
In the following example, a colon character separates tenant and bucket. Thus a sample URL would be:
https://rgw.domain.com/tenant:bucket
By contrast, a simple Python example separates the tenant and bucket in the bucket method itself:
Example
aws_access_key_id="_home_markdown_jenkins_workspace_Transform_in_SSEG27_5.3_developer_con_api_multi
-tenant-bucket-operations_TESTER",
aws_secret_access_key="test123",
host="rgw.domain.com",
calling_format = OrdinaryCallingFormat()
)
bucket = c.get_bucket("tenant:bucket")
NOTE: It’s not possible to use S3-style subdomains using multi-tenancy, since host names cannot contain colons or any other
separators that are not already valid in bucket names. Using a period creates an ambiguous syntax. Therefore, the bucket-in-URL-
path format has to be used with multi-tenancy.
Reference
Edit online
See the Multi Tenancy section under User Management for additional details.
Using this feature, bucket policies, access point policies, and object permissions can be overridden to allow public access. By
default, new buckets, access points, and objects do not allow public access.
The S3 API in the Ceph Object Gateway supports a subset of the AWS public access settings:
BlockPublicPolicy
This defines the setting to allow users to manage access point and bucket policies. This setting does not allow the users to
publicly share the bucket or the objects it contains. Existing access point and bucket policies are not affected by enabling this
To reject calls to PUT access point policy for all of the bucket's same-account access points.
IMPORTANT: Apply this setting at the account level so that users cannot alter a specific bucket's block public access setting.
NOTE: The TRUE setting only works if the specified policy allows public access.
RestrictPublicBuckets
This defines the setting to restrict access to a bucket or access point with public policy. The restriction applies to only AWS
service principals and authorized users within the bucket owner's account and access point owner's account.
This blocks cross-account access to the access point or bucket, except for the cases specified, while still allowing users within the
account to manage the access points or buckets.
Enabling this setting does not affect existing access point or bucket policies. It only defines that Amazon S3 blocks public and cross-
account access derived from any public access point or bucket policy, including non-public delegation to specific accounts.
NOTE: Access control lists (ACLs) are not currently supported by IBM Storage Ceph.
Bucket policies are assumed to be public unless defined otherwise. To block public access a bucket policy must give access only to
fixed values for one or more of the following:
NOTE: A fixed value does not contain a wildcard (*) or an AWS Identity and Access Management Policy Variable.
aws:SourceArn
aws:SourceVpc
aws:SourceVpce
aws:SourceOwner
aws:SourceAccount
s3:x-amz-server-side-encryption-aws-kms-key-id
s3:DataAccessPointArnNOTE: When used in a bucket policy, this value can contain a wildcard for the access point name
without rendering the policy public, as long as the account ID is fixed.
s3:DataAccessPointPointAccount
Example
{
"Principal": "*",
"Resource": "*",
"Action": "s3:PutObject",
"Effect": "Allow",
"Condition": { "StringLike": {"aws:SourceVpc": "vpc-*"}}
}
To make a policy non-public, include any of the condition keys with a fixed value.
Example
{
"Principal": "*",
"Resource": "*",
"Action": "s3:PutObject",
Additional Resources
Edit online
For more information about creating or modifying a PublicAccessBlock see S3 PUT PublicAccessBlock.
See the Blocking public access to your Amazon S3 storage section of Amazon Simple Storage Service (S3) documentation.
S3 GET PublicAccessBlock
Edit online
To get the S3 Block Public Access feature configured, use GET and specify a destination AWS account.
Syntax
Request headers
For more information about common request headers, see S3 common request headers.
Response
Additional Resources
Edit online
For more information about the S3 Public Access Block feature, see S3 Block Public Access.
S3 PUT PublicAccessBlock
Edit online
Use this to create or modify the PublicAccessBlock configuration for an S3 bucket.
IMPORTANT: If the PublicAccessBlock configuration is different between the bucket and the account, Amazon S3 uses the most
restrictive combination of the bucket-level and account-level settings.
Syntax
Request headers
For more information about common request headers, see S3 common request headers.
Response
The response is an HTTP 200 response and is returned with an empty HTTP body.
Additional Resources
Edit online
For more information about the S3 Public Access Block feature, see S3 Block Public Access.
S3 delete PublicAccessBlock
Edit online
Use this to delete the PublicAccessBlock configuration for an S3 bucket.
Syntax
Request headers
For more information about common request headers, see S3 common request headers.
Response
The response is an HTTP 200 response and is returned with an empty HTTP body.
Additional Resources
Edit online
For more information about the S3 Public Access Block feature, see S3 Block Public Access.
S3 object operations
Edit online
As a developer, you can perform object operations with the Amazon S3 application programming interface (API) through the Ceph
Object Gateway.
The following table list the Amazon S3 functional operations for objects, along with the function's support status.
Feature Status
Get Object Supported
Get Object Information Supported
Put Object Lock Supported
Get Object Lock Supported
Put Object Legal Hold Supported
Get Object Legal Hold Supported
Put Object Retention Supported
Prerequisites
S3 get an object from a bucket
S3 get information on an object
S3 put object lock
S3 get object lock
S3 put object legal hold
S3 get object legal hold
S3 put object retention
S3 get object retention
S3 put object tagging
S3 get object tagging
S3 delete object tagging
S3 add an object to a bucket
S3 delete an object
S3 delete multiple objects
S3 get an object’s Access Control List (ACL)
S3 set an object’s Access Control List (ACL)
S3 copy an object
S3 add an object to a bucket using HTML forms
S3 determine options for a request
S3 initiate a multipart upload
S3 add a part to a multipart upload
S3 list the parts of a multipart upload
S3 assemble the uploaded parts
S3 copy a multipart upload
S3 abort a multipart upload
S3 Hadoop interoperability
Prerequisites
Edit online
A RESTful client.
Prerequisites
1024 IBM Storage Ceph
Edit online
A RESTful client.
Syntax
Syntax
Request Headers
range
Description
The range of the object to retrieve.
Valid Values
Range:bytes=beginbyte-endbyte
Required
No
if-modified-since
Description
Gets only if modified since the timestamp.
Valid Values
Timestamp
Required
No
if-match
Description
Gets only if object ETag matches ETag.
Valid Values
Entity Tag
Required
No
if-none-match
Description
Gets only if object ETag matches ETag.
Valid Values
Entity Tag
Required
No
Response Headers
Description
Data range, will only be returned if the range header field was specified in the request.
x-amz-version-id
Description
Returns the version ID or null.
Syntax
Syntax
Request Headers
range
Description
The range of the object to retrieve.
Valid Values
Range:bytes=beginbyte-endbyte
Required
No
if-modified-since
Description
Gets only if modified since the timestamp.
Valid Values
Timestamp
Required
No
if-match
Description
Gets only if object ETag matches ETag.
Valid Values
Entity Tag
Required
No
if-none-match
Description
Gets only if object ETag matches ETag.
Required
No
Response Headers
x-amz-version-id
Description
Returns the version ID or null.
IMPORTANT: Enable the object lock when creating a bucket otherwise, the operation fails.
Syntax
Example
Request Entities
ObjectLockConfiguration
Description
A container for the request.
Type
Container
Required
Yes
ObjectLockEnabled
Description
Indicates whether this bucket has an object lock configuration enabled.
Type
String
Required
Yes
Rule
Description
The object lock rule in place for the specified bucket.
Type
Container
Required
No
DefaultRetention
Description
Type
Container
Required
No
Mode
Description
The default object lock retention mode. Valid values: GOVERNANCE/COMPLIANCE.
Type
Container
Required
Yes
Days
Description
The number of days specified for the default retention period.
Type
Integer
Required
No
Years
Description
The number of years specified for the default retention period.
Type
Integer
Required
No
HTTP Response
400
Status Code
MalformedXML
Description
The XML is not well-formed.
409
Status Code
InvalidBucketState
Description
The bucket object lock is not enabled.
Reference
Edit online
Syntax
Example
Response Entities
ObjectLockConfiguration
Description
A container for the request.
Type
Container
Required
Yes
ObjectLockEnabled
Description
Indicates whether this bucket has an object lock configuration enabled.
Type
String
Required
Yes
Rule
Description
The object lock rule is in place for the specified bucket.
Type
Container
Required
No
DefaultRetention
Description
The default retention period applied to new objects placed in the specified bucket.
Type
Container
Required
No
Mode
Description
The default object lock retention mode. Valid values: GOVERNANCE/COMPLIANCE.
Type
Container
Required
Yes
Days
Description
Type
Integer
Required
No
Years
Description
The number of years specified for the default retention period.
Type
Integer
Required
No
Reference
Edit online
Syntax
Example
Request Entities
LegalHold
Description
A container for the request.
Type
Container
Required
Yes
Status
Description
Indicates whether the specified object has a legal hold in place. Valid values: ON/OFF
Type
String
Required
Yes
Syntax
Example
Response Entities
LegalHold
Description
A container for the request.
Type
Container
Required
Yes
Status
Description
Indicates whether the specified object has a legal hold in place. Valid values: ON/OFF
Type
String
Required
Yes
Reference
Edit online
NOTE: During this period, your object is write-once-read-many(WORM protected) and can not be overwritten or deleted.
Syntax
Example
Request Entities
Retention
Description
A container for the request.
Type
Container
Required
Yes
Mode
Description
Retention mode for the specified object. Valid values: GOVERNANCE/COMPLIANCE
Type
String
Required
Yes
RetainUntilDate
Description
Retention date. Format: 2020-01-05T00:00:00.000Z
Type
Timestamp
Required
Yes
Reference
Edit online
Syntax
Example
Response Entities
Retention
Type
Container
Required
Yes
Mode
Description
Retention mode for the specified object. Valid values: GOVERNANCE/COMPLIANCE
Type
String
Required
Yes
RetainUntilDate
Description
Retention date. Format: 2020-01-05T00:00:00.000Z
Type
Timestamp
Required
Yes
Reference
Edit online
Syntax
Example
Request Entities
Tagging
Description
A container for the request.
Type
Container
Required
Yes
TagSet
Type
String
Required
Yes
Reference
Edit online
NOTE: For a versioned bucket, you can have multiple versions of an object in your bucket. To retrieve tags of any other version, add
the versionId query parameter in the request.
Syntax
Example
Reference
Edit online
NOTE:
To delete tags of a specific object version, add the versionId query parameter in the request.
Syntax
Example
Reference
Edit online
Syntax
Request Headers
content-md5
Description
A base64 encoded MD-5 hash of the message.
Valid Values
A string. No defaults or constraints.
Required
No
content-type
Description
A standard MIME type.
Valid Values
Any MIME type. Default: binary/octet-stream.
Required
No
x-amz-meta-<...>*
Description
User metadata. Stored with the object.
Valid Values
A string up to 8kb. No defaults.
Required
No
x-amz-acl
Description
A canned ACL.
Valid Values
private, public-read, public-read-write, authenticated-read
Required
No
Response Headers
x-amz-version-id
Description
Returns the version ID or null.
S3 delete an object
Edit online
Syntax
To delete an object when versioning is on, you must specify the versionId subresource and the version of the object to delete.
Syntax
Syntax
Add the versionId subresource to retrieve the ACL for a particular version:
Syntax
Response Headers
x-amz-version-id
Description
Returns the version ID or null.
Response Entities
AccessControlPolicy
Description
A container for the response.
Type
Container
AccessControlList
Description
A container for the ACL information.
Type
Container
Owner
Description
A container for the bucket owner’s ID and DisplayName.
ID
Description
The bucket owner’s ID.
Type
String
DisplayName
Description
The bucket owner’s display name.
Type
String
Grant
Description
A container for Grantee and Permission.
Type
Container
Grantee
Description
A container for the DisplayName and ID of the user receiving a grant of permission.
Type
Container
Permission
Description
The permission given to the Grantee bucket.
Type
String
Syntax
PUT /BUCKET/OBJECT?acl
Request Entities
AccessControlPolicy
Description
A container for the response.
Type
Container
AccessControlList
Description
A container for the ACL information.
Type
Owner
Description
A container for the bucket owner’s ID and DisplayName.
Type
Container
ID
Description
The bucket owner’s ID.
Type
String
DisplayName
Description
The bucket owner’s display name.
Type
String
Grant
Description
A container for Grantee and Permission.
Type
Container
Grantee
Description
A container for the DisplayName and ID of the user receiving a grant of permission.
Type
Container
Permission
Description
The permission given to the Grantee bucket.
Type
String
S3 copy an object
Edit online
To copy an object, use PUT and specify a destination bucket and the object name.
Syntax
Request Headers
x-amz-copy-source
Description
The source bucket name + object name.
Valid Values
Required
Yes
x-amz-acl
Description
A canned ACL.
Valid Values
private, public-read, public-read-write, authenticated-read
Required
No
x-amz-copy-if-modified-since
Description
Copies only if modified since the timestamp.
Valid Values
Timestamp
Required
No
x-amz-copy-if-unmodified-since
Description
Copies only if unmodified since the timestamp.
Valid Values
Timestamp
Required
No
x-amz-copy-if-match
Description
Copies only if object ETag matches ETag.
Valid Values
Entity Tag
Required
No
x-amz-copy-if-none-match
Description
Copies only if object ETag matches ETag.
Valid Values
Entity Tag
Required
No
Response Entities
CopyObjectResult
Description
A container for the response elements.
Type
Container
LastModified
Type
Date
Etag
Description
The ETag of the new object.
Type
String
Syntax
Syntax
Syntax
POST /BUCKET/OBJECT?uploads
Request Headers
content-md5
Description
A base64 encoded MD-5 hash of the message.
Valid Values
A string. No defaults or constraints.
Required
No
content-type
Description
A standard MIME type.
Valid Values
Required
No
x-amz-meta-<...>
Description
User metadata. Stored with the object.
Valid Values
A string up to 8kb. No defaults.
Required
No
x-amz-acl
Description
A canned ACL.
Valid Values
private, public-read, public-read-write, authenticated-read
Required
No
Response Entities
InitiatedMultipartUploadsResult
Description
A container for the results.
Type
Container
Bucket
Description
The bucket that will receive the object contents.
Type
String
Key
Description
The key specified by the key request parameter, if any.
Type
String
UploadId
Description
The ID specified by the upload-id request parameter identifying the multipart upload, if any.
Type
String
Specify the uploadId subresource and the upload ID to add a part to a multi-part upload:
HTTP Response
404
Status Code
NoSuchUpload
Description
Specified upload-id does not match any initiated upload on this object.
Syntax
Response Entities
InitiatedMultipartUploadsResult
Description
A container for the results.
Type
Container
Bucket
Description
The bucket that will receive the object contents.
Type
String
Key
Description
The key specified by the key request parameter, if any.
Type
String
UploadId
Description
The ID specified by the upload-id request parameter identifying the multipart upload, if any.
Type
String
Initiator
Description
Contains the ID and DisplayName of the user who initiated the upload.
Type
Container
ID
Type
String
DisplayName
Description
The initiator’s display name.
Type
String
Owner
Description
A container for the ID and DisplayName of the user who owns the uploaded object.
Type
Container
StorageClass
Description
The method used to store the resulting object. STANDARD or REDUCED_REDUNDANCY
Type
String
PartNumberMarker
Description
The part marker to use in a subsequent request if IsTruncated is true. Precedes the list.
Type
String
NextPartNumberMarker
Description
The next part marker to use in a subsequent request if IsTruncated is true. The end of the list.
Type
String
IsTruncated
Description
If true, only a subset of the object’s upload contents were returned.
Type
Boolean
Part
Description
A container for Key, Part, InitiatorOwner, StorageClass, and Initiated elements.
Type
Container
PartNumber
Description
A container for Key, Part, InitiatorOwner, StorageClass, and Initiated elements.
Type
Integer
ETag
Type
String
Size
Description
The size of the uploaded part.
Type
Integer
Specify the uploadId subresource and the upload ID to complete a multi-part upload:
Syntax
Request Entities
CompleteMultipartUpload
Description
A container consisting of one or more parts.
Type
Container
Required
Yes
Part
Description
A container for the PartNumber and ETag.
Type
Container
Required
Yes
PartNumber
Description
The identifier of the part.
Type
Integer
Required
Yes
ETag
Description
The part’s entity tag.
Type
String
Response Entities
CompleteMultipartUploadResult
Description
A container for the response.
Type
Container
Location
Description
The resource identifier (path) of the new object.
Type
URI
bucket
Description
The name of the bucket that contains the new object.
Type
String
Key
Description
The object’s key.
Type
String
ETag
Description
The entity tag of the new object.
Type
String
Specify the uploadId subresource and the upload ID to perform a multi-part upload copy:
Syntax
Request Headers
x-amz-copy-source
Description
The source bucket name and object name.
Valid Values
BUCKET/OBJECT
x-amz-copy-source-range
Description
The range of bytes to copy from the source object.
Valid Values
Range: bytes=first-last, where the first and last are the zero-based byte offsets to copy. For example,bytes=0-9
indicates that you want to copy the first ten bytes of the source.
Required
No
Response Entities
CopyPartResult
Description
A container for all response elements.
Type
Container
ETag
Description
Returns the ETag of the new part.
Type
String
LastModified
Description
Returns the date the part was last modified.
Type
String
Reference
Edit online
For more information about this feature, see the Amazon S3 site.
Specify the uploadId subresource and the upload ID to abort a multi-part upload:
Syntax
S3 Hadoop interoperability
Edit online
Ceph Object Gateway is fully compatible with the S3A connector that ships with Hadoop 2.7.3.
As a developer, you can use the S3 select API for high-level analytic applications like Spark-SQL to improve latency and throughput.
For example a CSV S3 object with several gigabytes of data, the user can extract a single column which is filtered by another column
using the following query:
Example
Currently, the S3 object must retrieve data from the Ceph OSD through the Ceph Object Gateway before filtering and extracting data.
There is improved performance when the object is large and the query is more specific.
Prerequisites
S3 select content from an object
S3 supported select functions
S3 alias programming construct
S3 CSV parsing explained
Prerequisites
Edit online
A RESTful client.
NOTE: You must specify the data serialization format for the response. You must have s3:GetObject permission for this operation.
Syntax
Example
Description
The bucket to select object content from.
Type
String
Required
Yes
Key
Description
The object key.
Length Constraints
Minimum length of 1.
Type
String
Required
Yes
SelectObjectContentRequest
Description
Root level tag for the select object content request parameters.
Type
String
Required
Yes
Expression
Description
The expression that is used to query the object.
Type
String
Required
Yes
ExpressionType
Description
The type of the provided expression for example SQL.
Type
String
Valid Values
SQL
Required
Yes
InputSerialization
Description
Describes the format of the data in the object that is being queried.
Type
String
Required
OutputSerialization
Description
Format of data returned in comma separator and new-line.
Type
String
Required
Yes
Response entities
If the action is successful, the service sends back HTTP 200 response. Data is returned in XML format by the service:
Payload
Description
Root level tag for the payload parameters.
Type
String
Required
Yes
Records
Description
The records event.
Type
Base64-encoded binary data object
Required
No
Stats
Description
The stats event.
Type
Long
Required
No
Example
Supported features
Reference
Edit online
timestamp(string)
Description
Converts string to the basic type of timestamp.
Supported
Currently it converts: yyyy:mm:dd hh:mi:dd
extract(date-part,timestamp)
Description
Returns integer according to date-part extract from input timestamp.
Supported
date-part: year,month,week,day.
dateadd(date-part ,integer,timestamp)
Description
Returns timestamp, a calculation based on the results of input timestamp and date-part.
Supported
date-part : year,month,day.
datediff(date-part,timestamp,timestamp)
Description
Return an integer, a calculated result of the difference between two timestamps according to date-part.
Supported
date-part : year,month,day,hours.
utcnow()
Description
Return timestamp of current time.
Aggregation
count()
Description
Returns integers based on the number of rows that match a condition if there is one.
sum(expression)
Description
Returns a summary of expression on each row that matches a condition if there is one.
avg(expression)
Description
Returns an average expression on each row that matches a condition if there is one.
max(expression)
Description
Returns the maximal result for all expressions that match a condition if there is one.
min(expression)
Description
Returns the minimal result for all expressions that match a condition if there is one.
String
substring(string,from,to)
Char_length
Description
Returns a number of characters in string. Character_length also does the same.
Trim
Description
Trims the leading or trailing characters from the target string, default is a blank character.
Upper\lower
Description
Converts characters into uppercase or lowercase.
NULL
The NULL value is missing or unknown that is NULL can not produce a value on any arithmetic operations. The same applies to
arithmetic comparison, any comparison to NULL is NULL that is unknown.
Reference
Edit online
Example
select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300;")
The csv-header-info is parsed upon USE appearing in the AWS-CLI; this is the first row in the input object containing the schema.
Currently, output serialization and compression-type is not supported. The S3 select engine has a CSV parser which parses S3-
objects:
The quote-character overrides the field-separator; that is, the field separator is any character between the quotes.
The escape character disables any special character except the row delimiter.
Reference
Edit online
The following table describes the support status for current Swift functional features:
Prerequisites
Swift API limitations
Create a Swift user
Swift authenticating a user
Swift container operations
Swift object operations
Swift temporary URL operations
Swift multi-tenancy container operations
Prerequisites
Edit online
A RESTful client.
Prerequisites
Edit online
A RESTful client.
Maximum metadata size when using Swift API: There is no defined limit on the total size of user metadata that can be
applied to an object, but a single HTTP request is imited to 16,000 bytes.
Prerequisites
Edit online
Syntax
Example
Syntax
Example
Syntax
Example Response
NOTE: You can retrieve data about Ceph’s Swift-compatible service by executing GET requests using the X-Storage-Url value
during authentication.
Reference
Edit online
Prerequisites
Swift container operations
Swift update a container’s Access Control List (ACL)
Swift list containers
Swift list a container’s objects
Swift create a container
Swift delete a container
Swift add or update the container metadata
Prerequisites
Edit online
A RESTful client.
NOTE: The Amazon S3 API uses the term bucket to describe a data container. When you hear someone refer to a bucket within the
Swift API, the term bucket might be construed as the equivalent of the term container.
One facet of object storage is that it does not support hierarchical paths or directories. Instead, it supports one level consisting of
one or more containers, where each container might have objects. The RADOS Gateway’s Swift-compatible API supports the notion
of pseudo-hierarchical containers, which is a means of using object naming to emulate a container, or directory hierarchy without
IMPORTANT: When uploading large objects to versioned Swift containers, use the --leave-segments option with the python-
swiftclient utility. Not using --leave-segments overwrites the manifest file. Consequently, an existing object is overwritten,
which leads to data loss.
Syntax
X-Container-Read
Description
The user IDs with read permissions for the container.
Type
Comma-separated string values of user IDs.
Required
No
X-Container-Write
Description
The user IDs with write permissions for the container.
Type
Comma-separated string values of user IDs.
Required
No
Syntax
Request Parameters
limit
Description
Limits the number of results to the specified value.
Valid Values
N/A
Required
Yes
format
Description
Limits the number of results to the specified value.
Type
Integer
Valid Values
json or xml
Required
No
marker
Description
Returns a list of results greater than the marker value.
Type
String
Valid Values
N/A
Required
No
The response contains a list of containers, or returns with an HTTP 204 response code.
Response Entities
account
Description
A list for account information.
Type
Container
container
Description
The list of containers.
Type
Container
name
Description
The name of a container.
Type
String
bytes
Description
The size of the container.
Type
Syntax
Request Parameters
format
Description
Limits the number of results to the specified value.
Type
Integer
Valid Values
json or xml
Required
No
prefix
Description
Limits the result set to objects beginning with the specified prefix.
Type
String
Valid Values
N/A
Required
No
marker
Description
Returns a list of results greater than the marker value.
Type
String
Valid Values
N/A
Required
No
limit
Description
Limits the number of results to the specified value.
Type
Integer
Valid Values
0 - 10,000
delimiter
Description
The delimiter between the prefix and the rest of the object name.
Type
String
Valid Values
N/A
Required
No
path
Description
The pseudo-hierarchical path of the objects.
Type
String
Valid Values
N/A
Required
No
Response Entities
container
Description
The container.
Type
Container
object
Description
An object within the container.
Type
Container
name
Description
The name of an object within the container.
Type
String
hash
Description
A hash code of the object’s contents.
Type
String
last_modified
Description
The last time the object’s contents were modified.
Type
content_type
Description
The type of content within the object.
Type
String
Syntax
Headers
X-Container-Read
Description
The user IDs with read permissions for the container.
Type
Comma-separated string values of user IDs.
Required
No
X-Container-Write
Description
The user IDs with write permissions for the container.
Type
Comma-separated string values of user IDs.
Required
No
X-Container-Meta-_KEY
Description
A user-defined metadata key that takes an arbitrary string value.
Type
String
Required
No
X-Storage-Policy
Type
String
Required
No
If a container with the same name already exists, and the user is the container owner then the operation will succeed. Otherwise, the
operation will fail.
HTTP Response
409
Status Code
BucketAlreadyExists
Description
The container already exists under a different user’s ownership.
Syntax
HTTP Response
204
Status Code
NoContent
Description
The container was removed.
Syntax
Request Headers
X-Container-Meta-_KEY
Type
String
Required
No
Prerequisites
Swift object operations
Swift get an object
Swift create or update an object
Swift delete an object
Swift copy an object
Swift get object metadata
Swift add or update object metadata
Prerequisites
Edit online
A RESTful client.
Syntax
Request Headers
Description
To retrieve a subset of an object’s contents, you can specify a byte range.
Type
Date
Required
No
If-Modified-Since
Description
Only copies if modified since the date and time of the source object’s last_modified attribute.
Type
Date
Required
No
If-Unmodified-Since
Description
Only copies if not modified since the date and time of the source object’s last_modified attribute.
Type
Date
Required
No
Copy-If-Match
Description
Copies only if the ETag in the request matches the source object’s ETag.
Type
ETag
Required
No
Copy-If-None-Match
Description
Copies only if the ETag in the request does not match the source object’s ETag.
Type
ETag
Required
No
Response Headers
Content-Range
Description
The range of the subset of object contents. Returned only if the range header field was specified in the request.
Syntax
Request Headers
ETag
Description
An MD5 hash of the object’s contents. Recommended.
Type
String
Valid Values
N/A
Required
No
Content-Type
Description
An MD5 hash of the object’s contents.
Type
String
Valid Values
N/A
Required
No
Transfer-Encoding
Description
Indicates whether the object is part of a larger aggregate object.
Type
String
Valid Values
chunked
Required
No
Syntax
For a PUT request, use the destination container and object name in the request, and the source container and object in the request
header.
For a Copy request, use the source container and object in the request, and the destination container and object in the request
header. You must have write permission on the container to copy an object. The destination object name must be unique within the
container. The request is not idempotent, so if you do not use a unique name, the request will update the destination object. You can
use pseudo-hierarchical syntax in the object name to distinguish the destination object from the source object of the same name if it
is under a different pseudo-hierarchical directory. You can include access control headers and metadata headers in the request.
Syntax
or alternatively:
Syntax
Request Headers
X-Copy-From
Description
Used with a PUT request to define the source container/object path.
Type
String
Required
Yes, if using PUT.
Destination
Description
Used with a COPY request to define the destination container/object path.
Type
String
Required
Yes, if using COPY.
If-Modified-Since
Description
Only copies if modified since the date and time of the source object’s last_modified attribute.
Type
Date
Required
No
If-Unmodified-Since
Description
Type
Date
Required
No
Copy-If-Match
Description
Copies only if the ETag in the request matches the source object’s ETag.
Type
ETag
Required
No
Copy-If-None-Match
Description
Copies only if the ETag in the request does not match the source object’s ETag.
Type
ETag
Required
No
Syntax
Syntax
Request Headers
X-Object-Meta-_KEY
Description
A user-defined meta data key that takes an arbitrary string value.
Type
String
For this functionality, initially the value of X-Account-Meta-Temp-URL-Key and optionally X-Account-Meta-Temp-URL-Key-2
should be set. The Temp URL functionality relies on a HMAC-SHA1 signature against these secret keys.
The expiry time, in the format of seconds since the epoch, that is, Unix time
The above items are normalized with newlines appended between them, and a HMAC is generated using the SHA-1 hashing
algorithm against one of the Temp URL Keys posted earlier.
Example
import hmac
from hashlib import sha1
from time import time
method = 'GET'
host = 'https://objectstore.example.com'
duration_in_seconds = 300 # Duration for which the url is valid
expires = int(time() + duration_in_seconds)
path = '/v1/your-bucket/your-object'
key = 'secret'
hmac_body = '%s\n%s\n%s' % (method, expires, path)
hmac_body = hmac.new(key, hmac_body, sha1).hexdigest()
sig = hmac.new(key, hmac_body, sha1).hexdigest()
rest_uri = "{host}{path}?temp_url_sig={sig}&temp_url_expires={expires}".format(
host=host, path=path, sig=sig, expires=expires)
print rest_uri
Example Output
https://objectstore.example.com/v1/your-bucket/your-object?
temp_url_sig=ff4657876227fc6025f04fcf1e82818266d022c6&temp_url_expires=1423200992
Request Headers
X-Account-Meta-Temp-URL-Key
Description
A user-defined key that takes an arbitrary string value.
Type
String
Required
Yes
X-Account-Meta-Temp-URL-Key-2
Description
A user-defined key that takes an arbitrary string value.
Type
String
Required
No
Extensions employed to specify an explicit tenant differ according to the protocol and authentication system used.
A colon character separates tenant and container, thus a sample URL would be:
Example
https://rgw.domain.com/tenant:container
By contrast, in a create_container() method, simply separate the tenant and container in the container method itself:
Example
create_container("tenant:container")
Prerequisites
Ceph summary
Authentication
Ceph File System
Storage cluster configuration
Prerequisites
Edit online
Ceph summary
Edit online
The method reference for using the Ceph RESTful API summary endpoint to display the Ceph summary details.
GET /api/summary
Description
Display a summary of Ceph details.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Authentication
Edit online
The method reference for using the Ceph RESTful API auth endpoint to initiate a session with IBM Storage Ceph. POST /api/auth
Curl Example
Example
{
"password": "STRING",
"username": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/auth/check
Description
Check the requirement for an authentication token.
Example
{
"token": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
POST /api/auth/logout
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
GET /api/cephfs
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/cephfs/_FS_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
DELETE /api/cephfs/_FSID/client/_CLIENT_ID
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/cephfs/_FSID/clients
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/cephfs/_FSID/get_root_directory
Description
The root directory that can not be fetched using the ls_dir API call.
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Description
List directories for a given path.
Parameters
Queries:
path - The string value where you want to start the listing. The default path is /, if not given.
depth - An integer value specifying the number of steps to go down the directory tree.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/cephfs/_FSID/mds_counters
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/cephfs/_FSID/quota
Description
Display the CephFS quotas for the given path.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/cephfs/_FSID/quota
Description
Sets the quota for a given path.
Parameters
Example
{
"max_bytes": "STRING",
"max_files": "STRING",
"path": "STRING"
}
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/cephfs/_FSID/snapshot
Description
Remove a snapsnot.
Parameters
Queries:
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/cephfs/_FSID/snapshot
Description
Create a snapshot.
Parameters
name - A string value specifying the snapshot name. If no name is specified, then a name using the current time in
RFC3339 UTC format is generated.
Example
{
"name": "STRING",
"path": "STRING"
}
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/cephfs/_FSID/tree
Description
Remove a directory.
Parameters
Queries:
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/cephfs/_FSID/tree
Description
Creates a directory.
Parameters
Example
{
"path": "STRING"
}
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
GET /api/cluster_conf
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Example
{
"name": "STRING",
"value": "STRING"
}
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/cluster_conf
Example
{
"options": "STRING"
}
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/cluster_conf/filter
Description
Display the storage cluster configuration by name.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/cluster_conf/_NAME
Parameters
Queries:
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/cluster_conf/_NAME
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
CRUSH rules
Edit online
The method reference for using the Ceph RESTful API crush_rule endpoint to manage the CRUSH rules.
GET /api/crush_rule
Description
List the CRUSH rule configuration.
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/crush_rule
Example
{
"device_class": "STRING",
"failure_domain": "STRING",
"name": "STRING",
"root": "STRING"
}
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/crush_rule/_NAME
Parameters
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/crush_rule/_NAME
Parameters
Example
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
GET /api/erasure_code_profile
Description
List erasure-coded profile information.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/erasure_code_profile
Example
{
"name": "STRING"
}
Status Codes
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/erasure_code_profile/_NAME
Parameters
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/erasure_code_profile/_NAME
Parameters
Example
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Feature toggles
Edit online
The method reference for using the Ceph RESTful API feature_toggles endpoint to manage the CRUSH rules.
GET /api/feature_toggles
Description
List the features of IBM Storage Ceph.
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Grafana
Edit online
The method reference for using the Ceph RESTful API grafana endpoint to manage Grafana.
POST /api/grafana/dashboards
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/grafana/url
Description
List the Grafana URL instance.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/grafana/validation/_PARAMS
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
GET /api/health/full
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/health/minimal
Description
Display the storage cluster’s minimal health report.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Host
Edit online
The method reference for using the Ceph RESTful API host endpoint to display host, also known as node, information.
GET /api/host
Description
List the host specifications.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/host
Example
{
"hostname": "STRING",
"status": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
DELETE /api/host/_HOST_NAME
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/host/_HOST_NAME
Description
Displays information on the given host.
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/host/_HOST_NAME
Description
Updates information for the given host. This method is only supported when the Ceph Orchestrator is enabled.
Parameters
Example
{
"force": true,
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/host/HOST_NAME/daemons
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/host/HOST_NAME/devices
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/host/HOST_NAME/identify_device
Description
Identify a device by switching on the device’s light for a specified number of seconds.
Parameters
Example
{
"device": "STRING",
"duration": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/host/HOST_NAME/inventory
Description
Display the inventory of the host.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/host/HOST_NAME/smart
Parameters
Example
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Logs
Edit online
The method reference for using the Ceph RESTful API logs endpoint to display log information.
GET /api/logs/all
Description
View all the log configuration.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
GET /api/mgr/module
Description
View the list of managed modules.
Example
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/mgr/module/_MODULE_NAME
Description
Retrieve the values of the persistent configuration settings.
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/mgr/module/_MODULE_NAME
Description
Set the values of the persistent configuration settings.
Parameters
Example
{
"config": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Description
Disable the given Ceph Manager module.
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/mgr/module/MODULE_NAME/enable
Description
Enable the given Ceph Manager module.
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/mgr/module/MODULE_NAME/options
Description
View the options for the given Ceph Manager module.
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Ceph Monitor
Edit online
The method reference for using the Ceph RESTful API monitor endpoint to display information on the Ceph Monitor.
GET /api/monitor
Description
View Ceph Monitor details.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Ceph OSD
Edit online
The method reference for using the Ceph RESTful API osd endpoint to manage the Ceph OSDs.
GET /api/osd
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Example
{
"data": "STRING",
"method": "STRING",
"tracking_id": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/osd/flags
Description
View the Ceph OSD flags.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/osd/flags
Description
Sets the Ceph OSD flags for the entire storage cluster.
Parameters
IMPORTANT: You must include these four flags for a successful operation.
Example
{
"flags": [
"STRING"
]
}
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/osd/flags/individual
Description
View the individual Ceph OSD flags.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/osd/flags/individual
Description
Updates the noout, noin, nodown, and noup flags for an individual subset of Ceph OSDs.
Example
{
"flags": {
"nodown": true,
"noin": true,
"noout": true,
"noup": true
},
"ids": [
1
]
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/osd/safe_to_destroy
Description
Check to see if the Ceph OSD is safe to destroy.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/osd/_SVC_ID
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Queries:
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/osd/_SVC_ID
Description
Returns collected data about a Ceph OSD.
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/osd/_SVC_ID
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Example
{
"device_class": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/osd/_SVCID/destroy
Description
Marks Ceph OSD as being destroyed. The Ceph OSD must be marked down before being destroyed. This operation keeps the
Ceph OSD identifier intact, but removes the Cephx keys, configuration key data, and lockbox keys.
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/osd/_SVCID/devices
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/osd/_SVCID/histogram
Description
Returns the Ceph OSD histogram data.
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/osd/_SVCID/mark
Description
Marks a Ceph OSD out, in, down, and lost.
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
{
"action": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/osd/_SVCID/purge
Description
Removes the Ceph OSD from the CRUSH map.
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/osd/_SVCID/reweight
Description
Temporarily reweights the Ceph OSD. When a Ceph OSD is marked out, the OSD’s weight is set to 0. When the Ceph OSD is
marked back in, the OSD’s weight is set to 1.
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Example
{
"weight": "STRING"
}
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/osd/_SVCID/scrub
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Queries:
Example
{
"deep": true
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/osd/_SVCID/smart
Parameters
Replace SVC_ID with a string value for the Ceph OSD service identifier.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
GET /api/rgw/status
Description
Display the Ceph Object Gateway status.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/rgw/daemon
Description
Display the Ceph Object Gateway daemons.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/rgw/daemon/_SVC_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
GET /api/rgw/site
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Bucket Management
GET /api/rgw/bucket
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/rgw/bucket
Example
{
"bucket": "STRING",
"daemon_name": "STRING",
"lock_enabled": "false",
"lock_mode": "STRING",
"lock_retention_period_days": "STRING",
"lock_retention_period_years": "STRING",
"placement_target": "STRING",
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/rgw/bucket/_BUCKET
Parameters
Queries:
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/rgw/bucket/_BUCKET
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/rgw/bucket/_BUCKET
Example
{
"bucket_id": "STRING",
"daemon_name": "STRING",
"lock_mode": "STRING",
"lock_retention_period_days": "STRING",
"lock_retention_period_years": "STRING",
"mfa_delete": "STRING",
"mfa_token_pin": "STRING",
"mfa_token_serial": "STRING",
"uid": "STRING",
"versioning_state": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
User Management
GET /api/rgw/user
Description
Display the Ceph Object Gateway users.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/rgw/user
Example
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/rgw/user/get_emails
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/rgw/user/_UID
Parameters
Queries:
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
GET /api/rgw/user/_UID
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/rgw/user/_UID
Parameters
Example
{
"daemon_name": "STRING",
"display_name": "STRING",
"email": "STRING",
"max_buckets": "STRING",
"suspended": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/rgw/user/_UID_/capability
Parameters
Queries:
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/rgw/user/_UID_/capability
Parameters
Example
{
"daemon_name": "STRING",
"perm": "STRING",
"type": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/rgw/user/_UID_/key
Parameters
Queries:
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/rgw/user/_UID_/key
Parameters
Example
{
"access_key": "STRING",
"daemon_name": "STRING",
"generate_key": "true",
"key_type": "s3",
"secret_key": "STRING",
"subuser": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/rgw/user/_UID_/quota
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/rgw/user/_UID_/quota
Parameters
Example
{
"daemon_name": "STRING",
"enabled": "STRING",
"max_objects": "STRING",
"max_size_kb": 1,
"quota_type": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/rgw/user/_UID_/subuser
Parameters
Example
{
"access": "STRING",
"access_key": "STRING",
"daemon_name": "STRING",
"generate_secret": "true",
"key_type": "s3",
"secret_key": "STRING",
"subuser": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/rgw/user/_UID_/subuser/_SUBUSER
Parameters
Queries:
purge_keys - Set to false to not purge the keys. This only works for S3 subusers.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
To invoke the REST admin APIs, create a user with admin caps.
Example
[root@host01 ~]# radosgw-admin --uid TESTER --display-name "TestUser" --access_key TESTER --secret
test123 user create
[root@host01 ~]# radosgw-admin caps add --
uid="_home_markdown_jenkins_workspace_Transform_in_SSEG27_5.3_developer_ref_rgw_rest-apis-for-
manipulating-a-role_TESTER" --caps="roles=*"
Create a role:
Syntax
POST “<hostname>?
Action=CreateRole&RoleName=ROLE_NAME&Path=PATH_TO_FILE&AssumeRolePolicyDocument=TRUST_RELATION
SHIP_POLICY_DOCUMENT”
Example
POST “<hostname>?
Action=CreateRole&RoleName=S3Access&Path=/application_abc/component_xyz/&AssumeRolePolicyDocum
ent={"Version":"2022-06-17","Statement":[{"Effect":"Allow","Principal":{"AWS":
["arn:aws:iam:::user/TESTER"]},"Action":["sts:AssumeRole"]}]}”
Example response
<role>
<id>8f41f4e0-7094-4dc0-ac20-074a881ccbc5</id>
<name>S3Access</name>
<path>/application_abc/component_xyz/</path>
<arn>arn:aws:iam:::role/application_abc/component_xyz/S3Access</arn>
<create_date>2022-06-23T07:43:42.811Z</create_date>
<max_session_duration>3600</max_session_duration>
<assumeROLEpolicy_document>{"Version":"2022-06-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":
["sts:AssumeRole"]}]}</assumeROLEpolicy_document>
</role>
Get a role:
Syntax
POST “<hostname>?Action=GetRole&RoleName=ROLE_NAME”
POST “<hostname>?Action=GetRole&RoleName=S3Access”
Example response
<role>
<id>8f41f4e0-7094-4dc0-ac20-074a881ccbc5</id>
<name>S3Access</name>
<path>/application_abc/component_xyz/</path>
<arn>arn:aws:iam:::role/application_abc/component_xyz/S3Access</arn>
<create_date>2022-06-23T07:43:42.811Z</create_date>
<max_session_duration>3600</max_session_duration>
<assumeROLEpolicy_document>{"Version":"2022-06-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":
["sts:AssumeRole"]}]}</assumeROLEpolicy_document>
</role>
List a role:
Syntax
POST “<hostname>?Action=GetRole&RoleName=ROLE_NAME&PathPrefix=PATH_PREFIX”
Example request
POST “<hostname>?Action=ListRoles&RoleName=S3Access&PathPrefix=/application”
Example response
<role>
<id>8f41f4e0-7094-4dc0-ac20-074a881ccbc5</id>
<name>S3Access</name>
<path>/application_abc/component_xyz/</path>
<arn>arn:aws:iam:::role/application_abc/component_xyz/S3Access</arn>
<create_date>2022-06-23T07:43:42.811Z</create_date>
<max_session_duration>3600</max_session_duration>
<assumeROLEpolicy_document>{"Version":"2022-06-17","Statement":
[{"Effect":"Allow","Principal":{"AWS":["arn:aws:iam:::user/TESTER"]},"Action":
["sts:AssumeRole"]}]}</assumeROLEpolicy_document>
</role>
Syntax
POST “<hostname>?
Action=UpdateAssumeRolePolicy&RoleName=ROLE_NAME&PolicyDocument=TRUST_RELATIONSHIP_POLICY_DOCU
MENT”
Example
POST “<hostname>?Action=UpdateAssumeRolePolicy&RoleName=S3Access&PolicyDocument=
{"Version":"2022-06-17","Statement":[{"Effect":"Allow","Principal":{"AWS":
["arn:aws:iam:::user/TESTER2"]},"Action":["sts:AssumeRole"]}]}”
Syntax
POST “<hostname>?
Action=PutRolePolicy&RoleName=ROLE_NAME&PolicyName=POLICY_NAME&PolicyDocument=TRUST_RELATIONSH
IP_POLICY_DOCUMENT”
Example
POST “<hostname>?Action=PutRolePolicy&RoleName=S3Access&PolicyName=Policy1&PolicyDocument=
{"Version":"2022-06-17","Statement":[{"Effect":"Allow","Action":
["s3:CreateBucket"],"Resource":"arn:aws:s3:::example_bucket"}]}”
Syntax
POST “<hostname>?Action=ListRolePolicies&RoleName=ROLE_NAME”
POST “<hostname>?Action=ListRolePolicies&RoleName=S3Access”
<PolicyNames>
<member>Policy1</member>
</PolicyNames>
Syntax
POST “<hostname>?Action=GetRolePolicy&RoleName=ROLE_NAME&PolicyName=POLICY_NAME”
Example
POST “<hostname>?Action=GetRolePolicy&RoleName=S3Access&PolicyName=Policy1”
<GetRolePolicyResult>
<PolicyName>Policy1</PolicyName>
<RoleName>S3Access</RoleName>
<Permission_policy>{"Version":"2022-06-17","Statement":[{"Effect":"Allow","Action":
["s3:CreateBucket"],"Resource":"arn:aws:s3:::example_bucket"}]}</Permission_policy>
</GetRolePolicyResult>
Syntax
POST “hostname>?Action=DeleteRolePolicy&RoleName=ROLE_NAME&PolicyName=POLICY_NAME“
Example
POST “<hostname>?Action=DeleteRolePolicy&RoleName=S3Access&PolicyName=Policy1”
Delete a role:
NOTE: You can delete a role only when it does not have any permission policy attached to it.
Syntax
POST “<hostname>?Action=DeleteRole&RoleName=ROLE_NAME"
Example
POST “<hostname>?Action=DeleteRole&RoleName=S3Access"
Reference
Edit online
Ceph Orchestrator
Edit online
The method reference for using the Ceph RESTful API orchestrator endpoint to display the Ceph Orchestrator status.
GET /api/orchestrator/status
Description
Display the Ceph Orchestrator status.
Example
Status Codes
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Pools
Edit online
The method reference for using the Ceph RESTful API pool endpoint to manage the storage pools.
GET /api/pool
Description
Display the pool list.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/pool
Example
{
"application_metadata": "STRING",
"configuration": "STRING",
"erasure_code_profile": "STRING",
"flags": "STRING",
"pg_num": 1,
"pool": "STRING",
"pool_type": "STRING",
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/pool/_POOL_NAME
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/pool/_POOL_NAME
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/pool/_POOL_NAME
Parameters
Example
{
"application_metadata": "STRING",
"configuration": "STRING",
"flags": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/pool/POOL_NAME/configuration
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Prometheus
Edit online
The method reference for using the Ceph RESTful API prometheus endpoint to manage Prometheus.
GET /api/prometheus
Example
Status Codes
200 OK – Okay.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/prometheus/rules
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/prometheus/silence
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/prometheus/silence/_S_ID
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/prometheus/silences
Example
Status Codes
200 OK – Okay.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/prometheus/notifications
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
RBD Namespace
RBD Snapshots
RBD Trash
RBD Mirroring
RBD Images
GET /api/block/image
Description
View the RBD images.
Parameters
Queries:
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image
Example
{
"configuration": "STRING",
"data_pool": "STRING",
"features": "STRING",
"name": "STRING",
"namespace": "STRING",
"obj_size": 1,
"pool_name": "STRING",
"size": 1,
"stripe_count": 1,
"stripe_unit": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/block/image/clone_format_version
Description
Returns the RBD clone format version.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/block/image/default_features
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/block/image/_IMAGE_SPEC
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/block/image/_IMAGE_SPEC
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/block/image/_IMAGE_SPEC
Parameters
Example
{
"configuration": "STRING",
"features": "STRING",
"name": "STRING",
"size": 1
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image/IMAGE_SPEC/copy
Parameters
Example
{
"configuration": "STRING",
"data_pool": "STRING",
"dest_image_name": "STRING",
"dest_namespace": "STRING",
"dest_pool_name": "STRING",
"features": "STRING",
"obj_size": 1,
"snapshot_name": "STRING",
"stripe_count": 1,
"stripe_unit": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image/IMAGE_SPEC/flatten
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image/IMAGE_SPEC/move_trash
Description
Move an image to the trash. Images actively in-use by clones can be moved to the trash, and deleted at a later time.
Parameters
Example
{
"delay": 1
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
RBD Mirroring
GET /api/block/mirroring/site_name
Description
Display the RBD mirroring site name.
Example
Status Codes
200 OK – Okay.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/block/mirroring/site_name
Example
{
"site_name": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/mirroring/pool/POOL_NAME/bootstrap/peer
Parameters
Example
{
"direction": "STRING",
"token": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/mirroring/pool/POOL_NAME/bootstrap/token
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/block/mirroring/pool/_POOL_NAME
Description
Display the RBD mirroring summary.
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/block/mirroring/pool/_POOL_NAME
Parameters
Example
{
"mirror_mode": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/block/mirroring/pool/POOL_NAME/peer
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/mirroring/pool/POOL_NAME/peer
Parameters
Example
{
"client_id": "STRING",
"cluster_name": "STRING",
"key": "STRING",
"mon_host": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/block/mirroring/pool/POOL_NAME/peer/_PEER_UUID
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
GET /api/block/mirroring/pool/POOL_NAME/peer/_PEER_UUID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/block/mirroring/pool/POOL_NAME/peer/_PEER_UUID
Parameters
Example
{
"client_id": "STRING",
"cluster_name": "STRING",
"key": "STRING",
"mon_host": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/block/mirroring/summary
Description
Display the RBD mirroring summary.
Example
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
RBD Namespace
GET /api/block/pool/POOL_NAME/namespace
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/pool/POOL_NAME/namespace
Parameters
Example
{
"namespace": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/block/pool/POOL_NAME/namespace/_NAMESPACE
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
RBD Snapshots
POST /api/block/image/IMAGE_SPEC/snap
Parameters
Example
{
"snapshot_name": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/block/image/IMAGE_SPEC/snap/_SNAPSHOT_NAME
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/block/image/IMAGE_SPEC/snap/_SNAPSHOT_NAME
Parameters
Example
{
"is_protected": true,
"new_snap_name": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image/IMAGE_SPEC/snap/SNAPSHOT_NAME/clone
Description
Clones a snapshot to an image.
Parameters
Example
{
"child_image_name": "STRING",
"child_namespace": "STRING",
"child_pool_name": "STRING",
"configuration": "STRING",
"data_pool": "STRING",
"features": "STRING",
"obj_size": 1,
"stripe_count": 1,
"stripe_unit": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image/IMAGE_SPEC/snap/SNAPSHOT_NAME/rollback
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
RBD Trash
GET /api/block/image/trash
Description
Display all the RBD trash entries, or the RBD trash details by pool name.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image/trash/purge
Description
Remove all the expired images from trash.
Parameters
Queries:
Example
{
"pool_name": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/block/image/trash/_IMAGEIDSPEC
Description
Deletes an image from the trash. If the image deferment time has not expired, you can not delete it unless you use force. An
actively in-use image by clones or has snapshots, it can not be deleted.
Parameters
Queries:
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/block/image/trash/_IMAGEIDSPEC_/restore
Description
Restores an image from the trash.
Parameters
Example
{
"new_image_name": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Performance counters
Edit online
The method reference for using the Ceph RESTful API perf_counters endpoint to display the various Ceph performance counter.
This reference includes all available performance counter endpoints, such as:
Ceph Manager
Ceph Monitor
Ceph OSD
TCMU Runner
GET /api/perf_counters
Description
Displays the performance counters.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/perf_counters/mds/_SERVICE_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
Ceph Manager
GET /api/perf_counters/mgr/_SERVICE_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Ceph Monitor
GET /api/perf_counters/mon/_SERVICE_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Ceph OSD
GET /api/perf_counters/osd/_SERVICE_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/perf_counters/rbd-mirror/_SERVICE_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/perf_counters/rgw/_SERVICE_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
TCMU Runner
GET /api/perf_counters/tcmu-runner/_SERVICE_ID
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Roles
Edit online
The method reference for using the Ceph RESTful API role endpoint to manage the various user roles in Ceph.
GET /api/role
Description
Display the role list.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/role
Example
{
"description": "STRING",
"name": "STRING",
"scopes_permissions": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/role/_NAME
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/role/_NAME
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/role/_NAME
Parameters
Example
{
"description": "STRING",
"scopes_permissions": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/role/NAME/clone
Example
{
"new_name": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Services
Edit online
The method reference for using the Ceph RESTful API service endpoint to manage the various Ceph services.
GET /api/service
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/service
Parameters
Example
{
"service_name": "STRING",
"service_spec": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/service/known_types
Description
Display a list of known service types.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/service/_SERVICE_NAME
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/service/_SERVICE_NAME
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/service/SERVICE_NAME/daemons
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Settings
Edit online
The method reference for using the Ceph RESTful API settings endpoint to manage the various Ceph settings.
GET /api/settings
Description
Display the list of available options
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/settings
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/settings/_NAME
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/settings/_NAME
Description
Display the given option.
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/settings/_NAME
Parameters
Example
{
"value": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Ceph task
Edit online
The method reference for using the Ceph RESTful API task endpoint to display Ceph tasks.
GET /api/task
Description
Display Ceph tasks.
Parameters
Queries:
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
Telemetry
Edit online
The method reference for using the Ceph RESTful API telemetry endpoint to manage data for the telemetry Ceph Manager module.
PUT /api/telemetry
Description
Enables or disables the sending of collected data by the telemetry module.
Parameters
license_name - A string value, such as, sharing-1-0. Make sure the user is aware of and accepts the license for
sharing telemetry data.
Example
{
"enable": true,
"license_name": "STRING"
}
Status Codes
200 OK – Okay.
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/telemetry/report
Description
Display report data on Ceph and devices.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
For more information about managing with the Ceph dashboard, see Activating and deactivating telemetry.
Ceph users
Edit online
The method reference for using the Ceph RESTful API user endpoint to display Ceph user details and to manage Ceph user
passwords.
GET /api/user
Description
Display a list of users.
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/user
Example
{
"email": "STRING",
"enabled": true,
"name": "STRING",
"password": "STRING",
"pwdExpirationDate": "STRING",
"pwdUpdateRequired": true,
"roles": "STRING",
"username": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
DELETE /api/user/USERNAME
Parameters
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
GET /api/user/USERNAME
Parameters
Example
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
PUT /api/user/USERNAME
Parameters
Example
{
"email": "STRING",
"enabled": "STRING",
"name": "STRING",
"password": "STRING",
"pwdExpirationDate": "STRING",
"pwdUpdateRequired": true,
"roles": "STRING"
}
Status Codes
200 OK – Okay.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/user/USERNAME_/change_password
Parameters
Example
{
"new_password": "STRING",
"old_password": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
POST /api/user/validate_password
Description
Checks the password to see if it meets the password policy.
Parameters
Example
{
"old_password": "STRING",
"password": "STRING",
"username": "STRING"
}
Status Codes
202 Accepted – Operation is still executing. Please check the task queue.
400 Bad Request – Operation exception. Please check the response body for details.
500 Internal Server Error – Unexpected error. Please check the response body for the stack trace.
Reference
Edit online
The AssumeRole example creates a role, assigns a policy to the role, then assumes a role to get temporary credentials and access to
S3 resources using those temporary credentials.
The AssumeRoleWithWebIdentity example authenticates users using an external application with Keycloak, an OpenID Connect
identity provider, assumes a role to get temporary credentials and access S3 resources according to the permission policy of the role.
AssumeRole Example
import boto3
iam_client = boto3.client(iam,
aws_access_key_id=ACCESS_KEY_OF_TESTER1,
aws_secret_access_key=SECRET_KEY_OF_TESTER1,
endpoint_url=<IAM URL>,
region_name='
)
policy_document = "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":
{\"AWS\":[\"arn:aws:iam:::user/TESTER1\"]},\"Action\":[\"sts:AssumeRole\"]}]}"
role_response = iam_client.create_role(
AssumeRolePolicyDocument=policy_document,
Path=/,
RoleName='S3Access,
)
role_policy = "{\"Version\":\"2012-10-17\",\"Statement\":
{\"Effect\":\"Allow\",\"Action\":\"s3:*\",\"Resource\":\"arn:aws:s3:::*\"}}"
response = iam_client.putROLEpolicy(
RoleName=S3Access,
PolicyName=Policy1,
PolicyDocument=role_policy
)
sts_client = boto3.client(sts,
aws_access_key_id=ACCESS_KEY_OF_TESTER2,
aws_secret_access_key=SECRET_KEY_OF_TESTER2,
endpoint_url=<STS URL>,
region_name=',
)
response = sts_client.assume_role(
RoleArn=role_response['Role][Arn],
RoleSessionName=Bob,
DurationSeconds=3600
)
s3client = boto3.client(s3,
aws_access_key_id = response[Credentials][AccessKeyId],
aws_secret_access_key = response[Credentials][SecretAccessKey],
aws_session_token = response[Credentials][SessionToken],
endpoint_url=<S3 URL>,
region_name=',)
bucket_name = 'my-bucket
s3bucket = s3client.create_bucket(Bucket=bucket_name)
resp = s3client.list_buckets()
AssumeRoleWithWebIdentity Example
import boto3
iam_client = boto3.client(iam,
aws_access_key_id=ACCESS_KEY_OF_TESTER1,
aws_secret_access_key=SECRET_KEY_OF_TESTER1,
endpoint_url=<IAM URL>,
region_name='
)
policy_document = "{\"Version\":\"2012-10-17\",\"Statement\":\[\
{\"Effect\":\"Allow\",\"Principal\":\{\"Federated\":\[\"arn:aws:iam:::oidc-
provider/localhost:8080/auth/realms/demo\"\]\},\"Action\":\
[\"sts:AssumeRoleWithWebIdentity\"\],\"Condition\":\{\"StringEquals\":\
{\"localhost:8080/auth/realms/demo:app_id\":\"customer-portal\"\}\}\}\]\}"
role_response = iam_client.create_role(
AssumeRolePolicyDocument=policy_document,
Path=/,
RoleName='S3Access,
)
role_policy = "{\"Version\":\"2012-10-17\",\"Statement\":
{\"Effect\":\"Allow\",\"Action\":\"s3:*\",\"Resource\":\"arn:aws:s3:::*\"}}"
response = iam_client.putROLEpolicy(
RoleName=S3Access,
PolicyName=Policy1,
PolicyDocument=role_policy
)
sts_client = boto3.client(sts,
aws_access_key_id=ACCESS_KEY_OF_TESTER2,
aws_secret_access_key=SECRET_KEY_OF_TESTER2,
endpoint_url=<STS URL>,
region_name=',
)
response = client.assumeROLEwith_web_identity(
RoleArn=role_response['Role][Arn],
RoleSessionName=Bob,
DurationSeconds=3600,
WebIdentityToken=<Web Token>
)
s3client = boto3.client(s3,
aws_access_key_id = response[Credentials][AccessKeyId],
aws_secret_access_key = response[Credentials][SecretAccessKey],
aws_session_token = response[Credentials][SessionToken],
endpoint_url=<S3 URL>,
region_name=',)
bucket_name = 'my-bucket
s3bucket = s3client.create_bucket(Bucket=bucket_name)
resp = s3client.list_buckets()
Reference
Edit online
For more details on using Python's boto module, see Test S3 Access.
Troubleshooting
Edit online
Troubleshoot and resolve common problems with IBM Storage Ceph.
Initial Troubleshooting
Configuring logging
Troubleshooting networking issues
Initial Troubleshooting
Edit online
As a storage administrator, you can do the initial troubleshooting of a IBM Storage Ceph cluster before contacting IBM support. This
chapter includes the following information:
Prerequisites
Identifying problems
Diagnosing the health of a storage cluster
Understanding Ceph health
Muting health alerts of a Ceph cluster
Understanding Ceph logs
Generating an sos report
Identifying problems
Edit online
To determine possible causes of the error with the IBM Storage Ceph cluster, answer the questions in the Procedure section.
Prerequisites
Procedure
1. Certain problems can arise when using unsupported configurations. Ensure that your configuration is supported.
e. Multi-site Ceph Object Gateway. See Troubleshooting a multi-site Ceph Object Gateway.
Reference
Prerequisites
Procedure
Example
Example
If the command returns HEALTH_WARN or HEALTH_ERR see Understanding Ceph health for details.
Example
4. To capture the logs of the cluster to a file, run the following commands:
Example
The logs are located by default in the /var/log/ceph/ directory. Check the Ceph logs for any error messages listed in
Understanding Ceph logs.
5. If the logs do not include a sufficient amount of information, increase the debugging level and try to reproduce the action that
failed. See Configuring logging for details.
HEALTH_WARN indicates a warning. In some cases, the Ceph status returns to HEALTH_OK automatically. For example when
IBM Storage Ceph cluster finishes the rebalancing process. However, consider further troubleshooting if a cluster is in the
HEALTH_WARN state for longer time.
HEALTH_ERR indicates a more serious problem that requires your immediate attention.
Use the ceph health detail and ceph -s commands to get a more detailed output.
NOTE: A health warning is displayed if there is no mgr daemon running. In case the last mgr daemon of a IBM Storage Ceph cluster
was removed, you can manually deploy a mgr daemon, on a random host of the IBM Storage Ceph cluster.
See Manually deploying a mgr daemon in the IBM Storage Ceph 5.3 Administration Guide.
Reference
See Ceph Monitor error messages table in the IBM Storage Ceph Troubleshooting Guide.
See Ceph OSD error messages table in the IBM Storage Ceph Troubleshooting Guide.
See Placement group error messages table in the IBM Storage Ceph Troubleshooting Guide.
Alerts are specified using the health check codes. One example is, when an OSD is brought down for maintenance, OSD_DOWN
warnings are expected. You can choose to mute the warning until the maintenance is over because those warnings put the cluster in
HEALTH_WARN instead of HEALTH_OK for the entire duration of maintenance.
Most health mutes also disappear if the extent of an alert gets worse. For example, if there is one OSD down, and the alert is muted,
the mute disappears if one or more additional OSDs go down. This is true for any health alert that involves a count indicating how
much or how many of something is triggering the warning or error.
Prerequisites
Procedure
Example
2. Check the health of the IBM Storage Ceph cluster by running the ceph health detail command:
Example
You can see that the storage cluster is in HEALTH_WARN status as one of the OSDs is down.
Syntax
Example
4. Optional: A health check mute can have a time to live (TTL) associated with it, such that the mute automatically expires after
the specified period of time has elapsed. Specify the TTL as an optional duration argument in the command:
Syntax
Example
Example
services:
mon: 3 daemons, quorum host01,host02,host03 (age 33h)
mgr: host01.pzhfuh(active, since 33h), standbys: host02.wsnngf, host03.xwzphg
osd: 11 osds: 10 up (since 4m), 11 in (since 5d)
data:
pools: 1 pools, 1 pgs
objects: 13 objects, 0 B
usage: 85 MiB used, 165 GiB / 165 GiB avail
pgs: 1 active+clean
In this example, you can see that the alert OSD_DOWN and OSD_FLAG is muted and the mute is active for nine minutes.
6. Optional: You can retain the mute even after the alert is cleared by making it sticky.
Syntax
Example
Syntax
Example
Reference
See Health messages of a Ceph cluster section in the IBM Storage Ceph Troubleshooting Guide for details.
The CLUSTER_NAME.log is the main storage cluster log file that includes global events. By default, the log file name is ceph.log.
Only the Ceph Monitor nodes include the main storage cluster log.
Each Ceph OSD and Monitor has its own log file named CLUSTER_NAME-osd.NUMBER.log and CLUSTER_NAME-
mon.HOSTNAME.log respectively.
When you increase debugging level for Ceph subsystems, Ceph generates new log files for those subsystems as well.
Reference
For details about logging, see Configuring logging in the IBM Storage Ceph Troubleshooting Guide.
See Common Ceph Monitor error messages in the Ceph logs table in the IBM Storage Ceph Troubleshooting Guide.
See Common Ceph OSD error messages in the Ceph logs table in the IBM Storage Ceph Troubleshooting Guide.
Edit online
You can run the sos report command to collect the configuration details, system information, and diagnostic information of a IBM
Storage Ceph cluster from a Red Hat Enterprise Linux. IBM Support team uses this information for further troubleshooting of the
storage cluster.
Prerequisites
Procedure
Example
NOTE: Install the sos-4.0.11.el8 package or higher version to capture the Ceph command output correctly.
2. Run the sos report to get the system information of the storage cluster:
Example
For sos versions 4.3 and later, you need to run the following command for specific Ceph information:
In the above example, you can get the logs of the Ceph Monitor.
Reference
See the What is an sos report and how to create one in Red Hat Enterprise Linux? KnowledgeBase article for more information.
Configuring logging
Edit online
This chapter describes how to configure logging for various Ceph subsystems.
IMPORTANT: Logging is resource intensive. Also, verbose logging can generate a huge amount of data in a relatively short time. If
you are encountering problems in a specific subsystem of the cluster, enable logging only of that subsystem. See Ceph Subsystems
for more information.
In addition, consider setting up a rotation of log files. See Accelerating log rotation for details.
Once you fix any problems you encounter, change the subsystems log and memory levels to their default values. See Ceph
subsystems default logging level values for a list of all Ceph subsystems and their default values.
Using the ceph command at runtime. This is the most common approach. See Configuring logging at runtime for details.
Updating the Ceph configuration file. Use this approach if you are encountering problems when starting the cluster. See
Configuring logging in configuration file for details.
Prerequisites
Ceph subsystems
Edit online
This section contains information about Ceph subsystems and their logging levels.
Output logs that are stored by default in /var/log/ceph/ directory (log level)
In general, Ceph does not send logs stored in memory to the output logs unless:
You request it
You can set different values for each of these subsystems. Ceph logging levels operate on a scale of 1 to 20, where 1 is terse and 20
is verbose.
Use a single value for the log level and memory level to set them both to the same value. For example, debug_osd = 5 sets the
debug level for the ceph-osd daemon to 5.
To use different values for the output log level and the memory level, separate the values with a forward slash (/). For example,
debug_mon = 1/5 sets the debug log level for the ceph-mon daemon to 1 and its memory log level to 5.
The following examples show the type of messages in the logs when you increase the verbosity for the Monitors and OSDs.
debug_ms = 5
debug_osd = 20
Reference
Prerequisites
Procedure
2. Replace:
ID with a specific ID of the Ceph daemon. Alternatively, use * to apply the runtime setting to all daemons of a particular
type.
For example, to set the log level for the OSD subsystem on the OSD named osd.0 to 0 and the memory level to 5:
1. Log in to the host with a running Ceph daemon, for example, ceph-osd or ceph-mon.
Syntax
Example
Reference
The Ceph Debugging and Logging Configuration Reference chapter in the IBM Storage Ceph Configuration Guide.
Prerequisites
Procedure
1. To activate Ceph debugging output, dout() at boot time, add the debugging settings to the Ceph configuration file.
a. For subsystems common to each daemon, add the settings under the [global] section.
b. For subsystems for particular daemons, add the settings under a daemon section, such as [mon], [osd], or [mds].
**Example**
[global]
debug_ms = 1/5
[mon]
[osd]
debug_osd = 1/5
debug_monc = 5/20
[mds]
debug_mds = 1
Reference
Ceph subsystems
Prerequisites
Procedure
1. Add the size setting after the rotation frequency to the log rotation file:
rotate 7
weekly
size SIZE
compress
sharedscripts
rotate 7
weekly
size 500 MB
compress
sharedscripts
size 500M
3. Add an entry to check the /etc/logrotate.d/ceph file. For example, to instruct Cron to check /etc/logrotate.d/ceph
every 30 minutes:
Procedure
Syntax
logrotate -f
Example
Syntax
ll LOG_LOCATION
Example
3. Create a bucket.
Syntax
/usr/local/bin/s3cmd mb s3://NEW_BUCKET_NAME
Example
Syntax
tail -f LOG_LOCATION/opslog.log
Example
{"bucket":"","time":"2022-09-29T06:17:03.133488Z","time_local":"2022-09-
29T06:17:03.133488+0000","remote_addr":"10.0.211.66","user":"test1",
"operation":"list_buckets","uri":"GET /
HTTP/1.1","http_status":"200","error_code":"","bytes_sent":232,
"bytes_received":0,"object_size":0,"total_time":9,"user_agent":"","referrer":
"","trans_id":"tx00000c80881a9acd2952a-006335385f-175e5-primary",
"authentication_type":"Local","access_key_id":"1234","temp_url":false}
{"bucket":"cn1","time":"2022-09-29T06:17:10.521156Z","time_local":"2022-09-
29T06:17:10.521156+0000","remote_addr":"10.0.211.66","user":"test1",
"operation":"create_bucket","uri":"PUT /cn1/
HTTP/1.1","http_status":"200","error_code":"","bytes_sent":0,
"bytes_received":0,"object_size":0,"total_time":106,"user_agent":"",
"referrer":"","trans_id":"tx0000058d60c593632c017-0063353866-175e5-primary",
"authentication_type":"Local","access_key_id":"1234","temp_url":false}
Prerequisites
Prerequisites
Procedure
1. Installing the net-tools and telnet packages can help when troubleshooting network issues that can occur in a Ceph
storage cluster:
Example
2. Log into the cephadm shell and verify that the public_network parameters in the Ceph configuration file include the correct
values:
Example
3. Exit the shell and verify that the network interfaces are up:
Example
4. Verify that the Ceph nodes are able to reach each other using their short host names. Verify this on each node in the storage
cluster:
Syntax
ping SHORT_HOST_NAME
Example
5. If you use a firewall, ensure that Ceph nodes are able to reach each other on their appropriate ports. The firewall-cmd and
telnet tools can validate the port status, and if the port is open respectively:
Syntax
Example
6. Verify that there are no errors on the interface counters. Verify that the network connectivity between nodes has expected
latency, and that there is no packet loss.
Syntax
ethtool -S INTERFACE
Example
Example
Example
7. For performance issues, in addition to the latency checks and to verify the network bandwidth between all nodes of the
storage cluster, use the iperf3 tool. The iperf3 tool does a simple point-to-point network bandwidth test between a server
and a client.
a. Install the iperf3 package on the IBM Storage Ceph nodes you want to check the bandwidth:
Example
Example
NOTE: The default port is 5201, but can be set using the -P command argument.
Example
iperf Done.
This output shows a network bandwidth of 1.1 Gbits/second between the IBM Storage Ceph nodes, along with no
retransmissions (Retr) during the test. IBM recommends you validate the network bandwidth between all the nodes in the
storage cluster.
8. Ensure that all nodes have the same network interconnect speed. Slower attached nodes might slow down the faster
connected ones. Also, ensure that the inter switch links can handle the aggregated bandwidth of the attached nodes:
Syntax
ethtool INTERFACE
Example
Reference
See the What is the ethtool command and how can I use it to obtain information about my network devices and interfaces
article for details.
For details, see the What are the performance benchmarking tools available for IBM Storage Ceph? solution on the Customer
Portal.
For more information, see Knowledgebase articles and solutions related to troubleshooting networking issues.
Prerequisites
Procedure
1. Verify that the chronyd daemon is running on the Ceph Monitor hosts:
Example
Example
Example
Reference
See Clock skew section in the IBM Storage Ceph Troubleshooting Guide for further details.
Prerequisites
Prerequisites
HEALTH_WARN 1 mons down, quorum 1,2 mon.b,mon.c, mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out
of quorum)
If the ceph-mon daemon is not running, it might have a corrupted store or some other error is preventing the daemon from starting.
Also, the /var/ partition might be full. As a consequence, ceph-mon is not able to perform any operations to the store located by
default at /var/lib/ceph/mon-SHORT_HOST_NAME/store.db and terminates.
If the ceph-mon daemon is running but the Ceph Monitor is out of quorum and marked as down, the cause of the problem depends
on the Ceph Monitor state:
If the Ceph Monitor is in the probing state longer than expected, it cannot find the other Ceph Monitors. This problem can be
caused by networking issues, or the Ceph Monitor can have an outdated Ceph Monitor map (monmap) and be trying to reach
the other Ceph Monitors on incorrect IP addresses. Alternatively, if the monmap is up-to-date, Ceph Monitor’s clock might not
be synchronized.
If the Ceph Monitor is in the electing state longer than expected, the Ceph Monitor’s clock might not be synchronized.
If the Ceph Monitor changes its state from synchronizing to electing and back, the cluster state is advancing. This means that it
is generating new maps faster than the synchronization process can handle.
If the Ceph Monitor marks itself as the leader or a peon, then it believes to be in a quorum, while the remaining cluster is sure
that it is not. This problem can be caused by failed clock synchronization.
Syntax
Replace HOST_NAME with the short name of the host where the daemon is running. Use the hostname -s command when
unsure.
2. If you are not able to start ceph-mon, follow the steps in The ceph-mon daemon cannot start.
3. If you are able to start the ceph-mon daemon but is marked as down, follow the steps in The ceph-mon daemon is running,
but marked as down.
2. If the log contains error messages similar to the following ones, the Ceph Monitor might have a corrupted store.
To fix this problem, replace the Ceph Monitor. See Replacing a failed monitor for details.
3. If the log contains an error message similar to the following one, the /var/ partition might be full. Delete any unnecessary
data from /var/.
IMPORTANT: Do not delete any data from the Monitor directory manually. Instead, use the ceph-monstore-tool to
compact it.
4. If you see any other error messages, open a support ticket. For more information, see Contacting IBM Support for service.
1. From the Ceph Monitor host that is out of the quorum, use the mon_status command to check its state:
2. If the status is probing, verify the locations of the other Ceph Monitors in the mon_status output.
a. If the addresses are incorrect, the Ceph Monitor has incorrect Ceph Monitor map (monmap). To fix this problem, see
Injecting a monmap.
b. If the addresses are correct, verify that the Ceph Monitor clocks are synchronized. See Clock skew for details. In addition, to
troubleshoot any networking issues, see Troubleshooting Networking issues for details.
3. If the status is electing, verify that the Ceph Monitor clocks are synchronized. See Clock skew for details.
4. If the status changes from electing to synchronizing, open a support ticket. For more information, see Contacting IBM Support
for service.
5. If the Ceph Monitor is the leader or a peon, verify that the Ceph Monitor clocks are synchronized. Open a support ticket if
synchronizing the clocks does not solve the problem. For more information, see Contacting IBM Support for service.
Reference
See Starting, Stopping, Restarting the Ceph daemons section in the IBM Storage Ceph Administration Guide.
The Using the Ceph Administration Socket section in the IBM Storage Ceph Administration Guide.
Clock skew
Edit online
A Ceph Monitor is out of quorum, and the ceph health detail command output contains error messages similar to these:
2022-05-04 07:28:32.035795 7f806062e700 0 log [WRN] : mon.a 127.0.0.1:6789/0 clock skew 0.14s > max
0.05s
2022-05-04 04:31:25.773235 7f4997663700 0 log [WRN] : message from mon.1 was stamped 0.186257s in
the future, clocks not synchronized
The ‘clock skew` error message indicates that Ceph Monitors’ clocks are not synchronized. Clock synchronization is important
because Ceph Monitors depend on time precision and behave unpredictably if their clocks are not synchronized.
The mon_clock_drift_allowed parameter determines what disparity between the clocks is tolerated. By default, this parameter
is set to 0.05 seconds.
IMPORTANT: Do not change the default value of mon_clock_drift_allowed without previous testing. Changing this value might
affect the stability of the Ceph Monitors and the Ceph Storage Cluster in general.
Possible causes of the clock skew error include network problems or problems with chrony Network Time Protocol (NTP)
synchronization if that is configured. In addition, time synchronization does not work properly on Ceph Monitors deployed on virtual
machines.
1. Verify that your network works correctly. For details, see Troubleshooting networking issues. If you use chrony for NTP, see
Basic chrony NTP troubleshooting section for more information.
2. If you use a remote NTP server, consider deploying your own chrony NTP server on your network. For details, see Using the
Chrony suite to configure NTP in the Configuring basic system settings for Red Hat Enterprise Linux 8.
NOTE: Ceph evaluates time synchronization every five minutes only so there will be a delay between fixing the problem and clearing
the clock skew messages.
Reference
mon.ceph1 store is getting too big! 48031 MB >= 15360 MB -- 62% avail
Ceph Monitors store is in fact a LevelDB database that stores entries as key–values pairs. The database includes a cluster map and is
located by default at /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME/store.db.
Querying a large Monitor store can take time. As a consequence, the Ceph Monitor can be delayed in responding to client queries.
In addition, if the /var/ partition is full, the Ceph Monitor cannot perform any write operations to the store and terminates. See
Ceph Monitor is out of quorum for details on troubleshooting this issue.
du -sch /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME/store.db
Specify the name of the cluster and the short host name of the host where the ceph-mon is running.
Example
# du -sch /var/lib/ceph/mon/ceph-host1/store.db
47G /var/lib/ceph/mon/ceph-ceph1/store.db/
47G total
2. Compact the Ceph Monitor store. For details, see Compacting the Ceph Monitor Store.
Reference
State
Rank
Elections epoch
If Ceph Monitors are able to form a quorum, use mon_status with the ceph command-line utility.
If Ceph Monitors are not able to form a quorum, but the ceph-mon daemon is running, use the administration socket to execute
mon_status.
{
"name": "mon.3",
"rank": 2,
"state": "peon",
"election_epoch": 96,
"quorum": [
1,
2
],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 1,
"fsid": "d5552d32-9d1d-436c-8db1-ab5fc2c63cd0",
"modified": "0.000000",
"created": "0.000000",
"mons": [
{
"rank": 0,
"name": "mon.1",
"addr": "172.25.1.10:6789\/0"
},
{
"rank": 1,
"name": "mon.2",
"addr": "172.25.1.12:6789\/0"
},
{
"rank": 2,
"name": "mon.3",
"addr": "172.25.1.13:6789\/0"
}
]
}
}
Leader
During the electing phase, Ceph Monitors are electing a leader. The leader is the Ceph Monitor with the highest rank, that is
the rank with the lowest value. In the example above, the leader is mon.1.
Peon
Peons are the Ceph Monitors in the quorum that are not leaders. If the leader fails, the peon with the highest rank becomes a
new leader.
Probing
A Ceph Monitor is in the probing state if it is looking for other Ceph Monitors. For example, after you start the Ceph Monitors,
they are probing until they find enough Ceph Monitors specified in the Ceph Monitor map (monmap) to form a quorum.
Synchronizing
A Ceph Monitor is in the synchronizing state if it is synchronizing with the other Ceph Monitors to join the quorum. The smaller
the Ceph Monitor store it, the faster the synchronization process. Therefore, if you have a large store, synchronization takes a
longer time.
Reference
Injecting a monmap
Edit online
If a Ceph Monitor has an outdated or corrupted Ceph Monitor map (monmap), it cannot join a quorum because it is trying to reach the
other Ceph Monitors on incorrect IP addresses.
The safest way to fix this problem is to obtain and inject the actual Ceph Monitor map from other Ceph Monitors.
NOTE: This action overwrites the existing Ceph Monitor map kept by the Ceph Monitor.
This procedure shows how to inject the Ceph Monitor map when the other Ceph Monitors are able to form a quorum, or when at least
one Ceph Monitor has a correct Ceph Monitor map. If all Ceph Monitors have corrupted store and therefore also the Ceph Monitor
map, see Recovering the Ceph Monitor store.
Prerequisites
Procedure
1. If the remaining Ceph Monitors are able to form a quorum, get the Ceph Monitor map by using the ceph mon getmap
command:
Example
2. If the remaining Ceph Monitors are not able to form the quorum and you have at least one Ceph Monitor with a correct Ceph
Monitor map, copy it from that Ceph Monitor:
a. Stop the Ceph Monitor which you want to copy the Ceph Monitor map from:
Syntax
For example, to stop the Ceph Monitor running on a host with the host01 short host name:
Example
Syntax
Replace ID with the ID of the Ceph Monitor which you want to copy the Ceph Monitor map from:
Example
Syntax
For example, to stop a Ceph Monitor running on a host with the host01 short host name:
Example
Syntax
Replace ID with the ID of the Ceph Monitor with the corrupted or outdated Ceph Monitor map:
Example
Example
If you copied the Ceph Monitor map from another Ceph Monitor, start that Ceph Monitor, too:
Example
Reference
Prerequisites
1. From the Monitor host, remove the Monitor store by default located at /var/lib/ceph/mon/CLUSTER_NAME-
SHORT_HOST_NAME:
rm -rf /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME
Specify the short host name of the Monitor host and the cluster name. For example, to remove the Monitor store of a Monitor
running on host1 from a cluster called remote:
3. Troubleshoot and fix any problems related to the underlying file system or hardware of the Monitor host.
Reference
By using the ceph-monstore-tool when the ceph-mon daemon is not running. Use this method when the previously
mentioned methods fail to compact the Monitor store or when the Monitor is out of quorum and its log contains the Caught
signal (Bus error) error message.
IMPORTANT: Monitor store size changes when the cluster is not in the active+clean state or during the rebalancing process. For
this reason, compact the Monitor store when rebalancing is completed. Also, ensure that the placement groups are in the
active+clean state.
Prerequisites
Procedure
Syntax
2. Replace HOST_NAME with the short host name of the host where the ceph-mon is running. Use the hostname -s command
when unsure.
Example
3. Add the following parameter to the Ceph configuration under the [mon] section:
[mon]
mon_compact_on_start = true
Syntax
Replace HOST_NAME with the short name of the host where the daemon is running. Use the hostname -s command when
unsure.
Example
NOTE: Before you start, ensure that you have the ceph-test package installed.
7. Verify that the ceph-mon daemon with the large store is not running. Stop the daemon if needed.
Syntax
Replace HOST_NAME with the short name of the host where the daemon is running. Use the hostname -s command when
unsure.
Example
Syntax
Example
Syntax
Example
Reference
See The Ceph Monitor store is getting too big for details.
Prerequisites
Procedure
1. To resolve this situation, for each host running ceph-mgr daemons, open ports 6800-7300.
Example
The IBM Storage Ceph clusters use at least three Ceph Monitors so that if one fails, it can be replaced with another one. However,
under certain circumstances, all Ceph Monitors can have corrupted stores. For example, when the Ceph Monitor nodes have
incorrectly configured disk or file system settings, a power outage can corrupt the underlying file system.
If there is corruption on all Ceph Monitors, you can recover it with information stored on the OSD nodes by using utilities called
ceph-monstore-tool and ceph-objectstore-tool.
IMPORTANT: Never restore the Ceph Monitor store from an old backup. Rebuild the Ceph Monitor store from the current cluster
state using the following steps and restore from that.
In containerized environments, this method requires attaching Ceph repositories and restoring to a non-containerized Ceph Monitor
first.
WARNING: This procedure can cause data loss. If you are unsure about any step in this procedure, contact the IBM Support for
assistance with the recovering process.
Prerequisites
The ceph-test and rsync packages are installed on the OSD and Monitor nodes.
Procedure
1. Mount all disks with Ceph data to a temporary location. Repeat this step for all OSD nodes.
Example
Syntax
Syntax
Replace OSD_ID with a numeric, space-separated list of Ceph OSD IDs on the OSD node.
Syntax
Replace OSD_ID with a numeric, space-separated list of Ceph OSD IDs on the OSD node.
IMPORTANT: Due to a bug that causes the update-mon-db command to use additional db and db.slow directories for the
Monitor database, you must also copy these directories. To do so:
a. Prepare a temporary location outside the container to mount and access the OSD database and extract the OSD maps
needed to restore the Ceph Monitor:
Syntax
Replace OSD-DATA with the Volume Group (VG) or Logical Volume (LV) path to the OSD data and OSD-ID with the ID of
the OSD.
Syntax
Replace BLUESTORE-DATABASE with the Volume Group (VG) or Logical Volume (LV) path to the BlueStore database and
OSD-ID with the ID of the OSD.
2. Use the following commands from the Ceph Monitor node with the corrupted store. Repeat them for all OSDs on all nodes.
Example
rm -rf $ms
rm -rf $db
rm -rf $db_slow
sh -t $host <<EOF
for osd in /var/lib/ceph/osd/ceph-*; do
ceph-objectstore-tool --type bluestore --data-path \$osd --op update-mon-
db --mon-store-path $ms
Example
c. Move all sst file from the db and db.slow directories to the temporary location:
Example
Example
NOTE: After using this command, only keyrings extracted from the OSDs and the keyring specified on the ceph-monstore-
tool command line are present in Ceph’s authentication database. You have to recreate or import all other keyrings, such as
clients, Ceph Manager, Ceph Object Gateway, and others, so those clients can access the cluster.
e. Back up the corrupted store. Repeat this step for all Ceph Monitor nodes:
Syntax
mv /var/lib/ceph/mon/ceph-HOSTNAME/store.db /var/lib/ceph/mon/ceph-HOSTNAME/store.db.corrupted
Replace HOSTNAME with the host name of the Ceph Monitor node.
f. Replace the corrupted store. Repeat this step for all Ceph Monitor nodes:
Syntax
g. Change the owner of the new store. Repeat this step for all Ceph Monitor nodes:
Syntax
Replace HOSTNAME with the host name of the Ceph Monitor node.
Example
Syntax
ceph -s
Replace HOSTNAME with the host name of the Ceph Monitor node.
6. Import the Ceph Manager keyring and start all Ceph Manager processes:
Syntax
Replace HOSTNAME with the host name of the Ceph Manager node.
Example
Example
Reference
For details on registering Ceph nodes to the Content Delivery Network (CDN), see Registering the IBM Storage Ceph nodes to
the CDN and attaching subscriptions section in the IBM Storage Ceph Installation Guide.
Prerequisites
Verify your network connection. See Troubleshooting networking issues for details.
Verify that Monitors have a quorum by using the ceph health command. If the command returns a health status
(HEALTH_OK, HEALTH_WARN, or HEALTH_ERR), the Monitors are able to form a quorum. If not, address any Monitor problems
first. See Troubleshooting Ceph Monitors for details. For details about ceph health see Understanding Ceph health.
Optionally, stop the rebalancing process to save time and resources. See Stopping and starting rebalancing for details.
Full OSDs
Edit online
The ceph health detail command returns an error message similar to the following one:
Ceph prevents clients from performing I/O operations on full OSD nodes to avoid losing data. It returns the HEALTH_ERR full
osds message when the cluster reaches the capacity set by the mon_osd_full_ratio parameter. By default, this parameter is set
to 0.95 which means 95% of the cluster capacity.
Scale the cluster by adding a new OSD node. This is a long-term solution recommended by IBM.
Reference
See Nearfull OSDs in IBM Storage Ceph Troubleshooting Guide for details.
See Deleting data from a full storage cluster in IBM Storage Ceph Troubleshooting Guide for details.
Backfillfull OSDs
Edit online
The ceph health detail command returns an error message similar to the following one:
health: HEALTH_WARN
3 backfillfull osd(s)
Low space hindering backfill (add storage if this doesn't resolve itself): 32 pgs backfill_toofull
When one or more OSDs has exceeded the backfillfull threshold, Ceph prevents data from rebalancing to this device. This is an early
warning that rebalancing might not complete and that the cluster is approaching full. The default for the backfullfull threshold is
90%.
ceph df
If %RAW USED is above 70-75%, you can carry out one of the following actions:
Scale the cluster by adding a new OSD node. This is a long-term solution recommended by Red Hat.
Increase the backfillfull ratio for the OSDs that contain the PGs stuck in backfull_toofull to allow the recovery
process to continue. Add new storage to the cluster as soon as possible or remove data to prevent filling more OSDs.
Syntax
Example
References
Nearfull OSDs
Edit online
The ceph health detail command returns an error message similar to the following one:
Ceph returns the nearfull osds message when the cluster reaches the capacity set by the mon osd nearfull ratio
defaults parameter. By default, this parameter is set to 0.85 which means 85% of the cluster capacity.
Ceph distributes data based on the CRUSH hierarchy in the best possible way but it cannot guarantee equal distribution. The main
causes of the uneven data distribution and the nearfull osds messages are:
The OSDs are not balanced among the OSD nodes in the cluster. That is, some OSD nodes host significantly more OSDs than
others, or the weight of some OSDs in the CRUSH map is not adequate to their capacity.
The Placement Group (PG) count is not proper as per the number of the OSDs, use case, target PGs per OSD, and OSD
utilization.
2. Verify that you use CRUSH tunables optimal to the cluster version and adjust them if not.
b. To view how much space OSDs use on particular nodes. Use the following command from the node containing nearfull
OSDs:
df
Reference
For details, see CRUSH Tunables section in the Storage Strategies Guide for IBM Storage Ceph 5.3 and How can I test the
impact CRUSH map tunable modifications will have on my PG distribution across OSDs in IBM Storage Ceph?.
Down OSDs
Edit online
The ceph health detail command returns an error similar to the following one:
One of the ceph-osd processes is unavailable due to a possible service failure or problems with communication with other OSDs. As
a consequence, the surviving ceph-osd daemons reported this failure to the Monitors.
If the ceph-osd daemon is not running, the underlying OSD drive or file system is either corrupted, or some other error, such as a
missing keyring, is preventing the daemon from starting.
In most cases, networking issues cause the situation when the ceph-osd daemon is running but still marked as down.
Replace OSD_NUMBER with the ID of the OSD that is down, for example:
a. If you are not able start ceph-osd, follow the steps in The ceph-osd daemon cannot start.
b. If you are able to start the ceph-osd daemon but it is marked as down, follow the steps in The ceph-osd daemon is
running but still marked as down.
1. If you have a node containing a number of OSDs (generally, more than twelve), verify that the default maximum number of
threads (PID count) is sufficient. See Increasing the PID count for details.
2. Verify that the OSD data and journal partitions are mounted properly. You can use the ceph-volume lvm list command to
list all devices and volumes associated with the Ceph Storage Cluster and then manually inspect if they are mounted properly.
See the mount(8) manual page for details.
3. If you got the ERROR: missing keyring, cannot use cephx for authentication error message, the OSD is a
missing keyring.
4. If you got the ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1 error message, the
ceph-osd daemon cannot read the underlying file system. See the following steps for instructions on how to troubleshoot
and fix this error.
a. Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the
/var/log/ceph/ directory.
b. An EIO error message indicates a failure of the underlying disk. To fix this problem replace the underlying OSD disk. See
Replacing an OSD drive for details.
c. If the log includes any other FAILED assert errors, such as the following one, open a support ticket. See Contacting IBM
Support for service for details.
5. Check the dmesg output for the errors with the underlying file system or disk:
dmesg
a. The error -5 error message similar to the following one indicates corruption of the underlying XFS file system. For details
on how to fix this problem, see the What is the meaning of "xfs_log_force: error -5 returned"? solution on the IBM Customer
Portal.
b. If the dmesg output includes any SCSI error error messages, see the SCSI Error Codes Solution Finder solution to
determine the best way to fix the problem.
c. Alternatively, if you are unable to fix the underlying file system, replace the OSD drive. See Replacing an OSD drive for
details.
6. If the OSD failed with a segmentation fault, such as the following one, gather the required information and open a support
ticket. See Contacting IBM Support for service for details.
1. Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the
/var/log/ceph/ directory.
a. If the log includes error messages similar to the following ones, see Flapping OSDs.
b. If you see any other errors, open a support ticket. See Contacting IBM Support for service for details.
Reference
Flapping OSDs
Flapping OSDs
Edit online
The ceph -w | grep osds command shows OSDs repeatedly as down and then up again within a short period of time:
In addition, the Ceph log contains error messages similar to the following ones:
2022-05-25 03:44:06.510583 osd.50 127.0.0.1:6801/149046 18992 : cluster [WRN] map e600547 wrongly
marked me down
Certain storage cluster operations, such as scrubbing or recovery, take an abnormal amount of time, for example, if you
perform these operations on objects with a large index or large placement groups. Usually, after these operations finish, the
flapping OSDs problem is solved.
Problems with the underlying physical hardware. In this case, the ceph health detail command also returns the slow
requests error message.
Ceph OSDs cannot manage situations where the private network for the storage cluster fails, or significant latency is on the public
client-facing network.
Ceph OSDs use the private network for sending heartbeat packets to each other to indicate that they are up and in. If the private
storage cluster network does not work properly, OSDs are unable to send and receive the heartbeat packets. As a consequence, they
report each other as being down to the Ceph Monitors, while marking themselves as up.
The following parameters in the Ceph configuration file influence this behavior:
NOTE: The flapping OSDs scenario does not include the situation when the OSD processes are started and then immediately killed.
1. Check the output of the ceph health detail command again. If it includes the slow requests error message, see for
details on how to troubleshoot this issue.
2. Determine which OSDs are marked as down and on what nodes they reside:
3. On the nodes containing the flapping OSDs, troubleshoot and fix any networking problems. For details, see Troubleshooting
networking issues.
4. Alternatively, you can temporarily force Monitors to stop marking the OSDs as down and up by setting the noup and nodown
flags:
IMPORTANT: Using the noup and nodown flags does not fix the root cause of the problem but only prevents OSDs from
flapping.
IMPORTANT: Flapping OSDs can be caused by MTU misconfiguration on Ceph OSD nodes, at the network switch level, or both. To
resolve the issue, set MTU to a uniform size on all storage cluster nodes, including on the core and access network switches with a
planned downtime. Do not tune osd heartbeat min size because changing this setting can hide issues within the network, and
it will not solve actual network inconsistency.
Reference
See Ceph heartbeat section in the IBM Storage Ceph Architecture Guide for details.
See Slow requests or requests are blocked section in the IBM Storage Ceph Troubleshooting Guide.
HEALTH_WARN 30 requests are blocked > 32 sec; 3 osds have slow requests
30 ops are blocked > 268435 sec
1 ops are blocked > 268435 sec on osd.11
1 ops are blocked > 268435 sec on osd.18
28 ops are blocked > 268435 sec on osd.39
3 osds have slow requests
In addition, the Ceph logs include an error message similar to the following ones:
2022-05-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at {date-
An OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time
defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.
Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
Problems with the network. These problems are usually connected with flapping OSDs. See Flapping OSDs for details.
System load
The following table shows the types of slow requests. Use the dump_historic_ops administration socket command to determine
the type of a slow request.
1. Determine if the OSDs with slow or block requests share a common piece of hardware, for example, a disk drive, host, rack, or
network switch.
a. Use the smartmontools utility to check the health of the disk or the logs to determine any errors on the disk.
b. Use the iostat utility to get the I/O wait report (%iowai) on the OSD disk to determine if the disk is under heavy load.
b. Use the netstat utility to see the network statistics on the Network Interface Controllers (NICs) and troubleshoot any
networking issues.
4. If the OSDs share a rack, check the network switch for the rack. For example, if you use jumbo frames, verify that the NIC in
the path has jumbo frames set.
5. If you are unable to determine a common piece of hardware shared by OSDs with slow requests, or to troubleshoot and fix
hardware and networking problems, open a support ticket. See Contacting IBM support for service for details.
Reference
See Using the Ceph Administration Socket section in the IBM Stroage Ceph Administration Guide for details.
NOTE: Placement groups within the stopped OSDs become degraded during troubleshooting and maintenance.
Prerequisites
Procedure
Example
Example
3. When you finish troubleshooting or maintenance, unset the noout flag to start rebalancing:
Example
Reference
Prerequisites
Procedure
Syntax
Replace PARTITION with the path to the partition on the OSD drive dedicated to OSD data. Specify the cluster name and the
OSD number.
Example
Syntax
Example
Reference
See Down OSDs in the IBM Storage Ceph Troubleshooting Guide for more details.
NOTE: Ceph can mark an OSD as down also as a consequence of networking or permissions problems.
Modern servers typically deploy with hot-swappable drives so you can pull a failed drive and replace it with a new one without
bringing down the node. The whole procedure includes these steps:
Prerequisites
Example
Example
3. Mark the OSD as out for the cluster to rebalance and copy its data to other OSDs.
Syntax
Example
NOTE: If the OSD is down, Ceph marks it as out automatically after 600 seconds when it does not receive any heartbeat
packet from the OSD based on the mon_osd_down_out_interval parameter. When this happens, other OSDs with copies
of the failed OSD data begin backfilling to ensure that the required number of copies exists within the cluster. While the cluster
is backfilling, the cluster will be in a degraded state.
Example
You should see the placement group states change from active+clean to active, some degraded objects, and finally
active+clean when migration completes.
Syntax
Example
Syntax
Example
See the documentation for the hardware node for details on replacing the physical drive.
1. If the drive is hot-swappable, replace the failed drive with a new one.
2. If the drive is not hot-swappable and the node contains multiple OSDs, you might have to shut down the whole node and
replace the physical drive. Consider preventing the cluster from backfilling. See Stopping and Starting Rebalancing chapter in
the IBM Storage Ceph Troubleshooting Guide for details.
3. When the drive appears under the /dev/ directory, make a note of the drive path.
4. If you want to add the OSD manually, find the OSD drive and format the disk.
1. Once the new drive is inserted, you can use the following options to deploy the OSDs:
The OSDs are deployed automatically by the Ceph Orchestrator if the --unmanaged parameter is not set.
Example
Deploy the OSDs on all the available devices with the unmanaged parameter set to true.
Example
Example
Example
Reference
See Deploying Ceph OSDs on all available devices section in the IBM Storage Ceph Operations Guide.
See Deploying Ceph OSDs on specific devices and hosts section in the IBM Storage Ceph Operations Guide.
See Down OSDs section in the IBM Storage Ceph Troubleshooting Guide.
Procedure
kernel.pid.max = 4194303
This procedure shows how to delete unnecessary data to fix this error.
NOTE: The mon_osd_full_ratio parameter sets the value of the full_ratio parameter when creating a cluster. You cannot
change the value of mon_osd_full_ratio afterward. To temporarily increase the full_ratio value, increase the set-full-
ratio instead.
Prerequisites
Procedure
Example
IMPORTANT: IBM strongly recommends to not set the set-full-ratio to a value higher than 0.97. Setting this parameter
to a higher value makes the recovery process harder. As a consequence, you might not be able to recover full OSDs at all.
As soon as the cluster changes its state from full to nearfull, delete any unnecessary data.
Reference
Full OSDs.
Nearfull OSDs.
NOTE: When the bucket sync status command reports bucket is behind on shards even if the data is consistent across multi-site,
performing additional writes to the bucket, synchronizes the sync status reports and displays the message bucket is caught
up with source.
Prerequisites
meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
The shard of the mdlog was never created so there is nothing to sync.
Reference
Example
This command lists which log shards, if any, which are behind their source zone.
NOTE: Sometimes you might observe recovering shards when running the radosgw-admin sync status command. For data
sync, there are 128 shards of replication logs that are each processed independently. If any of the actions triggered by these
replication log events result in any error from the network, storage, or elsewhere, those errors get tracked so the operation can retry
again later. While a given shard has errors that need a retry, radosgw-admin sync status command reports that shard as
recovering. This recovery happens automatically, so the operator does not need to intervene to resolve them.
If the results of the sync status you have run above reports log shards are behind, run the following command substituting the shard-
id for X.
Syntax
Example
The output lists which buckets are next to sync and which buckets, if any, are going to be retried due to previous errors.
Inspect the status of individual buckets with the following command, substituting the bucket id for X.
Syntax
The result shows which bucket index log shards are behind their source zone.
A common error in sync is EBUSY, which means the sync is already in progress, often on another gateway. Read errors written to the
sync error log, which can be read with the following command:
The syncing process will try again until it is successful. Errors can still occur that can require intervention.
fetch_bytes measures the number of objects and bytes fetched by data sync.
Use the ceph --admin-daemon command to view the current metric data for the performance counters:
Syntax
Example
{
"data-sync-from-us-west": {
"fetch bytes": {
"avgcount": 54,
"sum": 54526039885
NOTE: You must run the ceph --admin-daemon command from the node running the daemon.
Reference
See Ceph performance counters in the IBM Storage Ceph Administration Guide for more information about performance
counters.
You can run the radosgw-admin data sync init command to synchronize data between the sites and then restart the Ceph
Object Gateway. This command does not touch any actual object data and initiates data sync for a specified source zone. It causes
the zone to restart a full sync from the source zone.
IMPORTANT: Contact IBM support before running the data sync init command to avoid data loss. If you are going for a full
restart of sync, and if there is a lot of data that needs to be synced on the source zone, then the bandwidth consumption is high and
then you have to plan accordingly.
NOTE: If a user accidentally deletes a bucket on the secondary site, you can use the metadata sync init command on the site to
synchronize data.
Prerequisites
Procedure
Example
Example
Example
Prerequisites
Ensure that all healthy OSDs are up and in, and the backfilling and recovery processes are finished.
In addition, you can list placement groups that are stuck in a state that is not optimal.
Reference
See Listing placement groups stuck in stale, inactive, or unclean state for details.
Prerequisites
The Monitor marks a placement group as stale when it does not receive any status update from the primary OSD of the placement
group’s acting set or when other OSDs reported that the primary OSD is down.
Usually, PGs enter the stale state after you start the storage cluster and until the peering process completes. However, when the
PGs remain stale for longer than expected, it might indicate that the primary OSD for those PGs is down or not reporting PG
statistics to the Monitor. When the primary OSD storing stale PGs is back up, Ceph starts to recover the PGs.
The mon_osd_report_timeout setting determines how often OSDs report PGs statistics to Monitors. By default, this parameter is
set to 0.5, which means that OSDs report the statistics every half a second.
1. Identify which PGs are stale and on what OSDs they are stored. The error message includes information similar to the
following example:
Example
2. Troubleshoot any problems with the OSDs that are marked as down.
Reference
See the Monitoring Placement Group Sets section in the Administration Guide for IBM Storage Ceph 5.3
In most cases, errors during scrubbing cause inconsistency within placement groups.
Example
Syntax
ceph pg deep-scrub ID
b. Search the output of the ceph -w for any messages related to that placement group:
Syntax
ceph -w | grep ID
4. If the output includes any error messages similar to the following ones, you can repair the inconsistent placement group.
Syntax
PG.ID shard OSD: soid OBJECT missing attr , missing attr _ATTRIBUTE_TYPE
PG.ID shard OSD: soid OBJECT digest 0 != known digest DIGEST, size 0 != known size SIZE
PG.ID shard OSD: soid OBJECT size 0 != known size SIZE
PG.ID deep-scrub stat mismatch, got MISMATCH
PG.ID shard OSD: soid OBJECT candidate had a read error, digest 0 != known digest DIGEST
5. If the output includes any error messages similar to the following ones, it is not safe to repair the inconsistent placement
group because you can lose data. Open a support ticket in this situation.
PG.ID shard OSD: soid OBJECT digest DIGEST != known digest DIGEST
PG.ID shard OSD: soid OBJECT omap_digest DIGEST != known omap_digest DIGEST
Reference
See Listing placement group inconsistencies section in the IBM Storage Ceph Troubleshooting Guide.
See Ceph data integrity the section in the IBM Storage Ceph Architecture Guide.
See Scrubbing the OSD section in the IBM Storage Ceph Configuration Guide.
Ceph marks a placement group as unclean if it has not achieved the active+clean state for the number of seconds specified in
the mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300
seconds.
If a placement group is unclean, it contains objects that are not replicated the number of times specified in the
osd_pool_default_size parameter. The default value of osd_pool_default_size is 3, which means that Ceph creates three
replicas.
Usually, unclean placement groups indicate that some OSDs might be down.
2. Troubleshoot and fix any problems with the OSDs. See Down OSDs for details.
Reference
Ceph marks a placement group as inactive if it has not be active for the number of seconds specified in the
mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300
seconds.
Usually, inactive placement groups indicate that some OSDs might be down.
Reference
See Listing placement groups stuck in stale inactive or unclean state for details.
In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a
failure of an OSD causes the peering failures.
Syntax
ceph pg ID query
Example
{ "state": "down+peering",
...
"recovery_state": [
{ "name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2021-08-06 14:40:16.169679",
"requested_info_from": []},
{ "name": "Started\/Primary\/Peering",
"enter_time": "2021-08-06 14:40:16.169659",
"probing_osds": [
0,
1],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
1],
"peering_blocked_by": [
{ "osd": 1,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}]},
{ "name": "Started",
"enter_time": "2021-08-06 14:40:16.169513"}
]
}
The recovery_state section includes information on why the peering process is blocked.
If the output includes the peering is blocked due to down osds error message, see Down OSDs.
If you see any other error message, open a support ticket. See Contacting IBM Support for service for details.
Reference
See Ceph OSD peering section in the IBM Storage Ceph Administration Guide.
Unfound objects
Edit online
The ceph health command returns an error message similar to the following one, containing the unfound keyword:
An example situation
4. A peering process between osd.1 and osd.2 starts, and the objects missing on osd.1 are queued for recovery.
As a result, osd.1 knows that these objects exist, but there is no OSD that has a copy of the objects.
In this scenario, Ceph is waiting for the failed node to be accessible again, and the unfound objects blocks the recovery process.
Example
Syntax
ceph pg ID query
Replace ID with the ID of the placement group containing the unfound objects:
Example
The might_have_unfound section includes OSDs where Ceph tried to locate the unfound objects:
The already probed status indicates that Ceph cannot locate the unfound objects in that OSD.
The osd is down status indicates that Ceph cannot contact that OSD.
4. Troubleshoot the OSDs that are marked as down. See Down OSDs for details.
5. If you are unable to fix the problem that causes the OSD to be down, open a support ticket. See IBM support for details.
However, if a placement group stays in one of these states for a longer time than expected, it can be an indication of a larger
problem. The Monitors report when placement groups get stuck in a state that is not optimal.
The mon_pg_stuck_threshold option in the Ceph configuration file determines the number of seconds after which placement
groups are considered inactive, unclean, or stale.
The following table lists these states together with a short explanation.
Procedure
Example
Reference
See Placement Group States section in the IBM Storage Ceph Administration Guide.
Prerequisites
Procedure
Syntax
Example
Syntax
Example
The following fields are important to determine what causes the inconsistency:
name
The name of the object with inconsistent replicas.
nspace
The namespace that is a logical separation of a pool. It’s empty by default.
locator
The key that is used as the alternative of the object name for placement.
snap
The snapshot ID of the object. The only writable version of the object is called head. If an object is a clone, this field includes
its sequential ID.
version
The version ID of the object with inconsistent replicas. Each write operation to an object increments it.
errors
A list of errors that indicate inconsistencies between shards without determining which shard or shards are incorrect. See the
shard array to further investigate the errors.
data_digest_mismatch
The digest of the replica read from one OSD is different from the other OSDs.
size_mismatch
The size of a clone or the head object does not match the expectation.
read_error
This error indicates inconsistencies caused most likely by disk errors.
union_shard_error
The union of all errors specific to shards. These errors are connected to a faulty shard. The errors that end with oi indicate
that you have to compare the information from a faulty object to information with selected objects. See the shard array to
further investigate the errors.
Syntax
Example
ss_attr_missing
One or more attributes are missing. Attributes are information about snapshots encoded into a snapshot set as a list of key-
value pairs.
ss_attr_corrupted
One or more attributes fail to decode.
clone_missing
A clone is missing.
snapset_mismatch
The snapshot set is inconsistent by itself.
head_mismatch
The snapshot set indicates that head exists or not, but the scrub results report otherwise.
headless
The head of the snapshot set is missing.
size_mismatch
The size of a clone or the head object does not match the expectation.
Reference
Do not repair the placement groups if the Ceph logs include the following errors:
_PG_._ID_ shard _OSD_: soid _OBJECT_ digest _DIGEST_ != known digest _DIGEST_
_PG_._ID_ shard _OSD_: soid _OBJECT_ omap_digest _DIGEST_ != known omap_digest _DIGEST_
Open a support ticket instead. See Contacting IBM Support for service for details.
Prerequisites
Procedure
Syntax
ceph pg repair ID
Reference
The recommended ratio is between 100 and 300 PGs per OSD. This ratio can decrease when you add more OSDs to the cluster.
The pg_num and pgp_num parameters determine the PG count. These parameters are configured per each pool, and therefore, you
must adjust each pool with low PG count separately.
IMPORTANT: Increasing the PG count is the most intensive process that you can perform on a Ceph cluster. This process might have
a serious performance impact if not done in a slow and methodical way. Once you increase pgp_num, you will not be able to stop or
reverse the process and you must complete it. Consider increasing the PG count outside of business critical processing time
allocation, and alert all clients about the potential performance impact. Do not change the PG count if the cluster is in the
HEALTH_ERR state.
Prerequisites
Procedure
1. Reduce the impact of data redistribution and recovery on individual OSDs and OSD hosts:
2. Use the Ceph Placement Groups (PGs) per Pool Calculator to calculate the optimal value of the pg_num and pgp_num
parameters.
3. Increase the pg_num value in small increments until you reach the desired value.
a. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you
determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Syntax
Specify the pool name and the new value, for example:
Example
Example
The PGs state will change from creating to active+clean. Wait until all PGs are in the active+clean state.
4. Increase the pgp_num value in small increments until you reach the desired value:
a. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you
determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Syntax
Specify the pool name and the new value, for example:
The PGs state will change through peering, wait_backfill, backfilling, recover, and others. Wait until all PGs
are in the active+clean state.
5. Repeat the previous steps for all pools with insufficient PG count.
6. Set osd max backfills, osd_recovery_max_active, and osd_recovery_op_priority to their default values:
Reference
See Monitoring Placement Group Sets section in the IBM Storage Ceph Administration Guide.
IMPORTANT: Manipulating objects can cause unrecoverable data loss. Contact IBM support before using the ceph-objectstore-
tool utility.
Prerequisites
List objects
IMPORTANT: Manipulating objects can cause unrecoverable data loss. Contact IBM support before using the ceph-objectstore-
tool utility.
Prerequisites
Listing objects
Fixing lost objects
Listing objects
Edit online
The OSD can contain zero to many placement groups, and zero to many objects within a placement group (PG). The ceph-
objectstore-tool utility allows you to list objects stored within an OSD.
Prerequisites
Procedure
Syntax
Example
Syntax
Example
3. Identify all the objects within an OSD, regardless of their placement group:
Syntax
Example
Syntax
Example
Syntax
Example
Prerequisites
Procedure
Syntax
Example
Syntax
Example
Syntax
Example
4. Use the ceph-objectstore-tool utility to fix lost and unfound objects. Select the appropriate circumstance:
Syntax
Example
Syntax
Example
Syntax
Example
Remove an object
IMPORTANT: Manipulating objects can cause unrecoverable data loss. Contact IBM support before using the ceph-objectstore-
tool utility.
Prerequisites
IMPORTANT: Setting the bytes on an object can cause unrecoverable data loss. To prevent data loss, make a backup copy of the
object.
Prerequisites
Procedure
Syntax
Example
2. Find the object by listing the objects of the OSD or placement group (PG).
Syntax
Example
4. Before setting the bytes on an object, make a backup and a working copy of the object:
Syntax
Example
5. Edit the working copy object file and modify the object contents accordingly.
Syntax
Example
Removing an object
Edit online
Use the ceph-objectstore-tool utility to remove an object. By removing an object, its contents and references are removed
from the placement group (PG).
Prerequisites
Procedure
Syntax
Example
2. Remove an object:
Syntax
Example
Prerequisites
Procedure
Example
Syntax
Example
Syntax
Example
Prerequisites
Procedure
Syntax
Syntax
Example
Syntax
Example
Syntax
Example
Prerequisites
Procedure
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Example
Prerequisites
Procedure
Syntax
Example
Syntax
Example
Syntax
Example
Prerequisites
Procedure
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Example
Syntax
Example
Prerequisites
Procedure
Example
Example
mon: 5 daemons, quorum host01, host02, host04, host05 (age 30s), out of quorum: host07
Syntax
Example
Important: You get an error message if the monitor is in the same location as existing non-tiebreaker monitors:
Example
Error EINVAL: mon.host02 has location DC1, which matches mons host02 on the datacenter
dividing bucket for stretch mode.
Syntax
Example
Syntax
Example
5. Once the monitor is removed from the host, redeploy the monitor:
Syntax
Example
Example
mon: 5 daemons, quorum host01, host02, host04, host05, host07 (age 15s)
epoch 19
fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
last_changed 2023-01-17T04:12:05.709475+0000
created 2023-01-16T05:47:25.631684+0000
min_mon_release 16 (pacific)
election_strategy: 3
stretch_mode_enabled 1
tiebreaker_mon host02
disallowed_leaders host02
0: [v2:132.224.169.63:3300/0,v1:132.224.169.63:6789/0] mon.host02; crush_location
{datacenter=DC3}
1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location
{datacenter=DC2}
2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location
{datacenter=DC1}
3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host07; crush_location
{datacenter=DC1}
4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host03; crush_location
{datacenter=DC2}
dumped monmap epoch 19
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05,
host07"
Prerequisites
Procedure
Syntax
Example
Note: The new monitor has to be in a different location than existing non-tiebreaker monitors.
Example
Syntax
Example
Example
mon: 6 daemons, quorum host01, host02, host04, host05, host06 (age 30s), out of quorum:
host07
Syntax
Example
Syntax
Example
Example
epoch 19
fsid 1234ab78-1234-11ed-b1b1-de456ef0a89d
last_changed 2023-01-17T04:12:05.709475+0000
created 2023-01-16T05:47:25.631684+0000
min_mon_release 16 (pacific)
election_strategy: 3
stretch_mode_enabled 1
tiebreaker_mon host06
disallowed_leaders host06
0: [v2:213.222.226.50:3300/0,v1:213.222.226.50:6789/0] mon.host06; crush_location
{datacenter=DC3}
1: [v2:220.141.179.34:3300/0,v1:220.141.179.34:6789/0] mon.host04; crush_location
{datacenter=DC2}
2: [v2:40.90.220.224:3300/0,v1:40.90.220.224:6789/0] mon.host01; crush_location
{datacenter=DC1}
3: [v2:60.140.141.144:3300/0,v1:60.140.141.144:6789/0] mon.host02; crush_location
{datacenter=DC1}
4: [v2:186.184.61.92:3300/0,v1:186.184.61.92:6789/0] mon.host05; crush_location
{datacenter=DC2}
dumped monmap epoch 19
Syntax
Example
[ceph: root@host01 /]# ceph orch apply mon --placement="host01, host02, host04, host05,
host06"
Prerequisites
Procedure
Example
NOTE: The recovery state puts the cluster in the HEALTH_WARN state.
2. When in recovery mode, the cluster should go back into normal stretch mode after the placement groups are healthy. If that
does not happen, you can force the stretch cluster into the healthy mode:
Example
NOTE: You can also run this command if you want to force the cross-data-center peering early and you are willing to risk data
downtime, or you have verified separately that all the placement groups can peer, even if they are not fully recovered. You
might also wish to invoke the healthy mode to remove the HEALTH_WARN state, which is generated by the recovery state.
Prerequisites
Prerequisites
Procedure
2. Ideally, attach an sos report to the ticket. See the What is a sos report and how to create one in Red Hat Enterprise Linux?
solution for details.
3. If the Ceph daemons fail with a segmentation fault, consider generating a human-readable core dump file. See Generating
readable core dump files for details.
Such information speeds up the initial investigation. Also, the Support Engineers can compare the information from the core dump
files with IBM Storage Ceph cluster known issues.
Prerequisites
RHEL 8:
RHEL 9:
Once the repository is enabled, you can install the debug info packages that you need from this list of supported packages:
ceph-base-debuginfo
ceph-common-debuginfo
ceph-debugsource
ceph-fuse-debuginfo
ceph-immutable-object-cache-debuginfo
ceph-mds-debuginfo
ceph-mgr-debuginfo
ceph-mon-debuginfo
ceph-osd-debuginfo
ceph-radosgw-debuginfo
cephfs-mirror-debuginfo
Ensure that the gdb package is installed and if it is not, install it:
Example
When a Ceph process terminates unexpectedly due to the SIGILL, SIGTRAP, SIGABRT, or SIGSEGV error.
Manually, for example for debugging issues such as Ceph processes are consuming high CPU cycles, or are not responding.
Prerequisites
Ensure the hosts has at least 8 GB RAM. If there are multiple daemons on the host, then IBM recommends more RAM.
Procedure
1. If a Ceph process terminates unexpectedly due to the SIGILL, SIGTRAP, SIGABRT, or SIGSEGV error:
a. Set the core pattern to the systemd-coredump service on the node where the container with the failed Ceph process is
running:
Example
b. Watch for the next container failure due to a Ceph process and search for the core dump file in the
/var/lib/systemd/coredump/ directory:
Example
2. To manually capture a core dump file for the Ceph Monitors and Ceph OSDs:
Syntax
podman ps
podman exec -it MONITOR_ID_OR_OSD_ID bash
Example
Example
Syntax
Replace PROCESS with the name of the running process, for example ceph-mon or ceph-osd.
Example
Syntax
gcore ID
Replace ID with the ID of the process that you got from the previous step, for example 18110:
Example
e. Verify that the core dump file has been generated correctly.
Example
f. Copy the core dump file outside of the Ceph Monitor container:
Syntax
Replace MONITOR_ID with the ID number of the Ceph Monitor and replace MONITOR_PID with the process ID number.
Example
Example
Syntax
Example
d. Exit the cephadm shell and log in to the host where the daemons are deployed:
Example
Example
Example
Example
Syntax
gcore PID
Example
i. Verify that the core dump file has been generated correctly.
Example
Syntax
Replace DAEMON_ID with the ID number of the Ceph daemon and replace PID with the process ID number.
4. To allow systemd-coredump to successfully store the core dump for a crashed ceph daemon:
a. Set DefaultLimitCORE to infinity in /etc/systemd/system.conf to allow core dump collection for a crashed process:
Syntax
# cat /etc/systemd/system.conf
DefaultLimitCORE=infinity
Syntax
c. Verify the core dump files associated with previous daemon crashes:
Syntax
# ls -ltr /var/lib/systemd/coredump/
5. Upload the core dump file for analysis to a IBM Support case. See Providing information to IBM Support engineers for details.
Table 1. Monitor
Health Code Description
DAEMON_OLD_VERSION Warn if old version of Ceph are running on any daemons. It will generate a health error if multiple
versions are detected.
MON_DOWN One or more Ceph Monitor daemons are currently down.
MON_CLOCK_SKEW The clocks on the nodes running the ceph-mon daemons are not sufficiently well synchronized. Resolve
it by synchronizing the clocks using ntpd or chrony.
Related information
Edit online
Product guides, other publications, websites, blogs, and videos that contain information related to IBM Storage Ceph.
IBM publications
IBM Storage Insights documentation
IBM Spectrum Protect documentation
External publications
Product Documentation for Red Hat Ceph Storage
Product Documentation for Red Hat Enterprise Linux
Product Documentation for Red Hat OpenShift Data Foundation
Product Documentation for Red Hat OpenStack Platform
Red Hat's Ceph team is moving to IBM, on Ceph Blog. Last updated 4 Oct 2022.
Acknowledgments
Edit online
The Ceph Storage project is seeing amazing growth in the quality and quantity of contributions from individuals and organizations in
the Ceph community. We would like to thank all members of the IBM Storage Ceph team, all of the individual contributors in the
Ceph community, and additionally, but not limited to, the contributions from organizations such as:
Intel®
Fujitsu ®
UnitedStack
Yahoo ™
Ubuntu Kylin
Mellanox ®
CERN ™
Deutsche Telekom