Ceph at CSC
Ceph at CSC
Ceph @ CSC
#whoami
Karan Singh
• ISO27001:2013 Certification
More Information
o https://www.csc.fi/
o https://research.csc.fi/cloud-computing
3
CSC Cloud Offering
• Pouta Cloud Service [ IaaS ]
o cPouta - Public cloud , General Purpose
o ePouta - Public cloud , purposely built for sensitive data
4
Our Need for Ceph
• To build our own storage – Not to buy black box
5
Our Need for Ceph
• Remove SPOF for Storage in OpenStack
6
Storage Complexity
Storage
for
Nova
Instances
LUN
OpenStack
OpenStack
Compute
OpenStack
Local
Compute
Gateway-‐1
Compute
Disk
LUN
Enterprise
Array
LUN
Gateway-‐2
OpenStack
OpenStack
LUN
Compute
OpenStack
Compute
Controller
NFS
7
This is why we choose Ceph
hFp://www.slideshare.net/ircolle/what-‐is-‐a-‐ceph-‐and-‐why-‐do-‐i-‐care-‐openstack-‐storage-‐colorado-‐openstack-‐meetup-‐october-‐14-‐2014
Ceph Infrastructure
ePouta Cloud Service
• Ansible
o End to end system configuration
o Network, Kernel, packages, OS Tuning, NTP,
o Metric collection, Monitoring, Central logging
etc.
o Entire Ceph deployment
o System / Ceph administration
• Version Control
o Git , GitHub 11
Live Demo
12
Near Future
• CSC Espoo DC [ePouta Cloud Storage]
o Next 8-12 months à 3PB Raw
o Introduction to storage POD layout for scalability & better failure domain
o Dedicated Monitor node
o SSD Journals
o Erasure Coding
• Miscellaneous
o Multi DC replication [Espoo – Kajaani]
13
Long Term
Build Ceph environment , that is
• Multi-Petabyte ( ~ 10 PB Usable )
• Hyper Scalable
• Multi-Rack Fault tolerant
Storage PODs
14
Disks, Nodes , Racks
Storage
Node
Disks
Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack Rack
17
Storage POD in action
C
E
P
H
21
Some Recommendations
• Monitor Nodes
o Use dedicated monitor nodes, avoid sharing them with OSD’s
o Use SSD for Ceph Monitor LevelDB
• OSD nodes
o Avoid overloading your SSD journals, you might not get what you expect.
o Node Preference:
o #1 Thin node (10-16 disk)
o #2 Thick Node (16-30 disk)
o #3 Fat Node (disk > 30)
o If using FAT nodes , use several of them
22
Operational Experience
• Use dedicated disks for OS , OSD data & OSD Journal ( can be shared )
• Plan your requirement well , choose PG count wisely for a prod. Cluster
o Increasing PG count is one of the most intensive operation
o Decreasing PG count is not allowed
23
Operational Experience
• If you are seeing Blocked OPS/Slow OSD/Request, don’t worry you are not alone
o Ceph health detail -> Find OSD -> Find node -> Check “EVERYTHING” on that node -> Mark out
o If the problem is on most of the nodes -> Check “NETWORK”
o Interface errors , MTU , Configuration, Network blocking , Architecture, Switch logs, remove iface, bonding.
o Even the cable change worked for us ( upgraded switch FW and the cable type became up supported )
• Ceph recovery/backfilling can starve your client for IO , you may want to reduce it
ceph tell osd.\* injectargs '--osd_recovery_max_active 1 --osd_recovery_max_single_start 1 --
osd_recovery_op_priority 50 --osd_recovery_max_chunk 1048576 --osd_recovery_threads 1 --
osd_max_backfills 1 --osd_backfill_scan_min 4 --osd_backfill_scan_max 8’
24
#
1
Health
OK
#2
03/02/15 25
Operational Experience
• Increasing filestore max_sync and min_sync vlaues , helped to a certain extent
o filestore_max_sync_interval = 140
o filestore_min_sync_interval = 100
26
THANK YOU
27