Zfs Last
Zfs Last
Page 1
ZFS
The Last Word In File Systems
Jeff Bonwick Bill Moore
www.opensolaris.org/os/community/zfs
Page 2
ZFS Overview
!
Pooled storage
! !
Completely eliminates the antique notion of volumes Does for storage what VM did for memory Always consistent on disk no fsck, ever Universal file, block, iSCSI, swap ... Detects and corrects silent data corruption Historically considered too expensive no longer true Concisely express your intent
Simple administration
!
Page 3
Any defect in disk, controller, cable, driver, laser, or firmware can corrupt data silently; like running a server without ECC memory Labels, partitions, volumes, provisioning, grow/shrink, /etc files... Lots of limits: filesystem/volume size, file size, number of files, files per directory, number of snapshots ... Different tools to manage file, block, iSCSI, NFS, CIFS ... Not portable between platforms (x86, SPARC, PowerPC, ARM ...) Linear-time create, fat locks, fixed block size, nave prefetch, dirty region logging, painful RAID rebuilds, growing backup time
Brutal to manage
! !
! !
Dog slow
!
Page 4
ZFS Objective
! ! !
Page 5
Hard: redesign filesystems to solve these problems well Easy: insert a shim (volume) to cobble disks together Filesystem, volume manager sold as separate products Inherent problems in FS/volume interface can't be fixed FS Volume FS Volume
(2G stripe)
FS
FS Volume
(1G mirror)
(2G concat)
1G Disk
Lower 1G
Upper 1G
Even 1G
Odd 1G
Left 1G
Right 1G
Page 6
Abstraction: virtual disk Partition/volume for each FS Grow/shrink by hand Each FS has limited bandwidth Storage is fragmented, stranded
Abstraction: malloc/free No partitions to manage Grow/shrink automatically All bandwidth always available All storage in the pool is shared
FS Volume
FS Volume
FS Volume
ZFS
ZFS
Page 7
Write this block, then that block, ... Loss of power = loss of on-disk consistency Workaround: journaling, which is slow & complex
FS
ZPL
ZFS POSIX Layer
DMU
Data Management Unit
Atomic for entire group Always consistent on disk No journal not needed
Volume
Write each block to each disk immediately to keep mirrors in sync Loss of power = resync Synchronous and slow
SPA
Storage Pool Allocator
Schedule, aggregate, and issue I/O at will No resync if power lost Runs at platter speed
Page 8
Universal Storage
!
ZFS dataset = up to 248 objects, each up to 264 bytes Snapshots, compression, encryption, end-to-end data integrity
Page 9
Copy-On-Write Transactions
1. Initial block tree 2. COW some blocks
Page 10
Actually cheaper to take a snapshot than not! Snapshot root Live root
Page 11
O(1) per-snapshot space overhead Unlimited snapshots Birth-time-pruned tree walk is O(") Generates semantically rich object delta Can generate delta since any point in time
Snapshot bitmap comparison is O(N) Generates unstructured block delta Requires some prior snapshot to exist
birth time 19
19 25 37 19 25 19 3719 19 19 19
19 25 25 3719 25
Page 12
1 in 1014 bits (~12TB) for desktop-class drives 1 in 1015 bits (~120TB) for enterprise-class drives (allegedly) Bad sector every 8-20TB in practice (desktop and enterprise)
! ! ! !
Drive capacities doubling every 12-18 months Number of drives per deployment increasing ! Rapid increase in error rates Both silent and noisy data corruption becoming more common Cheap flash storage will only accelerate this trend
Page 13
Measurements at CERN
!
Write 1MB, sleep 1 second, etc. until 1GB has been written Read 1MB, verify, sleep 1 second, etc.
! !
Ran on 3000 rack servers with HW RAID card After 3 weeks, found 152 instances of silent data corruption
!
! !
HW RAID only detected noisy data errors Need end-to-end verification to catch silent data corruption
Page 14
Checksum stored with data block Any self-consistent block will pass Can't detect stray writes Inherent FS/volume interface limitation
Data
Checksum
Checksum stored in parent block pointer Fault isolation between data and checksum Checksum hierarchy forms self-validating Merkle tree
Address Address Checksum Checksum Address Address Checksum Checksum
Data
Checksum
Data
Data
# # # # #
Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite
Page 15
Traditional Mirroring
1. Application issues a read.
Mirror reads the first disk, which has a corrupt block. It can't tell.
Page 16
Application
Application
Application
ZFS mirror
ZFS mirror
ZFS mirror
Page 17
Read old data and old parity (two synchronous disk reads) Compute new parity = new data ^ old data ^ old parity Write new data and new parity
= garbage
Loss of power between data and parity writes will corrupt data Workaround: $$$ NVRAM in hardware (i.e., don't lose power!)
Page 18
Disk LBA
RAID-Z
!
0 1 2 3 4 5 6 7 8 9 10 11 12
P0 P1 P0 D0 P0 P1 P2 P3 D1 D0 D3 D4 D5
D0 D1 D0 D1 D0 D1 D2 D3 D2 D1 D6 D7 D8
D2 D3 D1 D2 D4 D5 D6 D7 D3 X D9 D10
D4 D5 D2 P0 D8 D9 D10 P0 X P0 P1 P2
Variable block size: 512 128K Each logical block is its own stripe Eliminates read-modify-write (it's fast) Eliminates the RAID-5 write hole (no need for NVRAM)
! !
Page 19
Traditional Resilvering
!
Copy one disk to the other (or XOR them together) so all copies are self-consistent even though they're all random garbage!
Whole-disk copy even if the volume is nearly empty No checksums or validity checks along the way No assurance of progress until 100% complete your root directory may be the last block copied
Page 20
Smokin' Mirrors
!
Top-down resilvering
! ! !
ZFS resilvers the storage pool's block tree from the root down Most important blocks first Every single block copy increases the amount of discoverable data No time wasted copying free space Zero time to initialize a new mirror or RAID-Z group ZFS records the transaction group window that the device missed To resilver, ZFS walks the tree and prunes where birth time < DTL A five-second outage takes five seconds to repair
Page 21
With increasing error rates, multiple failures can exceed RAID's ability to recover
!
! !
Silent errors compound the issue Filesystem block tree can become compromised More important blocks should be more highly replicated
!
Page 22
Ditto Blocks
!
Different devices whenever possible Different places on the same device otherwise (e.g. laptop drive)
In a multi-disk pool, ZFS survives any non-consecutive disk failures In a single-disk pool, ZFS survives loss of up to 1/8 of the platter
Page 23
Disk Scrubbing
!
All mirror copies, all RAID-Z parity, and all ditto blocks
! !
Verifies each copy against its 256-bit checksum Repairs data as it goes Low I/O priority ensures that scrubbing doesn't get in the way User-defined scrub rates coming soon
!
Minimally invasive
! !
Gradually scrub the pool over the course of a month, a quarter, etc.
Page 24
ZFS Scalability
!
Moore's Law: need 65th bit in 10-15 years ZFS capacity: 256 quadrillion ZB (1ZB = 1 billion TB) Exceeds quantum limit of Earth-based storage
!
Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)
No limits on files, directory entries, etc. No wacky knobs (e.g. inodes/cg) Byte-range locking: parallel read/write without violating POSIX Parallel, constant-time directory operations
Concurrent everything
! !
Page 25
ZFS Performance
!
Copy-on-write design
! !
Turns random writes into sequential writes Intrinsically hot-spot-free Fully scoreboarded 24-stage pipeline with I/O dependency graphs Maximum possible I/O parallelism Priority, deadline scheduling, out-of-order issue, sorting, aggregation
Pipelined I/O
! ! !
! ! !
Dynamic striping across all devices Intelligent prefetch Variable block size
Page 26
Dynamic Striping
!
! ! !
Writes: striped across all five mirrors Reads: wherever the data was written No need to migrate existing data ! Old data striped across 1-4 ! New data striped across 1-5 ! COW gently reallocates old data
ZFS
ZFS
ZFS
Page 27
Intelligent Prefetch
!
Crucial for any streaming service provider The Matrix (2 hours, 16 minutes)
Jeff 0:07
!
Bill 0:33
Matt 1:42
Great for HPC applications ZFS understands the matrix multiply problem
!
Page 28
Large blocks: less metadata, higher bandwidth Small blocks: more space-efficient for small objects Record-structured files (e.g. databases) have natural granularity; filesystem must match it to avoid read/modify/write Extents don't COW or checksum nicely (too big) Large blocks suffice to run disks at platter speed A 37k file consumes 37k no wasted space
Per-object granularity
!
Page 29
Built-in Compression
!
Transparent to all other layers Each block compressed independently All-zero blocks converted into file holes DMU translations: all 128k
128k
37k
69k
Page 30
Built-in Encryption
!
http://www.opensolaris.org/os/project/zfs-crypto
Page 31
ZFS Administration
!
Up to 248 datasets per pool filesystems, iSCSI targets, swap, etc. Nothing to provision! Hierarchical, with inherited properties
! ! ! !
Per-dataset policy: snapshots, compression, backups, quotas, etc. Who's using all the space? du(1) takes forever, but df(1M) is instant Manage logically related filesystems as a group Inheritance makes large-scale administration a snap
! ! !
Policy follows the data (mounts, shares, properties, etc.) Delegated administration lets users manage their own data ZFS filesystems are cheap use a ton, it's OK, really!
Online everything
Page 32
Page 33
Setting Properties
!
Page 34
ZFS Snapshots
!
Instantaneous creation, unlimited number No additional space used blocks copied only when they change Accessible through .zfs/snapshot in root of each filesystem
!
Page 35
ZFS Clones
!
Instantaneous creation, unlimited number Software installations Source code repositories Diskless clients Zones Virtual machines
Page 36
Powered by snapshots
! ! !
Full backup: any snapshot Incremental backup: any snapshot delta Very fast delta generation cost proportional to data changed
!
!
Page 37
Change server from x86 to SPARC, it just works Adaptive endianness: neither platform pays a tax
! !
Writes always use native endianness, set bit in block pointer Reads byteswap only if host endianness != block endianness
Forget about device paths, config files, /etc/vfstab, etc. ZFS will share/unshare, mount/unmount, etc. as necessary
Page 38
NT-style ACLs
!
Allow/deny with inheritance Essential for proper Windows interaction Simplifies domain consolidation Case-insensitivity Non-blocking mandatory locks Unicode normalization Virus scanning
Options to control:
! ! ! !
Page 39
Secure Local zones cannot even see physical devices Fast snapshots and clones make zone creation instant Zone A
tank/a
Zone B
tank/b1 tank/b2
Zone C
tank/c
tank
Global Zone
Page 40
ZFS Root
!
Checksums, compression, replication, snapshots, clones Boot from any dataset Take snapshot, apply patch... rollback if there's a problem Create clone (instant), upgrade, boot from clone No extra partition ZFS can easily create multiple boot environments GRUB can easily manage them
Page 41
ZFS was designed to run in either user or kernel context Nightly ztest program does all of the following in parallel:
! ! ! ! ! ! ! !
Read, write, create, and delete files and directories Create and destroy entire filesystems and storage pools Turn compression on and off (while filesystem is active) Change checksum algorithm (while filesystem is active) Add and remove devices (while pool is active) Change I/O caching and scheduling policies (while pool is active) Scribble random garbage on one side of live mirror to test self-healing data Force violent crashes to simulate power loss, then verify pool integrity
! !
Probably more abuse in 20 seconds than you'd see in a lifetime ZFS has been subjected to over a million forced, violent crashes without losing data integrity or leaking a single block
Page 42
ZFS Summary
End the Suffering ! Free Your Mind
!
Simple
!
Powerful
!
Safe
!
Fast
!
Open
!
http://www.opensolaris.org/os/community/zfs
Free
Page 43
ZFS internals (snapshots, RAID-Z, dynamic striping, etc.) Using iSCSI, CIFS, Zones, databases, remote replication and more Latest news on pNFS, Lustre, and ZFS crypto projects Apple Mac: http://developer.apple.com/adcnews FreeBSD: http://wiki.freebsd.org/ZFS Linux/FUSE: http://zfs-on-fuse.blogspot.com As an appliance: http://www.nexenta.com
ZFS ports
! ! ! !
Page 44
ZFS
The Last Word In File Systems
Jeff Bonwick Bill Moore
www.opensolaris.org/os/community/zfs