0% found this document useful (0 votes)
181 views44 pages

Zfs Last

ZFS is a new file system that provides pooled storage, transactional semantics, and end-to-end data integrity. It eliminates volumes and provides consistent storage that can detect and correct silent data corruption. ZFS uses copy-on-write transactions, constant-time snapshots, and self-healing data replication to provide a robust, scalable, and portable file system without the problems of traditional file systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views44 pages

Zfs Last

ZFS is a new file system that provides pooled storage, transactional semantics, and end-to-end data integrity. It eliminates volumes and provides consistent storage that can detect and correct silent data corruption. ZFS uses copy-on-write transactions, constant-time snapshots, and self-healing data replication to provide a robust, scalable, and portable file system without the problems of traditional file systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

ZFS The Last Word in File Systems

Page 1

ZFS
The Last Word In File Systems
Jeff Bonwick Bill Moore
www.opensolaris.org/os/community/zfs

ZFS The Last Word in File Systems

Page 2

ZFS Overview
!

Pooled storage
! !

Completely eliminates the antique notion of volumes Does for storage what VM did for memory Always consistent on disk no fsck, ever Universal file, block, iSCSI, swap ... Detects and corrects silent data corruption Historically considered too expensive no longer true Concisely express your intent

Transactional object system


! !

Provable end-to-end data integrity


! !

Simple administration
!

ZFS The Last Word in File Systems

Page 3

Trouble with Existing Filesystems


!

No defense against silent data corruption


!

Any defect in disk, controller, cable, driver, laser, or firmware can corrupt data silently; like running a server without ECC memory Labels, partitions, volumes, provisioning, grow/shrink, /etc files... Lots of limits: filesystem/volume size, file size, number of files, files per directory, number of snapshots ... Different tools to manage file, block, iSCSI, NFS, CIFS ... Not portable between platforms (x86, SPARC, PowerPC, ARM ...) Linear-time create, fat locks, fixed block size, nave prefetch, dirty region logging, painful RAID rebuilds, growing backup time

Brutal to manage
! !

! !

Dog slow
!

ZFS The Last Word in File Systems

Page 4

ZFS Objective

End the Suffering


Figure out why storage has gotten so complicated Blow away 20 years of obsolete assumptions Design an integrated system from scratch

! ! !

ZFS The Last Word in File Systems

Page 5

Why Volumes Exist


In the beginning, each filesystem managed a single disk. It wasn't very big.
!

Customers wanted more space, bandwidth, reliability


! !

Hard: redesign filesystems to solve these problems well Easy: insert a shim (volume) to cobble disks together Filesystem, volume manager sold as separate products Inherent problems in FS/volume interface can't be fixed FS Volume FS Volume
(2G stripe)

An industry grew up around the FS/volume model


! !

FS

FS Volume
(1G mirror)

(2G concat)

1G Disk

Lower 1G

Upper 1G

Even 1G

Odd 1G

Left 1G

Right 1G

ZFS The Last Word in File Systems

Page 6

FS/Volume Model vs. Pooled Storage


Traditional Volumes
! ! ! ! !

ZFS Pooled Storage


! ! ! ! !

Abstraction: virtual disk Partition/volume for each FS Grow/shrink by hand Each FS has limited bandwidth Storage is fragmented, stranded

Abstraction: malloc/free No partitions to manage Grow/shrink automatically All bandwidth always available All storage in the pool is shared

FS Volume

FS Volume

FS Volume

ZFS

ZFS Storage Pool

ZFS

ZFS The Last Word in File Systems

Page 7

FS/Volume Interfaces vs. ZFS


FS/Volume I/O Stack
Block Device Interface
!

ZFS I/O Stack


Object-Based Transactions
!

Write this block, then that block, ... Loss of power = loss of on-disk consistency Workaround: journaling, which is slow & complex

FS

Make these 7 changes to these 3 objects Atomic (all-or-nothing)

ZPL
ZFS POSIX Layer

Transaction Group Commit


!

DMU
Data Management Unit

Atomic for entire group Always consistent on disk No journal not needed

Block Device Interface


!

Volume

Write each block to each disk immediately to keep mirrors in sync Loss of power = resync Synchronous and slow

Transaction Group Batch I/O


!

SPA
Storage Pool Allocator

Schedule, aggregate, and issue I/O at will No resync if power lost Runs at platter speed

ZFS The Last Word in File Systems

Page 8

Universal Storage
!

DMU is a general-purpose transactional object store


!

ZFS dataset = up to 248 objects, each up to 264 bytes Snapshots, compression, encryption, end-to-end data integrity

Key features common to all datasets


!

Any flavor you want: file, block, object, network


iSCSI Raw Swap Dump UFS(!) pNFS Lustre DB ZFS Volume Emulator

Local NFS CIFS ZFS POSIX Layer

Data Management Unit (DMU) Storage Pool Allocator (SPA)

ZFS The Last Word in File Systems

Page 9

Copy-On-Write Transactions
1. Initial block tree 2. COW some blocks

3. COW indirect blocks

4. Rewrite uberblock (atomic)

ZFS The Last Word in File Systems

Page 10

Bonus: Constant-Time Snapshots


!

At end of TX group, don't free COWed blocks


!

Actually cheaper to take a snapshot than not! Snapshot root Live root

The tricky part: how do you know when a block is free?

ZFS The Last Word in File Systems

Page 11

Traditional Snapshots vs. ZFS


Per-Snapshot Bitmaps
!

ZFS Birth Times


!

Block allocation bitmap for every snapshot


!

Each block pointer contains child's birth time


!

O(N) per-snapshot space overhead Limits number of snapshots


!

O(1) per-snapshot space overhead Unlimited snapshots Birth-time-pruned tree walk is O(") Generates semantically rich object delta Can generate delta since any point in time

O(N) create, O(N) delete, O(N) incremental


!

O(1) create, O(") delete, O(") incremental


! !

Snapshot bitmap comparison is O(N) Generates unstructured block delta Requires some prior snapshot to exist

birth time 19

Snapshot 1 Snapshot 2 Snapshot 3 Live FS Summary Block number

birth time 25 birth time 37

19 25 37 19 25 19 3719 19 19 19

19 25 25 3719 25

ZFS The Last Word in File Systems

Page 12

Trends in Storage Integrity


!

Uncorrectable bit error rates have stayed roughly constant


! ! !

1 in 1014 bits (~12TB) for desktop-class drives 1 in 1015 bits (~120TB) for enterprise-class drives (allegedly) Bad sector every 8-20TB in practice (desktop and enterprise)

! ! ! !

Drive capacities doubling every 12-18 months Number of drives per deployment increasing ! Rapid increase in error rates Both silent and noisy data corruption becoming more common Cheap flash storage will only accelerate this trend

ZFS The Last Word in File Systems

Page 13

Measurements at CERN
!

Wrote a simple application to write/verify 1GB file


! !

Write 1MB, sleep 1 second, etc. until 1GB has been written Read 1MB, verify, sleep 1 second, etc.

! !

Ran on 3000 rack servers with HW RAID card After 3 weeks, found 152 instances of silent data corruption
!

Previously thought everything was fine

! !

HW RAID only detected noisy data errors Need end-to-end verification to catch silent data corruption

ZFS The Last Word in File Systems

Page 14

End-to-End Data Integrity in ZFS


Disk Block Checksums
! ! ! !

ZFS Data Authentication


!

Checksum stored with data block Any self-consistent block will pass Can't detect stray writes Inherent FS/volume interface limitation
Data
Checksum

Checksum stored in parent block pointer Fault isolation between data and checksum Checksum hierarchy forms self-validating Merkle tree
Address Address Checksum Checksum Address Address Checksum Checksum

Data
Checksum

Data

Data

Disk checksum only validates media " Bit rot

ZFS validates the entire I/O path


" " " " " "
Bit rot Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

# # # # #

Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

ZFS The Last Word in File Systems

Page 15

Traditional Mirroring
1. Application issues a read.
Mirror reads the first disk, which has a corrupt block. It can't tell.

2. Volume manager passes


bad block up to filesystem. If it's a metadata block, the filesystem panics. If not...

3. Filesystem returns bad data


to the application.

Application FS xxVM mirror

Application FS xxVM mirror

Application FS xxVM mirror

ZFS The Last Word in File Systems

Page 16

Self-Healing Data in ZFS


1. Application issues a read.
ZFS mirror tries the first disk. Checksum reveals that the block is corrupt on disk.

2. ZFS tries the second disk.


Checksum indicates that the block is good.

3. ZFS returns known good


data to the application and repairs the damaged block.

Application

Application

Application

ZFS mirror

ZFS mirror

ZFS mirror

ZFS The Last Word in File Systems

Page 17

Traditional RAID-4 and RAID-5


!

Several data disks plus one parity disk


^ ^ ^ ^ =0

Fatal flaw: partial stripe writes


!

Parity update requires read-modify-write (slow)


! ! !

Read old data and old parity (two synchronous disk reads) Compute new parity = new data ^ old data ^ old parity Write new data and new parity

Suffers from write hole:


! !

= garbage

Loss of power between data and parity writes will corrupt data Workaround: $$$ NVRAM in hardware (i.e., don't lose power!)

Can't detect or correct silent data corruption

ZFS The Last Word in File Systems

Page 18
Disk LBA

RAID-Z
!

0 1 2 3 4 5 6 7 8 9 10 11 12

P0 P1 P0 D0 P0 P1 P2 P3 D1 D0 D3 D4 D5

D0 D1 D0 D1 D0 D1 D2 D3 D2 D1 D6 D7 D8

D2 D3 D1 D2 D4 D5 D6 D7 D3 X D9 D10

D4 D5 D2 P0 D8 D9 D10 P0 X P0 P1 P2

D6 D7 P0 D0 D11 D12 D13 D0 P0 D0 D1 D2

Dynamic stripe width


! !

Variable block size: 512 128K Each logical block is its own stripe Eliminates read-modify-write (it's fast) Eliminates the RAID-5 write hole (no need for NVRAM)

All writes are full-stripe writes


! !

! !

Both single- and double-parity


!

Detects and corrects silent data corruption


Checksum-driven combinatorial reconstruction

No special hardware ZFS loves cheap disks

ZFS The Last Word in File Systems

Page 19

Traditional Resilvering
!

Creating a new mirror (or RAID stripe):


!

Copy one disk to the other (or XOR them together) so all copies are self-consistent even though they're all random garbage!

Replacing a failed device:


!

Whole-disk copy even if the volume is nearly empty No checksums or validity checks along the way No assurance of progress until 100% complete your root directory may be the last block copied

Recovering from a transient outage:


!

Dirty region logging slow, and easily defeated by random writes

ZFS The Last Word in File Systems

Page 20

Smokin' Mirrors
!

Top-down resilvering
! ! !

ZFS resilvers the storage pool's block tree from the root down Most important blocks first Every single block copy increases the amount of discoverable data No time wasted copying free space Zero time to initialize a new mirror or RAID-Z group ZFS records the transaction group window that the device missed To resilver, ZFS walks the tree and prunes where birth time < DTL A five-second outage takes five seconds to repair

Only copy live blocks


! !

Dirty time logging (for transient outages)


! ! !

ZFS The Last Word in File Systems

Page 21

Surviving Multiple Data Failures


!

With increasing error rates, multiple failures can exceed RAID's ability to recover
!

With a big enough data set, it's inevitable

! !

Silent errors compound the issue Filesystem block tree can become compromised More important blocks should be more highly replicated
!

Filesystem block tree Good Damaged Undiscoverable

Small cost in space and bandwidth

ZFS The Last Word in File Systems

Page 22

Ditto Blocks
!

Data replication above and beyond mirror/RAID-Z


!

Each logical block can have up to three physical blocks


! !

Different devices whenever possible Different places on the same device otherwise (e.g. laptop drive)

All ZFS metadata 2+ copies


!

Small cost in latency and bandwidth (metadata ! 1% of data)

Explicitly settable for precious user data

Detects and corrects silent data corruption


! !

In a multi-disk pool, ZFS survives any non-consecutive disk failures In a single-disk pool, ZFS survives loss of up to 1/8 of the platter

ZFS survives failures that send other filesystems to tape

ZFS The Last Word in File Systems

Page 23

Disk Scrubbing
!

Finds latent errors while they're still correctable


!

ECC memory scrubbing for disks

Verifies the integrity of all data


!

Traverses pool metadata to read every copy of every block


!

All mirror copies, all RAID-Z parity, and all ditto blocks

! !

Verifies each copy against its 256-bit checksum Repairs data as it goes Low I/O priority ensures that scrubbing doesn't get in the way User-defined scrub rates coming soon
!

Minimally invasive
! !

Gradually scrub the pool over the course of a month, a quarter, etc.

ZFS The Last Word in File Systems

Page 24

ZFS Scalability
!

Immense capacity (128-bit)


! ! !

Moore's Law: need 65th bit in 10-15 years ZFS capacity: 256 quadrillion ZB (1ZB = 1 billion TB) Exceeds quantum limit of Earth-based storage
!

Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)

100% dynamic metadata


! !

No limits on files, directory entries, etc. No wacky knobs (e.g. inodes/cg) Byte-range locking: parallel read/write without violating POSIX Parallel, constant-time directory operations

Concurrent everything
! !

ZFS The Last Word in File Systems

Page 25

ZFS Performance
!

Copy-on-write design
! !

Turns random writes into sequential writes Intrinsically hot-spot-free Fully scoreboarded 24-stage pipeline with I/O dependency graphs Maximum possible I/O parallelism Priority, deadline scheduling, out-of-order issue, sorting, aggregation

Pipelined I/O
! ! !

! ! !

Dynamic striping across all devices Intelligent prefetch Variable block size

ZFS The Last Word in File Systems

Page 26

Dynamic Striping
!

Automatically distributes load across all devices


Writes: striped across all four mirrors Reads: wherever the data was written Block allocation policy considers: ! Capacity ! Performance (latency, BW) ! Health (degraded mirrors)
! ! !

! ! !

Writes: striped across all five mirrors Reads: wherever the data was written No need to migrate existing data ! Old data striped across 1-4 ! New data striped across 1-5 ! COW gently reallocates old data

ZFS

ZFS Storage Pool

ZFS Add Mirror 5

ZFS

ZFS Storage Pool

ZFS

ZFS The Last Word in File Systems

Page 27

Intelligent Prefetch
!

Multiple independent prefetch streams


!

Crucial for any streaming service provider The Matrix (2 hours, 16 minutes)

Jeff 0:07
!

Bill 0:33

Matt 1:42

Automatic length and stride detection


! !

Great for HPC applications ZFS understands the matrix multiply problem
!

Row-major access Columnmajor storage

Detects any linear access pattern Forward or backward

The Matrix (10M rows, 10M columns)

ZFS The Last Word in File Systems

Page 28

Variable Block Size


!

No single block size is optimal for everything


! ! !

Large blocks: less metadata, higher bandwidth Small blocks: more space-efficient for small objects Record-structured files (e.g. databases) have natural granularity; filesystem must match it to avoid read/modify/write Extents don't COW or checksum nicely (too big) Large blocks suffice to run disks at platter speed A 37k file consumes 37k no wasted space

Why not arbitrary extents?


! !

Per-object granularity
!

Enables transparent block-based compression

ZFS The Last Word in File Systems

Page 29

Built-in Compression
!

Block-level compression in SPA


! ! !

Transparent to all other layers Each block compressed independently All-zero blocks converted into file holes DMU translations: all 128k

128k

37k

69k

SPA block allocations: vary with compression

LZJB and GZIP available today; more on the way

ZFS The Last Word in File Systems

Page 30

Built-in Encryption
!

http://www.opensolaris.org/os/project/zfs-crypto

ZFS The Last Word in File Systems

Page 31

ZFS Administration
!

Pooled storage no more volumes!


! !

Up to 248 datasets per pool filesystems, iSCSI targets, swap, etc. Nothing to provision! Hierarchical, with inherited properties
! ! ! !

Filesystems become administrative control points


!

Per-dataset policy: snapshots, compression, backups, quotas, etc. Who's using all the space? du(1) takes forever, but df(1M) is instant Manage logically related filesystems as a group Inheritance makes large-scale administration a snap

! ! !

Policy follows the data (mounts, shares, properties, etc.) Delegated administration lets users manage their own data ZFS filesystems are cheap use a ton, it's OK, really!

Online everything

ZFS The Last Word in File Systems

Page 32

Creating Pools and Filesystems


!

Create a mirrored pool named tank


# zpool create tank mirror c2d0 c3d0

Create home directory filesystem, mounted at /export/home


# zfs create tank/home # zfs set mountpoint=/export/home tank/home

Create home directories for several users


Note: automatically mounted at /export/home/{ahrens,bonwick,billm} thanks to inheritance

# zfs create tank/home/ahrens # zfs create tank/home/bonwick # zfs create tank/home/billm


!

Add more space to the pool


# zpool add tank mirror c4d0 c5d0

ZFS The Last Word in File Systems

Page 33

Setting Properties
!

Automatically NFS-export all home directories


# zfs set sharenfs=rw tank/home

Turn on compression for everything in the pool


# zfs set compression=on tank

Limit Eric to a quota of 10g


# zfs set quota=10g tank/home/eschrock

Guarantee Tabriz a reservation of 20g


# zfs set reservation=20g tank/home/tabriz

ZFS The Last Word in File Systems

Page 34

ZFS Snapshots
!

Read-only point-in-time copy of a filesystem


! ! !

Instantaneous creation, unlimited number No additional space used blocks copied only when they change Accessible through .zfs/snapshot in root of each filesystem
!

Allows users to recover files without sysadmin intervention

Take a snapshot of Mark's home directory


# zfs snapshot tank/home/marks@tuesday

Roll back to a previous snapshot


# zfs rollback tank/home/perrin@monday

Take a look at Wednesday's version of foo.c


$ cat ~maybee/.zfs/snapshot/wednesday/foo.c

ZFS The Last Word in File Systems

Page 35

ZFS Clones
!

Writable copy of a snapshot


!

Instantaneous creation, unlimited number Software installations Source code repositories Diskless clients Zones Virtual machines

Ideal for storing many private copies of mostly-shared data


! ! ! ! !

Create a clone of your OpenSolaris source code


# zfs clone tank/solaris@monday tank/ws/lori/fix

ZFS The Last Word in File Systems

Page 36

ZFS Send / Receive


!

Powered by snapshots
! ! !

Full backup: any snapshot Incremental backup: any snapshot delta Very fast delta generation cost proportional to data changed

!
!

So efficient it can drive remote replication


Generate a full backup
# zfs send tank/fs@A >/backup/A

Generate an incremental backup


# zfs send -i tank/fs@A tank/fs@B >/backup/B-A

Remote replication: send incremental once per minute


# zfs send -i tank/fs@11:31 tank/fs@11:32 | ssh host zfs receive -d /tank/fs

ZFS The Last Word in File Systems

Page 37

ZFS Data Migration


!

Host-neutral on-disk format


! !

Change server from x86 to SPARC, it just works Adaptive endianness: neither platform pays a tax
! !

Writes always use native endianness, set bit in block pointer Reads byteswap only if host endianness != block endianness

ZFS takes care of everything


! !

Forget about device paths, config files, /etc/vfstab, etc. ZFS will share/unshare, mount/unmount, etc. as necessary

Export pool from the old server


old# zpool export tank

Physically move disks and import pool to the new server


new# zpool import tank

ZFS The Last Word in File Systems

Page 38

Native CIFS (SMB) Support


!

NT-style ACLs
!

Allow/deny with inheritance Essential for proper Windows interaction Simplifies domain consolidation Case-insensitivity Non-blocking mandatory locks Unicode normalization Virus scanning

True Windows SIDs not just POSIX UID mapping


! !

Options to control:
! ! ! !

Simultaneous NFS and CIFS client access

ZFS The Last Word in File Systems

Page 39

ZFS and Zones (Virtualization)


! !

Secure Local zones cannot even see physical devices Fast snapshots and clones make zone creation instant Zone A
tank/a

Dataset: Logical resource in local zone

Zone B
tank/b1 tank/b2

Zone C
tank/c

Pool: Physical resource in global zone

tank

Global Zone

ZFS The Last Word in File Systems

Page 40

ZFS Root
!

Brings all the ZFS goodness to /


! !

Checksums, compression, replication, snapshots, clones Boot from any dataset Take snapshot, apply patch... rollback if there's a problem Create clone (instant), upgrade, boot from clone No extra partition ZFS can easily create multiple boot environments GRUB can easily manage them

Patching becomes safe


!

Live upgrade becomes fast


! !

Based on new Solaris boot architecture


! !

ZFS The Last Word in File Systems

Page 41

ZFS Test Methodology


!

A product is only as good as its test suite


! !

ZFS was designed to run in either user or kernel context Nightly ztest program does all of the following in parallel:
! ! ! ! ! ! ! !

Read, write, create, and delete files and directories Create and destroy entire filesystems and storage pools Turn compression on and off (while filesystem is active) Change checksum algorithm (while filesystem is active) Add and remove devices (while pool is active) Change I/O caching and scheduling policies (while pool is active) Scribble random garbage on one side of live mirror to test self-healing data Force violent crashes to simulate power loss, then verify pool integrity

! !

Probably more abuse in 20 seconds than you'd see in a lifetime ZFS has been subjected to over a million forced, violent crashes without losing data integrity or leaking a single block

ZFS The Last Word in File Systems

Page 42

ZFS Summary
End the Suffering ! Free Your Mind
!

Simple
!

Concisely expresses the user's intent

Powerful
!

Pooled storage, snapshots, clones, compression, scrubbing, RAID-Z

Safe
!

Detects and corrects silent data corruption

Fast
!

Dynamic striping, intelligent prefetch, pipelined I/O

Open
!

http://www.opensolaris.org/os/community/zfs

Free

ZFS The Last Word in File Systems

Page 43

Where to Learn More


! ! !

Community: http://www.opensolaris.org/os/community/zfs Wikipedia: http://en.wikipedia.org/wiki/ZFS ZFS blogs: http://blogs.sun.com/main/tags/zfs


! ! !

ZFS internals (snapshots, RAID-Z, dynamic striping, etc.) Using iSCSI, CIFS, Zones, databases, remote replication and more Latest news on pNFS, Lustre, and ZFS crypto projects Apple Mac: http://developer.apple.com/adcnews FreeBSD: http://wiki.freebsd.org/ZFS Linux/FUSE: http://zfs-on-fuse.blogspot.com As an appliance: http://www.nexenta.com

ZFS ports
! ! ! !

ZFS The Last Word in File Systems

Page 44

ZFS
The Last Word In File Systems
Jeff Bonwick Bill Moore
www.opensolaris.org/os/community/zfs

You might also like