ZFS Overview
ZFS Overview
ZFS Overview
Oracle Confidential 3
ZFS Features
Why ZFS?
Oracle Confidential 4
FS/Volume Model vs. Pooled Storage
Traditional Volumes ZFS Pooled Storage
Abstraction: virtual disk Abstraction: malloc/free
Partition/volume for each FS No partitions to manage
Grow/shrink by hand Grow/shrink automatically
Each FS has limited bandwidth All bandwidth always available
Storage is fragmented, stranded All storage in the pool is shared
Oracle Confidential 5
Self-Healing Data in ZFS
1. Application issues a 2. ZFS tries the second 3. ZFS returns known
read. ZFS mirror tries the disk. Checksum indicates good data to the
first disk. Checksum that the block is good. application and
reveals that the block is repairs the damaged
corrupt on disk. block.
Application Application Application
Oracle Confidential 6
<Insert Picture Here>
ZFS Configurations
Oracle Confidential 7
Single/multiple devices linear/striped configuration
• Disks used to create pool with concatenation
• Dynamic striping across all the disks
• No redundancy, no parity – One disk failure fails the pool
• Normal/Improved performance, No fault-tolerant.
• Pool capacity is sum of all the disks' size.
ZFS ZFS ZFS ZFS ZFS ZFS
Oracle Confidential 8
Mirror (single/multiple) devices configuration
• pool with mirrored disks
• Data redundancy – One disk fails, the other provides data.
ZFS repairs bad disk automatically
• Normal performance, Better fault-tolerant.
• Pool capacity is the size of smaller disk in the mirror.
ZFS ZFS ZFS ZFS ZFS ZFS
Oracle Confidential 9
RAID-Z configuration
• pool with striping of disks with distributed parity.
• Data redundancy – Many disk fails, the other provides data.
ZFS repairs bad disk automatically
• Normal performance, Great fault-tolerant.
• Pool capacity is the size of all disks minus parity disk.
ZFS ZFS ZFS ZFS ZFS ZFS
Oracle Confidential 10
<Insert Picture Here>
Oracle Confidential 11
Properties of zfs pool
/usr/sbin/zpool get all ( RO readonly; C create-time, I import-time )
• allocated (RO): Amount of space allocated
• capacity (RO): %age used of total capacity
• free (RO): Size corresponding to blocks not allocated in this pool.
• size (RO): Total size of pool.
• guid (RO): Global Unique ID for the pool
• health (RO): Pool status, one of “ONLINE”, “DEGRADED” or “FAULTED”
• altroot (C/I): path. Alternate root. The value prefixed to the mount-points. No
cache file entry.
• readonly (C/I): [on|off]. Whether to import read-only.
Oracle Confidential 12
Properties of zfs pool - 2
/usr/sbin/zpool get all ( RO readonly; C create-time, I import-time, A Anytime )
• bootfs : <dataset name> bootable dataset for a BE. Used by beadm/LU programs
• cachefile (C/I): path. The file to be used as cachefile for storing pool config.
Default is /etc/zfs/zpool.cache. Use zdb -U <cachefile path> for non-default.
• failmode (A): [wait/continue/panic]. What to do in case of catastrophic pool failure.
• autoexpand (A): [on/off]. Adjust pool size upon LUN/Slice expansion.
• listsnaps (A): [on/off]. Whether to list snapshots or not when doing 'zfs list'.
• version (A): [current version <= num <= software SPA version]. Pool version or
SPA version. Should be <= Software SPA version. Preferred is 'zpool upgrade'.
Oracle Confidential 13
<Insert Picture Here>
Oracle Confidential 14
Properties of zfs dataset
/usr/sbin/zfsl get all --- RO readonly; C create-time, M Mount-time, A Anytime
• available (RO): Amount of space available from the pool for this dataset
• compressionratio (RO): Ratio of compression. See compression.
• mounted (RO): Whether mounted
• type (RO): If this dataset is filesystem / snapshot / volume / clone.
• origin (RO): Which snapshot did this clone ZFS/ZVOL originate from, if it is one.
• referenced (RO): Amount of space accessible by this dataset
• used (RO): Amount of space used by this dataset and its children.
• usedby*(RO): children, dataset, refreservation, snapshots.
• volblocksize (A): size. Block-size for volumes.
• recordsize (A): size. Block-size for filesystems.
• readonly (C/M): [on|off]. Whether to mount readonly.
Oracle Confidential 15
Properties of zfs dataset - 2
/usr/sbin/zfsl get all --- RO readonly; C create-time, M Mount-time, A Anytime
• sharenfs (A): Publish the dataset as NFS Share
• sharesmb (A): Publish the dataset as smb Share
• snapdir (A): [hidden|visible]. Default 'hidden'. Whether to mount snap under .zfs/snapshot
• zoned (A): Whether or not managed by zone.
• encryption (A): [on|off|<value>]. If set, what kind of encryption to adopt. (on | aes-128-ccm
| aes-192-ccm | aes-256-ccm | aes-128-gcm | aes-192-gcm | aes-256-gcm). When on it is
aes-128-ccm.
• compression (A): [on|off|<value>]. What compression to adopt. On is lzjb, (on | off | lzjb |
gzip | gzip-N | zle). gzip is gzip-6. N is 1-9. 1 being fastest and 9 being best compression.
• Checksum (A): [on|off|<value>]. If not set to off, what checksum algorithm to adopt. (on |
off | fletcher2,| fletcher4 | sha256 | sha256+mac). When on it is fletcher4, currently. When
off, only userdata is not checksum-verified. The metadata is always verified.
• atime (A): [on|off]. Whether to update access time on file access.
Oracle Confidential 16
<Insert Picture Here>
Oracle Confidential 17
Virtual Devices
Internal
VDEVS or Logical
VDEVs
– Logical or Internal
• Conceptual grouping of
physical VDEVs A B C D
VDEV VDEV VDEV VDEV
• Organized as a tree (disk) (disk) (file) (file)
Oracle Confidential 18
Physical VDEV Layout
Label L0 256K VDEV Configuration 112K Array of 128 Uberblocks 128K
Oracle Confidential 19
Label Layout: VDEV Configuration
VDEV Configuration 112K Array of 128 Uberblocks 128K
Oracle Confidential 20
Label Layout: Uberblock – uberblock_t
VDEV Configuration 112K Array of 128 Uberblocks 128K
Oracle Confidential 21
Block Pointer – blkptr_t 64 56 48 40 32 24 16 8
Locates a block of data 0 VDEV 1 ncpy|L4T ASIZE
1 G| OFFSET 1
• ZFS deals with data in blocks. 2 VDEV 2 ncpy|L4T ASIZE
• A blkptr locates a block on a vdev. 3 G| OFFSET 2
4 VDEV 3 ncpy|L4T ASIZE
• Parameters to locate a block 5 G| OFFSET 3
– vdev id, asize and offset 6 BDE| LVL TYPE CKSUM COMP PSIZE LSIZE
7 PADDING
– Called DVA, 16 bytes structure 8 PADDING
– First 48 bytes forms 3 DVAs 9 PHYSICAL BIRTH TXG
A BIRTH TXG
• Checksum type, compression type, obj type,
B FILL COUNT
block level, endianness, psize / lsize. C CHECKSUM[0]
• Birth TXG, Physical Birth TXG. D CHECKSUM[1]
E CHECKSUM[2]
• Fill count – No. of alloc'd blks accounted for F CHECKSUM[3]
• Checksum of the content of the block
Oracle Confidential 22
Block Pointer
to see them laid out for a plain file
Creating a file with 5 blocks of 128K (default in ZFS) each
# dd if=/dev/urandom of=/tp/file bs=131072 count=5
5+0 records in
5+0 records out
# ls -li /tp/file
8 -rw-r--r-- 1 root root 655360 Feb 5 11:51 /tp/file The inode field is the obj id in ZFS
# zdb -ddddd tp 8
Dataset tp [ZPL], ID 18, cr_txg 1, 673K, 8 objects, rootbp DVA[0]=<0:b2c00:200:STD:1> Summry of
DVA[1]=<0:18012c00:200:STD:1> [L0 DMU objset] fletcher4 lzjb LE contiguous unique unencrypted 2-copy
size=800L/200P birth=37129L/37129P fill=8 cksum=157453ee91:765706dc8f2:15983eda56fd3:2c3a4cdaee8af9
dataset
Oracle Confidential 24
DMU Object
zdb -dd pool/dataset / zdb -MM objset pool
• Almost everything is stored on disk as objects (exceptions are like contents of labels)
• Example: Files, Directories, Datasets, list of snapshots and Array of objects
• A dnode of 512B defines an object.
• Defines the type of object, block-size, levels of indirection, max. allocated blocks, etc.
• Has up to 3 block pointers that points either an indirect or a data block.
• Has optionally one bonus block that contains additional information about the object.
• Type DMU_OT_DNODE (0xa) is a special object pointing to array of dnodes
Oracle Confidential 25
The dnode – dnode_phys_t
Identifies and defines each object uni8_t dn_type
• Of type dmu_object_type_t dn_type and dn_bonus_type
uint8_t dn_indblkshift
define what object and bonus block contain
uint8_t dn_nlevels
• Shift for indirect block size (16K = 14)
uint8_t dn_nblkptr
• Can support upto 7 levels.
uint8_t dn_bonustype
• Can have up to 3 block pointers that points either an indirect
or data block. uint16_t dn_datablkszsec
uint64_t dn_maxblkid
• dn_maxblkid is the id of last block in L0
Variable Size
dn_blkptr[0] dn_blkptr[1] dn_blkptr[2]
• Object size in bytes = (dn_maxblkid + 1)*(dn_datablkszsec
*512) uint64_t dn_bonus[BONUSLEN]
Oracle Confidential 26
The dnode – dnode_phys_t – dn_type / dn_bonus_type
type of an object – what it holds
• Some dn_type value we encounter generally:
– ZAP object (many object types implemented as ZAP)
• DMU_OT_OBJECT_DIRECTORY
• DMU_OT_DSL_PROPS
• DMU_OT_DSL_DATASET_CHILD_MAP
– DSL Directory / DSL Dataset (only part of MOS)
• DMU_OT_DSL_DIR, Describes a DSL Directory
• DMU_OT_DSL_DATASET, Describes a DSL Dataset
– ZFS Plain File / ZFS Directory (not part of MOS)
• DMU_OT_PLAIN_FILE, regular files in file-system
• DMU_OT_DIRECTORY_CONTENTS, Directories in file-system, A ZAP Object
– Config (packed nvlist)
• DMU_OT_PACKED_NVLIST
Oracle Confidential 27
The dnode – dnode_phys_t – bonus buffer
additional information about the object
• Additional information like written on back of the sheet.
– Usually the on-disk structure pertinent to the object
– Identified by a dmu_object_type_t
– If fits, will be embedded in the dnode_phys_t
– Else, bonus blkptr points to where bonus buffer is.
– The regular file and directory contains file/directory stat information like
znode_phys_t or sa_header_phys_t
• dn_bonuslen defines how long is the bonus buffer
– Valid only if bonus buffer is embedded
Oracle Confidential 28
objset_phys_t {
os_meta_dnode
Oracle Confidential 29
objset_phys_t {
os_meta_dnode
Oracle Confidential 30
Bigger Picture
VDEV to Label to active uberblock to rootbp to objset to MDN to dnodes
Label L0 256K VDEV Configuration 112K Array of 128 Uberblocks 128K
Label L1 256K
Boot Area 3.5M
14 x 256K uberblock_t {
ub_magic = 0x00bab10c
ub_version = SPA_VERSION
ub_txg = SYNC_TXG Dnode Dnode Dnode ... Dnode Dnode
Label L2 256K
Dnode Dnode Dnode Dnode ... Dnode Dnode
os_zil_header
Label L3 256K os_type = DMU_OS_META Obj id 64 Obj id 65 Obj id66 Obj id 67 Obj id 94 Obj id 95
…
}
Oracle Confidential 31
Bigger Picture (extension)
leading to DSL
Label L0 256K VDEV Configuration 112K Array of 128 Uberblocks 128K
Label L1 256K
uberblock_t {
Boot Area 3.5M ub_magic = 0x00bab10c
Object id 1 of MOS is an object
ub_version = SPA_VERSION
14 x 256K
ub_txg = SYNC_TXG Dnode Dnode directory ZAP Object
ub_guid_sum
blkptr_t ub_root_bp
Obj id 1 Obj id 2
name “root_dataset” indicates the
…
Allocatable Data ub_pool_guid root dataset for the pool.
Area X * 256K …
} The value indicates the object of
Object Directory
objset_phys_t { root_dataset = 2 type DMU_OT_DSL_DIR
os_meta_dnode config = 24
creation_version =
33
This Object contains all information
Meta
...
about the DSL Directory.
Dnode
Obj id 0 dsl_dir_phys_t’s “child_dir_zapobj”
of type ZAP establishes the parent-
Label L2 256K os_zil_header children relationship
os_type = DMU_OS_META
Label L3 256K … dsl_dir_phys_t’s “head dataset” has
}
the dsl_dataset for the dir's active FS
Oracle Confidential 32
DSL Layer
Dataset and Snapshot Layer
• DSL Dataset: Represents an object
set Child Dataset Information
DSL DSL
• DSL Directory: Provides a Child Dataset
DSL
Properties
Directory
hierarchical framework to fit ZAP Object ZAP Object
Infrastructure
DSL
Dataset Dataset Dataset
– space accounting, estimation DSL Directory
(child2) (active) (snapshot) (snapshot)
Directory
and enforcement (child1)
Oracle Confidential 33
DSL dir (dsl_dir_phys_t)
Dataset and Snapshot Layer
typedef struct dsl_dir_phys {
uint64_t dd_creation_time; TS of creation of this DD
uint64_t dd_head_dataset_obj; object containing the dsl dataset for active FS
uint64_t dd_parent_obj; object containing parent dsl dir of this DD
uint64_t dd_origin_obj; (just for clones) obj containing origin DS
uint64_t dd_child_dir_zapobj; ZAP obj containing map of name of children-DD
uint64_t dd_used_bytes; bytes used by all DS of this DD
uint64_t dd_compressed_bytes; compressed bytes used by all DS's of this DD
uint64_t dd_uncompressed_bytes; uncompressed bytes used by all DS's of this DD
uint64_t dd_quota; quota if any set for all DS's of this DD
uint64_t dd_reserved; reservation if any set for all DS's of this DD
uint64_t dd_props_zapobj; ZAP object containing the non-default properties
uint64_t dd_deleg_zapobj;
uint64_t dd_flags; typedef enum dd_used {
uint64_t dd_used_breakdown[DD_USED_NUM]; DD_USED_HEAD,
uint64_t dd_clones; DD_USED_SNAP,
uint64_t dd_keychain_obj; DD_USED_CHILD,
uint64_t dd_pad[12]; DD_USED_CHILD_RSRV,
} dsl_dir_phys_t; DD_USED_REFRSRV,
DD_USED_NUM
} dd_used_t;
Oracle Confidential 34
DSL Dataset (dsl_dataset_t)
Dataset and Snapshot Layer
typedef struct dsl_dataset_phys {
uint64_t ds_dir_obj; object containing the dsl directory for this DS
uint64_t ds_prev_snap_obj; object containing the dsl dataset for previous snap
uint64_t ds_prev_snap_txg; TXG the previous snap was created in. Important.
uint64_t ds_next_snap_obj; object containing the dsl dataset for next snap
uint64_t ds_snapnames_zapobj; ZAP obj containing map of snapname-dsl dataset obj
uint64_t ds_num_children; next snap OR active FS AND one from each clone
uint64_t ds_creation_time; TS for creation of this DS.
uint64_t ds_creation_txg; TXG this DS was created in
uint64_t ds_deadlist_obj; object containing the dsl deadlist for this.
uint64_t ds_used_bytes; bytes used by objset(this DS)
uint64_t ds_compressed_bytes; compressed bytes used by objset(this DS)
uint64_t ds_uncompressed_bytes; uncompressed bytes used by objset(this DS)
uint64_t ds_unique_bytes; (just for snaps) extent of divergence from active
uint64_t ds_fsid_guid;
uint64_t ds_guid;
uint64_t ds_flags;
blkptr_t ds_bp; Where the objset_phys_t located
uint64_t ds_next_clones_obj; ZAP obj containing list of clone dsl dataset obj
uint64_t ds_props_obj;
uint64_t ds_userrefs_obj;
uint64_t ds_shares_obj;
} dsl_dataset_phys_t;
Oracle Confidential 35
Bigger Picture (extension)
leading to DSL and FS Objset
objset_phys_t { objset_phys_t {
os_meta_dnode Dnode Dnode Dnode os_meta_dnode
…
Obj id 1 Obj id 2 Obj id 18
Meta Meta
Dnoe Dnoe
Obj id 0 Obj id 0
These Objects belong to objset of type DMU_OT_ZFS Dnode Dnode Dnode ... Dnode Dnode
Oracle Confidential 36
DSL
Dataset and Snapshot Layer DSL Layout for a typical ZFS hierarchy
2
62 18
dsl_dir dsl_dataset
rpool @one rpool
rootdir
31 33
58 71
dsl_dir dsl_dataset
ROOT @one ROOT
child
@two
8 46
dsl_dir 1 91
dsl_dataset
S11u10
child
@Dec S11u10
Clone of S11 76
u11@Aug
43 54 83
12 DSL object id
10 12
dsl_dir 15
dsl_dataset DSL dataset links
$ORIGI
N $ORIGIN
4 @ORIGIN DSL dir links
0
Oracle Confidential 37
Q&A
38
<Insert Picture Here>
Appendix
39