0% found this document useful (0 votes)

35 views33 pages

Ext3/4 File Systems: Don Porter CSE 506

Ext3 and Ext4 are file systems that improve on the reliability of Ext2 by adding journaling. Ext3 uses redo logging to record file system operations in a journal and replay them if needed after a crash. This allows faster recovery than scanning the entire file system but adds complexity in ensuring data and journal consistency. Ext4 builds on Ext3 with changes like 48-bit block numbers that allow handling much larger file systems. It also uses extents to more efficiently represent large contiguous files.

Uploaded by

David Briggs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views33 pages

Ext3/4 File Systems: Don Porter CSE 506

Uploaded by

David Briggs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Ext3/4 file systems

Don Porter
CSE 506
Logical Diagram
Binary Memory
Threads
Formats Allocators
User
Today’s Lecture
System Calls Kernel
RCU File System Networking Sync

Memory Device CPU

Management Drivers Scheduler

Hardware
Interrupts Disk Net Consistency
Ext2 review
ò  Very reliable, “best-of-breed” traditional file system
design
ò  Much like the JOS file system you are building now
ò  Fixed location super blocks
ò  A few direct blocks in the inode, followed by indirect
blocks for large files
ò  Directories are a special file type with a list of file names
and inode numbers
ò  Etc.
File systems and crashes
ò  What can go wrong?
ò  Write a block pointer in an inode before marking block as
allocated in allocation bitmap
ò  Write a second block allocation before clearing the first –
block in 2 files after reboot
ò  Allocate an inode without putting it in a directory –
“orphaned” after reboot
ò  Etc.
Deeper issue
ò  Operations like creation and deletion span multiple on-
disk data structures
ò  Requires more than one disk write
ò  Think of disk writes as a series of updates
ò  System crash can happen between any two updates
ò  Crash between wrong two updates leaves on-disk data
structures inconsistent!
Atomicity
ò  The property that something either happens or it doesn’t
ò  No partial results
ò  This is what you want for disk updates
ò  Either the inode bitmap, inode, and directory are updated
when a file is created, or none of them are
ò  But disks only give you atomic writes for a sector L

ò  Fundamentally hard problem to prevent disk corruptions

if the system crashes
fsck
ò  Idea: When a file system is mounted, mark the on-disk
super block as mounted
ò  If the system is cleanly shut down, last disk write clears
this bit
ò  Reboot: If the file system isn’t cleanly unmounted, run
fsck

ò  Basically, does a linear scan of all bookkeeping and

checks for (and fixes) inconsistencies
fsck examples
ò  Walk directory tree: make sure each reachable inode is
marked as allocated

ò  For each inode, check the reference count, make sure all
referenced blocks are marked as allocated

ò  Double-check that all allocated blocks and inodes are

reachable

ò  Summary: very expensive, slow scan of the entire file

system
Journaling
ò  Idea: Keep a log of what you were doing
ò  If the system crashes, just look at data structures that
might have been involved
ò  Limits the scope of recovery; faster fsck!
Undo vs. redo logging
ò  Two main choices for a journaling scheme (same in databases,
etc)

ò  Undo logging:

1) Write what you are about to do (and how to undo it)
ò  Synchronously
2) Then make changes on disk
3) Then mark the operations as complete
ò  If system crashes before commit record, execute undo steps
ò  Undo steps MUST be on disk before any other changes! Why?
Redo logging
ò  Before an operation (like create)
1) Write everything that is going to be done to the log + a
commit record
ò  Sync
2) Do the updates on disk
3) When updates are complete, mark the log entry as obsolete
ò  If the system crashes during (2), re-execute all steps in
the log during fsck
Which one?
ò  Ext3 uses redo logging
ò  Tweedie says for delete
ò  Intuition: It is easier to defer taking something apart than to
put it back together later
ò  Hard case: I delete something and reuse a block for something
else before journal entry commits
ò  Performance: This only makes sense if data comfortably fits
into memory
ò  Databases use undo logging to avoid loading and writing large
data sets twice
Atomicity revisited
ò  The disk can only atomically write one sector

ò  Disk and I/O scheduler can reorder requests

ò  Need atomic journal “commit”

Atomicity strategy
ò  Write a journal log entry to disk, with a transaction
number (sequence counter)
ò  Once that is on disk, write to a global counter that
indicates log entry was completely written
ò  This single write is the point at which a journal entry is
atomically “committed” or not
ò  Sometimes called a linearization point
ò  Atomic: either the sequence number is written or not;
sequence number will not be written until log entry on
disk
Batching
ò  This strategy requires a lot of synchronous writes
ò  Synchronous writes are expensive
ò  Idea: let’s batch multiple little transactions into one
bigger one
ò  Assuming no fsync()
ò  For up to 5 seconds, or until we fill up a disk block in the
journal
ò  Then we only have to wait for one synchronous disk write!
Complications
ò  We can’t write data to disk until the journal entry is
committed to disk
ò  Ok, since we buffer data in memory anyway
ò  But we want to bound how long we have to keep dirty
data (5s by default)
ò  JBD adds some flags to buffer heads that transparently
handles a lot of the complicated bookkeeping
ò  Pins writes in memory until journal is written
ò  Allows them to go to disk afterward
More complications
ò  We also can’t write to the in-memory version until we’ve
written a version to disk that is consistent with the
journal

ò  Example:
ò  I modify an inode and write to the journal
ò  Journal commits, ready to write inode back
ò  I want to make another inode change
ò  Cannot safely change in-memory inode until I have either
written it to the file system or created another journal entry
Another example
ò  Suppose journal transaction1 modifies a block, then
transaction 2 modifies the same block.

ò  How to ensure consistency?

ò  Option 1: stall transaction 2 until transaction 1 writes to fs
ò  Option 2 (ext3): COW in the page cache + ordering of
writes
Yet more complications
ò  Interaction with page reclaiming:
ò  Page cache can pick a dirty page and tell fs to write it back
ò  Fs can’t write it until a transaction commits
ò  PFRA chose this page assuming only one write-back;
must potentially wait for several
ò  Advanced file systems need the ability to free another
page, rather than wait until all prerequisites are met
Write ordering
ò  Issue, if I make file 1 then file 2, can I have a situation
where file 2 is on disk but not file 1?
ò  Yes, theoretically
ò  API doesn’t guarantee this won’t happen (journal
transactions are independent)
ò  Implementation happens to give this property by grouping
transactions into a large, compound transactions
(buffering)
Checkpointing
ò  We should “garbage collect” our log once in a while
ò  Specifically, once operations are safely on disk, journal
transaction is obviated
ò  A very long journal wastes time in fsck
ò  Journal hooks associated buffer heads to track when they get
written to disk

ò  Advances logical start of the journal, allows reuse of those

blocks
Journaling modes
ò  Full data + metadata in the journal

ò  Lots of data written twice, batching less effective, safer

ò  Ordered writes

ò  Only metadata in the journal, but data writes only allowed after
metadata is in journal
ò  Faster than full data, but constrains write orderings (slower)
ò  Metadata only – fastest, most dangerous

ò  Can write data to a block before it is properly allocated to a file

Revoke records
ò  When replaying the journal, don’t redo these operations
ò  Mostly important for metadata-only modes
ò  Example: Once a file is deleted and the inode is reused,
revoke the creation record in the log
ò  Recreating and re-deleting could lose some data written to
the file
ext3 summary
ò  A modest change: just tack on a journal

ò  Make crash recovery faster, less likely to lose data

ò  Surprising number of subtle issues

ò  You should be able to describe them
ò  And key design choices (like redo logging)
ext4
ò  ext3 has some limitations that prevent it from handling
very large, modern data sets
ò  Can’t fix without breaking backwards compatibility
ò  So fork the code
ò  General theme: several changes to better handle larger
data
ò  Plus a few other goodies
Example
ò  Ext3 fs limited to 16 TB max size
ò  32-bit block numbers (2^32 * 4k block size), or “address”
of blocks on disk
ò  Can’t make bigger block numbers on disk without
changing on-disk format
ò  Can’t fix without breaking backwards compatibility
ò  Ext4 – 48 bit block numbers
Indirect blocks vs. extents
ò  Instead of represent each block, represent large
contiguous chunks of blocks with an extent

ò  More efficient for large files (both in space and disk
scheduling)

ò  Ex: Disk sectors 50—300 represent blocks 0—250 of file

ò  Vs.: Allocate and initialize 250 slots in an indirect block
ò  Deletion requires marking 250 slots as free
Extents, cont.
ò  Worse for highly fragmented or sparse files
ò  If no 2 blocks are contiguous, will have an extent for each
block
ò  Basically a more expensive indirect block scheme
ò  Propose a block-mapped extent, which essentially reverts
to a more streamlined indirect block
Static inode allocations
ò  When you create an ext3 or ext4 file system, you create
all possible inodes

ò  Disk blocks can either be used for data or inodes, but
can’t change after creation

ò  If you need to create a lot of files, better make lots of

inodes

ò  Why?
Why?
ò  Simplicity

ò  Fixed location inodes means you can take inode number, total
number of inodes, and find the right block using math
ò  Dynamic inodes introduces another data structure to track this
mapping, which can get corrupted on disk (losing all contained
files!)
ò  Bookkeeping gets a lot more complicated when blocks change
type
ò  Downside: potentially wasted space if you guess wrong
number of files
Directory scalability
ò  An ext3 directory can have a max of 32,000 sub-
directories/files
ò  Painfully slow to search – remember, this is just a simple
array on disk (linear scan to lookup a file)
ò  Replace this in ext4 with an HTree
ò  Hash-based custom BTree
ò  Relatively flat tree to reduce risk of corruptions
ò  Big performance wins on large directories – up to 100x
Other goodies
ò  Improvements to help with locality
ò  Preallocation and hints keep blocks that are often accessed
together close on the disk
ò  Checksumming of disk blocks is a good idea
ò  Especially for journal blocks
ò  Fsck on a large fs gets expensive
ò  Put used inodes at front if possible, skip large swaths of
unused inodes if possible
Summary
ò  ext2 – Great implementation of a “classic” file system

ò  ext3 – Add a journal for faster crash recovery and less
risk of data loss

ò  ext4 – Scale to bigger data sets, plus other features

ò  Total FS size (48-bit block numbers)
ò  File size/overheads (extents)
ò  Directory size (HTree vs. a list)

Oreilly Advanced SQL For Data Analysis
0% (1)
Oreilly Advanced SQL For Data Analysis
11 pages
Lecture 2 Advanced File Systems
No ratings yet
Lecture 2 Advanced File Systems
66 pages
Business Information Systems and Technology 4.0 - New Trends in The Age of Digital Change
100% (3)
Business Information Systems and Technology 4.0 - New Trends in The Age of Digital Change
283 pages
537 L22 LFS
No ratings yet
537 L22 LFS
64 pages
12 File Systems
No ratings yet
12 File Systems
42 pages
12 File Systems
No ratings yet
12 File Systems
79 pages
File Syetem
No ratings yet
File Syetem
33 pages
5 FileSystems
No ratings yet
5 FileSystems
33 pages
Module 4 File System
No ratings yet
Module 4 File System
58 pages
Journaling
No ratings yet
Journaling
22 pages
ExtXfs Short
No ratings yet
ExtXfs Short
38 pages
Files File System Core Lecture
No ratings yet
Files File System Core Lecture
36 pages
13 Filesystems Slides
No ratings yet
13 Filesystems Slides
39 pages
Lec20 Distributed
No ratings yet
Lec20 Distributed
29 pages
Extxfs Short
No ratings yet
Extxfs Short
41 pages
Lec15 Lfs
No ratings yet
Lec15 Lfs
24 pages
File System Consistency and Exam Review
No ratings yet
File System Consistency and Exam Review
43 pages
22 File Systems 2
No ratings yet
22 File Systems 2
28 pages
2023 334 The3
No ratings yet
2023 334 The3
19 pages
The Design and Implementation of A Log-Structured File System
No ratings yet
The Design and Implementation of A Log-Structured File System
31 pages
L20 FS Reliability
No ratings yet
L20 FS Reliability
18 pages
11 Case Study Unix PDF
No ratings yet
11 Case Study Unix PDF
42 pages
QB Delhi Campus
No ratings yet
QB Delhi Campus
17 pages
Crash Consistency
No ratings yet
Crash Consistency
14 pages
Linux File System: PRAKHER GUPTA (144032) SHISHIR (144045)
No ratings yet
Linux File System: PRAKHER GUPTA (144032) SHISHIR (144045)
37 pages
13 File-Systems
No ratings yet
13 File-Systems
69 pages
Ext4 Features
No ratings yet
Ext4 Features
9 pages
Session 5 6 Revision
No ratings yet
Session 5 6 Revision
47 pages
4a Filesystem
No ratings yet
4a Filesystem
33 pages
Ext 2
No ratings yet
Ext 2
12 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
Ext3 Journaling File System: Chadd Williams Shrug 10/05/2001
No ratings yet
Ext3 Journaling File System: Chadd Williams Shrug 10/05/2001
21 pages
Ext4 Foss
No ratings yet
Ext4 Foss
25 pages
Ext4 Fast FSCK Ted Tso
No ratings yet
Ext4 Fast FSCK Ted Tso
24 pages
The Second Extended File System
No ratings yet
The Second Extended File System
47 pages
The Google File System: CSE 490h, Autumn 2008
No ratings yet
The Google File System: CSE 490h, Autumn 2008
29 pages
Outline: Access Control Lists (ACL) : Keep Lists of Access For Each Domain With
No ratings yet
Outline: Access Control Lists (ACL) : Keep Lists of Access For Each Domain With
5 pages
LINUX File System: Slides Adopted From
No ratings yet
LINUX File System: Slides Adopted From
41 pages
19CS2106S 19CS2106A Test - I Set - 1 Key and Scheme
No ratings yet
19CS2106S 19CS2106A Test - I Set - 1 Key and Scheme
8 pages
Internal Representation Of Files: Database Lab. 석사 3학기 방지환
No ratings yet
Internal Representation Of Files: Database Lab. 석사 3학기 방지환
45 pages
Outline: File System Consistency Issues in The Presence of Failures
No ratings yet
Outline: File System Consistency Issues in The Presence of Failures
4 pages
File Systems
No ratings yet
File Systems
7 pages
Lec7 Logging
No ratings yet
Lec7 Logging
4 pages
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
No ratings yet
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
21 pages
A Brief History of UNIX File Systems: Val Henson IBM, Inc
No ratings yet
A Brief History of UNIX File Systems: Val Henson IBM, Inc
22 pages
FileSystemppt Scribd
No ratings yet
FileSystemppt Scribd
34 pages
Media and Storage: UNIX File Systems
No ratings yet
Media and Storage: UNIX File Systems
47 pages
Session 15 EXT2 and EXT3 Amzb
No ratings yet
Session 15 EXT2 and EXT3 Amzb
47 pages
He-Dieu-Hanh - Kai-Li - Filelayout - (Cuuduongthancong - Com)
No ratings yet
He-Dieu-Hanh - Kai-Li - Filelayout - (Cuuduongthancong - Com)
7 pages
Big Data Thesis
100% (4)
Big Data Thesis
61 pages
Sap Hana
No ratings yet
Sap Hana
245 pages
50 SQL Interview Questions and Answers For 2022
No ratings yet
50 SQL Interview Questions and Answers For 2022
11 pages
Ext3 Journal Design
No ratings yet
Ext3 Journal Design
8 pages
Journal Design PDF
No ratings yet
Journal Design PDF
8 pages
Reading: Washington. Thank You, Hank!
No ratings yet
Reading: Washington. Thank You, Hank!
4 pages
EXT4 Filesystem
No ratings yet
EXT4 Filesystem
4 pages
A Comparison of Journaling and Transactional File Systems
No ratings yet
A Comparison of Journaling and Transactional File Systems
12 pages
List of New Books Arrivals 2018 - 24!12!18
100% (1)
List of New Books Arrivals 2018 - 24!12!18
74 pages
A Survey of File Systems
No ratings yet
A Survey of File Systems
2 pages
Ext34 Disk Layout
No ratings yet
Ext34 Disk Layout
16 pages
Local Rules v12 - Revised 7-1-16
No ratings yet
Local Rules v12 - Revised 7-1-16
99 pages
Basic Linux Filesystems Tutorial
No ratings yet
Basic Linux Filesystems Tutorial
4 pages
Cloud Computing - So Far
No ratings yet
Cloud Computing - So Far
122 pages
Comprehensive Exam Inresearch
No ratings yet
Comprehensive Exam Inresearch
8 pages
Linux Term Questions
No ratings yet
Linux Term Questions
31 pages
Lecture 1 - Introduction To Databases
No ratings yet
Lecture 1 - Introduction To Databases
32 pages
Bida Notes
No ratings yet
Bida Notes
67 pages
Master Thesis Topics in Java
100% (3)
Master Thesis Topics in Java
8 pages
Anatomy of Linux Journaling File Systems: Journaling Today and Tomorrow
No ratings yet
Anatomy of Linux Journaling File Systems: Journaling Today and Tomorrow
9 pages
Geom Use
100% (1)
Geom Use
38 pages
Data Abstraction
No ratings yet
Data Abstraction
12 pages
QTR 4 (Eapp)
No ratings yet
QTR 4 (Eapp)
31 pages
Cosmetics: Methylglyoxal, The Major Antibacterial Factor in Manuka Honey: An Alternative To Preserve Natural Cosmetics?
No ratings yet
Cosmetics: Methylglyoxal, The Major Antibacterial Factor in Manuka Honey: An Alternative To Preserve Natural Cosmetics?
8 pages
(Simkin, Rose y Norman, 2012, Pp. 4-9) Core Concepts of AIS
No ratings yet
(Simkin, Rose y Norman, 2012, Pp. 4-9) Core Concepts of AIS
6 pages
Form 4 - T3
No ratings yet
Form 4 - T3
7 pages
Data - Investigation - Interpretation - Year 8
No ratings yet
Data - Investigation - Interpretation - Year 8
34 pages
Unit 5 PC QB Ans
No ratings yet
Unit 5 PC QB Ans
39 pages
Investigation of Facebook Cambridge Anal
No ratings yet
Investigation of Facebook Cambridge Anal
9 pages
Stats in Brief: Student Victimization in U.S. Schools
No ratings yet
Stats in Brief: Student Victimization in U.S. Schools
30 pages
Salinan Dari CH 2 Writing & Reading - Juliandri
No ratings yet
Salinan Dari CH 2 Writing & Reading - Juliandri
39 pages
Lab 1.2.2.3-5CDR Solution
No ratings yet
Lab 1.2.2.3-5CDR Solution
16 pages
Enclosure 5 Format of Research Paper
No ratings yet
Enclosure 5 Format of Research Paper
4 pages
VOYA USG Accident-Hospital Indemnity
No ratings yet
VOYA USG Accident-Hospital Indemnity
37 pages
SCC White Paper
No ratings yet
SCC White Paper
15 pages
ASM1 1st DatabaseDesignAndDevelopment
No ratings yet
ASM1 1st DatabaseDesignAndDevelopment
9 pages
ModBus Protocol For Ultrasonic Level Meter
No ratings yet
ModBus Protocol For Ultrasonic Level Meter
3 pages
IT RDBMS Practical Ques
No ratings yet
IT RDBMS Practical Ques
9 pages
Os Mod4 PDF
No ratings yet
Os Mod4 PDF
18 pages
Tomczyk Et Al. - The Automation FINAL
No ratings yet
Tomczyk Et Al. - The Automation FINAL
16 pages
Overview
No ratings yet
Overview
12 pages
OBIEE Training
No ratings yet
OBIEE Training
12 pages
Case Study UEC
No ratings yet
Case Study UEC
7 pages
2q19 Eaof Letter
No ratings yet
2q19 Eaof Letter
13 pages
AADT 167 BelleHaven 2010
No ratings yet
AADT 167 BelleHaven 2010
9 pages
8 Best Practices For Handling ERISA Benefit Claims: Portfolio Media. Inc. - 111 West 19
No ratings yet
8 Best Practices For Handling ERISA Benefit Claims: Portfolio Media. Inc. - 111 West 19
9 pages
Past and Projected Trends in Teacher Demand and Supply in Michigan
No ratings yet
Past and Projected Trends in Teacher Demand and Supply in Michigan
4 pages
2019 Global Brochure Web
No ratings yet
2019 Global Brochure Web
4 pages
Tobacco Cessation Protocol
No ratings yet
Tobacco Cessation Protocol
7 pages
RCH Terminating Plans Web
No ratings yet
RCH Terminating Plans Web
7 pages
Error Code E10200 E10399
No ratings yet
Error Code E10200 E10399
6 pages
Offloading Hybrid Cloud Management For Strategic Advantage: The 451 Take
No ratings yet
Offloading Hybrid Cloud Management For Strategic Advantage: The 451 Take
2 pages
Evanston Alternative Opportunities Fact Sheet 10-01-2019
No ratings yet
Evanston Alternative Opportunities Fact Sheet 10-01-2019
5 pages
PS 270 01 Voluntary Corporate Action
No ratings yet
PS 270 01 Voluntary Corporate Action
2 pages
Atlassian Technical Account Management
No ratings yet
Atlassian Technical Account Management
4 pages
Continuous Diagnostics and Mitigation Program: How CDM Works
No ratings yet
Continuous Diagnostics and Mitigation Program: How CDM Works
2 pages
Profile - Ahmer Naeem
No ratings yet
Profile - Ahmer Naeem
2 pages
Voya Compass Hospital Confinement Indemnity Insurance
No ratings yet
Voya Compass Hospital Confinement Indemnity Insurance
4 pages
Next Decade Manifesto: We Commit Ourselves: We Reject
No ratings yet
Next Decade Manifesto: We Commit Ourselves: We Reject
1 page
Hotelsforheroes Factsheet
No ratings yet
Hotelsforheroes Factsheet
1 page
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
OpenBSD Mastery: Filesystems: IT Mastery, #19
From Everand
OpenBSD Mastery: Filesystems: IT Mastery, #19
Michael W. Lucas
No ratings yet
FreeBSD Mastery: Storage Essentials: IT Mastery, #4
From Everand
FreeBSD Mastery: Storage Essentials: IT Mastery, #4
Michael W. Lucas
No ratings yet

Ext3/4 File Systems: Don Porter CSE 506

Uploaded by

Ext3/4 File Systems: Don Porter CSE 506

Uploaded by

Ext3/4 file systems

Memory Device CPU

ò Fundamentally hard problem to prevent disk corruptions

ò Basically, does a linear scan of all bookkeeping and

ò Double-check that all allocated blocks and inodes are

ò Summary: very expensive, slow scan of the entire file

ò Undo logging:

ò Disk and I/O scheduler can reorder requests

ò Need atomic journal “commit”

ò How to ensure consistency?

ò Advances logical start of the journal, allows reuse of those

ò Lots of data written twice, batching less effective, safer

ò Can write data to a block before it is properly allocated to a file

ò Make crash recovery faster, less likely to lose data

ò Surprising number of subtle issues

ò Ex: Disk sectors 50—300 represent blocks 0—250 of file

ò If you need to create a lot of files, better make lots of

ò ext4 – Scale to bigger data sets, plus other features

You might also like

ò  Fundamentally hard problem to prevent disk corruptions

ò  Basically, does a linear scan of all bookkeeping and

ò  Double-check that all allocated blocks and inodes are

ò  Summary: very expensive, slow scan of the entire file

ò  Undo logging:

ò  Disk and I/O scheduler can reorder requests

ò  Need atomic journal “commit”

ò  How to ensure consistency?

ò  Advances logical start of the journal, allows reuse of those

ò  Lots of data written twice, batching less effective, safer

ò  Can write data to a block before it is properly allocated to a file

ò  Make crash recovery faster, less likely to lose data

ò  Surprising number of subtle issues

ò  Ex: Disk sectors 50—300 represent blocks 0—250 of file

ò  If you need to create a lot of files, better make lots of

ò  ext4 – Scale to bigger data sets, plus other features