L2.1 File Organization
L2.1 File Organization
Introduction
File organization: is a method of arranging the records in a file when the file is
stored on disk.
Disk Failures
File Allocation
Primary Storage
- Directly accessed by the CPU
- Limited storage capacity
- Volatile memory
Secondary and tertiary Storage
- Non-volatile memory
Data in secondary and tertiary storage cannot be processed directly by the CPU, first it must be copied
into primary storage then processed by the CPU
Memory Hierarchy
Cache
- Static RAM
- Speeds up execution of program
instructions
Main memory
- Dynamic RAM (DRAM)
- Main working area of CPU
- Keeps program instructions and data
- Data is moved from main memory to
cache to the processor
Magnetic disk
- Includes CD, DVD, hard disks
- Stores permanent data
Tapes
- Used for archiving and backup storage of
data
Most databases permanently store data on magnetic disk secondary storage since:
It is non-volatile
Most databases are too large to fit in main memory
Memory Hierarchy
Transfer of Data between Levels
A key technique for speeding up data operations is to arrange data so that when a piece of a disk block is needed,
it is likely that other data on the same block will also be needed at about the same time.
Hardware Description of a Disk
Hardware description of a disk
Magnetic disks are used for storing large amounts of data.
tracks
- Each track is divided into smaller blocks or sectors Sector Error
Identifier corrector
(Division of a track into equal sized disk blocks is set by O/S during information
Example:
The hardware address of a block is a combination of a cylinder number, a track number and block
number.
The address of a buffer- a contiguous reserved area in main storage that holds a disk block is also
provided.
Disk Controller:
Inside a disk unit
Responsible for all mechanical operation of a disk and interfaces with CPU
Read Command
Write Command
1. Controlling the mechanical actuator that moves the head assembly, to position the head at a particular radius
(so that any track of one particular cylinder can be read or written)
2. Selecting a sector from among all those in the cylinder at which the heads are positioned. (It knows when the
rotating spindle had reached the point where the desired sector is beginning to move under the head)
3. Transferring bits between the desired sector and the computers main memory.
4. Buffering an entire track or more in the local memory of a disk controller hoping that sectors of this track will
be read soon to avoid additional access to the disk.
Disk Access
The figure bellows shows a simple, single processor computer. The processor communicates via a data bus
with the main memory and the disk controller. A disk controller can control several disks
1. Disk controller positions the head assembly at the cylinder containing the track on which the
block is located. The time to do so is seek time
2. The disk controller waits while the first sector of the block moves under the head. This time is
called rotational latency or latency. It depends on the rpm of the disk. Example: at 15000 rpm
the time per rotation is 4 msec , and average rotational delay is 2 msec (4/2 )
Total time to locate and transfer a disk block given its address is the sum of seek time,
rotational delay and block transfer time
Activity
Activity
. Suppose that a disk unit has the following parameters:
Seek time s=20 msec; Rotational delay rd=10 msec; Block transfer time btt=1 msec; Block size B=2400 bytes.
An EMPLOYEE file has the following fields: Ssn, 9 bytes; Last_name, 20 bytes; First_name, 20 bytes;
Middle_init, 1 byte; Birth_date, 10 bytes; Address, 35 bytes; Phone, 12 bytes; Supervisor_ssn, 9 bytes;
Department, 4 bytes; Job_code, 4 bytes; deletion marker 1 byte
The EMPLOYEE file has r=30,000 records, fixed-length format, and unspanned blocking.
Calculate
d. Calculate in msec the average time needed to search for an arbitrary record in the file, using linear search, if
the file blocks are not stored on consecutive disk blocks.
Accelerating Access to Secondary Storage
Accelerating disk access
A disk designed to take an average of, say, 10 milliseconds to access a block will not necessarily
deliver data to an application in 10 milliseconds, after the request has been sent to the disk
controller since:
If there is only one disk, the disk may be busy with another access from the same process or
another process.
In worst case, a request for disk access arrives more than once every 10 milliseconds, and these
1. Place blocks that are accessed together in the same cylinder (to avoid seek time and rotational latency)
2. Divide the data among several smaller disks rather than one large one. This is known as striping
3. Mirror a disk: making two or more copies of the data on different disks
4. Use a disk scheduling algorithm to select the order in which disks will be accessed.
5. Pre-fetch block to main memory in anticipation of their later use. (Double buffering)
2. A more serious failure is one in which bit or bits are corrupted, and it is impossible to read a sector correctly,
no matter how many times we try. This form of error is called media decay.
3. A related type of error is a write failure, where we attempt to write a sector, but we can neither write
successfully, nor can we retrieve the previously written sector. A possible cause is that there was a power outage
4. The most serious form of disk failure is a disk crash, where the entire disk become unreadable, suddenly and
permanently.
Disk Failures
1. Intermittent Failures
This occurs when we try to read a sector, but the correct content of that sector is not delivered to the disk
controller.
If the controller has a way to tell whether the sector is good or bad, then it will reissue the read request when
bad data is read, until the sector is returned correctly, or some preset limit is reached.
Intermittent failure can also occur when a controller attempts to write a sector, but the contents of the sector
The only way to check whether the write was correct is to let the disk go round again and read the sector. If a
good sector is read then the write was correct.
Disk Failures
2. Checksums
In order to determine the good or bad status of a sector, additional bits called checksum are added to each sector.
If, on reading we find that the checksum is not proper for the data bits, we know there is an error in reading.
If there is an odd number of 1’s among the collection of bits, we say the bits have odd parity, and add a parity
bit that is 1.
Similarly, if there is an even number of 1’s among the bits, then we say the bits have even parity and add parity
bit 0
NB: The number of 1’s among a collection of bits and their parity bit is always even
A disk controller counts the number of 1’s to determine the presence of an error if a sector has odd parity
Disk Failures
2. Checksums
Example:
1. If a sequence of bits in a sector were 01101000 ,then there is an odd number of 1’s, so the parity bit is 1. If we
2. If the given sequence of bits were 11101110 , we have an even number of 1’s and the parity bit is 0. The
NB: The number of 1’s among a collection of bits and their parity bit is always even
Disk Failures
3. Stable Storage
Checksums assist to detect the existence of media failure (failure to read or write), but does not help to correct the
error.
During writing, an overwrite of the previous contents of a sector may occur, yet the new contents cannot be read
correctly. This leads to the loss of both the old and new content.
When we write a value B to a disk sector s currently containing the value A: after the write operation the sector
Unfortunately disks satisfy instead a weaker property, the Weakly-Atomic Property: After the write operation the
sector will contain a value that is either A, B, or such that when it is read it gives a read error.
Disk Failures
3. Stable Storage
A policy know as stable storage deals with the problems above by pairing sectors to have copies of same
information.
Consider two disks D1 and D2 that are mirror images (RAID-1) of each other, i.e. they have the same number of
equally numbered sectors. Corresponding sectors are intended to have the same content. Here is how we write
REPEAT
write B to sector s on D1;
read sector s from D1;
UNTIL value read is without error;
REPEAT
write B to sector s on D2;
read sector s from D2;
UNTIL value read is without error;
Disk Failures
Recovery from Disk Crashes
The most serious mode of failure is the disk crash or the head crash where the data is permanently destroyed.
Schemes have been developed to reduce the loss of data by disk crashes.
The simplest scheme is to mirror each disk. One of the disks that holds the main data is called a data disk and its
copies, held in other disks, whose contents are completely determined by the data disks are called redundant
disks.
Mirroring as a protection against failure is referred to as RAID level 1. (Redundant Array of Independent disks 1)
Disk Failures
Recovery from Disk Crashes
b) Parity Blocks
Mirroring as a technique to reduce the probability of a disk crash involving data loss uses as many redundant disks
Another approach RAID level 4, uses only one redundant disk, no matter how many data disks there are.
We assume the disks are identical so that we can number the blocks on each disk from 1 to some number n. All
In the redundant disk, the i th block consists of parity checks for the i th blocks of all the data disks. That is, the jth
bits of all the ith blocks, including both the data disks and the redundant disks must have an even number of 1’s
among them, and we always choose the bit of the redundant disk to make this condition true.
Disk Failures
Recovery from Disk Crashes
b) Parity Blocks
Suppose blocks consist of only one byte-eight bits. Let there be three disks called 1,2 and 3, and one redundant disk called
disk 4. If the first 3 data disks have in their first blocks the following disk sequence:
Disk 1 : 11110000
Disk 2: 10101010
Disk 3: 00111000
Then the redundant disk will have in block 1 the parity check bits:
Disk 4:01100010
The modulo-2 sum of bits is 0 if there are even number of 1’s among those bits and 1
if there are odd number of 1’s
Disk Failures
Recovery from Disk Crashes
b) Parity Blocks
RAID level 4
Writing: When we write a new block of a data disk, we also need to change its corresponding blocks of disks
Failure Recovery: If a redundant disk crashes we swap in a new disk and recomputed the redundant blocks. If a data
disk fails, we swap in a new disk and recomputed its data from other disks.
Disk Failures
Recovery from Disk Crashes
b) Parity Blocks
RAID level 5
Whatever scheme we use for updating the disks, we need to read and write the redundant disk’s
blocks.
Raid level 4 suffers from bottleneck effect, where by if there are n data disks that needs to be written
to, then the number of writes to the redundant disk will be n times the average number of writes to
In RAID level 5, each disk is treated as a redundant disk for some blocks.
Eg if there are n+1 disks numbered 0 through n, we could treat the ith cylinder of disk j as redundant
b) Parity Blocks
Record: a collection of related data values or items, where each value is formed of one or
Record type: a collection of field names and their corresponding data types. A data type,
associated with each field, specifies the type of values a field can take.
Example:
struct employee {
char name[30];
char SSN[9];
int salary;
int jobcode;
char department[20];
};
Arranging data on disk
A data element such as a tuple or object is represented by a record, which consists of consecutive bytes
Collections such as relations are usually represented by placing the records that represent their data
};
Arranging data on disk
Fixed-length records
The simplest sort of record consists of fixed length fields, one for each attribute of the represented
tuple.
The record has a header and a fixed length region for the record itself.
1. A pointer to the schema for the data stored in the record. Eg it could point to the schema for the relation to
which the tuple belongs, helping us find the fields of the record.
3. Timestamps indicating the time the records was last modified, or last read.
gender CHAR ( 1 ),
The first field is for name, and this field requires 30 bytes. If we assume that all fields begin at a multiple of 4, then we
The next attribute is address. A VARCHAR attribute requires a fixed length segment of bytes, with one more byte than
the maximum length (for the string’s end marker). Thus, we need 256 bytes for address.
Attribute gender is a single byte, holding either the character ’M’ or ’F ’.We allocate 4 bytes, so the next field can start
at a multiple of 4.
Attribute birth date is a SQL DATE value, which is a 10-byte string. We shall allocate 12 bytes to its field, to keep
gender CHAR ( 1 ),
Records representing tuples of a relation are stored in blocks of the disk and moved into main memory,
In addition to the block there is a block header holding information such as:
Arranging data on disk
Packing Fixed-length records into blocks
In addition to the block there is a block header holding information such as:
1. Links to one or more other blocks that are part of a network of blocks , used for creating indexes to the tuples of
a relation.
3. Information about which relation the tuples of this block belong to.
5. Timestamp(s) indicating the time of the block’s last modification and/or access.
File Organization
Spanned records
Suppose that a block size is B bytes and a file of fixed length, records are of size R bytes,
with B bytes > R bytes. R may not divide B exactly leaving some unused space.
To utilize the unused space, we store part of a record in one block and the rest in
another block, with a pointer pointing to the next block. Records span more than one
Sets the file pointer of an open file to the beginning of the file.
Close:
Completes the file access by releasing the buffers and performing any other needed cleanup operation
(e.g., cleanup the information of the file header which is maintained in main memory).
Transfers the block containing that record into a main memory buffer (if it not already there).
The file pointer points to the record in the buffer and it becomes the current record.
Operations on files
Read (or get):
Copies the current record from the buffer to a program variable in the user program.
This command may also advance the current record pointer to the next record in the file, which may
Searches for the next record in the file that satisfies the search condition.
Transfers the block containing that record into a main memory buffer (if it is not already there).
The record is located in the buffer and becomes the current record.
Operations on files
Delete
Deletes the current record and (eventually) updates the file on disk to reflect the deletion.
Modify
Modifies some field values for the current record and (eventually) updates the file on disk to reflect
the modification.
Insert
Inserts a new record in the file by locating the block where the record is to be inserted, transferring
that block into a main memory buffer (if it is not already there), writing the record into the buffer, and
If the file has just been opened or reset, Scan returns the first record that satisfies the search