0% found this document useful (0 votes)
23 views32 pages

CNG351 Lecture 11 Part 2

Uploaded by

berayseray382
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views32 pages

CNG351 Lecture 11 Part 2

Uploaded by

berayseray382
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Disk Storage, Basic File Structures and

Hashing

CNG351 - Data Management and File Structures


Lecture - 11
Instructor: Dr. Yeliz Yesilada
Outline
• Disk Storage Devices
• Files of Records
• Operations on Files
• Unordered Files
• Ordered Files
• Hashed Files
– Dynamic and Extendible Hashing Techniques
• RAID Technology

CNG 351 - lecture 11 2/32


File Organisation and Access Methods
• A file organisation refers to the organisation of the
data of a file into records, blocks and access
structures; this includes the way records and blocks
are placed on the storage medium and interlinked.
• An access method, on the other hand, provides a
group of operations that can be applied to a file.
• Several access methods can be applied to a file
organisation.

CNG 351 - lecture 11 3/32


Types of File Organisation
• Heap (unordered) files:
– Records are placed in no particular order
• Sequential (ordered) files:
– Records are ordered by the value of a specified field.
• Hash files:
– Records are placed on a disk according to a hash
function.

CNG 351 - lecture 11 4/32


(Heap) Unordered Files
• Also called a heap or a pile file.
• Insertion: New records are inserted at the end of the file which is very
efficient.
• Searching: A linear search through the file records is necessary to
search for a record.This requires reading and searching half the file
blocks on the average, and is hence quite expensive.
• Deletion:
– To delete a record, the required block has to be retrieved,
the record is marked as deleted, and the block is written
back to the disk.
– Can have an extra byte or bit, called a deletion marker, stored with
each record.
– Both of these require periodic reorganisation of the file to reclaim
the unused spaced of deleted records.

CNG 351 - lecture 11 5/32


Heap Files
• Reading the records in order of a particular field requires sorting
the file records.
• We can use either spanned or unspanned organisation for
unordered file and may be used with either fixed or variable
length records.

CNG 351 - lecture 11 6/32


Ordered Files
• Also called a sequential file.
• File records are kept sorted by the values of an ordering field. If the
ordering field is also the key field, then the field is called the
ordering key.
• Insertion: records must be inserted in the correct order so it is
expensive
– It is common to keep a separate unordered overflow (or
transaction) file for new records to improve insertion efficiency;
this is periodically merged with the main ordered file.
• Deletion: is an expensive operation
• Searching: A binary search can be used to search for a record on
its ordering field value.
– This requires reading and searching log2 of the file blocks on
the average, an improvement over linear search.
– Reading the records in order of the ordering field is quite
efficient. CNG 351 - lecture 11 7/32
Binary Search
• Retrieve the mid-block of the file. Check whether the required
record is in this block. If it is then no need to retrieve another
block.
• If the value of the key field in the first record on the block is is
greater than the required value, the required value if it exists,
occurs on an earlier page. Therefore, we repeat the above steps
using the lower half of the file as the new search area.
• If the value of the key field in the last record on the page is less
than the required value, the required value occurs on a later
page, and so we repeat the above steps using the top half of the
file as the new search area.

CNG 351 - lecture 11 8/32


Binary Search Example

SELECT *
FROM Staff
WHERE staffNo = ‘SG21’;

Block: 1 2 3 4 5 6
SA1 SG5 SG14 SG21 SL37 SL41

(1) (3) (2)

CNG 351 - lecture 11 9/32


Advantages of Ordered Records
1. Reading the records in order of the ordering key values
becomes extremely efficient;
2. Finding the next record from the current one in order of the
ordering key usually requires no additional block access;
3. Using a search condition based on the value of ordering key
field results in faster access when the binary search technique
is used.
However,
Ordering doesn’t provide any advantages for random or ordered
access of the records based on the values of the other non-
ordering fields of the file.

CNG 351 - lecture 11 10/32


Ordered Files
Example

CNG 351 - lecture 11 11/32


Average Access Times
• The following table shows the average access time to
access a specific record for a given type of file
• Average access time for a file of b Blocks under
Basic file organizations:

CNG 351 - lecture 11 12/32


Hash File
• Records do not need to be written sequentially to the file;
• A hash function calculates the address of the block in which the
record is to be stored based on one or more fields in the record.
– Division remainder hashing function: Uses the MOD function, which
takes the field value, and uses the remainder as the address.
• The base field is called the hash key.
• Records in a hash file will appear to be randomly distributed across the
available file space. For this reason, they are sometimes called
random, or direct files.
• The problem with most hashing functions is that they do not guarantee
a unique address because the number of possible a hash field can take
is typically much larger than the number of available address of
records.

CNG 351 - lecture 11 13/32


Hash Files
• The file blocks are divided into M equal-sized buckets, numbered
bucket0, bucket1, ..., bucketM-1.
– Typically, a bucket corresponds to one (or a fixed number of) disk
block.
• One of the file fields is designated to be the hash key of the file.
• The record with hash key value K is stored in bucket i, where i=h(K),
and h is the hashing function.
• Within a bucket, records are placed in the order of arrival.
• Collisions occur when a new record hashes to a bucket that is already
full. There are several methods that can be used to manage collisions:
– Open addressing;
– Unchanged workflow;
– Chained workflow;
– Multiple hashing.

CNG 351 - lecture 11 14/32


1. Open Addressing
• Open addressing: Proceeding from the occupied position specified
by the hash address, the program checks the subsequent positions
in order until an unused (empty) position is found.
• For example:
– Hash function: MOD 3 of staff number field
– Therefore, SG5 and SG14 hash to bucket 2.
– When SL41 is inserted, generates an address to bucket 2;
– Cannot add 2 as it is full, so it searches from top an available
space.
Before Bucket After Bucket
Staff SA9 record Staff SA9 record
Staff SL21 record 0 Staff SL21 record 0
Staff SG37 record Staff SG37 record
1 Staff SL41 record 1
Staff SG5 record Staff SG5 record
Staff SG14 record 2 351 - lecture 11Staff SG14 record
CNG 2 15/32
2. Unchained overflow
• Instead of searching for a free slot, an overflow area is maintained for
collisions that cannot be placed at the hash address.

Before Bucket Overflow area Bucket


Staff SA9 record Staff SL41 Record
Staff SL21 record 0 3
Staff SG37 record
1 4
Staff SG5 record
Staff SG14 record 2

CNG 351 - lecture 11 16/32


3. Chained Overflow
• An overflow area is maintained for collisions that
cannot be placed at the hash address. However,
each bucket has an additional field that indicates
whether a collision occurred, and if so, points to the
overflow page used.
Before Bucket Overflow area Bucket
Staff SA9 record Staff SL41 Record
Staff SL21 record 0 0 3
Staff SG37 record
0 1 4
Staff SG5 record
Staff SG14 record 3 2
CNG 351 - lecture 11 17/32
Chained Overflow Example

CNG 351 - lecture 11 18/32


4. Multiple Hashing
• The program applies a second hash function if the first results in a
collision. If another collision results, the program uses open addressing
or applies a third hash function and then uses open addressing if
necessary.

CNG 351 - lecture 11 19/32


Hashed Files
• To reduce overflow records, a hash file is typically kept 70-80%
full.
• The hash function h should distribute the records uniformly
among the buckets
– Otherwise, search time will be increased because many
overflow records will exist.
• Main disadvantages of static hashing (hash address space if
fixed when the file is created):
– Fixed number of buckets M is a problem if the number of
records in the file grows or shrinks.
– Ordered access on the hash key is quite inefficient (requires
sorting the records).
– It is difficult to expand or shrink the file dynamically.

CNG 351 - lecture 11 20/32


Dynamic And Extendible Hashed Files
• Dynamic and Extendible Hashing Techniques
– Hashing techniques are adapted to allow the dynamic
growth and shrinking of the number of file records.
– These techniques include the following: dynamic
hashing, extendible hashing, and linear hashing.
• Both dynamic and extendible hashing use the binary
representation of the hash value h(K) in order to
access a directory.
– In dynamic hashing the directory is a binary tree.
– In extendible hashing the directory is an array of size
2d where d is called the global depth.

CNG 351 - lecture 11 21/32


Dynamic And Extendible Hashing
• The directories can be stored on disk, and they expand or
shrink dynamically.
– Directory entries point to the disk blocks that contain the
stored records.
• An insertion in a disk block that is full causes the block to split
into two blocks and the records are redistributed among the two
blocks.
– The directory is updated appropriately.
• Dynamic and extendible hashing do not require an overflow
area.
• Linear hashing does require an overflow area but does not use
a directory.
– Blocks are split in linear order as the file expands.

CNG 351 - lecture 11 22/32


Extendible Hashing

CNG 351 - lecture 11 23/32


Parallelizing Disk Access using RAID
Technology
• Secondary storage technology must take steps to
keep up in performance and reliability with
processor technology.
• A major advance in secondary storage technology is
represented by the development of RAID, which
originally stood for Redundant Arrays of
Inexpensive (Independent) Disks.
• The main goal of RAID is to even out the widely
different rates of performance improvement of disks
against those in memory and microprocessors.

CNG 351 - lecture 11 24/32


Trends…

• The main goal of RAID is to even out the widely different rates
of performance improvement of disks against those in memory
and microprocessors. CNG 351 - lecture 11 25/32
RAID Technology
• A natural solution is a large array of small
independent disks acting as a single higher-
performance logical disk.
• A concept called data striping is used, which utilizes
parallelism to improve disk performance.
• Data striping distributes data transparently over
multiple disks to make them appear as a single large,
fast disk.

CNG 351 - lecture 11 26/32


Reliability with RAID
• Keeping a single copy of data in a single set of disks
will cause significant loss of reliability.
• An obvious solution is to employ redundancy of data
so that disk failures can be tolerated.
• One technique for introducing redundancy is called
mirroring or shadowing.
• Another technique is to store extra information that is
normally needed but that can be used to reconstruct
the lost information.

CNG 351 - lecture 11 27/32


Performance with RAID
• The disk arrays employ the technique of data
stripping to achieve higher transfer rates.
• Bit-level data stripping: consists of splitting a byte
of data and writing bit j to the jth disk.
– With 8 bits bytes, eight physical disks may be
considered as one logical disk.
• Block-level data stripping: The granularity of
splitting is higher than a bit, blocks of file can be
stripped across disks.

CNG 351 - lecture 11 28/32


RAID Technology & Levels
• Different raid organizations were defined based on different
combinations of the two factors of granularity of data interleaving
(striping) and pattern used to compute redundant information.
– Raid level 0 has no redundant data and hence has the best write
performance at the risk of data loss
– Raid level 1 uses mirrored disks.
– Raid level 2 uses memory-style redundancy by using Hamming
codes, which contain parity bits for distinct overlapping subsets of
components. Level 2 includes both error detection and correction.
– Raid level 3 uses a single parity disk relying on the disk controller
to figure out which disk has failed.
– Raid Levels 4 and 5 use block-level data striping, with level 5
distributing data and parity information across all disks.
– Raid level 6 applies the so-called P + Q redundancy scheme using
Reed-Soloman codes to protect against up to two disk failures by
using just two redundant disks.

CNG 351 - lecture 11 29/32


Use of RAID Technology (contd.)
• Different raid organizations are being used under different situations
– Raid level 1 (mirrored disks) is the easiest for rebuild of a disk from
other disks
• It is used for critical applications like logs
– Raid level 2 uses memory-style redundancy by using Hamming
codes, which contain parity bits for distinct overlapping subsets of
components.
• Level 2 includes both error detection and correction.
– Raid level 3 (single parity disks relying on the disk controller to
figure out which disk has failed) and level 5 (block-level data
striping) are preferred for Large volume storage, with level 3 giving
higher transfer rates.
• Most popular uses of the RAID technology currently are:
– Level 0 (with striping), Level 1 (with mirroring) and Level 5 with an
extra drive for parity.
• Design Decisions for RAID include:
– Level of RAID, number of disks, choice of parity schemes, and
grouping of disks for block-level striping.
CNG 351 - lecture 11 31/32
Summary
• Operations on Files
• Unordered Files
• Ordered Files
• Hashed Files
– Dynamic and Extendible Hashing Techniques
• RAID Technology

CNG 351 - lecture 11 32/32

You might also like