0% found this document useful (0 votes)
39 views

Lecture 16

This document provides an overview of storage and indexing concepts. It discusses different file organizations like heap files, sorted files, and indexed files. It also covers different types of indexes like hash-based and tree-based indexes. The document compares the costs of common operations like scans, searches, inserts and deletes on different file organizations and indexes. It provides assumptions used for the cost analysis and comparisons.

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Lecture 16

This document provides an overview of storage and indexing concepts. It discusses different file organizations like heap files, sorted files, and indexed files. It also covers different types of indexes like hash-based and tree-based indexes. The document compares the costs of common operations like scans, searches, inserts and deletes on different file organizations and indexes. It provides assumptions used for the cost analysis and comparisons.

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Overview of Storage and Indexing

Chapter 8

Comp 521 – Files and Databases Fall 2016 1


Data on External Storage
 Solid State Disks, Secure Digital (SD) non-volatile memory:
 Block addressable storage device, relatively symmetric R/W speeds,
Access latency
 Disks: Can retrieve random page at fixed cost
 But reading consecutive pages is much cheaper than
reading them in random order
 Tapes: Can only read pages sequentially
 Cheaper than disks; used for archival storage
 File organization: Method of arranging a file of records on external
storage.
 Record id (rid) is sufficient to physically locate record
 Indexes are data structures that allow us to find the record ids of
records with given values in index search key fields
 Architecture: Buffer manager stages pages from external storage to
main memory buffer pool. File and index layers make calls to the
buffer manager.

Comp 521 – Files and Databases Fall 2016 2


Alternative File Organizations
Many alternatives exist, each ideal for some
situations, and not so good in others:
 Heap (random order) files: Suitable when typical
access is a file scan retrieving all records.
 Sorted Files: Best if records must be retrieved in
some order, or only a `range’ of records is needed.
 Indexes: Data structures to organize records via
trees or hashing.
• Like sorted files, they speed up searches for a subset of
records, based on values in certain (“search key”) fields
• Updates are much faster than in sorted files.

Comp 521 – Files and Databases Fall 2016 3


Indexes
 An index is an axillary data structure that
speeds up selections on the search key fields of
the index.
 Any subset of attributes from a relation can be a
search key.
 Search key is not necessarily a relation key (a set of
fields that uniquely identify a tuple in a relation).
 An index contains a collection of data entries,
and supports efficient retrieval of all data
entries k* with a given key value k.
 Given data entry k*, we can find record with key k
in at most one disk I/O. (Details soon …)

Comp 521 – Files and Databases Fall 2016 4


Hash-Based Index
 Place all records with a common attribute
together.
 Index is a collection of buckets.
 Bucket = primary page plus zero or
more overflow pages. key
 Buckets contain data entries. H(x)

 Hashing function, r = h(key) :


Mapping from the index’s search key to a
bucket in which the (data entry for) record r
belongs.

Comp 521 – Files and Databases Fall 2016 5


Tree-Based Index

Non-leaf
Pages

Leaf
Pages
(“Ordered” by search key)

 Leaf pages contain data entries, and are chained (prev & next)
 Non-leaf pages have index entries; only used to direct searches:
index entry

P0 K 1 P1 K 2 P 2 K m Pm

Comp 521 – Files and Databases Fall 2016 6


Alternative Data/Index Organizations
 In data entry k* we store one of the following:
 The actual data record with its key k (clustered)
 <k, rid of data record with search key value k>
 <k, list of rids of data records with search key k>
 Data organization choice is independent of the
indexing method.
 Clustered indices save on accesses, but you can only
have 1 clustered index per relation
 Unclustered alternatives tradeoff uniformity of
index entries verses size considerations
 Often, indices contains auxiliary information
Comp 521 – Files and Databases Fall 2016 7
Index Classifications
 Primary vs. Secondary: If search key contains
primary key, then it is called a primary index.
 Unique index: Search key contains a candidate key.
 Clustered vs. Unclustered:
 Clustered: tuples are sorted by search key and stored
sequentially in data blocks
 A file can be clustered on at most one search key.
 Unclustered: search keys are stored with record ids
(rids) that identify the block containing the
associated tuple

Comp 521 – Files and Databases Fall 2016 8


Clustered vs. Unclustered Index
 Index type (Hash or Tree) is independent of the data’s
organization (clustered or unclustered).
 To build clustered index, we must first sort the records (perhaps
allowing for some free space on each page for future inserts).
 Later inserts might create overflow pages. Thus, eventual order
of data records is “close to”, but not identical to, the sort order.

Index entries
CLUSTERED direct search for UNCLUSTERED
data entries

Data entries Data entries


(Index Blocks)
(Data Blocks)

Data Records Data Records


Comp 521 – Files and Databases Fall 2016 9
Costs / Benefits of Indexing
 Adding an index incurs
 Storage overhead
 Maintenance overhead
 Without indexing, searching the records of a
database for a particular record would
require on average

Number of Records * Cost to read a Record * 0.5

(assumes records are in random order)

Comp 521 – Files and Databases Fall 2016 10


Cost Model for Our Analysis
We ignore CPU costs, for simplicity:
 B: The number of data pages
 R: Number of records per page
 D: (Average) time to read or write a block
 Measuring number of page I/O’s ignores gains of
pre-fetching a sequence of pages; thus, even I/O
cost is only approximated.
 Average-case analysis; based on several simplistic
assumptions.

 Good enough to show the overall trends!


Comp 521 – Files and Databases Fall 2016 11
Comparing File Organizations
 Heap file (random record order;
insert at eof)
 Sorted files, sorted on <age, sal>
 Clustered B+ tree file, clustered on search
key <age, sal>
 Heap file with unclustered B+ tree index
on search key <age, sal>
 Heap file with unclustered hash index
on search key <age, sal>

Comp 521 – Files and Databases Fall 2016 12


Operations to Compare
 Scan: Fetch all records from disk SELECT *
FROM Emp
 Equality search SELECT *
FROM Emp
 Range selection WHERE Age = 25 SELECT *
 Insert a record FROM Emp
WHERE Age > 30
 Delete a record
INSERT
INTO Emp(Name, Age, Salary)
VALUES(‘Jordan’, 49, 3000000)

DELETE
FROM Emp
WHERE Name =‘Bristow’

Comp 521 – Files and Databases Fall 2016 13


Assumptions in Our Analysis
 Heap Files:
 Equality selection is on key  exactly one match
 Sorted Files:
 Files compacted after deletions.
 Indexes:
 Search key overhead = 10% size of record
 Hash: No overflow buckets.
• 80% page occupancy => File size = 1.25 data size
 Tree: 67% occupancy (this is typical).
• Implies file size = 1.5 data size
• Tree Fan-out = F

Comp 521 – Files and Databases Fall 2016 14


Assumptions (contd.)
 Scans:
 Leaf levels of a tree-index are chained.
 Index data-entries plus actual file scanned for
unclustered indexes.
 Range searches:
 We use tree indexes to restrict the set of data
records fetched, but ignore hash indexes.

Comp 521 – Files and Databases Fall 2016 15


Cost of Operations
File Type Scan Equality Range Search Insert Delete
Search
Heap BD 0.5BD BD 2D Search + D

Sorted BD Dlog2B Dlog2B + Search + BD Search + BD


#matches
Clustered 1.5BD DlogF1.5B DlogF1.5B + Search + D Search + D
#matches
Unclustered BD(R+0.15) D(1+ D(1+logF0.15B+ D(logF0.15B) Search + 2D
tree index logF0.15B) #matches)

Unclustered BD(R+0.125) 2D BD 4D Search + 2D


hash index

 Several assumptions underlie these (rough) estimates!


We’ll cover them in the next few lectures.

Comp 521 – Files and Databases Fall 2016 16


Indexes and Workload
 For each query in the workload:
 Which relations does it access?
 Which attributes are retrieved?
 Which attributes are involved in selection/join conditions?
How selective are the conditions applied likely to be?
 For each update in the workload:
 Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
 The type of update (INSERT/DELETE/UPDATE), and the
attributes that are affected.

Comp 521 – Files and Databases Fall 2016 17


Index-Only Plans
 Some queries <E.dno> SELECT E.dno, COUNT(*)
Index stores a
can be answered count of tuples FROM Emp E
without with the same GROUP BY E.dno
retrieving any key
tuples from one A Tree index on
SELECT E.dno, MIN(E.sal)
or more of the <E.dno,E.sal> FROM Emp E
relations would give the
GROUP BY E.dno
involved if a anwser
suitable
index is <E. age,E.sal> SELECT AVG(E.sal)
or FROM Emp E
available.
<E.sal, E.age> WHERE E.age=25 AND
Average the E.sal BETWEEN 3000 AND 5000
index keys
Comp 521 – Files and Databases Fall 2016 18
Summary
 Alternative file organizations, each suited for
different situations.
 If selection queries are frequent, data
organization and indices are important.
 Hash-based indexes
 Sorted files
 Tree-based indexes
 An index maps search-keys to associated tuples.
 Understanding the workload of an application,
and its performance goals, is essential for a
good design.
Comp 521 – Files and Databases Fall 2016 19

You might also like