0% found this document useful (0 votes)
30 views62 pages

Indexing

Uploaded by

f20211140
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views62 pages

Indexing

Uploaded by

f20211140
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

File Organization and Indexing

Data on External Storage

 Disks: Can retrieve random page at fixed cost


 But reading several consecutive pages is much cheaper than
reading them in random order
 Tapes: Can only read pages in sequence
 Cheaper than disks; used for archival storage
 File organization: Method of arranging a file of records on
external storage.
 Record id (rid) is sufficient to physically locate record
 Indexes are data structures that allow us to find the record
ids of records with given values in index search key fields
 Architecture: Buffer manager stages pages from external
storage to main memory buffer pool. File and index layers
make calls to the buffer manager.
Alternative File Organizations
Many alternatives exist, each ideal for some situations, and not
so good in others:
 Heap (random order) files: Suitable when typical access is
a file scan retrieving all records.
 Sorted Files: Best if records must be retrieved in some
order, or only a `range’ of records is needed.
 Indexes: Data structures to organize records via trees or
hashing.
 Like sorted files, they speed up searches for a subset
of records, based on values in certain (“search key”)
fields
 Updates are much faster than in sorted files.
Internal Schema Design
DBMS

request stored
stored record
record returned

File Manager

request stored
stored block
block returned

Disk Manager

disk I/O data read


operation from disk

Stored Database
Unordered Files
 Also called a heap or a pile file.
 New records are inserted at the end of the file.
 A linear search through the file records is necessary
to search for a record.
 This requires reading and searching half the file blocks on
the average, and is hence quite expensive.
 Record insertion is quite efficient.
 Reading the records in order of a particular field
requires sorting the file records.
Ordered Files
 Also called a sequential file.
 File records are kept sorted by the values of an ordering field.
 Insertion is expensive: records must be inserted in the correct
order.
 It is common to keep a separate unordered overflow (or

transaction) file for new records to improve insertion


efficiency; this is periodically merged with the main ordered
file.
 A binary search can be used to search for a record on its
ordering field value.
 This requires reading and searching log of the file blocks on
2
the average, an improvement over linear search.
 Reading the records in order of the ordering field is quite
efficient.
Ordered Files
Average Access Times
 The following table shows the average access time to
access a specific record for a given type of file
Sequential File Organization
 Suitable for applications that require sequential
processing of the entire file
 The records in the file are ordered by a search-key
Sequential File Organization (Cont.)
 Deletion – use pointer chains
 Insertion –locate the position where the record is to be
inserted
 if there is free space insert there

 if no free space, insert the record in an overflow block

 In either case, pointer chain must be updated

 Need to reorganize the file


from time to time to restore
sequential order
Multitable Clustering File
Organization
Store several relations in one file using a multitable clustering
file organization

department

instructor

multitable clustering
of department and
instructor
Multitable Clustering File Organization (cont.)

 good for queries involving department instructor, and


for queries involving one single department and its
instructors
 bad for queries involving only department
 results in variable size records
 Can add pointer chains to link records of a particular
relation
Data Dictionary Storage
The Data dictionary (also called system catalog)
stores metadata; that is, data about data, such as
 Information about relations
 names of relations

 names, types and lengths of attributes of each relation

 names and definitions of views

 integrity constraints

 User and accounting information, including passwords


 Statistical and descriptive data
 number of tuples in each relation

 Physical file organization information


 How relation is stored (sequential/hash/…)

 Physical location of relation

 Information about indices


Index structures/files

 Dense, Sparse, Primary,


Secondary,
 Clustered, Un-clustered files
 I/O Cost based Analysis model
Introduction
 Issue
 How to get required records efficiently
 Example
 SELECT * from R;
 SELECT * from R where A=10;
 Index is a data structure that lets us find
quickly records with given ‘search key’ value
without having to look at more than a fraction
of all records
 An index takes a value for search key and
finds records with the matching value
Indexing
 Can we do anything else to improve query performance other
than selecting a good file organization?
 Yes, the answer lies in indexing
 Index - a data structure that allows the DBMS to locate
particular records in a file more quickly
 Very similar to the index at the end of a book to locate various
topics covered in the book
 Types of Index
 Primary index – one primary index per file
 Clustering index – one clustering index per file – data file is ordered
on a non-key field and the index file is built on that non-key field
 Secondary index – many secondary indexes per file
 Sparse index – has only some of the search key values in the
file
 Dense index – has an index corresponding to every search key
value in the file
16
 An index file takes much less space than the
corresponding data file
 An index is especially advantageous if it can
fit in memory
 A record can be found with only one disk I/O
 An index itself can be too large to fit in the
memory
 Multi-level indexes
 Only part of index in memory
Purposes of Data Indexing

 What is Data Indexing?

 Why is it important?
How DBMS Accesses Data?
 The operations read, modify, update, and
delete are used to access data from
database.

 DBMS must first transfer the data temporarily


to a buffer in main memory.

 Data is then transferred between disk and


main memory into units called blocks.
Time Factors

 The transferring of data into blocks is a


very slow operation.

 Accessing data is determined by the


physical storage device being used.
More Time Factors

 Querying data out of a database


requires more time.

 DBMS must search among the blocks of


the database file to look for matching
tuples.
Purpose of Data Indexing

 It is a data structure that is added to a


file to provide faster access to the data.

 It reduces the number of blocks that


the DBMS has to check.
Properties of Data Index

 It contains a search key and a pointer.

 Search key - an attribute or set of attributes


that is used to look up the records in a file.

 Pointer - contains the address of where the


data is stored in memory.

 It can be compared to the card catalog


system used in public libraries of the past.
Two Types of Indices

 Ordered index (Primary index or


clustering index) – which is used to
access data sorted by order of values.

 Hash index (secondary index or non-


clustering index ) - used to access data
that is distributed uniformly across a
range of buckets.
Index
 Mechanism for efficiently locating row(s)
without having to scan entire table
 Based on a search key: rows having a
particular value for the search key attributes
can be quickly located
 Don’t confuse candidate key with search key:
 Candidate key: set of attributes; guarantees
uniqueness
 Search key: sequence of attributes; does not
guarantee uniqueness –just used for search
Indexes
 Sometimes need to retrieve records by the values in
one or more fields, e.g.,
 Find all students in the “IS” department
 Find all students with a gpa > 3
 An index on a file is a:
 Disk-based data structure
 Speeds up selections on the search key fields for the index.
 Any subset of the fields of a relation can be index search key
 Search key is not the same as (candidate) key
 (e.g. doesn’t have to be unique).
 An index
 Contains a collection of index and data entries
 Supports efficient retrieval of all records with a given search
key value k.
Basic Concepts

 Indexing is used to speed up access to desired data.


 E.g. author catalog in library

 A search key is an attribute or set of attributes used to look up


records in a file. Unrelated to keys in the db schema.
 An index file consists of records called index entries.
 An index entry for key k may consist of
 An actual data record (with search key value k)
 A pair (k, rid) where rid is a pointer to the actual data record
 A pair (k, bid) where bid is a pointer to a bucket of record pointers
 Index files are typically much smaller than the original file if the
actual data records are in a separate file.
 If the index contains the data records, there is a single file with
a special organization.

Indexing and Hashing 27


Types of index structures
 Simple indexes on sorted files
 Usually, created on primary key
 Secondary indexes on unsorted files
 Clustered indexes
 B-trees, a commonly used structure
 Hash table
Types of Indices

 The records in a file may be unordered or ordered


sequentially by some search key.
 A file whose records are unordered is called a heap file.
 If an index contains the actual data records or the records
are sorted by search key in a separate file, the index is
called clustering (otherwise non-clustering).
 In an ordered index, index entries are sorted on the
search key value. Other index structures include trees and
hash tables.

Indexing and Hashing 29


Primary Indexes (On sorted
files)
 The simplest structure
 The data file is a sequential file
 The data file is sorted on a key, usually
primary key
 The index file consists of <key,pointer> pairs
 Types of indexes
 Dense: every record has an entry in the index
 Sparse: only some of the data records have
entries in the index
Types of Single-Level Indexes
 Primary Index
 Defined on an ordered data file
 The data file is ordered on a key field
 Includes one index entry for each block in the data file;
the index entry has the key field value for the first
record in the block, which is called the block anchor
 A similar scheme can use the last record in a block.
 A primary index is a nondense (sparse) index, since it
includes an entry for each disk block of the data file and
the keys of its anchor record rather than for every
search value.
Primary index
on the
ordering key
field of the file
Index Structure
 Contains:
 Index entries
 Can contain the data tuple itself (index and table are
integrated in this case); or
 Search key value and a pointer to a row having that value;
table stored separately in this case – unintegrated index
 Location mechanism
 Algorithm + data structure for locating an index entry with a
given search key value
 Index entries are stored in accordance with the
search key value
 Entries with the same search key value are stored together
(hash, B- tree)
 Entries may be sorted on search key value (B-tree)
Index Structure
S
Search key
value

Location Mechanism
Location mechanism
facilitates finding
index entry for S
S Index entries

Once index entry is


found, the row can
be directly accessed S, …….
Dense indexes

 Every key from the data file is represented


 Entries are in the same order as that of the file
 Binary search can be used to find the required
<key, pointer>
 No.of blocks searched ‘log n’ instead of n/2 on an
average
 Example: 1,000,000 tuples, 10 tuples/4096 byte
block, key field 30 bytes, pointer 8 bytes
 Data file takes 400MB space
 Index file will take 10,000 blocks with100 entries/block
 Search will involve at most log10000 = 13 blocks in
MM
 Memory can also be optimized by keeping only
most searched blocks in memory
 Hence a record can be retrieved with less than 14
disk I/Os
Sparse indexes
 Useful if dense index is too large
 Uses less space at the cost of possibly more time
to search
 Generally a record, usually the first, per block is
represented
 Sparse index for previous example would take only
1000 blocks, 4MB
 But, it can not give quick answer to query ‘does
there exist a record with key value K?”
 It requires one disk I/O with searching in the

block
 Search K: find entry with largest key  K
Sparse Vs Dense Index
 Dense index: index entry for each data
record
 Unclustered index must be dense
 Clustered index need not be dense
 Sparse index: index entry for each block
of data file
Sparse Vs. Dense Index
Id Name Dept

Sparse,
clustered
index sorted
on Id
data file sorted Dense,
on Id unclustered
index sorted
on Name
Clustered vs. Unclustered Index

 Clustered (main) index: index entries and rows


are ordered in the same way
 An integrated storage structure is always clustered
 There can be at most one clustered index on a table
 Unclustered (secondary) index: index entries and
rows are not sorted on the same search key
 An index file might be clustered or unclustered with
respect to the storage structure it references
 There can be many secondary indices on a table
Clustering and Non-clustering
 Non-clustering indices have to be dense.
 Indices offer substantial benefits when searching for
records.
 When a file is modified, every index on the file must
be updated. Updating indices imposes overhead on
database modification.
 Sequential scan using clustering index is efficient, but
a sequential scan using a non-clustering index is
expensive – each record access may fetch a new
block from disk.

Indexing and Hashing 41


Clustered Index
 Good for range searches
 Use location mechanism to locate index
entry at start of range
 This locates first data record.
 Subsequent data records are contiguous if
index is clustered (not so if unclustered)
 Minimizes page transfers and maximizes
likelihood of cache hits
Sparse Index Files

 A clustering index may be sparse.


 Index records for only some search-key values.
 To locate a record with search-key value k we:
 Find index record with largest search-key value < k

 Search file sequentially starting at the record to which

the index record points


 Less space and less maintenance overhead for insertions
and deletions.
 Generally slower than dense index for locating records.
 Good tradeoff: sparse index with an index entry for every
block in file, corresponding to least search-key value in the
block.
Indexing and Hashing 43
Types of Single-Level Indexes
 Secondary Index
A secondary index provides a secondary means of
accessing a file for which some primary access already
exists.
The secondary index may be on a field which is a
candidate key and has a unique value in every record, or
a nonkey with duplicate values.
The index is an ordered file with two fields.
 The first field is of the same data type as some
nonordering field of the data file that is an indexing
field.
 The second field is either a block pointer or a record
pointer. There can be many secondary indexes (and
hence, indexing fields) for the same file.
 Includes one entry for each record in the data file;
hence, it is a dense index
A dense secondary
index (with block
pointers) on a
nonordering key
field of a file.
Secondary indexes
 SELECT name, address
FROM MovieStar
WHERE birthdate=DATE ‘1952-01-01’
 CREATE INDEX BDIndex ON MovieStar(birthdate);
 Secondary indexes are always ‘dense’
 Second level index could be ‘sparse’
 Secondary indexes are usually with duplicates
Secondary Indices Example

Secondary index on balance field of account

 Index record points to a bucket that contains


pointers to all the actual records with that particular
search-key value.
Multi-level indexes
 When an index is too large with even binary
search taking too many disk I/Os
 Define second level index: index on index
 This can continue to multi-level index structure
 Second and higher level indexes must be sparse
 Second level index in previous example would
take only 10 blocks, 40KB
 Search involves 2 disk I/Os and searching in the
block
Multilevel Index

 If an index does not fit in memory, access becomes


expensive.
 To reduce number of disk accesses to index records,
treat the index kept on disk as a sequential file and
construct a sparse index on it.
 outer index – a sparse index on main index

 inner index – the main index file

 If even outer index is too large to fit in main


memory, yet another level of index can be created,
and so on.
 Indices at all levels must be updated on insertion or
deletion from the file.
49
Multilevel Index (Cont.)
outer index inner index

Data
Index Block 0
Block 0

M
 Data
Block 1
M

Index 
Block 1

M


M

CIS552 Indexing and Hashing 50
Secondary indexes
 SELECT name, address
FROM MovieStar
WHERE birthdate=DATE ‘1952-01-01’
 CREATE INDEX BDIndex ON MovieStar(birthdate);

 Secondary index does not determine the


location of the record
 Secondary indexes are always ‘dense’
 Second level index could be ‘sparse’
 Secondary indexes are usually with duplicates
20
Secondary index 40

10 10
10 20
20
20 50
30
20
30 10
40 50
50
60
20
 Pointers in one index block may refer to
multiple data blocks
 Results in more number of Disk I/Os
 Unavoidable problem
 Using ‘bucket file’ between index file and data
file
 Single entry <k,p> for each value ‘k’ where p
points to location in bucket file containing all
other pointers of records with value ‘k’
 Avoids wastage of space due to multiple storage
of same value ‘k’
Definition of Bucket

 Bucket - another form of a storage unit


that can store one or more records of
information.

 Buckets are used if the search key value


cannot form a candidate key, or if the
file is not stored in search key order.
20
40

10 10
20 20
30
40 50
30
50
60 10
50

60
Index file 20

Bucket file Data file


 Application of ‘bucket file’
 It can help answer queries efficiently using
intersection of pointer sets
 Example
 SELECT title
FROM Movie
WHERE StudioName=‘Disney’ AND year=1995;
 This reduces number of Disk I/Os
Movie Tuples
Buckets for studio Buckets for year

Disney 1995

Studio index Year index


Estimating Costs
 For simplicity we estimate the cost of an operation by
counting the number of blocks that are read or
written to disk.
 We ignore the possibility of blocked access which
could significantly lower the cost of I/O.
 We assume that each relation is stored in a separate
file with B blocks and R records per block.

CIS552 Indexing and Hashing 58


Choosing Indexing Technique
 Five Factors involved when choosing the
indexing technique:
 access type
 access time
 insertion time
 deletion time
 space overhead
Indexing Definitions
 Access type is the type of access being used.
 Access time - time required to locate the
data.
 Insertion time - time required to insert the
new data.
 Deletion time - time required to delete the
data.
 Space overhead - the additional space
occupied by the added data structure.
Index Evaluation Metrics
 Access time for:
 Equality searches – records with a specified

value in an attribute
 Range searches – records with an attribute

value falling within a specified range.


 Insertion time
 Deletion time
 Space overhead

61
Primary and Secondary Indices

o Indices offer substantial benefits when searching for


records.
o BUT: Updating indices imposes overhead on database
modification --when a file is modified, every index on
the file must be updated,
o Sequential scan using primary index is efficient, but a
sequential scan using a secondary index is expensive
o Each record access may fetch a new block from

disk
o Block fetch requires about 5 to 10 micro seconds,

versus about 100 nanoseconds for memory access

You might also like