0% found this document useful (0 votes)
8 views36 pages

Lecture12 (CNC 312)

Uploaded by

mohammed.adel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views36 pages

Lecture12 (CNC 312)

Uploaded by

mohammed.adel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Overview of Storage and Indexing

Lecture 12

1
Motivation
❖ DBMS stores vast quantities of data
❖ Data is stored on external storage devices and fetched
into main memory as needed for processing
❖ Page is unit of information read from or written to
disk. (in DBMS, a page may have size 8KB or more).
❖ Data on external storage devices :
▪ Disks: Can retrieve random page at fixed cost
But reading several consecutive pages is much cheaper than reading
them in random order
▪ Tapes: Can only read pages in sequence
Cheaper than disks; used for archival storage

❖ Cost of page I/O dominates cost of typical database


operations

2
Structure of a DBMS: These layers
must consider

Layered Architecture concurrency


control and
recovery

Query Optimization
❖ external storage access and Execution
▪Disk space manager
manages persistent data Relational Operators

▪Buffer manager stages Files and Access Methods


pages from external
storage to main memory Buffer Management
buffer pool.
Disk Space Management
▪File and Access Methods
layers make calls to buffer
manager.
DB

3
Files versus Indices

❖ File organization :
▪ Method of arranging a file of records on external storage.
▪ Record id (rid) is sufficient to physically locate record

❖ Indexes :
▪ Indexes are data structures that allow to find record ids
of records with given values in index search key fields

4
File Organizations
▪ Heap (random order) files: Suitable when typical
access is a file scan retrieving all records.

▪ Sorted Files: Best if records must be retrieved in


some order, or only a `range’ of records is needed.

▪ Indexes: Data structures to organize records to


optimize certain kinds of retrieval operations.
• Speed up searches for a subset of records, based
on values in certain (“search key”) fields
• Updates are much faster than in sorted files.

5
Alternatives for Data Entry k* in Index
❖ Data Entry : Records stored in index file
▪ Given search key value k, provide for efficient retrieval of all
data entries k* with value k.

❖ In a data entry k* , alternatives include that we can store:


▪ alternative 1: Full data record with key value k, or
▪ alternative 2: <k, rid of data record with search key value k>, or
▪ alternative 3: <k, list of rids of data records with search key k>

❖ Choice of above 3 alternative data entries is orthogonal to indexing


technique used to locate data entries.
▪ Example indexing techniques: B+ trees, hash-based structures, etc.

6
Alternatives for Data Entries
❖ Alternative 1: Full data record with key value k

▪ Index structure is file organization for data records


(instead of a Heap file or sorted file).

▪ At most one index on a given collection of data


records can use Alternative 1.
Otherwise, data records are duplicated, leading to
redundant storage and potential inconsistency.

▪ If data records are very large, this implies size of


auxiliary information in index is also large.

7
Alternatives for Data Entries
❖ Alternatives 2 (<k, rid>) and 3 (<k, list-of-rids>):
▪ Data entries typically much smaller than data records.

❖ Comparison:
▪ Both better than Alternative 1 with large data records,
especially if search keys are small.

▪ Alternative 3 more compact than Alternative 2,


but leads to variable sized data entries even if search
keys are of fixed length.

8
Index Classification

❖ Primary vs. secondary index:


▪ If search key contains primary key, then called primary index.

❖ Clustered vs. unclustered index :


▪ If order of data records is the same as, or `close to’, order of data
entries, then called clustered index.

9
Index Clustered vs Unclustered

❖ Observation 1:
▪ Alternative 1 implies clustered. True ?
❖ Observation 2:
▪ In practice, clustered also implies Alternative 1 (since
sorted files are rare).
❖ Observation 3:
▪ A file can be clustered on at most one search key.
❖ Observation 4:
▪ Cost of retrieving data records through index varies
greatly based on whether index is clustered or not !!

10
Clustered vs. Unclustered Index

Index entries
CLUSTERED direct search for UNCLUSTERED
data entries

Data entries Data entries


(Index File)
(Data file)

Data Records Data Records

11
Clustered vs. Unclustered Index
❖ Use Alternative (2) for data entries
❖ Data records are stored in Heap file.
▪ To build clustered index, first sort the Heap file
▪ Overflow pages may be needed for inserts.
▪ Thus, order of data recs is close to (not identical to) sort order.

Index entries
CLUSTERED direct search for UNCLUSTERED
data entries

Data entries Data entries


(Index File)
(Data file)

Data Records Data Records


12
B+ Tree Indexes

Non-leaf
Pages

Leaf
Pages
(Sorted by search key)

❖ Index leaf pages contain data entries, and are chained (prev & next)
❖ Index non-leaf pages have index entries; only used to direct searches:

index entry

P0 K 1 P1 K 2 P 2 K m Pm

13
Example B+ Tree
Root Note how data entries
17 in leaf level are sorted

Entries < 17 Entries >= 17

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

❖ Find: 29*? 28*? All > 15* and < 30*


❖ Insert/delete: Find data entry in leaf, then change it.

14
Hash-Based Indexes
❖ Index is a collection of buckets.
▪ Bucket = primary page plus zero or more overflow pages.
▪ Buckets contain data entries.

❖ Hashing function h:
▪ h(r) = bucket in which data entry for record r belongs.
▪ h looks at search key fields of r.
▪ No need for “index entries” due to one-level index file

❖ Good for equality selections.

15
Hash-Based Indexes

16
Understanding the Workload
❖ For each query in workload:
▪ Which relations does it access?
▪ Which attributes are retrieved?
▪ Which attributes are involved in selection/join conditions?
▪ How selective are these conditions likely to be?

❖ For each update in workload:


▪ Which attributes are involved in selection/join conditions?
▪ How selective are these conditions likely to be?
▪ The type of update (INSERT/DELETE/UPDATE), and the
attributes that are affected.

17
Choice of Indexes
❖ What indexes should we create?
▪ Which relations should have indexes?
▪ What field(s) should be the search key?
▪ Should we build several indexes?

❖ For each index, what kind of an index should it be?


▪ Clustered vs. unclustered? Hash vs. tree?
• Clustering must be used sparingly and only when justified by
frequent queries that benefit from clustering.
• At most one index can be clustered.

18
Choice of Indexes: One Approach
▪ Consider most important queries in turn.

▪ Consider best plan using current indexes, and see


if a better plan possible with additional index. If
so, create it.

▪ Consider impact on updates in workload!


• Trade-off: Indexes can make queries go faster, updates
slower. Require disk space, too.

▪ Obviously, we must understand how DBMS


evaluates queries and creates query evaluation
plans

19
Choice of Indexes: Simple Approach

• For now, we discuss simple 1-table queries.

20
Index Selection Guidelines
❖ Attributes in WHERE clause are candidates for index keys.

▪ Exact match condition suggests hash index.

▪ Range query suggests tree index.

▪ Clustering is especially useful for range queries

▪ Clustering can also help equality queries if there are


many duplicates.

21
Index Selection Guidelines

❖ Multi-attribute search keys considered when WHERE


clause contains several conditions.
▪ Order of attributes is important for range queries.

❖ Try to choose indexes that benefit as many queries as


possible.

❖ Since only one index can be clustered per relation, choose


it based on important queries that would benefit the most
from clustering.

22
Examples of Clustered Indexes
SELECT E.dno
FROM Emp E
WHERE E.age>40

❖ B+ tree index on E.dno?


❖ B+ tree index on E.age ?

❖ Trade-offs :
▪ How selective is the condition?
(all > 40?) or (only some > 40)

▪ Is the index clustered?


23
Examples of Clustered Indexes
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>10
GROUP BY E.dno

❖ Consider the GROUP BY query.


▪ Index on E.age ? E.dno ?
❖ Issues :
▪ Use Index on E.age ?
▪ If many tuples have E.age > 10, using E.age index and
sorting the retrieved tuples may be costly.
▪ Use Index on E.dno ?
▪ Clustered E.dno index may be better!

❖ What about without WHERE condition?

24
Examples of Clustered Indexes
SELECT E.dno
FROM Emp E
WHERE E.hobby=Stamps

❖ B+ tree index on E.hobby?


❖ NOTE: is equality query.
❖ NOTE : may contain many duplicates.

❖ Clustered or Unclustered index ?


❖ CONCLUDE : Clustering on E.hobby helps!

25
Indexes with Composite Search Keys
Composite Search Keys: Search on combination of fields (sal and age).

11,80 11
12,10 12
12,20 name age sal 12
13,75 bob 12 10 13
<age, sal> cal 11 80 <age>
joe 12 20
10,12 sue 13 75 10
20,12 Data records 20
75,13 sorted by name 75
80,11 80
<sal, age> <sal>
Data entries in index Data entries
sorted by <sal,age> sorted by <sal>

26
Equality and Composite Search Keys

❖ Equality query: Every field


value is equal to a constant
11,80 11
value. 12,10 12
12,20 name age sal 12
13,75 bob 12 10 13
❖ Examples : <age, sal> cal 11 80 <age>
▪ age=20 joe 12 20
10,12 sue 13 75 10
▪ sal =75
20,12 Data records 20
▪ age=20 and sal =75 75,13 sorted by name 75

▪ sal =75 and age=20 80,11 80


<sal, age> <sal>
Data entries in index Data entries
sorted by <sal,age> sorted by <sal>

27
Composite Search Keys

❖ If retrieve Emp records with age=30 AND sal=4000

▪ Index on <age,sal> would be better than an index on age or


an index on sal.

28
Ranges and Composite Search Keys
Examples of composite key
indexes using lexicographic order.
❖ Range query: Some
field value is not a 11,80 11

constant but a range. 12,10


name age sal
12
12,20 12
13,75 bob 12 10 13
<age, sal> cal 11 80 <age>
❖ Examples : joe 12 20
▪ age=12 and sal > 10 10,12 sue 13 75 10
20,12 Data records 20
75,13 sorted by name 75
▪ age =12 80,11 80
<sal, age> <sal>
Data entries in index Data entries
sorted by <sal,age> sorted by <sal>

29
Composite Search Keys

❖ If condition is: 20<age<30 AND 3000<sal<5000:

▪ Clustered tree index on <age,sal> or <sal,age> is best.

❖ If condition is: age=30 AND 3000<sal<5000:

▪ Clustered <age,sal> index much better than <sal,age> index!

❖ Composite indexes are larger, updated more often.

30
Index-Only Plans
❖ Answer a query without
retrieving actual tuples …

❖ Is that possible ?

❖ If index with suitable


information is available.

❖ Why is it a good idea ?


31
SELECT D.mgr
Index-Only Plans FROM Dept D, Emp E
<E.dno> WHERE D.dno=E.dno

❖ A number of SELECT D.mgr, E.eid


<E.dno,E.eid>
queries can be FROM Dept D, Emp E
Tree index!
answered without WHERE D.dno=E.dno
retrieving any SELECT E.dno, COUNT(*)
tuples from one or <E.dno> FROM Emp E
more of the GROUP BY E.dno
relations involved SELECT E.dno, MIN(E.sal)
if a suitable index <E.dno,E.sal> FROM Emp E
is available. Tree index! GROUP BY E.dno
<E. age,E.sal> SELECT AVG(E.sal)
or FROM Emp E
<E.sal, E.age> WHERE E.age=25 AND
Tree! E.sal BETWEEN 3000 AND 5000
32
Index-Only Plans
+ Does index-only evaluation make sense?

SELECT E.dno, COUNT(*)


<E.dno> ? FROM Emp E
GROUP BY E.dno

<E.dno> ?
SELECT E.dno, MIN(E.sal)
<E.sal> ?
FROM Emp E
<E.dno,E.sal> ?
GROUP BY E.dno

<E. age,E.sal> SELECT AVG(E.sal)


or FROM Emp E
<E.sal, E.age>? WHERE E.age=25 AND
Tree index! E.sal BETWEEN 3000 AND 5000
33
Index-Only Plans : Multi-Key Index

PROS:
+ The chance for index-only evaluation is
increased.

CONS:
- Index size larger.
- Update response for any field.

34
Index-Only Plans

❖ Tree index on SELECT E.dno, COUNT (*)


<dno,age>, FROM Emp E
or on : WHERE E.age=30
<age,dno> GROUP BY E.dno

❖ Which is better?

35
Index-Only Plans
❖ Tree index on
<dno,age>, SELECT E.dno, COUNT (*)
or on : FROM Emp E
<age,dno> WHERE E.age=30
GROUP BY E.dno
❖ Which is better?

SELECT E.dno, COUNT (*)


FROM Emp E
What if we consider WHERE E.age>30
the second query? GROUP BY E.dno

36

You might also like