Discussion Session Week 7: Database Indices
Discussion Session Week 7: Database Indices
Database Indices
Example 1
Assume the table created by the
following statement :
CREATE TABLE customer {
id
Serial
Primary key,
SSN
Integer
NOT NULL UNIQUE,
gender Varchar[6] NOT NULL CHECK
(gender=MALE or
gender=FEMALE),
city
text
NOT NULL
}
Example 1
Assume the following information for the
customer table :
Id
SSN
Gender
City
4M
4M
40K
Example 1a
Consider the following prepared query :
SELECT *
FRO M
custom er
W H ERE SSN = ?
And the following information we have about the system (DA =
disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists
Example 1a
Cost without index :
Only one tuple in the answer.
Need to scan over entire relation to find
the tuple.
100k page access =100 k DA
Example 1b
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
G ender= ?
Assume same information about the
system.
Should we use an index on the
G ender attribute to answer this
query more efficiently?
Example 1b
Cost without index :
Still need to scan over the entire relation
100k page access = 100k DA
Example 1c
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
City= ?
Assume same information about the
system.
Should we use an index on the city
attribute to answer this query more
efficiently?
Example 1c
Cost without index :
Scan over the entire relation
100k page access = 100k DA
Example 2
Assume we now have another
relation :
CREATE TABLE sales {
id
SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customer(id)
NOT NULL,
product TEXT NOT NULL,
amount INTEGER NOT NULL CHECK (amount > 0
AND amount <= 4000),
};
Assume the company doesnt allow sales of more
than 4000 products at a time.
Example 2
Assume the following information for the sales
table :
ID
Customer
_id
Product
Amount
40M
4M
4K
4K
Example 2
Consider the following prepared query :
SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id
Recall the information we have about the system (DA = disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes for both relations (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists
Page write cost : 1DA
Example 2
Recall the best alternative, the sort-merge join :
Recall from the lecture notes : given a join of tables R
and S, sort-merge join only takes 2 reads of R and S
and one write of the equivalent amount of data.
Recall : #pagesInPerson = 100k, #pagesInSales =
1M.
Cost of sort-merge join
#PageAccesses = (2read+1write) x (#pagesInPerson
+ #pagesInSales)
= 3 x (100k + 1M)
= 3300k DA
Example 2
Assume the index is on
sales.customer_id :
For each tuple in customer, we join with
tuples from sales that match the
customer tuple id.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 4M x (40M / 4M) = 40M
Index on sales.custom er_id is much worse
than sort-merge join.
Example 2
Assume the index is on customer.id :
For each tuple in sales, we join with the
tuples from customer that is referred by
the sales tuples customer_id.
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is also worse than
sort-merge join.
Example 3
Now assume we change slightly the previous
query by adding a selection :
SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id AN D
C.id = 12345 % Yannis ID
Example 3
If we use the sort-merge join :
Recall : #pagesInSales = 1M.
Given only one tuple from person is
selected and selection happens before
the join, we have #pagesInPerson = 1.
#PageAccesses = (2read+1write) x
(#pagesInPerson + #pagesInSales)
= 3 x (1+ 1M)
= 3M DA
Example 3
Assume the index is on
sales.customer_id :
There is only one tuple in customer after
the selection.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 1 x (40M / 4M) = 10
Index on sales.custom er_id is much, much
better than sort-merge join if we have a
very selective selection.
Example 3
Assume the index is on customer.id :
There is no selection over sales
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is not affected by the
selection on the customer table.