0% found this document useful (0 votes)
35 views

Discussion Session Week 7: Database Indices

The document discusses database indexing and provides examples to determine when indexes would be beneficial for different types of queries. In Example 1, an index on SSN would help a query searching on SSN, but an index on gender would not help since it would return half the rows. An index on city would help since it would only return 1% of rows. In Example 2, indexes on the foreign key columns would not help a join query as much as a sort-merge join. In Example 3, an index would help if there is a selective filter on one of the tables.

Uploaded by

Chochunder
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Discussion Session Week 7: Database Indices

The document discusses database indexing and provides examples to determine when indexes would be beneficial for different types of queries. In Example 1, an index on SSN would help a query searching on SSN, but an index on gender would not help since it would return half the rows. An index on city would help since it would only return 1% of rows. In Example 2, indexes on the foreign key columns would not help a join query as much as a sort-merge join. In Example 3, an index would help if there is a selective filter on one of the tables.

Uploaded by

Chochunder
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Discussion Session Week 7

Database Indices

Example 1
Assume the table created by the
following statement :
CREATE TABLE customer {
id
Serial
Primary key,
SSN
Integer
NOT NULL UNIQUE,
gender Varchar[6] NOT NULL CHECK
(gender=MALE or
gender=FEMALE),
city
text
NOT NULL
}

Example 1
Assume the following information for the
customer table :
Id

SSN

Gender

City

4M

4M

40K

Where the number below each field


corresponds to the number of distinct
values for each attribute. Notice that
given ID is a primary key, the table itself
has 4M tuples.

Example 1a
Consider the following prepared query :

SELECT *
FRO M
custom er
W H ERE SSN = ?
And the following information we have about the system (DA =
disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists

Should we use an index on the SSN attribute to answer this


query more efficiently?

Example 1a
Cost without index :
Only one tuple in the answer.
Need to scan over entire relation to find
the tuple.
100k page access =100 k DA

Cost with index :


Index lookup finds the correct page right
away.
T(customer)/V(customer,SSN) = 4M/4M
= 1 DA

Example 1b
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
G ender= ?
Assume same information about the
system.
Should we use an index on the
G ender attribute to answer this
query more efficiently?

Example 1b
Cost without index :
Still need to scan over the entire relation
100k page access = 100k DA

Cost with index (unclustered tuples) :


Half the tuples are in the answer (50% chance for
each tuple).
T(customer)/V(customer,gender) = 4M/2= 2M DA

Cost with index (clustered tuples) :


Pages are accessed at most once, therefore
#pageAccessed = 100k DA.

Index beneficial : No, in all cases.

Example 1c
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
City= ?
Assume same information about the
system.
Should we use an index on the city
attribute to answer this query more
efficiently?

Example 1c
Cost without index :
Scan over the entire relation
100k page access = 100k DA

Cost with index :


1% of tuples are in the answer.
T(customer)/V(customer,city) =
4M/40k= 100 DA

Index beneficial : Yes.

Example 2
Assume we now have another
relation :
CREATE TABLE sales {
id
SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customer(id)
NOT NULL,
product TEXT NOT NULL,
amount INTEGER NOT NULL CHECK (amount > 0
AND amount <= 4000),
};
Assume the company doesnt allow sales of more
than 4000 products at a time.

Example 2
Assume the following information for the sales
table :
ID

Customer
_id

Product

Amount

40M

4M

4K

4K

Again, the number below each field corresponds


to the number of distinct values for each
attribute. Given customer_id is a foreign key,
their cannot be more than 4M distinct values.
There are 40M tuples in this table.

Example 2
Consider the following prepared query :

SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id
Recall the information we have about the system (DA = disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes for both relations (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists
Page write cost : 1DA

Should we use an index on the custom er_id attribute of sales or


the id attribute of customer to answer this query more
efficiently?

Example 2
Recall the best alternative, the sort-merge join :
Recall from the lecture notes : given a join of tables R
and S, sort-merge join only takes 2 reads of R and S
and one write of the equivalent amount of data.
Recall : #pagesInPerson = 100k, #pagesInSales =
1M.
Cost of sort-merge join
#PageAccesses = (2read+1write) x (#pagesInPerson
+ #pagesInSales)
= 3 x (100k + 1M)
= 3300k DA

Example 2
Assume the index is on
sales.customer_id :
For each tuple in customer, we join with
tuples from sales that match the
customer tuple id.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 4M x (40M / 4M) = 40M
Index on sales.custom er_id is much worse
than sort-merge join.

Example 2
Assume the index is on customer.id :
For each tuple in sales, we join with the
tuples from customer that is referred by
the sales tuples customer_id.
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is also worse than
sort-merge join.

Example 3
Now assume we change slightly the previous
query by adding a selection :
SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id AN D
C.id = 12345 % Yannis ID

Important to know : selection will happen first


on most database systems.
Should we use an index on the custom er_id
attribute of sales or the id attribute of
customer to answer this query more efficiently?

Example 3
If we use the sort-merge join :
Recall : #pagesInSales = 1M.
Given only one tuple from person is
selected and selection happens before
the join, we have #pagesInPerson = 1.
#PageAccesses = (2read+1write) x
(#pagesInPerson + #pagesInSales)
= 3 x (1+ 1M)
= 3M DA

Example 3
Assume the index is on
sales.customer_id :
There is only one tuple in customer after
the selection.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 1 x (40M / 4M) = 10
Index on sales.custom er_id is much, much
better than sort-merge join if we have a
very selective selection.

Example 3
Assume the index is on customer.id :
There is no selection over sales
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is not affected by the
selection on the customer table.

You might also like