Data Warehousing: Lecture No 04
Data Warehousing: Lecture No 04
Data Warehousing
By: Dr. Syed Aun Irtaza
De-normalization
Data warehousing
Normalization
Normalization
Normalization is a process that “improves” a
database design for OLTP systems by generating
relations that are simple and stable in structure.
3
Normalization
What are the goals of normalization?
Eliminate redundant data.
Ensure data dependencies make sense.
4 BS Islamabad CS-105 40
5
Normalization: 1NF
Only contains atomic values, BUT also contains redundant
FIRST data.
SID Degree Campus Course Marks
1 BS Islamabad CS-101 30
1 BS Islamabad CS-102 20
1 BS Islamabad CS-103 40
1 BS Islamabad CS-104 20
1 BS Islamabad CS-105 10
1 BS Islamabad CS-106 10
2 MS Lahore CS-101 30
2 MS Lahore CS-102 40
3 MS Lahore CS-102 20
4 BS Islamabad CS-102 20
4 BS Islamabad CS-104 30
4 BS Islamabad CS-105 40
6
Normalization: 1NF
Anomalies
INSERT. Certain student with SID 5 got admission in a
different campus (say) Karachi cannot be added until
the student registers for a course.
FIRST is in 1NF but not in 2NF because degree and campus are
functionally dependent upon only on the column SID of the
composite key (SID, course). This can be illustrated by listing the
functional dependencies in the table:
To transform the table FIRST into 2NF we move the columns SID, Degree
and Campus to a new table called REGISTRATION. The column SID
becomes the primary key of this new table.
8
Normalization: 2NF
SID Degree Campus
REGISTRATION
PERFORMANCE
1 BS Islamabad 1 CS-101 30
2 MS Lahore 1 CS-102 20
3 MS Lahore 1 CS-103 40
4 BS Islamabad 1 CS-104 20
5 PhD Peshawar 1 CS-105 10
1 CS-106 10
SID is now a PK 2 CS-101 30
2 CS-102 40
3 CS-102 20
4 CS-102 20
4 CS-104 30
4 CS-105 40
12
Normalization: 3NF
STUDENT_CAMPUS
SID Campus
1 Islamabad
REGISTRATION 2 Lahore
SID Degree Campus 3 Lahore
1 BS Islamabad 4 Islamabad
2 MS Lahore 5 Peshawar
3 MS Lahore
4 BS Islamabad
CAMPUS_DEGREE
5 PhD Peshawar
Campus Degree
Islamabad BS
Lahore MS
Peshawar PhD
13
Normalization: 3NF
Conclusions:
Data Cubes 1 st
Normal Form
Data Lists
16
What is De-normalization?
It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.
17
Why De-normalization In DSS?
Bringing “close” dispersed but related data items.
18
How De-normalization improves performance?
19
4 Guidelines for De-normalization
1. Carefully do a cost-benefit analysis
(frequency of use, additional storage,
join time).
2. Do a data requirement and storage
analysis.
3. Weigh against the maintenance issue of
the redundant data (triggers used).
4. When in doubt, don’t denormalize.
20
Areas for Applying De-Normalization Techniques
21
Five principal De-normalization techniques
1. Collapsing Tables.
- Two entities with a One-to-One relationship.
- Two entities with a Many-to-Many relationship.
3. Pre-Joining.
22
Collapsing Tables
ColA ColB
denormalized
ColA ColC
Reduced indexing.
23
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC
Vertical Split
Table_h1 Table_h2
24
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.
GOAL
25
Splitting Tables: Horizontal splitting
ADVANTAGE
Enhance security of data.
Organizing tables differently for different
queries.
26
Splitting Tables: Vertical Splitting
Infrequently accessed columns become extra
“baggage” thus degrading performance.
27
Pre-joining …
Identify frequent joins and append the tables
together in the physical data model.
28
Pre-Joining…
Master
Sale_ID Sale_date Sale_person
normalized
1 M
Tx_ID Sale_ID Item_ID Item_QtySale_Rs Detail
denormalized
29
Pre-Joining: Typical Scenario
Typical of Market basket query
Join ALWAYS required
Tables could be millions of rows
Table_2 Table_2
31
Adding Redundant Columns…
Columns can also be moved, instead of making
them redundant. Very similar to pre-joining as
discussed earlier.
EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.
A join is required.
Storage
Performance
Ease-of-use
Maintenance
34
Industry Characteristics
Master:Detail Ratios
Health care 1:2 ratio
35
Storage Issues: Pre-joining Facts
Assume 1:2 record count ratio between claim
master and detail for health-care application.
Assume 10 million members (20 million records
in claim detail).
Assume 10 byte member_ID.
Assume 40 byte header for master and 60 byte
header for detail tables.
36
Storage Issues: Pre-joining (Calculations)
With normalization:
Total space used = 10 x 40 + 20 x 60 = 1.6 GB
After denormalization:
Total space used = (60 + 40 – 10) x 20 = 1.8 GB
37
Performance Issues: Pre-joining
Consider the query “How many members were
paid claims during last year?”
With normalization:
Simply count the number of records in the
master table.
After denormalization:
The member_ID would be repeated, hence
need a count distinct. This will cause sorting
on a larger table and degraded performance.
38
Performance Issues: Adding redundant columns
Continuing with the previous Health-Care
example, assuming a 60 byte detail table and 10
byte Sale_Person.
39
Other Issues: Adding redundant columns
Other issues include, increase in table size,
maintenance and loss of information:
40
Ease of use Issues: Horizontal Splitting
Horizontal splitting is a Divide&Conquer technique that exploits
parallelism. The conquer part of the technique is about combining the
results.
41
Ease of use Issues: Horizontal Splitting
42
Ease of use Issues: Horizontal Splitting
Round robin and random splitting:
◦ Guarantee good data distribution.
◦ Almost impossible to reverse (or
undo).
◦ Not pre-defined.
43
Ease of use Issues: Horizontal Splitting
Range and expression splitting:
◦ Can facilitate partition elimination
with a smart optimizer.
◦ Generally lead to "hot spots” (uneven
distribution of data).
44
Performance Issues: Horizontal Splitting
Dramatic
cancellation of airline
reservations after
Processors 9/11, resulting in
“hot spot”
P1 P2 P3 P4
45
Performance issues: Vertical Splitting Facts
Example: Consider a 100 byte header for the
member table such that 20 bytes provide
complete coverage for 90% of the queries.
47