0% found this document useful (0 votes)
42 views47 pages

Data Warehousing: Lecture No 04

The document discusses normalization and denormalization in data warehousing. It explains the goals and levels of normalization from 1NF to 3NF and provides examples to illustrate how normalization removes anomalies and improves database design. While normalization is important, the document notes that a balance needs to be struck between normalization and denormalization. Denormalization is discussed as a controlled process done to enhance query performance in data warehousing without losing information.

Uploaded by

karl evans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views47 pages

Data Warehousing: Lecture No 04

The document discusses normalization and denormalization in data warehousing. It explains the goals and levels of normalization from 1NF to 3NF and provides examples to illustrate how normalization removes anomalies and improves database design. While normalization is important, the document notes that a balance needs to be struck between normalization and denormalization. Denormalization is discussed as a controlled process done to enhance query performance in data warehousing without losing information.

Uploaded by

karl evans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Lecture No 04

Data Warehousing
By: Dr. Syed Aun Irtaza
De-normalization
Data warehousing
Normalization
Normalization
Normalization is a process that “improves” a
database design for OLTP systems by generating
relations that are simple and stable in structure.

The objective of normalization:


“to create relations where every dependency is
on the key, the whole key, and nothing but the
key”.

3
Normalization
What are the goals of normalization?
 Eliminate redundant data.
 Ensure data dependencies make sense.

What is the result of normalization?

What are the levels of normalization?

Always follow purists approach of


normalization?
NO
4
Normalization
Consider a student database system to be developed for a multi-campus
university, such that it specializes in one degree program at a campus i.e. BS,
MS or PhD.
SID Degree Campus Course Marks SID: Student ID
1 BS Islamabad CS-101 30
1 BS Islamabad CS-102 20 Degree: Registered as BS or MS student
1 BS Islamabad CS-103 40
Campus: City where campus is located
1 BS Islamabad CS-104 20

1 BS Islamabad CS-105 10 Course: Course taken


1 BS Islamabad CS-106 10
Marks: Score out of max of 50
2 MS Lahore CS-101 30
2 MS Lahore CS-102 40
3 MS Lahore CS-102 20
4 BS Islamabad CS-102 20
4 BS Islamabad CS-104 30

4 BS Islamabad CS-105 40

5
Normalization: 1NF
Only contains atomic values, BUT also contains redundant
FIRST data.
SID Degree Campus Course Marks
1 BS Islamabad CS-101 30

1 BS Islamabad CS-102 20

1 BS Islamabad CS-103 40

1 BS Islamabad CS-104 20

1 BS Islamabad CS-105 10

1 BS Islamabad CS-106 10

2 MS Lahore CS-101 30
2 MS Lahore CS-102 40
3 MS Lahore CS-102 20
4 BS Islamabad CS-102 20
4 BS Islamabad CS-104 30

4 BS Islamabad CS-105 40

6
Normalization: 1NF
Anomalies
INSERT. Certain student with SID 5 got admission in a
different campus (say) Karachi cannot be added until
the student registers for a course.

DELETE. If student graduates and his/her corresponding


record is deleted, then all information about that
student is lost.

UPDATE. If student migrates from Islamabad campus to


Lahore campus (say) SID = 1, then six rows would have
to be updated with this new information.
7
Normalization: 2NF
Every non-key column is fully dependent on the PK

FIRST is in 1NF but not in 2NF because degree and campus are
functionally dependent upon only on the column SID of the
composite key (SID, course). This can be illustrated by listing the
functional dependencies in the table:

SID —> campus, degree


SID & Campus are NOT unique
campus —> degree

(SID, Course) —> Marks

To transform the table FIRST into 2NF we move the columns SID, Degree
and Campus to a new table called REGISTRATION. The column SID
becomes the primary key of this new table.

8
Normalization: 2NF
SID Degree Campus
REGISTRATION

SID Course Marks

PERFORMANCE
1 BS Islamabad 1 CS-101 30
2 MS Lahore 1 CS-102 20
3 MS Lahore 1 CS-103 40
4 BS Islamabad 1 CS-104 20
5 PhD Peshawar 1 CS-105 10
1 CS-106 10
SID is now a PK 2 CS-101 30
2 CS-102 40
3 CS-102 20
4 CS-102 20
4 CS-104 30
4 CS-105 40

PERFORMANCE in 2NF as (SID, Course) uniquely identify


Marks
9
Normalization: 2NF
Presence of modification anomalies for tables in
2NF. For the table REGISTRATION, they are:

 INSERT: Until a student gets registered in a degree


program, that program cannot be offered!

 DELETE: Deleting any row from REGISTRATION destroys


all other facts in the table.

Why there are anomalies?

The table is in 2NF but NOT in 3NF


10
Normalization: 3NF
All columns must be dependent only on the primary key.
Table PERFORMANCE is already in 3NF. The non-key column, marks, is
fully dependent upon the primary key (SID, degree).

REGISTRATION is in 2NF but not in 3NF because it contains a transitive


dependency.

A transitive dependency occurs when a non-key column that is a


determinant of the primary key is the determinate of other columns.

The concept of a transitive dependency can be illustrated by showing the


functional dependencies in REGISTRATION:

REGISTRATION.SID —> REGISTRATION.Degree


REGISTRATION.SID —> REGISTRATION.Campus
REGISTRATION.Campus —> REGISTRATION.Degree

Note that REGISTRATION.Degree is determined both by the primary key SID


and the non-key column campus.
11
Normalization: 3NF

To transform REGISTRATION into 3NF, we create


a new table called CAMPUS_DEGREE and move
the columns campus and degree into it.

Degree is deleted from the original table,


campus is left behind to serve as a foreign key
to CAMPUS_DEGREE, and the original table is
renamed to STUDENT_CAMPUS to reflect its
semantic meaning.

12
Normalization: 3NF

STUDENT_CAMPUS
SID Campus
1 Islamabad
REGISTRATION 2 Lahore
SID Degree Campus 3 Lahore

1 BS Islamabad 4 Islamabad

2 MS Lahore 5 Peshawar

3 MS Lahore
4 BS Islamabad
CAMPUS_DEGREE
5 PhD Peshawar
Campus Degree
Islamabad BS

Lahore MS
Peshawar PhD

13
Normalization: 3NF

Removal of anomalies and improvement


in queries as follows:

 INSERT: Able to first offer a degree program,


and then students registering in it.

 UPDATE: Migrating students between


campuses by changing a single row.

 DELETE: Deleting information about a


course, without deleting facts about all
columns in the record.
14
Normalization

Conclusions:

 Normalization guidelines are cumulative.

 Generally a good idea to only ensure 2NF.

 3NF is at the cost of simplicity and


performance.

 There is a 4NF with no multi-valued


dependencies.
 There is also a 5NF.
15
Striking a balance between “good” & “evil”
De-normalization Normalization
Too many tables
4+ Normal Forms

3rd Normal Form

2nd Normal Form

Data Cubes 1 st
Normal Form

Data Lists

Flat Table One big flat file

16
What is De-normalization?
 It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.

 Normalization is a rule of thumb in DBMS, but


in DSS ease of use is achieved by way of de-
normalization.

 De-normalization comes in many flavors, such


as combining tables, splitting tables, adding
data etc., but all done very carefully.

17
Why De-normalization In DSS?
 Bringing “close” dispersed but related data items.

Query performance in DSS significantly dependent


on physical data model.

 Very early studies showed performance difference


in orders of magnitude for different number de-
normalized tables and rows per table.

The level of de-normalization should be carefully


considered.

18
How De-normalization improves performance?

De-normalization specifically improves


performance by either:

 Reducing the number of tables and hence


the reliance on joins, which consequently
speeds up performance.

 Reducing the number of joins required


during query execution, or

 Reduces the amount of I/O operations.

19
4 Guidelines for De-normalization
1. Carefully do a cost-benefit analysis
(frequency of use, additional storage,
join time).
2. Do a data requirement and storage
analysis.
3. Weigh against the maintenance issue of
the redundant data (triggers used).
4. When in doubt, don’t denormalize.

20
Areas for Applying De-Normalization Techniques

 Dealing with the abundance of star schemas.

 Fast access of time series data for analysis.

 Fast aggregate (sum, average etc.) results and


complicated calculations.

 Multidimensional analysis (e.g. geography) in a


complex hierarchy.

 Dealing with few updates but many join queries.

De-normalization will ultimately affect the database size


and query performance.

21
Five principal De-normalization techniques

1. Collapsing Tables.
- Two entities with a One-to-One relationship.
- Two entities with a Many-to-Many relationship.

2. Splitting Tables (Horizontal/Vertical Splitting).

3. Pre-Joining.

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc).

22
Collapsing Tables
ColA ColB
denormalized

ColA ColB ColC


normalized

ColA ColC

 Reduced storage space.

 Reduced update time.

 Does not changes business view.

 Reduced foreign keys.

 Reduced indexing.

23
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC

Vertical Split
Table_h1 Table_h2

ColA ColB ColC ColA ColB ColC


Horizontal split

24
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.

GOAL

 Spreading rows for exploiting parallelism.

 Grouping data to avoid unnecessary query load


in WHERE clause.

25
Splitting Tables: Horizontal splitting
ADVANTAGE
 Enhance security of data.
 Organizing tables differently for different
queries.

 Graceful degradation of database in case of


table damage.

 Fewer rows result in fast indexing.

26
Splitting Tables: Vertical Splitting
 Infrequently accessed columns become extra
“baggage” thus degrading performance.

Very useful for rarely accessed large text


columns with large headers.

 Header size is reduced, allowing more rows per


block, thus reducing I/O.

Splitting and distributing into separate files


with repeating primary key.

27
Pre-joining …
 Identify frequent joins and append the tables
together in the physical data model.

 Generally used for 1:M such as master-detail.


RI is assumed to exist.

 Additional space is required as the master


information is repeated in the new header
table.

28
Pre-Joining…
Master
Sale_ID Sale_date Sale_person
normalized

1 M
Tx_ID Sale_ID Item_ID Item_QtySale_Rs Detail
denormalized

Tx_ID Sale_ID Sale_date Sale_personItem_ID Item_QtySale_Rs

29
Pre-Joining: Typical Scenario
Typical of Market basket query
Join ALWAYS required
Tables could be millions of rows

Squeeze Master into Detail

Repetition of facts. How much?

Detail 3-4 times of master


30
Adding Redundant Columns…
Table_1 Table_1’
ColA ColB ColA ColB ColC

Table_2 Table_2

ColA ColC ColD … ColZ ColA ColC ColD … ColZ

31
Adding Redundant Columns…
Columns can also be moved, instead of making
them redundant. Very similar to pre-joining as
discussed earlier.

EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.
 A join is required.

 To eliminate the join, a redundant attribute


added in the target entity which is functionally
independent of the primary key.
32
Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attribute
GP  Calculated once
DoB: Date of Birth Age  Used Frequently

Age is also a derived attribute, calculated as


Current_Date – DoB (calculated periodically).

GP (Grade Point) column in the data warehouse data


model is included as a derived value. The formula for
calculating this field is Grade*Credits.
33
Issues of Denormalization

 Storage
 Performance
 Ease-of-use

 Maintenance

34
Industry Characteristics
Master:Detail Ratios
 Health care 1:2 ratio

 Video Rental 1:3 ratio

 Retail 1:30 ratio

35
Storage Issues: Pre-joining Facts
 Assume 1:2 record count ratio between claim
master and detail for health-care application.
 Assume 10 million members (20 million records
in claim detail).
 Assume 10 byte member_ID.
 Assume 40 byte header for master and 60 byte
header for detail tables.

36
Storage Issues: Pre-joining (Calculations)
With normalization:
Total space used = 10 x 40 + 20 x 60 = 1.6 GB

After denormalization:
Total space used = (60 + 40 – 10) x 20 = 1.8 GB

Net result is 12.5% additional space required in


raw data table size for the database.

37
Performance Issues: Pre-joining
Consider the query “How many members were
paid claims during last year?”

With normalization:
Simply count the number of records in the
master table.

After denormalization:
The member_ID would be repeated, hence
need a count distinct. This will cause sorting
on a larger table and degraded performance.

38
Performance Issues: Adding redundant columns
Continuing with the previous Health-Care
example, assuming a 60 byte detail table and 10
byte Sale_Person.

◦ Copying the Sale_Person to the detail table results in


all scans taking 16% longer than previously.

◦ Justifiable only if significant portion of queries get


benefit by accessing the denormalized detail table.

◦ Need to look at the cost-benefit trade-off for each


denormalization decision.

39
Other Issues: Adding redundant columns
Other issues include, increase in table size,
maintenance and loss of information:

 The size of the (largest table i.e.) transaction table


increases by the size of the Sale_Person key.
 For the example being considered, the detail table size
increases from 1.2 GB to 1.32 GB.

 If the Sale_Person key changes (e.g. new 12 digit


NID), then updates to be reflected all the way to
transaction table.

 In the absence of 1:M relationship, column movement


will actually result in loss of data.

40
Ease of use Issues: Horizontal Splitting
Horizontal splitting is a Divide&Conquer technique that exploits
parallelism. The conquer part of the technique is about combining the
results.

Lets see how it works for hash based splitting/partitioning.

 Assuming uniform hashing, hash splitting supports even data


distribution across all partitions in a pre-defined manner.

 However, hash based splitting is not easily reversible to eliminate the


split.

41
Ease of use Issues: Horizontal Splitting

42
Ease of use Issues: Horizontal Splitting
 Round robin and random splitting:
◦ Guarantee good data distribution.
◦ Almost impossible to reverse (or
undo).
◦ Not pre-defined.

43
Ease of use Issues: Horizontal Splitting
 Range and expression splitting:
◦ Can facilitate partition elimination
with a smart optimizer.
◦ Generally lead to "hot spots” (uneven
distribution of data).

44
Performance Issues: Horizontal Splitting

Dramatic
cancellation of airline
reservations after
Processors 9/11, resulting in
“hot spot”
P1 P2 P3 P4

1998 1999 2000 2001


Splitting based on year

45
Performance issues: Vertical Splitting Facts
Example: Consider a 100 byte header for the
member table such that 20 bytes provide
complete coverage for 90% of the queries.

Split the member table into two parts as


follows:

1. Frequently accessed portion of table (20 bytes), and

2. Infrequently accessed portion of table (80+ bytes).


Why 80+?

Note that primary key (member_id) must be


present in both tables for eliminating the
split.
46
Performance issues: Vertical Splitting Good vs. Bad

Scanning the claim table for most frequently used


queries will be 500% faster with vertical splitting

Ironically, for the “infrequently” accessed queries


the performance will be inferior as compared to
the un-split table because of the join overhead.

47

You might also like