0% found this document useful (0 votes)
24 views46 pages

Lecture 02

This document discusses normalization in databases. It begins by defining normalization and its goals of eliminating redundant data and ensuring sensible data dependencies. It then describes the levels of normalization, including first normal form (1NF), second normal form (2NF), and third normal form (3NF). The document provides examples to illustrate the rules for 1NF and how to achieve 2NF and remove partial dependencies. It also discusses denormalization techniques used in data warehousing to improve query performance.

Uploaded by

Syed Badshah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views46 pages

Lecture 02

This document discusses normalization in databases. It begins by defining normalization and its goals of eliminating redundant data and ensuring sensible data dependencies. It then describes the levels of normalization, including first normal form (1NF), second normal form (2NF), and third normal form (3NF). The document provides examples to illustrate the rules for 1NF and how to achieve 2NF and remove partial dependencies. It also discusses denormalization techniques used in data warehousing to improve query performance.

Uploaded by

Syed Badshah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Warehousing & DATA

MINING (SE-409)
Lecture-2
Introduction and Background

Dr. Huma
Software Engineering department

University of Engineering and Technology, Taxila

1
Normalization

2
Normalization
What is normalization?
What are the goals of normalization?
 Eliminate redundant data.
 Ensure data dependencies make sense.

What is the result of normalization?

What are the levels of normalization?

3
Rules for First Normal Form
The first normal form expects you to follow a few simple rules while designing your
database, and they are:

Rule 1: Single Valued Attributes


Each column of your table should be single valued which means they should not
contain multiple values. We will explain this with help of an example later, let's see
the other rules for now.

Rule 2: Attribute Domain should not change


This is more of a "Common Sense" rule. In each column the values stored must be
of the same kind or type.

For example: If you have a column dob to save date of births of a set of people,
then you cannot or you must not save 'names' of some of them in that column along
with 'date of birth' of others in that column. It should hold only 'date of birth' for all
the records/rows.

4
Rules for First Normal Form
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to
avoid confusion at the time of retrieving data or performing any other operation on
the stored data.
If one or more columns have same name, then the DBMS system will be left
confused.

Rule 4: Order doesn't matters


This rule says that the order in which you store the data in your table doesn't matter.

Time for an Example


Here is our table, with some sample data added to it.

5
Rules for First Normal Form
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++

6
How to solve this Problem?
It's very simple, because all we have to do is break the values into
atomic values.
Here is our updated table and it now satisfies the First Normal Form.

roll_no name subject


101 Akon OS
101 Akon CN
103 Ckon Java
102 Bkon C
102 Bkon C++

7
Second Normal Form
• For a table to be in the Second Normal form, it
should be in the First Normal form and it should not
have Partial Dependency.
• Partial Dependency exists, when for a composite
primary key, any attribute in the table depends only
on a part of the primary key and not on the complete
primary key.
• To remove Partial dependency, we can divide the
table, remove the attribute which is causing partial
dependency, and move it to some other table where
it fits in well.

8
Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php

9
Let's create another table Score, to store the marks obtained by students
in the respective subjects. We will also be saving name of the
teacher who teaches that subject along with marks.

score_id student_id subject_id marks teacher


1 10 1 70 Java
Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teach

10
In the score table we are saving the student_id to know which student's
marks are these and subject_id to know for which subject the marks
are for.

Together, student_id + subject_id forms a Candidate Key(learn


about Database Keys) for this table, which can be the Primary key.

Confused, How this combination can be a primary key?


See, if I ask you to get me marks of student with student_id 10, can you
get it from this table? No, because you don't know for which subject.
And if I give you subject_id, you would not know for which student.
Hence we need student_id + subject_id to uniquely identify any row.

11
But where is Partial Dependency?
• Now if you look at the Score table, we have a column
names teacher which is only dependent on the
subject, for Java it's Java Teacher and for C++ it's C++
Teacher & so on.
• Now as we just discussed that the primary key for
this table is a composition of two columns which
is student_id & subject_id but the teacher's name
only depends on subject, hence the subject_id, and
has nothing to do with student_id.
• This is Partial Dependency, where an attribute in a
table depends on only a part of the primary key and
not on the whole key.
12
How to remove Partial
Dependency?
There can be many different solutions for this, but out objective is to remove teacher's
name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the
Subject table. Hence, the Subject table will become:
And our Score table is now in the second normal form, with no partial dependency.

subject_id subject_name teacher

1 Java Java Teacher

2 C++ C++ Teacher

3 Php Php Teacher

13
How to remove Partial
Dependency?
And our Score table is now in the second normal form, with no partial dependency.

score_id student_id subject_id marks


1 10 1 70
2 10 2 75
3 11 1 80

14
Third Normal Form (3NF)

• Requirements for Third Normal Form


• For a table to be in the third normal form,
• It should be in the Second Normal form.
• And it should not have Transitive Dependency.

15
• By transitive functional dependency, we mean
we have the following relationships in the
table: A is functionally dependent on B, and B
is functionally dependent on C. In this case, C
is transitively dependent on A via B.
• 3rd Normal Form Example
• Consider the following example:

16
17
Striking a balance between “good” & “evil”
De-normalization Normalization
Too many tables
4+ Normal Forms

3rd Normal Form

2nd Normal Form

Data Cubes 1st Normal Form

Data Lists

Flat Table One big flat file

18
What is De-normalization?
 It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.

 Normalization is a rule of thumb in DBMS,


but in Decision Support System ease of use is
achieved by way of denormalization.

 De-normalization comes in many flavors,


such as combining tables, splitting tables,
adding data etc., but all done very carefully.
19
Why De-normalization In DSS?
• Bringing “close” dispersed but related data items.

• Query performance in DSS significantly dependent


on physical data model.

• Very early studies showed performance difference


in orders of magnitude for different number de-
normalized tables and rows per table.

• The level of de-normalization should be carefully


considered.
20
How De-normalization improves performance?
De-normalization specifically improves
performance by either:

 Reducing the number of tables and hence the


reliance on joins, which consequently speeds up
performance.

 Reducing the number of joins required during


query execution, or

 Reducing the number of rows to be retrieved from


the Primary Data Table.
21
Areas for Applying De-Normalization Techniques
 Dimensional modelling, dealing with data come across (two
major schemas. Snowflake schema and

Dealing with the abundance of star schemas.

 Fast access of time series data for analysis. [time has


hierarchy: days+=weeks+=months+=years…..]. No single
person query. Collapse in case of normalize time data.
 Fast aggregate (sum, average etc.) results and
complicated calculations.
 Multidimensional analysis (e.g. geography) in a complex
hierarchy.
 Dealing with few updates but many join queries. (hint
oltp)
De-normalization will ultimately affect the database size [redundancy
increase] and query performance 22
Five principal De-normalization techniques
1. Collapsing Tables.
- Two entities with a One-to-One relationship.
- Two entities with a Many-to-Many relationship.

2. Splitting Tables (Horizontal/Vertical Splitting).

3. Pre-Joining. One to many relationship

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc).

23
De-normalization Techniques

24
Collapsing Tables
ColA ColB
denormalized

ColA ColB ColC


normalized

ColA ColC

 Reduced storage space.

 Reduced update time.

 Does not changes business view.

 Reduced foreign keys.

25
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC

Vertical Split
Table_h1 Table_h2

ColA ColB ColC ColA ColB ColC

26
Horizontal split
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.

GOAL

 Spreading rows for taking advantage of


parallelism.

 Grouping data to avoid unnecessary query load


in WHERE clause.

27
Splitting Tables: Horizontal splitting
ADVANTAGE
 Enhance security of data.
 Organizing tables differently for different
queries.

 Graceful degradation of database in case of


table damage.

28
Splitting Tables: Vertical Splitting
 Infrequently accessed columns become extra
“baggage” thus degrading performance.

Very useful for rarely accessed large text columns


with large headers.

 Header size is reduced, allowing more rows per


block, thus reducing I/O.

 For an end user, the split appears as a single table


through a view.

29
Pre-joining …

• Identify frequent joins and append the tables


together in the physical data model.

• Generally used for 1:M such as master-detail. RI


is assumed to exist.

• Additional space is required as the master


information is repeated in the new header
table.

30
Master
Pre-Joining…
Sale_ID Sale_date Sale_person
normalized

1 M
Tx_ID Sale_ID Item_ID Item_Qty Sale_Rs Detail
denormalized

Tx_ID Sale_ID Sale_date Sale_person Item_ID Item_Qty Sale_Rs

31
Adding Redundant Columns…
Table_1 Table_1’
ColA ColB ColA ColB ColC

Table_2 Table_2

ColA ColC ColD … ColZ ColA ColC ColD … ColZ

32
Adding Redundant Columns…
Columns can also be moved, instead of making them
redundant. Very similar to pre-joining as discussed
earlier.

EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.

 A join is required.

 To eliminate the join, a redundant attribute added in


the target entity which is functionally independent of
the primary key.
33
Redundant Columns: Surprise

Note that:
 Actually increases in storage space( header
size increase), and increase in update
overhead(query fast, few slow).

34
Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attributes
GP  Calculated once
DoB: Date of Birth Age  Used Frequently

Age is also a derived attribute, calculated as Current_Date


– DoB (calculated periodically).

GP (Grade Point) column in the data warehouse data


model is included as a derived value. The formula for
calculating this field is Grade*Credits.
35
Online Analytical Processing (OLAP)

36
DWH & OLAP

• Relationship between DWH &


OLAP

• Data Warehouse & OLAP go


together.

• Analysis supported by OLAP


37
Supporting the human thought process
THOUGHT PROCESS QUERY SEQUENCE

An enterprise wide fall in profit What was the quarterly sales


during last year ??

? Profit down by a large percentage What was the quarterly sales at


consistently during last quarter regional level during last year ??
only. Rest is OK

What was the quarterly sales at


What is special about last quarter product level during last year?
?
What was the monthly sale for
last quarter group by products
Products alone doing OK, but
North region is most problematic.
What was the monthly sale for
last quarter group by region
OK. So the problem is the high
cost of products purchased
in north. What was the monthly sale of
products in north at store level
group by products purchased

How many such query sequences can be programmed in advance? 38


Analysis of last example
• Analysis is Ad-hoc [no predefine sequence of
quries]
• Analysis is interactive (user driven) [content
change with click: thought process continuty]
• Analysis is iterative
– Answer to one question leads to a dozen more

• Analysis is directional
– Drill Down [details. Year->month->week More in
– Roll Up subsequent
slides
– Pivot
39
Challenges…
• Not feasible to write predefined queries.
– Fails to remain user_driven (becomes programmer
driven).

– Fails to remain ad_hoc and hence is not interactive.

• Enable ad-hoc query support


– Business user can not build his/her own queries
(does not know SQL, should not know it).

– On_the_go SQL generation and execution too slow.

40
Challenges
• Contradiction
– Want to compute answers in advance, but don't
know the questions

• Solution
– Compute answers to “all” possible “queries”. But
how?

– NOTE: Queries are multidimensional aggregates at


some level

41
“All” possible queries (level aggregates)
ALL ALL

Province Frontier ... Punjab

Division Mardan ... Peshawar Lahore ... Multan

District Peshawar Lahore

City Lahore ... Gugranwala

Zone Defense ...Gulberg 42


OLAP: Facts & Dimensions

• FACTS: Quantitative values (numbers) or “measures.”


– e.g., units sold, sales $, Co, Kg etc.

• DIMENSIONS: Descriptive categories.


– e.g., time, geography, product etc.

– DIM often organized in hierarchies representing levels


of detail in the data (e.g., week, month, quarter, year,
decade etc.).

43
Where does OLAP fit in?

?

Transaction
Data
Data
Loading

ELT

OLAP


Reports

Decision
Maker
Data Cube
(MOLAP) Presentation
Tools

44
OLTP vs. OLAP
Feature OLTP OLAP
Level of data Detailed Aggregated
Amount of data per Small Large
transaction
Views Pre-defined User-defined
[Programmer]
Typical write Update, insert, delete Bulk insert
operation
“age” of data Current (60-90 days) Historical 5-10 years and
also current [Active
DW]
Number of users High Low-Med
Tables Flat tables [Highly Multi-Dimensional tables
normalized]
Database size Med (109 B – 1012 B) High (1012 B – 1015 B)
Query Optimizing Requires experience Already “optimized”
45
Data availability High Low-Med
OLAP FASMI Test
Fast: Delivers information to the user at a fairly constant rate.
Most queries answered in under five seconds.

Analysis: Performs basic numerical and statistical analysis of the


data, pre-defined by an application developer or defined ad-hocly
by the user.

Shared: Implements the security requirements necessary for


sharing potentially confidential data across a large user population.

Multi-dimensional: The essential characteristic of OLAP.

Information: Accesses all the data and information necessary and


relevant for the application, wherever it may reside and not limited
by volume.
...from the OLAP Report by Pendse and Creeth.

46

You might also like