0% found this document useful (0 votes)

24 views46 pages

Lecture 02

This document discusses normalization in databases. It begins by defining normalization and its goals of eliminating redundant data and ensuring sensible data dependencies. It then describes the levels of normalization, including first normal form (1NF), second normal form (2NF), and third normal form (3NF). The document provides examples to illustrate the rules for 1NF and how to achieve 2NF and remove partial dependencies. It also discusses denormalization techniques used in data warehousing to improve query performance.

Uploaded by

Syed Badshah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views46 pages

Lecture 02

Uploaded by

Syed Badshah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Warehousing & DATA

MINING (SE-409)
Lecture-2
Introduction and Background

Dr. Huma
Software Engineering department

University of Engineering and Technology, Taxila

1
Normalization

2
Normalization
What is normalization?
What are the goals of normalization?
 Eliminate redundant data.
 Ensure data dependencies make sense.

What is the result of normalization?

What are the levels of normalization?

3
Rules for First Normal Form
The first normal form expects you to follow a few simple rules while designing your
database, and they are:

Rule 1: Single Valued Attributes

Each column of your table should be single valued which means they should not
contain multiple values. We will explain this with help of an example later, let's see
the other rules for now.

Rule 2: Attribute Domain should not change

This is more of a "Common Sense" rule. In each column the values stored must be
of the same kind or type.

For example: If you have a column dob to save date of births of a set of people,
then you cannot or you must not save 'names' of some of them in that column along
with 'date of birth' of others in that column. It should hold only 'date of birth' for all
the records/rows.

4
Rules for First Normal Form
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to
avoid confusion at the time of retrieving data or performing any other operation on
the stored data.
If one or more columns have same name, then the DBMS system will be left
confused.

Rule 4: Order doesn't matters

This rule says that the order in which you store the data in your table doesn't matter.

Time for an Example

Here is our table, with some sample data added to it.

5
Rules for First Normal Form
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++

6
How to solve this Problem?
It's very simple, because all we have to do is break the values into
atomic values.
Here is our updated table and it now satisfies the First Normal Form.

roll_no name subject

101 Akon OS
101 Akon CN
103 Ckon Java
102 Bkon C
102 Bkon C++

7
Second Normal Form
• For a table to be in the Second Normal form, it
should be in the First Normal form and it should not
have Partial Dependency.
• Partial Dependency exists, when for a composite
primary key, any attribute in the table depends only
on a part of the primary key and not on the complete
primary key.
• To remove Partial dependency, we can divide the
table, remove the attribute which is causing partial
dependency, and move it to some other table where
it fits in well.

8
Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php

9
Let's create another table Score, to store the marks obtained by students
in the respective subjects. We will also be saving name of the
teacher who teaches that subject along with marks.

score_id student_id subject_id marks teacher

1 10 1 70 Java
Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teach

10
In the score table we are saving the student_id to know which student's
marks are these and subject_id to know for which subject the marks
are for.

Together, student_id + subject_id forms a Candidate Key(learn

about Database Keys) for this table, which can be the Primary key.

Confused, How this combination can be a primary key?

See, if I ask you to get me marks of student with student_id 10, can you
get it from this table? No, because you don't know for which subject.
And if I give you subject_id, you would not know for which student.
Hence we need student_id + subject_id to uniquely identify any row.

11
But where is Partial Dependency?
• Now if you look at the Score table, we have a column
names teacher which is only dependent on the
subject, for Java it's Java Teacher and for C++ it's C++
Teacher & so on.
• Now as we just discussed that the primary key for
this table is a composition of two columns which
is student_id & subject_id but the teacher's name
only depends on subject, hence the subject_id, and
has nothing to do with student_id.
• This is Partial Dependency, where an attribute in a
table depends on only a part of the primary key and
not on the whole key.
12
How to remove Partial
Dependency?
There can be many different solutions for this, but out objective is to remove teacher's
name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the
Subject table. Hence, the Subject table will become:
And our Score table is now in the second normal form, with no partial dependency.

subject_id subject_name teacher

1 Java Java Teacher

2 C++ C++ Teacher

3 Php Php Teacher

13
How to remove Partial
Dependency?
And our Score table is now in the second normal form, with no partial dependency.

score_id student_id subject_id marks

1 10 1 70
2 10 2 75
3 11 1 80

14
Third Normal Form (3NF)

• Requirements for Third Normal Form

• For a table to be in the third normal form,
• It should be in the Second Normal form.
• And it should not have Transitive Dependency.

15
• By transitive functional dependency, we mean
we have the following relationships in the
table: A is functionally dependent on B, and B
is functionally dependent on C. In this case, C
is transitively dependent on A via B.
• 3rd Normal Form Example
• Consider the following example:

16
17
Striking a balance between “good” & “evil”
De-normalization Normalization
Too many tables
4+ Normal Forms

3rd Normal Form

2nd Normal Form

Data Cubes 1st Normal Form

Data Lists

Flat Table One big flat file

18
What is De-normalization?
 It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.

 Normalization is a rule of thumb in DBMS,

but in Decision Support System ease of use is
achieved by way of denormalization.

 De-normalization comes in many flavors,

such as combining tables, splitting tables,
adding data etc., but all done very carefully.
19
Why De-normalization In DSS?
• Bringing “close” dispersed but related data items.

• Query performance in DSS significantly dependent

on physical data model.

• Very early studies showed performance difference

in orders of magnitude for different number de-
normalized tables and rows per table.

• The level of de-normalization should be carefully

considered.
20
How De-normalization improves performance?
De-normalization specifically improves
performance by either:

 Reducing the number of tables and hence the

reliance on joins, which consequently speeds up
performance.

 Reducing the number of joins required during

query execution, or

 Reducing the number of rows to be retrieved from

the Primary Data Table.
21
Areas for Applying De-Normalization Techniques
 Dimensional modelling, dealing with data come across (two
major schemas. Snowflake schema and

Dealing with the abundance of star schemas.

 Fast access of time series data for analysis. [time has

hierarchy: days+=weeks+=months+=years…..]. No single
person query. Collapse in case of normalize time data.
 Fast aggregate (sum, average etc.) results and
complicated calculations.
 Multidimensional analysis (e.g. geography) in a complex
hierarchy.
 Dealing with few updates but many join queries. (hint
oltp)
De-normalization will ultimately affect the database size [redundancy
increase] and query performance 22
Five principal De-normalization techniques
1. Collapsing Tables.
- Two entities with a One-to-One relationship.
- Two entities with a Many-to-Many relationship.

2. Splitting Tables (Horizontal/Vertical Splitting).

3. Pre-Joining. One to many relationship

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc).

23
De-normalization Techniques

24
Collapsing Tables
ColA ColB
denormalized

ColA ColB ColC

normalized

ColA ColC

 Reduced storage space.

 Reduced update time.

 Does not changes business view.

 Reduced foreign keys.

25
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC

Vertical Split
Table_h1 Table_h2

ColA ColB ColC ColA ColB ColC

26
Horizontal split
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.

GOAL

 Spreading rows for taking advantage of

parallelism.

 Grouping data to avoid unnecessary query load

in WHERE clause.

27
Splitting Tables: Horizontal splitting
ADVANTAGE
 Enhance security of data.
 Organizing tables differently for different
queries.

 Graceful degradation of database in case of

table damage.

28
Splitting Tables: Vertical Splitting
 Infrequently accessed columns become extra
“baggage” thus degrading performance.

Very useful for rarely accessed large text columns

with large headers.

 Header size is reduced, allowing more rows per

block, thus reducing I/O.

 For an end user, the split appears as a single table

through a view.

29
Pre-joining …

• Identify frequent joins and append the tables

together in the physical data model.

• Generally used for 1:M such as master-detail. RI

is assumed to exist.

• Additional space is required as the master

information is repeated in the new header
table.

30
Master
Pre-Joining…
Sale_ID Sale_date Sale_person
normalized

1 M
Tx_ID Sale_ID Item_ID Item_Qty Sale_Rs Detail
denormalized

Tx_ID Sale_ID Sale_date Sale_person Item_ID Item_Qty Sale_Rs

31
Adding Redundant Columns…
Table_1 Table_1’
ColA ColB ColA ColB ColC

Table_2 Table_2

ColA ColC ColD … ColZ ColA ColC ColD … ColZ

32
Adding Redundant Columns…
Columns can also be moved, instead of making them
redundant. Very similar to pre-joining as discussed
earlier.

EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.

 A join is required.

 To eliminate the join, a redundant attribute added in

the target entity which is functionally independent of
the primary key.
33
Redundant Columns: Surprise

Note that:
 Actually increases in storage space( header
size increase), and increase in update
overhead(query fast, few slow).

34
Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attributes
GP  Calculated once
DoB: Date of Birth Age  Used Frequently

Age is also a derived attribute, calculated as Current_Date

– DoB (calculated periodically).

GP (Grade Point) column in the data warehouse data

model is included as a derived value. The formula for
calculating this field is Grade*Credits.
35
Online Analytical Processing (OLAP)

36
DWH & OLAP

• Relationship between DWH &

OLAP

• Data Warehouse & OLAP go

together.

• Analysis supported by OLAP

37
Supporting the human thought process
THOUGHT PROCESS QUERY SEQUENCE

An enterprise wide fall in profit What was the quarterly sales

during last year ??

? Profit down by a large percentage What was the quarterly sales at


consistently during last quarter regional level during last year ??
only. Rest is OK

What was the quarterly sales at

What is special about last quarter product level during last year?
?
What was the monthly sale for
last quarter group by products
Products alone doing OK, but
North region is most problematic.
What was the monthly sale for
last quarter group by region
OK. So the problem is the high
cost of products purchased
in north. What was the monthly sale of
products in north at store level
group by products purchased

How many such query sequences can be programmed in advance? 38

Analysis of last example
• Analysis is Ad-hoc [no predefine sequence of
quries]
• Analysis is interactive (user driven) [content
change with click: thought process continuty]
• Analysis is iterative
– Answer to one question leads to a dozen more

• Analysis is directional
– Drill Down [details. Year->month->week More in
– Roll Up subsequent
slides
– Pivot
39
Challenges…
• Not feasible to write predefined queries.
– Fails to remain user_driven (becomes programmer
driven).

– Fails to remain ad_hoc and hence is not interactive.

• Enable ad-hoc query support

– Business user can not build his/her own queries
(does not know SQL, should not know it).

– On_the_go SQL generation and execution too slow.

40
Challenges
• Contradiction
– Want to compute answers in advance, but don't
know the questions

• Solution
– Compute answers to “all” possible “queries”. But
how?

– NOTE: Queries are multidimensional aggregates at

some level

41
“All” possible queries (level aggregates)
ALL ALL

Province Frontier ... Punjab

Division Mardan ... Peshawar Lahore ... Multan

District Peshawar Lahore

City Lahore ... Gugranwala

Zone Defense ...Gulberg 42

OLAP: Facts & Dimensions

• FACTS: Quantitative values (numbers) or “measures.”

– e.g., units sold, sales $, Co, Kg etc.

• DIMENSIONS: Descriptive categories.

– e.g., time, geography, product etc.

– DIM often organized in hierarchies representing levels

of detail in the data (e.g., week, month, quarter, year,
decade etc.).

43
Where does OLAP fit in?

?

Transaction
Data
Data
Loading

ELT

OLAP


Reports

Decision
Maker
Data Cube
(MOLAP) Presentation
Tools

44
OLTP vs. OLAP
Feature OLTP OLAP
Level of data Detailed Aggregated
Amount of data per Small Large
transaction
Views Pre-defined User-defined
[Programmer]
Typical write Update, insert, delete Bulk insert
operation
“age” of data Current (60-90 days) Historical 5-10 years and
also current [Active
DW]
Number of users High Low-Med
Tables Flat tables [Highly Multi-Dimensional tables
normalized]
Database size Med (109 B – 1012 B) High (1012 B – 1015 B)
Query Optimizing Requires experience Already “optimized”
45
Data availability High Low-Med
OLAP FASMI Test
Fast: Delivers information to the user at a fairly constant rate.
Most queries answered in under five seconds.

Analysis: Performs basic numerical and statistical analysis of the

data, pre-defined by an application developer or defined ad-hocly
by the user.

Shared: Implements the security requirements necessary for

sharing potentially confidential data across a large user population.

Multi-dimensional: The essential characteristic of OLAP.

Information: Accesses all the data and information necessary and

relevant for the application, wherever it may reside and not limited
by volume.
...from the OLAP Report by Pendse and Creeth.

Etap PowerStation® 4.0-User Guide-Dr Tarek Nagla
100% (8)
Etap PowerStation® 4.0-User Guide-Dr Tarek Nagla
1,664 pages
NORMALIZATION
No ratings yet
NORMALIZATION
6 pages
Full Stack
100% (1)
Full Stack
81 pages
Sra4 Installation Guide
No ratings yet
Sra4 Installation Guide
26 pages
Database Analysis-Unit-1I: Course Name: Faculty Name
100% (1)
Database Analysis-Unit-1I: Course Name: Faculty Name
36 pages
Trends, Networks, and Critical Thinking in The 21st Century
100% (4)
Trends, Networks, and Critical Thinking in The 21st Century
17 pages
HKMC RPMCC1
No ratings yet
HKMC RPMCC1
12 pages
Practitioner Prep Guide 2.1
No ratings yet
Practitioner Prep Guide 2.1
90 pages
DBMS Normalization
No ratings yet
DBMS Normalization
18 pages
Data Warehousing: Lecture No 04
No ratings yet
Data Warehousing: Lecture No 04
47 pages
Lec3 De-Normalization
No ratings yet
Lec3 De-Normalization
38 pages
DBMS Session 6 Notes
No ratings yet
DBMS Session 6 Notes
50 pages
Islamic Republic of Afghanistan Ministry of Higher Education Herat University Computer Science Faculty
No ratings yet
Islamic Republic of Afghanistan Ministry of Higher Education Herat University Computer Science Faculty
35 pages
PDF Document 2
No ratings yet
PDF Document 2
72 pages
Normalization
No ratings yet
Normalization
47 pages
Unit IV
No ratings yet
Unit IV
65 pages
Lecture - 5 6 16032023 111618am
No ratings yet
Lecture - 5 6 16032023 111618am
38 pages
De Normalization 17062020 101155am 01042022 064624pm
No ratings yet
De Normalization 17062020 101155am 01042022 064624pm
36 pages
Normalization
No ratings yet
Normalization
60 pages
Autocad - Tutorial Auto Cad 2002 2D 3D
100% (4)
Autocad - Tutorial Auto Cad 2002 2D 3D
56 pages
12.1 Manupulating Data - Relational Data Base
No ratings yet
12.1 Manupulating Data - Relational Data Base
25 pages
Normalisation
No ratings yet
Normalisation
21 pages
CS331 - Chapter5 Normalization
No ratings yet
CS331 - Chapter5 Normalization
35 pages
Normalization of Database
No ratings yet
Normalization of Database
10 pages
Normalization
No ratings yet
Normalization
15 pages
DBMS Unit3
No ratings yet
DBMS Unit3
57 pages
Normalization of Database-Ass-2
No ratings yet
Normalization of Database-Ass-2
31 pages
DB 2
No ratings yet
DB 2
15 pages
Database Normalization
No ratings yet
Database Normalization
44 pages
CSC2243-Databases-Part III
No ratings yet
CSC2243-Databases-Part III
60 pages
Data Normalization
No ratings yet
Data Normalization
25 pages
Normalization
No ratings yet
Normalization
13 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
33 pages
Normalization
No ratings yet
Normalization
36 pages
Week 2
No ratings yet
Week 2
34 pages
1NF, 2NF
No ratings yet
1NF, 2NF
9 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
14 pages
NORMALIZATION
No ratings yet
NORMALIZATION
11 pages
Topic6 Normalization Updated
No ratings yet
Topic6 Normalization Updated
14 pages
Database Techniques DB Normalization
No ratings yet
Database Techniques DB Normalization
37 pages
DBMS Unit-4 Notes
No ratings yet
DBMS Unit-4 Notes
18 pages
Normalization Lesson
No ratings yet
Normalization Lesson
13 pages
Normalization
No ratings yet
Normalization
17 pages
Normal Forms
No ratings yet
Normal Forms
30 pages
Lesson5 NORMALIZATION (Midtrem)
No ratings yet
Lesson5 NORMALIZATION (Midtrem)
29 pages
Normalization FNL
No ratings yet
Normalization FNL
14 pages
RDBMS Normalization
No ratings yet
RDBMS Normalization
8 pages
Normalisation Data
No ratings yet
Normalisation Data
8 pages
Database Chapter 5
No ratings yet
Database Chapter 5
23 pages
Adbms (Bca) 1 1744715686575
No ratings yet
Adbms (Bca) 1 1744715686575
39 pages
D-Series User Manual
No ratings yet
D-Series User Manual
204 pages
Fuel Card Request Form
100% (1)
Fuel Card Request Form
2 pages
DB Week 10 Lec 1
No ratings yet
DB Week 10 Lec 1
32 pages
12 Normalization
No ratings yet
12 Normalization
41 pages
Data Base - Database - Databse Chapter 9
No ratings yet
Data Base - Database - Databse Chapter 9
54 pages
Normalization & De-Normalization: Group Members
No ratings yet
Normalization & De-Normalization: Group Members
46 pages
Normalization
No ratings yet
Normalization
35 pages
Q.1 What Is Normalisation? ANSWER:-Normalisation Is The Process of Structuring A Relational Database in Accordance
No ratings yet
Q.1 What Is Normalisation? ANSWER:-Normalisation Is The Process of Structuring A Relational Database in Accordance
9 pages
Lecture 7 - 8 - Normalization
No ratings yet
Lecture 7 - 8 - Normalization
30 pages
Normalization Paper
No ratings yet
Normalization Paper
3 pages
Module3 PartB
No ratings yet
Module3 PartB
41 pages
Normalization and Demoralization
No ratings yet
Normalization and Demoralization
4 pages
Databases
No ratings yet
Databases
4 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
10 pages
Normalization
No ratings yet
Normalization
7 pages
Functional Dependency (Normalization) Asad Khailany, DSC.: First Normal Form
No ratings yet
Functional Dependency (Normalization) Asad Khailany, DSC.: First Normal Form
13 pages
DBMS MP
No ratings yet
DBMS MP
15 pages
Normalization
No ratings yet
Normalization
26 pages
Free Fire Bangladesh Championship 2025 RULEBOOK
No ratings yet
Free Fire Bangladesh Championship 2025 RULEBOOK
31 pages
Linear Algebra For Quantum Computing (From Amelie Schreiber Notebook)
No ratings yet
Linear Algebra For Quantum Computing (From Amelie Schreiber Notebook)
72 pages
Unlocking The Power of AI in Our Daily Lives - How Artificial Intelligence Is Revolutionizing The Way We Live and Work
No ratings yet
Unlocking The Power of AI in Our Daily Lives - How Artificial Intelligence Is Revolutionizing The Way We Live and Work
3 pages
Csol 510 Final Project
No ratings yet
Csol 510 Final Project
19 pages
246 - 15ec82 - Fon Module 4 - 5
No ratings yet
246 - 15ec82 - Fon Module 4 - 5
42 pages
COMP246-016 - Fridge Management System - Parts A, B, & C
No ratings yet
COMP246-016 - Fridge Management System - Parts A, B, & C
56 pages
Jyot Resume
No ratings yet
Jyot Resume
4 pages
Bugreport NAM L29 HUAWEINAM L29 2025 02 04 08 20 40 Dumpstate - Log 9524
No ratings yet
Bugreport NAM L29 HUAWEINAM L29 2025 02 04 08 20 40 Dumpstate - Log 9524
40 pages
Clat1 Vlsi Ak
No ratings yet
Clat1 Vlsi Ak
5 pages
Chapter 30
No ratings yet
Chapter 30
28 pages
Second Review Major Project Implementation
No ratings yet
Second Review Major Project Implementation
27 pages
Question Paper Code: X11182
No ratings yet
Question Paper Code: X11182
2 pages
Chapter 9-Statements
No ratings yet
Chapter 9-Statements
3 pages
Video Visit Guide
No ratings yet
Video Visit Guide
12 pages
1ST Periodical Exam-Ict 12
No ratings yet
1ST Periodical Exam-Ict 12
5 pages
IBM Z & Zero Trust - by Mike Jordan
No ratings yet
IBM Z & Zero Trust - by Mike Jordan
17 pages
Auction Research Paper
No ratings yet
Auction Research Paper
8 pages
2SC6105 Datasheet
No ratings yet
2SC6105 Datasheet
5 pages
MTSM-1 Multi Location Connectivity of COSEC Door Controller
No ratings yet
MTSM-1 Multi Location Connectivity of COSEC Door Controller
8 pages
Capstone Project 2-SQL-DataETL
No ratings yet
Capstone Project 2-SQL-DataETL
7 pages
Acc111 Register Vendor
No ratings yet
Acc111 Register Vendor
3 pages
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet