0% found this document useful (0 votes)
56 views36 pages

Data Warehousing & DATA MINING (SE-409) : Lecture-2

Here are the key rules for first normal form (1NF): - Each column should contain a single value (atomicity) - no repeating groups of values. - The domain (set of possible values) of each column should be well-defined and not change for different rows. - Each row must be uniquely identifiable by its primary key. - Columns have unique names to avoid confusion. The goal of 1NF is to eliminate repeating groups and ensure each cell contains a single value. If a table follows these rules, it is considered to be in first normal form. Normalization helps reduce data redundancy and ensures data dependencies make logical sense.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views36 pages

Data Warehousing & DATA MINING (SE-409) : Lecture-2

Here are the key rules for first normal form (1NF): - Each column should contain a single value (atomicity) - no repeating groups of values. - The domain (set of possible values) of each column should be well-defined and not change for different rows. - Each row must be uniquely identifiable by its primary key. - Columns have unique names to avoid confusion. The goal of 1NF is to eliminate repeating groups and ensure each cell contains a single value. If a table follows these rules, it is considered to be in first normal form. Normalization helps reduce data redundancy and ensures data dependencies make logical sense.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Warehousing & DATA

MINING (SE-409)
Lecture-2
Introduction and Background

Huma Ayub
Software Engineering department

University of Engineering and Technology, Taxila


How is it Different?
• Starts with a 6x12 availability requirement ... but
7x24 usually becomes the goal.
 Decision makers typically don’t work 24 hrs a day and 7
days a week. An ATM (OLTP) system does.

 Once decision makers start using the DWH, and start


gaining the benefits, they start liking it…

 Start using the DWH more often, till want it available


100% of the time.
 For business across the globe, 50% of the world may be
sleeping at any one time, but the businesses are up 100%
of the time.
 100% availability not a minor task, need to take into
account loading strategies, refresh
DWH-Ahsan Abdullahrates etc. 2
How is it Different?
• Does not follows the traditional development
model
Requirements

 Program

Classical SDLC

 Requirements gathering
 Analysis
 Design
 Programming
 Testing
 Integration
 Implementation
DWH-Ahsan Abdullah 3
How is it Different?
• Does not follows the traditional development
model
DWH

Program

 Requirements
DWH SDLC (CLDS)

 Implement warehouse
 Integrate data
 Test for biasness
 Program w.r.t data
 Design DSS system
 Analyze results
 Understand requirement
DWH-Ahsan Abdullah 4
Data Warehouse Vs. OLTP

OLTP (On Line Transaction Processing)


Select tx_date, balance from tx_table
Where account_ID = 23876;

DWH-Ahsan Abdullah 5
Data Warehouse Vs. OLTP

DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;

DWH-Ahsan Abdullah 6
Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)

OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned

May use a single table Uses multiple tables


High selectivity of query Low selectivity of query
Indexing on primary key Indexing on primary index
(unique) (non-unique)

DWH-Ahsan Abdullah 7
Putting the pieces together

Data Data Warehouse Server OLAP Servers Clients


(Tier 0) (Tier 1) (Tier 2) (Tier 3)


Semistructured MOLAP
Sources Query/Reporting

www data
Meta
Data 
 Extract
Data 
 
Analysis







 Archived
Transform
Load Warehouse 
 data
(ETL) ROLAP Business
IT Data Mining
Users
Users
Operational
Data Bases 

Data sources Data Marts  Tools
Business Users

DWH-Ahsan Abdullah 8
Types & Typical Applications of DWH

DWH-Ahsan Abdullah 9
Types of data warehouse

• Financial
• Telecommunication
• Insurance
• Human Resource
• Global
• Exploratory

DWH-Ahsan Abdullah 10
Types of data warehouse
Financial
 First data warehouse that an organization
builds. This is appealing because:

 Nerve center, easy to get attention.

 In most organizations start work from smallest data


set. [due to risk factor, more complexity]

 Touches all aspects of an organization, with a


common denomination i.e. money.

DWH-Ahsan Abdullah 11
Types of data warehouse
Telecommunication
Controlled by complete volume of data.

Many ways to accommodate call level detail:

 Only a few months of call level detail,


 Storing lots of call level detail scattered over different
storage media,

 Storing only selective call level detail, etc.

 Unfortunately, for many kinds of processing, working at


an aggregate level is simply not possible.

DWH-Ahsan Abdullah 12
Types of data warehouse
Insurance
Insurance data warehouses are similar to other
data warehouses BUT with a few exceptions.
Stored data that is very, very old, used for actuarial
processing.(RISK ASSESMENT)
Typical business may change dramatically over
last 40-50 years, but not insurance.
In retailing or telecomm there are a few important
dates, but in the insurance environment there are
many dates of many kinds.

DWH-Ahsan Abdullah 13
Types of data warehouse
Insurance
Insurance data warehouses are similar to other
data warehouses BUT with a few exceptions.
Long operational business cycles, in years.
Processing time in months. Thus the operating
speed is different.
Transactions are not gathered and processed, but
are in kind of “frozen”.
Thus a very unique approach of design &
implementation.

DWH-Ahsan Abdullah 14
Typical Applications
Impact on organization’s core business is to streamline
and maximize profitability.

• Fraud detection.
• Profitability analysis.
• Direct mail/database marketing.
• Credit risk prediction.
• Customer retention modeling.
• Yield management.
• Inventory management.

DWH-Ahsan Abdullah 15
Typical Applications
Fraud detection

• By observing data usage patterns.


• People have typical purchase patterns.
• Deviation from patterns.
• Certain cities notorious for fraud.
• Certain items bought by stolen cards.
• Similar behavior for stolen phone cards.

DWH-Ahsan Abdullah 16
Typical Applications
Profitability Analysis
• Every Banks know if they are profitable or not.
• Don’t know which customers are profitable.
• Typically more than 50% are NOT profitable.
• Don’t know which one?
• Balance is not enough, transactional behavior is
the key.
• Restructure products and pricing strategies.
• Life-time profitability models (next 3-5 years).
DWH-Ahsan Abdullah 17
Typical Applications
Direct mail marketing

• Targeted marketing.
• Offering high bandwidth package NOT to all
users.
• Know from call detail records of web surfing.
• Saves marketing expense, saving pennies.

DWH-Ahsan Abdullah 18
Typical Applications
Credit risk prediction

• Who should get a loan?


• Customer separation i.e. stable vs. rolling.
• Qualitative decision making NOT subjective.
• Different interest rates for different customers.
• Do not fund bad customer on the basis of good.

DWH-Ahsan Abdullah 19
Normalization

Ahsan Abdullah 20
Normalization
What is normalization?
What are the goals of normalization?
 Eliminate redundant data.
 Ensure data dependencies make sense.

What is the result of normalization?

What are the levels of normalization?

Ahsan Abdullah 21
Rules for First Normal Form
The first normal form expects you to follow a few simple rules while designing your
database, and they are:

Rule 1: Single Valued Attributes


Each column of your table should be single valued which means they should not
contain multiple values. We will explain this with help of an example later, let's see
the other rules for now.

Rule 2: Attribute Domain should not change


This is more of a "Common Sense" rule. In each column the values stored must be
of the same kind or type.

For example: If you have a column dob to save date of births of a set of people,
then you cannot or you must not save 'names' of some of them in that column along
with 'date of birth' of others in that column. It should hold only 'date of birth' for all
the records/rows.
Rules for First Normal Form
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to
avoid confusion at the time of retrieving data or performing any other operation on
the stored data.
If one or more columns have same name, then the DBMS system will be left
confused.

Rule 4: Order doesn't matters


This rule says that the order in which you store the data in your table doesn't matter.

Time for an Example


Here is our table, with some sample data added to it.
Rules for First Normal Form
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++
How to solve this Problem?
It's very simple, because all we have to do is break the values into
atomic values.
Here is our updated table and it now satisfies the First Normal Form.

roll_no name subject


101 Akon OS
101 Akon CN
103 Ckon Java
102 Bkon C
102 Bkon C++
Second Normal Form
• For a table to be in the Second Normal form, it
should be in the First Normal form and it should not
have Partial Dependency.
• Partial Dependency exists, when for a composite
primary key, any attribute in the table depends only
on a part of the primary key and not on the complete
primary key.
• To remove Partial dependency, we can divide the
table, remove the attribute which is causing partial
dependency, and move it to some other table where
it fits in well.
Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php
Let's create another table Score, to store the marks obtained by students
in the respective subjects. We will also be saving name of the
teacher who teaches that subject along with marks.

score_id student_id subject_id marks teacher


1 10 1 70 Java
Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teach
In the score table we are saving the student_id to know which student's
marks are these and subject_id to know for which subject the marks
are for.

Together, student_id + subject_id forms a Candidate Key(learn


about Database Keys) for this table, which can be the Primary key.

Confused, How this combination can be a primary key?


See, if I ask you to get me marks of student with student_id 10, can you
get it from this table? No, because you don't know for which subject.
And if I give you subject_id, you would not know for which student.
Hence we need student_id + subject_id to uniquely identify any row.
But where is Partial Dependency?
• Now if you look at the Score table, we have a column
names teacher which is only dependent on the
subject, for Java it's Java Teacher and for C++ it's C++
Teacher & so on.
• Now as we just discussed that the primary key for
this table is a composition of two columns which
is student_id & subject_id but the teacher's name
only depends on subject, hence the subject_id, and
has nothing to do with student_id.
• This is Partial Dependency, where an attribute in a
table depends on only a part of the primary key and
not on the whole key.
How to remove Partial
Dependency?
There can be many different solutions for this, but out objective is to remove teacher's
name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the
Subject table. Hence, the Subject table will become:
And our Score table is now in the second normal form, with no partial dependency.

subject_id subject_name teacher

1 Java Java Teacher

2 C++ C++ Teacher

3 Php Php Teacher


How to remove Partial
Dependency?
And our Score table is now in the second normal form, with no partial dependency.

score_id student_id subject_id marks


1 10 1 70
2 10 2 75
3 11 1 80
Third Normal Form (3NF)

• Requirements for Third Normal Form


• For a table to be in the third normal form,
• It should be in the Second Normal form.
• And it should not have Transitive Dependency.
• By transitive functional dependency, we mean
we have the following relationships in the
table: A is functionally dependent on B, and B
is functionally dependent on C. In this case, C
is transitively dependent on A via B.
• 3rd Normal Form Example
• Consider the following example:

You might also like