Data Mining: Concepts & Techniques
Data Mining: Concepts & Techniques
Motivation:
Necessity is the Mother of Invention
Data explosion problem
Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
Data mining:
the core of
knowledge
discovery
process
10
11
Data Selection
Select the information about people who have subscribed to a
magazine
12
Cleaning
Pollutions: Type errors, moving from one place to another without
notifying change of address, people give incorrect information
about themselves
Pattern Recognition Algorithms
13
Cleaning
Lack of domain consistency
14
Enrichment
Need extra information about the clients consisting of date of birth,
income, amount of credit, and whether or not an individual owns a
car or a house
15
Enrichment
The new information need to be easily joined to the existing
client records
Extract more knowledge
16
Coding
We select only those records that have enough information to be
of value (row)
Project the fields in which we are interested (column)
17
Coding
Code the information which is too detailed
Address to region
Birth date to age
Divide income by 1000
Divide credit by 1000
Convert cars yes-no to 1-0
Convert purchase date to month numbers starting from
1990
The way in which we code the information will
determine the type of patterns we find
Coding has to be performed repeatedly in order to get the best
results
18
Coding
The way in which we code the information will determine the
type of patterns we find
19
Coding
We are interested in the relationships between readers of
different magazines
Perform flattening operation
20
Data mining
We may find the following rules
A customer with credit > 13000 and aged between 22 and 31 who
has subscribed to a comics at time T will very likely subscribe to
a car magazine five years later
The number of house magazines sold to customers with credit
between 12000 and 31000 living in region 4 is increasing
A customer with credit between 5000 and 10000 who reads a
comics magazine will very likely become a customer with
credit between 12000 and 31000 who reads a sports and a house
magazine after 12 years
21
22
Business-Question-Driven Process
23
24
Data Presenta
Visualization
Techniques
Data Mining
Information D
Data Explorat
Making
Decisions
Stati
stica
l
Ana
lysis
,
Que
ryin
g
and Reporting
Data Warehouses /
Data Marts
OLAP,
MDA
End
User
Busi
ness
An
aly
st
D
a
t
a
An
25
alyst
D
B
A
Data Sources
Paper, Files, Information Providers, Database Systems,
OLTP
26