DM-Unit 1
DM-Unit 1
DATA MINING
1
Why Data Mining?
3
Evolution of Database
Technology
1960s:
– Data collection, database creation, IMS and network DBMS
1970s:
– Relational data model, relational DBMS implementation
1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
– Data mining, data warehousing, multimedia databases, and Web
databases
2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems 4
Large-scale Data is
Everywhere!
There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
New mantra
Gather whatever data you can
whenever and wherever
possible.
Expectations
Gathered data will have value
Traffic Patterns Social Networking: Twitter
either for the purpose
collected or for a purpose not
envisioned.
5
Why Data Mining? Commercial
Viewpoint
6
Why Data Mining? Scientific Viewpoint
8
Great Opportunities to Solve Society’s Major
Problems
Improving health care and reducing costs Predicting the impact of climate change
10
What Is Data Mining?
Task-relevant Data
Data Cleaning
Data Integration
Databases 12
Example: A Web Mining
Framework
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
19
Data Mining: Confluence of Multiple
Disciplines
20
Why Confluence of Multiple
Disciplines?
Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
– Micro-array may have tens of thousands of dimensions
High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data
– Software programs, scientific simulations
New and sophisticated applications
21
Data Mining Tasks
Prediction Methods
– Use some variables to predict unknown or
future values of other variables.
Description Methods
– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
22
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
23
Data Mining Models and Tasks
24
Data Mining: Classification Schemes
General functionality
Clu
ste Data
ri ng
Tid Refund Marital
Status
Taxable
Income Cheat
l i ng
1 Yes Single 125K No
ode
2 No Married 100K No
M
i ve
3 No Single 70K No
4 Yes Married 120K No
ct
5 No Divorced 95K Yes
edi
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No
tec ly
oci 13 No Single 85K Yes
tio
s
As
14 No Married 75K No
10
15 No Single 90K Yes
n
l es
Ru
Milk
26
Predictive Modeling: Classification
Find a model for class attribute as a function of
the values of other attributes Model for predicting credit
worthiness
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
27
Classification Example
Training
Learn
Set Classifier Model
28
Examples of Classification Task
29
Classification: Application 1
Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit
card transactions on an account.
30
Classification: Application 2
31
Classification: Application 3
Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
Segment the image.
Measure image attributes (features) - 40 of them per
object.
Model the class based on these features.
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
33
Regression
34
Clustering
35
Applications of Cluster Analysis
Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
Summarization
– Reduce the size of large data
sets
Courtesy: Michael Eisen
Use of K-means to
partition Sea Surface
60
Land Cluster 2
0
(NPP) into clusters that
Ice or No NPP
-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
36
longitude
Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
Collect different attributes of customers based on
their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.
37
Clustering: Application 2
Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
38
Association Rule Discovery:
Definition
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
39
Association Analysis: Applications
Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management
Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases
40
Association Analysis: Applications
42
Primitives that Define a Data Mining
Task
Task-relevant data
– Database or data warehouse name
– Database tables or data warehouse cubes
– Condition for data selection
– Relevant attributes or dimensions
– Data grouping criteria
Type of knowledge to be mined
– Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns
43
Primitive #: Background Knowledge
44
Primitive #: Pattern Interestingness
Measure
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule quality,
discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise threshold
(description)
Novelty
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
45
Primitive #: Presentation of Discovered
Patterns
46
Motivating Challenges
Scalability
High Dimensionality
Non-traditional Analysis
47
Major Issues in Data Mining
49
Major Issues in Data Mining
Performance Issues
– Efficiency and scalability
Huge amount of data
Running time must be predictable and acceptable
– Parallel, distributed and incremental mining algorithms
Divide the data into partitions and processed in parallel
Incorporate database updates without having to mine the entire data again from
scratch
Diversity of Database Types
– Other database that contain complex data objects, multimedia data,
spatial data, etc.
– Expect to have different DM systems for different kinds of data
– Heterogeneous databases and global information systems
Web mining becomes a very challenging and fast-evolving field in data mining
50
Summary of Points
51
KDD Process: Several Key
Steps
Learning the application domain
– relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
– summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
52
Are All the “Discovered” Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
– Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
– A pattern is interesting if it is easily understood by humans, valid on new
or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
– Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
53
Find All and Only Interesting
Patterns?
54
Other Pattern Mining Issues
55
A typical DM System Architecture
56
Architecture: Typical Data Mining
System
Pattern Evaluation
Know
Data Mining Engine ledge
-Base
Database or Data
Warehouse Server
Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
59
10
A More Complete View of Data
61
Measurement of Length
The way you measure an attribute may not match the
attributes properties.
5 A 1
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
62
Types of Attributes
female} test
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
68
Asymmetric Attributes
Only presence (a non-zero attribute value) is regarded as important
Words present in documents
Items present in customer transactions
69
Why Data Preprocessing?
70
Why can Data be
Incomplete?
71
Data Cleaning
72
Data Quality
74
Noise
For objects, noise is an extraneous object
For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen
Causes? 76
Missing Values
Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and
weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
Handling missing values
– Eliminate data objects or variables
– Estimate missing values
Example: time series of temperature
Example: census results
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
80
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
81
Aggregation
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc.
Days aggregated into weeks, months, or years
– More “stable” data
Aggregated data tends to have less variability
82
Example: Precipitation in Australia
85
Sampling …
86
Sample Size
87
Types of Sampling
Simple Random Sampling
– There is an equal probability of selecting any
particular item
– Sampling without replacement
As each item is selected, it is removed from the population
– Sampling with replacement
Objects are not removed from the population as they are
selected for the sample.
In sampling with replacement, the same object can be
picked up more than once
Stratified sampling
– Split the data into several partitions; then draw
random samples from each partition
88
Sample Size
What sample size is necessary to get at least one
object from each of 10 equal-sized groups.
89
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required
by data mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or
reduce noise
Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
91
Dimensionality Reduction: PCA
x1
92
Dimensionality Reduction: PCA
93
Feature Subset Selection
Frequency
96
Discretization
Counts 20
10
0
0 2 4 6 8
Petal Length
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
101
Discretization Without Using Class
Labels
102
Discretization Without Using Class
Labels
103
Discretization Without Using Class
Labels
104
Binarization
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.
110