Chapter 1
Chapter 1
Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
Large-scale Data is Everywhere!
§ There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
§ New mantra
§ Gather whatever data you can
whenever and wherever
possible.
§ Expectations
§ Gathered data will have value Social Networking: Twitter
Traffic Patterns
either for the purpose
collected or for a purpose not
envisioned.
– purchases at department/
grocery stores, e-commerce
u Amazon handles millions of visits/day
– Bank/Credit Card transactions
● Computers have become cheaper and more powerful
● Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Improving health care and reducing costs Predicting the impact of climate change
● Prediction Methods
– Use some variables to predict unknown or
future values of other variables.
● Description Methods
– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10
Learn
Training
Model
Set Classifier
● Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
u Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
u Label past transactions as fraud or fair
transactions. This forms the class attribute.
u Learn a model for the class of the transactions.
u Use this model to detect fraud by observing credit
card transactions on an account.
01/17/2018 Introduction to Data Mining, 2nd Edition 14
Classification: Application 2
Use of K-means to
partition Sea Surface
60
Land Cluster 2
0
(NPP) into clusters that
Ice or No NPP
-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1
-90
-180 -150 01/17/2018
-120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
Introduction to Data Mining, 2nd Edition 17
longitude
Clustering: Application 1
● Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
u Collect different attributes of customers based on
their geographical and lifestyle related information.
u Find clusters of similar customers.
u Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.
19
Clustering: Application 2
● Document Clustering:
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
● Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management
● Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases
01/17/2018 Introduction to Data Mining, 2nd Edition 22
23
The KDD Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
DATA
Objects
variable, field, characteristic,
dimension, or feature 4 Yes Married 120K No
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
2
5 1
2
5
● Sequences of transactions
Items/Events
An element of
the sequence
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Data Quality