Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
Tutorial
Gregory Piatetsky-Shapiro
KDnuggets
© 2006 KDnuggets
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples
2
© 2006 KDnuggets
Trends leading to Data Flood
More data is generated:
Web, text, images …
Business transactions, calls,
...
Scientific data: astronomy,
biology, etc
3
© 2006 KDnuggets
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
4
© 2006 KDnuggets
Data Growth
5
© 2006 KDnuggets
Data Growth Rate
6
© 2006 KDnuggets
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
7
© 2006 KDnuggets
Related Fields
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
8
© 2006 KDnuggets
Statistics, Machine Learning and
Data Mining
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of data
mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
Distinctions are fuzzy
9
© 2006 KDnuggets
Knowledge Discovery Process
flow, according to CRISP-DM
see
Monitoring www.crisp-dm.org
for more
information
Continuous
monitoring and
improvement is
an addition to CRISP
10
© 2006 KDnuggets
Historical Note:
Many Names of Data Mining
Data Fishing, Data Dredging: 1960-
used by statisticians (as bad name)
© 2006 KDnuggets
Some Definitions
Attribute or Field
measuring aspects of the Instance, e.g. temperature
Class (Label)
grouping of instances, e.g. days good for playing
13
© 2006 KDnuggets
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships
…
© 2006 KDnuggets 14
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
15
© 2006 KDnuggets
Clustering
Find “natural” grouping of
instances given un-labeled data
16
© 2006 KDnuggets
Association Rules &
Frequent Itemsets
Transactions
TID Produce Frequent Itemsets:
1 MILK, BREAD, EGGS
2 BREAD, SUGAR Milk, Bread (4)
3 BREAD, CEREAL
Bread, Cereal (3)
4 MILK, BREAD, SUGAR
5 MILK, CEREAL Milk, Bread, Cereal (2)
6 BREAD, CEREAL …
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread (66%)
17
© 2006 KDnuggets
Visualization & Data Mining
Visualizing the data to
facilitate human
discovery
Presenting the
discovered results in a
visually "nice" way
18
© 2006 KDnuggets
Summarization
19
© 2006 KDnuggets
Data Mining Central Quest