0% found this document useful (0 votes)
58 views

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

Introduction Data Mining Tasks Classification & Evaluation Clustering Application Examples

Uploaded by

Asim Tahir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

Introduction Data Mining Tasks Classification & Evaluation Clustering Application Examples

Uploaded by

Asim Tahir
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Mining

Tutorial
Gregory Piatetsky-Shapiro
KDnuggets

© 2006 KDnuggets
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples

2
© 2006 KDnuggets
Trends leading to Data Flood
 More data is generated:
 Web, text, images …
 Business transactions, calls,
...
 Scientific data: astronomy,
biology, etc

 More data is captured:


 Storage technology faster
and cheaper
 DBMS can handle bigger DB

3
© 2006 KDnuggets
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp

4
© 2006 KDnuggets
Data Growth

In 2 years (2003 to 2005),


the size of the largest database TRIPLED!

5
© 2006 KDnuggets
Data Growth Rate

 Twice as much information was created in 2002


as in 1999 (~30% growth rate)
 Other growth rate estimates even higher
 Very little data will ever be looked at by a human

Knowledge Discovery is NEEDED to make sense


and use of data.

6
© 2006 KDnuggets
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

7
© 2006 KDnuggets
Related Fields

Machine Visualization
Learning
Data Mining and
Knowledge Discovery

Statistics Databases

8
© 2006 KDnuggets
Statistics, Machine Learning and
Data Mining
 Statistics:
 more theory-based
 more focused on testing hypotheses
 Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of data
mining
 Data Mining and Knowledge Discovery
 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
 Distinctions are fuzzy

9
© 2006 KDnuggets
Knowledge Discovery Process
flow, according to CRISP-DM

see
Monitoring www.crisp-dm.org
for more
information

Continuous
monitoring and
improvement is
an addition to CRISP

10
© 2006 KDnuggets
Historical Note:
Many Names of Data Mining
 Data Fishing, Data Dredging: 1960-
 used by statisticians (as bad name)

 Data Mining :1990 --


 used in DB community, business

 Knowledge Discovery in Databases (1989-)


 used by AI, Machine Learning Community
 also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery
are used interchangeably
11
© 2006 KDnuggets
Data Mining Tasks

© 2006 KDnuggets
Some Definitions

 Instance (also Item or Record):


 an example, described by a number of attributes,
 e.g. a day can be described by temperature, humidity
and cloud status

 Attribute or Field
 measuring aspects of the Instance, e.g. temperature

 Class (Label)
 grouping of instances, e.g. days good for playing

13
© 2006 KDnuggets
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
 Deviation Detection: finding changes
 Estimation: predicting a continuous value
 Link Analysis: finding relationships
…
© 2006 KDnuggets 14
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances

Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

15
© 2006 KDnuggets
Clustering
Find “natural” grouping of
instances given un-labeled data

16
© 2006 KDnuggets
Association Rules &
Frequent Itemsets
Transactions
TID Produce Frequent Itemsets:
1 MILK, BREAD, EGGS
2 BREAD, SUGAR Milk, Bread (4)
3 BREAD, CEREAL
Bread, Cereal (3)
4 MILK, BREAD, SUGAR
5 MILK, CEREAL Milk, Bread, Cereal (2)
6 BREAD, CEREAL …
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL

Rules:
Milk => Bread (66%)

17
© 2006 KDnuggets
Visualization & Data Mining
 Visualizing the data to
facilitate human
discovery

 Presenting the
discovered results in a
visually "nice" way

18
© 2006 KDnuggets
Summarization

 Describe features of the


selected group
 Use natural language
and graphics
 Usually in Combination
with Deviation detection
or other methods

Average length of stay in this study area rose 45.7 percent,


from 4.3 days to 6.2 days, because ...

19
© 2006 KDnuggets
Data Mining Central Quest

Find true patterns


and avoid overfitting

(finding seemingly signifcant


but really random patterns due
to searching too many possibilites)
20
© 2006 KDnuggets

You might also like