0% found this document useful (0 votes)

73 views89 pages

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

data mining

Uploaded by

marj000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views89 pages

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

data mining

Uploaded by

marj000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 89

Data Mining

Tutorial
Gregory Piatetsky-Shapiro
KDnuggets

2006 KDnuggets
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples

2006 KDnuggets 2
Trends leading to Data Flood
More data is generated:
Web, text, images
Business transactions,
calls, ...
Scientific data:
astronomy, biology, etc
More data is captured:
Storage technology
faster and cheaper
DBMS can handle bigger
DB

2006 KDnuggets 3
Largest Databases in 2005
Winter Corp. 2005
Commercial Database
Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest
Data Warehouse)
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
3. AT&T ~ 94 TB

2006 KDnuggets 4
Data Growth

In 2 years (2003 to 2005),

the size of the largest database TRIPLED!

2006 KDnuggets 5
Data Growth Rate
Twice as much information was created in
2002 as in 1999 (~30% growth rate)
Other growth rate estimates even higher
Very little data will ever be looked at by a
human

Knowledge Discovery is NEEDED to make

sense and use of data.

2006 KDnuggets 6
Knowledge Discovery
Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

2006 KDnuggets 7
Related Fields

Machine Visualization
Learning
Data Mining and
Knowledge Discovery

Statistics Databases

2006 KDnuggets 8
Statistics, Machine Learning
and
Data Mining
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics areas not part of data
mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
Distinctions are fuzzy

2006 KDnuggets 9
Knowledge Discovery Process
flow, according to CRISP-DM

see
Monitoring www.crisp-dm.org
for more
information

Continuous
monitoring and
improvement is
an addition to CRISP

2006 KDnuggets 10
Historical Note:
Many
Data Names
Fishing, of Data
Data Dredging: 1960-
Mining
used by statisticians (as bad name)

Data Mining :1990 --

used in DB community, business

Knowledge Discovery in Databases (1989-)

used by AI, Machine Learning Community
also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery
are used interchangeably
2006 KDnuggets 11
Data Mining Tasks

2006 KDnuggets
Some Definitions
Instance (also Item or Record):
an example, described by a number of attributes,
e.g. a day can be described by temperature,
humidity and cloud status
Attribute or Field
measuring aspects of the Instance, e.g.
temperature
Class (Label)
grouping of instances, e.g. days good for playing

2006 KDnuggets 13
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships

2006 KDnuggets 14
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances

Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

2006 KDnuggets 15
Clustering
Find natural grouping of
instances given un-labeled data

2006 KDnuggets 16
Association Rules &
Frequent Itemsets
Transactions
Frequent Itemsets:

Milk, Bread (4)

Bread, Cereal (3)
Milk, Bread, Cereal (2)

Rules:
Milk => Bread (66%)

2006 KDnuggets 17
Visualization & Data Mining
Visualizing the data
to facilitate human
discovery

Presenting the
discovered results in
a visually "nice" way
2006 KDnuggets 18
Summarization

Describe features of the

selected group
Use natural language
and graphics
Usually in Combination
with Deviation detection
or other methods

Average length of stay in this study area rose 45.7 percent,

from 4.3 days to 6.2 days, because ...
2006 KDnuggets 19
Data Mining Central Quest

Find true patterns

and avoid overfitting

(finding seemingly signifcant

but really random patterns due
to searching too many possibilites)
2006 KDnuggets 20
Classification
Methods

2006 KDnuggets
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances

Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...

Given a set of points from classes

what is the class of new point ?
2006 KDnuggets 22
Classification: Linear
Regression

Linear Regression
w0 + w1 x + w2 y >= 0

Regression computes
wi from data to
minimize squared
error to fit the data
Not flexible enough

2006 KDnuggets 23
Regression for Classification
Any regression technique can be used for
classification
Training: perform a regression for each class, setting
the output to 1 for training instances that belong to
class, and 0 for those that dont
Prediction: predict class corresponding to model with
largest output value (membership value)

For linear regression this is known as multi-

response linear regression

2006 KDnuggets 24
Classification: Decision Trees

if X > 5 then blue

else if Y > 3 then blue
Y else if X > 2 then green
else blue

2 5 X

2006 KDnuggets 25
DECISION TREE
An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g.,
Color=red.
A leaf node represents a class label or class label
distribution.
At each node, one attribute is chosen to split
training examples into distinct classes as much as
possible
A new instance is classified by following a matching
path to a leaf node.

2006 KDnuggets 26
Weather Data: Play or not
Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes no relation to
rain cool normal true No Microsoft
overcast cool normal true Yes email program
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
2006 KDnuggets 27
Example Tree for Play?

Outlook

sunny
overcast rain

Humidity Yes
Windy

high normal true false

No Yes No Yes

2006 KDnuggets 28
Classification: Neural Nets

Can select more

complex regions
Can be more
accurate
Also can overfit the
data find patterns
in random noise

2006 KDnuggets 29
Classification: other
approaches
Nave Bayes
Rules
Support Vector Machines
Genetic Algorithms

See www.KDnuggets.com/software/

2006 KDnuggets 30
Evaluation

2006 KDnuggets
Evaluating which method works
the best for classification
No model is uniformly the best
Dimensions for Comparison
speed of training
speed of model application
noise tolerance
explanation ability

Best Results: Hybrid, Integrated models

2006 KDnuggets 32
Comparison of Major
Classification Approaches

A hybrid method will have higher accuracy

2006 KDnuggets 33
Evaluation of Classification
Models
How predictive is the model we learned?
Error on the training data is not a good
indicator of performance on future data
The new data will probably not be exactly the
same as the training data!
Overfitting fitting the training data too
precisely - usually leads to poor results on
new data

2006 KDnuggets 34
Evaluation issues
Possible evaluation measures:
Classification Accuracy
Total cost/benefit when different errors involve
different costs
Lift and ROC curves
Error in numeric predictions

How reliable are the predicted results ?

2006 KDnuggets 35
Classifier error rate
Natural performance measure for
classification problems: error rate
Success: instances class is predicted correctly
Error: instances class is predicted incorrectly
Error rate: proportion of errors made over the
whole set of instances
Training set error rate: is way too
optimistic!
you can find patterns even in random data

2006 KDnuggets 36
Evaluation on LARGE data

If many (>1000) examples are

available, including >100 examples
from each class
A simple evaluation will give useful results
Randomly split data into training and test sets
(usually 2/3 for train, 1/3 for test)
Build a classifier using the train set and
evaluate it using the test set

2006 KDnuggets 37
Classification Step 1:
Split THE
data PAST
into train and test
setsResults Known
+
+ Training set
-
-
+
Data

Testing set

2006 KDnuggets 38
Classification Step 2:
BuildTHEaPAST
model on a training
set Results Known
+
+ Training set
-
-
+
Data

Model Builder

Testing set

2006 KDnuggets 39
Classification Step 3:
Evaluate on test set (Re-
train?)
Results Known
+
+ Training set
-
-
+
Data

Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -

2006 KDnuggets 40
Unbalanced data
Sometimes, classes have very unequal
frequency
Attrition prediction: 97% stay, 3% attrite (in a month)
medical diagnosis: 90% healthy, 10% disease
eCommerce: 99% dont buy, 1% buy
Security: >99.99% of Americans are not terrorists

Similar situation with multiple classes

Majority class classifier can be 97% correct, but
useless

2006 KDnuggets 41
Handling unbalanced data
how?
If we have two classes that are very
unbalanced, then how can we
evaluate our classifier method?

2006 KDnuggets 42
Balancing unbalanced data,
1
With two classes, a good approach is to
build BALANCED train and test sets, and
train model on a balanced set
randomly select desired number of minority
class instances
add equal number of randomly selected majority
class
How do we generalize balancing to
multiple classes?

2006 KDnuggets 43
Balancing unbalanced data,
2
Generalize balancing to multiple
classes
Ensure that each class is represented with
approximately equal proportions in train
and test

2006 KDnuggets 44
A note on parameter tuning
It is important that the test data is not used in any
way to create the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings

The test data cant be used for parameter tuning!

Proper procedure uses three sets: training data,
validation data, and test data
Validation data is used to optimize parameters

2006 KDnuggets 45
Making the most of the data
Once evaluation is complete, all the data
can be used to build the final classifier
Generally, the larger the training data the
better the classifier (but returns diminish)
The larger the test data the more accurate
the error estimate

2006 KDnuggets 46
Classification:
Train, Validation, Test split
Results Known
+
Training set Model
+
-
-
Builder
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -

+
- Final Evaluation
+
Final Test Set Final Model -

2006 KDnuggets 47
Cross-validation
Cross-validation avoids overlapping test sets
First step: data is split into k subsets of equal size
Second step: each subset in turn is used for
testing and the remainder for training
This is called k-fold cross-validation
Often the subsets are stratified before the
cross-validation is performed
The error estimates are averaged to yield an
overall error estimate
2006 KDnuggets 48
Cross-validation example:
Break up data into groups of the same size

Hold aside one group for testing and use the rest to build
model
Test

Repeat

49
2006 KDnuggets 49
More on cross-validation
Standard method for evaluation: stratified ten-
fold cross-validation
Why ten? Extensive experiments have shown
that this is the best choice to get an accurate
estimate
Stratification reduces the estimates variance
Even better: repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten times
and results are averaged (reduces the variance)

2006 KDnuggets 50
Direct Marketing Paradigm
Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than
number of prospects

Typical Applications
retailers, catalogues, direct mail (and e-mail)
customer acquisition, cross-sell, attrition prediction
...

2006 KDnuggets 51
Direct Marketing Evaluation
Accuracy on the entire dataset is not
the right measure
Approach
develop a target model
score all prospects and rank them by decreasing
score
select top P% of prospects for action

How do we decide what is the best subset

of prospects ?
2006 KDnuggets 52
Model-Sorted List
Use a model to assign score to each customer
Sort customers by decreasing score
Expect more targets (hits) near the top of the list
No Scor Targe CustI Ag
e t D e 3 hits in top 5% of
1 0.97 Y 1746 the list
2 0.95 N 1024 If there 15 targets
3 0.94 Y 2478 overall, then top 5
4 0.93 Y 3820 has 3/15=20% of
targets
5 0.92 N 4897

99 0.11 N 2734
100 0.06 N 2422
2006 KDnuggets 53
CPH (Cumulative Pct
Hits)

Cumulative % Hits
Definition:
CPH(P,M)
= % of all targets
in the first P%
of the list scored
by model M
CPH frequently
called Gains
Pct list
5% of random list have 5% of targets

2006 KDnuggets 54
CPH: Random List vs
Model-ranked list

Cumulative % Hits

Pct list
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
CPH(5%,model)=21%.

2006 KDnuggets 55
Lift
Lift(P,M) = CPH(P,M) / P

Lift (at 5%)

= 21% / 5%
= 4.2
better
than random

Note: Some authors

use Lift for what
we call CPH.
P -- percent of the list

2006 KDnuggets 56
Lift a measure of model
quality
Lift helps us decide which models are better
If cost/benefit values are not available or
changing, we can use Lift to select a better
model.
Model with the higher Lift curve will
generally be better

2006 KDnuggets 57
Clustering

2006 KDnuggets
Clustering
Unsupervised learning:
Finds natural grouping of
instances given un-labeled data

2006 KDnuggets 59
Clustering Methods
Many different method and algorithms:
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up

2006 KDnuggets 60
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across
clusters

2006 KDnuggets 61
The distance function
Simplest case: one numeric attribute A
Distance(X,Y) = A(X) A(Y)

Several numeric attributes:

Distance(X,Y) = Euclidean distance between X,Y

Nominal attributes: distance is set to 1 if

values are different, 0 if they are equal
Are all attributes equally important?
Weighting the attributes might be necessary

2006 KDnuggets 62
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster
center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence
(change in cluster assignments less than a
threshold)
2006 KDnuggets 63
K-means example, step 1

1
c1
Y
Pick 3 c2
initial
cluster
centers
(randomly)
c3

X
2006 KDnuggets 64
K-means example, step 2

c1
Y

c2
Assign
each point
to the closest
cluster
center c3

X
2006 KDnuggets 65
K-means example, step 3

c1 c1
Y

Move c2
each cluster
center c3
c2
to the mean
of each cluster c3

X
2006 KDnuggets 66
K-means example, step 4a

Reassign c1
points Y
closest to a
different new
cluster center
c3
Q: Which c2
points are
reassigned?

X
2006 KDnuggets 67
K-means example, step 4b

Reassign c1
points Y
closest to a
different new
cluster center
c3
Q: Which c2
points are
reassigned?

X
2006 KDnuggets 68
K-means example, step 4c

1
c1
Y
3
A: three
points with 2
animation c3
c2

X
2006 KDnuggets 69
K-means example, step 4d

c1
Y
re-compute
cluster
means c3
c2

X
2006 KDnuggets 70
K-means example, step 5

c1
Y

c2
move cluster
centers to c3
cluster means

X
2006 KDnuggets 71
Data Mining
Applications

2006 KDnuggets
Problems Suitable for Data-
Mining
require knowledge-based decisions
have a changing environment
have sub-optimal current methods
have accessible, sufficient, and relevant
data
provides high payoff for the right
decisions!

2006 KDnuggets 73
Major Application Areas for
Data Mining
Advertising
Solutions
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web

2006 KDnuggets 74
Application: Search Engines
Before Google, web search engines used
mainly keywords on a page results were
easily subject to manipulation
Google's early success was partly due to its
algorithm which uses mainly links to the page
Google founders Sergey Brin and Larry Page
were students at Stanford in 1990s
Their research in databases and data mining
led to Google

2006 KDnuggets 75
Microarrays: Classifying
Leukemia
Leukemia: Acute Lymphoblastic (ALL) vs
Acute Myeloid (AML), Golub et al, Science,
v.286, 1999
72 examples (38 train, 34 test), about 7,000
genes
ALL AML

Visually similar, but genetically very different

Best Model: 97% accuracy,

1 error (sample suspected
mislabelled)
2006 KDnuggets 76
Microarray Potential
Applications
New and better molecular diagnostics
Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip,
based on Affymetrix technology
New molecular targets for therapy
few new drugs, large pipeline,
Improved treatment outcome
Partially depends on genetic signature
Fundamental Biological Discovery
finding and refining biological pathways
Personalized medicine ?!

2006 KDnuggets 77
Application:
Direct Marketing and
Most major direct marketing companies are
CRM
using modeling and data mining
Most financial companies are using
customer modeling
Modeling is easier than changing customer
behaviour
Example
Verizon Wireless reduced customer attrition rate
from 2% to 1.5%, saving many millions of $
2006 KDnuggets 78
Application: e-Commerce

Amazon.com recommendations
if you bought (viewed) X, you are likely to buy Y

Netflix
If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"
Comparison shopping
Froogle, mySimon, Yahoo Shopping,

2006 KDnuggets 79
Application:
Security and Fraud
Detection
Credit Card Fraud Detection
over 20 Million credit cards
protected by Neural networks (Fair,
Isaac)
Securities Fraud Detection
NASDAQ KDD system

Phone fraud detection

AT&T, Bell Atlantic, British
Telecom/MCI

2006 KDnuggets 80
Data Mining, Privacy, and
Security
TIA: Terrorism (formerly Total) Information
Awareness Program
TIA program closed by Congress in 2003 because
of privacy concerns
However, in 2006 we learn that NSA is
analyzing US domestic call info to find
potential terrorists
Invasion of Privacy or Needed Intelligence?

2006 KDnuggets 81
Criticism of Analytic
Approaches to Threat
Detection:
Data Mining will
be ineffective - generate millions of false
positives
and invade privacy

First, can data mining be effective?

2006 KDnuggets 82
Can Data Mining and
Statistics be Effective for
Threat Detection?
Criticism: Databases have 5% errors, so
analyzing 100 million suspects will generate 5
million false positives
Reality: Analytical models correlate many
items of information to reduce false positives.
Example: Identify one biased coin from 1,000.
After one throw of each coin, we cannot
After 30 throws, one biased coin will stand out
with high probability.
Can identify 19 biased coins out of 100 million
with sufficient number of throws

2006 KDnuggets 83
Another Approach: Link
Analysis

Can find unusual patterns in the network structure

2006 KDnuggets 84
Analytic technology can be
effective
Data Mining is just one additional tool to
help analysts
Combining multiple models and link
analysis can reduce false positives
Today there are millions of false positives
with manual analysis
Analytic technology has the potential to
reduce the current high rate of false
positives
2006 KDnuggets 85
Data Mining with Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation distributed data

Bayardo & Srikant, Technological Solutions for

Protecting Privacy, IEEE Computer, Sep 2003

2006 KDnuggets 86
The Hype Curve for
Data Mining and Knowledge
Discovery
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations

Disappointment

2005

2006 KDnuggets 87
Summary
Data Mining and Knowledge Discovery are
needed to deal with the flood of data
Knowledge Discovery is a process !
Avoid overfitting (finding random patterns
by searching too many possibilities)

2006 KDnuggets 88
Additional Resources
www.KDnuggets.com
data mining software, jobs, courses, etc

www.acm.org/sigkdd
ACM SIGKDD the professional society for
data mining

2006 KDnuggets 89

A Comparison of Machine Learning Algorithms For Customer Churn Prediction
No ratings yet
A Comparison of Machine Learning Algorithms For Customer Churn Prediction
6 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
Building Recommendation System Using Movielens Data
No ratings yet
Building Recommendation System Using Movielens Data
6 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Diabetes Prediction Using Data Mining
No ratings yet
Diabetes Prediction Using Data Mining
17 pages
Data Mining Model Performance of Sales Predictive Algorithms Based On Rapidminer Workflows
No ratings yet
Data Mining Model Performance of Sales Predictive Algorithms Based On Rapidminer Workflows
18 pages
E-Commerce Customer Prediction
No ratings yet
E-Commerce Customer Prediction
5 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
Data Mining
No ratings yet
Data Mining
49 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
Modelling & Neural Network Grade 9
0% (1)
Modelling & Neural Network Grade 9
71 pages
CH 6
No ratings yet
CH 6
72 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Real-World Use of Big Data in Telecommunications
No ratings yet
Real-World Use of Big Data in Telecommunications
20 pages
Abell Model-Business Modeling - (Chapter 2 MSO)
No ratings yet
Abell Model-Business Modeling - (Chapter 2 MSO)
35 pages
E-Commerce Recommendation System With Reverse Image Search
No ratings yet
E-Commerce Recommendation System With Reverse Image Search
47 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
Big Data For Marketing Resource Reallocation
No ratings yet
Big Data For Marketing Resource Reallocation
31 pages
Project Proposal 260 Copy
No ratings yet
Project Proposal 260 Copy
38 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
Process Science in Action
No ratings yet
Process Science in Action
20 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
Applications of Data Mining in The Banking Sector
No ratings yet
Applications of Data Mining in The Banking Sector
8 pages
Data Mining in Medicine
No ratings yet
Data Mining in Medicine
42 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
12 pages
L1 2 Introduction
No ratings yet
L1 2 Introduction
79 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Fuzzy C-Means Clustering - MATLAB FCM
0% (1)
Fuzzy C-Means Clustering - MATLAB FCM
6 pages
Data Transformation and Arima Models A S
No ratings yet
Data Transformation and Arima Models A S
8 pages
Web Analytics, Web Mining, and Social Analytics
No ratings yet
Web Analytics, Web Mining, and Social Analytics
53 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Business Analytics
No ratings yet
Business Analytics
42 pages
7 Verification Validation
No ratings yet
7 Verification Validation
36 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
DataMining S
No ratings yet
DataMining S
103 pages
12 Classification
No ratings yet
12 Classification
16 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
BSBFIM601 Powerpoint Presentation
No ratings yet
BSBFIM601 Powerpoint Presentation
94 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
15 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
Data Mining For Customer Segmentation
No ratings yet
Data Mining For Customer Segmentation
13 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
House Price Prediction Using Machine Learning: Bachelor of Technology
No ratings yet
House Price Prediction Using Machine Learning: Bachelor of Technology
20 pages
Adoption of BI in SMEs PDF
No ratings yet
Adoption of BI in SMEs PDF
22 pages
Image Classification
No ratings yet
Image Classification
18 pages
PSSC Maths Statistics Project Handbook Eff08 PDF
No ratings yet
PSSC Maths Statistics Project Handbook Eff08 PDF
19 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
The Concise Guide to the Internet of Things for Executives
From Everand
The Concise Guide to the Internet of Things for Executives
alasdair gilchrist
4/5 (7)
Single customer view Second Edition
From Everand
Single customer view Second Edition
Gerardus Blokdyk
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
89 pages
Dmtut
No ratings yet
Dmtut
88 pages
Aiml
No ratings yet
Aiml
8 pages
Machine Learning Marking Criteria Portfolio Part 3
No ratings yet
Machine Learning Marking Criteria Portfolio Part 3
1 page
ML Practice Questions
No ratings yet
ML Practice Questions
6 pages
Assignment PDF
0% (1)
Assignment PDF
5 pages
Bussin
No ratings yet
Bussin
81 pages
Clustering-Based Undersampling With Random Over Sampling Examples and Support Vector Machine For Imbalanced Classification of Breast Cancer Diagnosis
No ratings yet
Clustering-Based Undersampling With Random Over Sampling Examples and Support Vector Machine For Imbalanced Classification of Breast Cancer Diagnosis
12 pages
DEA-RNN A Hybrid Deep Learning Approach
No ratings yet
DEA-RNN A Hybrid Deep Learning Approach
93 pages
Spam Detection Synopsis
No ratings yet
Spam Detection Synopsis
8 pages
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
No ratings yet
Diabetes Detection Using Deep Learning Algorithms: ICT Express November 2018
5 pages
Demo
No ratings yet
Demo
29 pages
Failure Prediction in The Refinery Piping System Using Machine Learning Algorithms
No ratings yet
Failure Prediction in The Refinery Piping System Using Machine Learning Algorithms
10 pages
Titanic Survival Prediction Assignment
No ratings yet
Titanic Survival Prediction Assignment
3 pages
Assignment 1 Face Recognition Updated
No ratings yet
Assignment 1 Face Recognition Updated
3 pages
Research Article: Improved KNN Algorithm Based On Preprocessing of Center in Smart Cities
No ratings yet
Research Article: Improved KNN Algorithm Based On Preprocessing of Center in Smart Cities
10 pages
Data Mining Overview: by Dr. Sunil D. Lakdawala
No ratings yet
Data Mining Overview: by Dr. Sunil D. Lakdawala
52 pages
Fuzzy Logic
No ratings yet
Fuzzy Logic
102 pages
Bayesian Classifier Implementation Using MATLAB
No ratings yet
Bayesian Classifier Implementation Using MATLAB
21 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Stress Detection Report
No ratings yet
Stress Detection Report
11 pages
B20-ml Basedbotnet Attack in IoT Devices
No ratings yet
B20-ml Basedbotnet Attack in IoT Devices
66 pages
Basic 7: Data and Information What Is Data?
No ratings yet
Basic 7: Data and Information What Is Data?
3 pages
Sign Languages To Speech Conversion Prototype Using The SVM Classifier
No ratings yet
Sign Languages To Speech Conversion Prototype Using The SVM Classifier
5 pages
3 DSeismic Waveform Classification
No ratings yet
3 DSeismic Waveform Classification
5 pages
Anatomy of The Infamous Artificial: Example by Caesar Ogole
No ratings yet
Anatomy of The Infamous Artificial: Example by Caesar Ogole
33 pages
ML Unit 5
No ratings yet
ML Unit 5
20 pages
ccs341 Data Warehousing Lab Manual2021
No ratings yet
ccs341 Data Warehousing Lab Manual2021
50 pages
Real Time Finger Tracking and Contour Detection For Gesture Recognition Using Opencv
No ratings yet
Real Time Finger Tracking and Contour Detection For Gesture Recognition Using Opencv
4 pages
User Simulation for Evaluating Information Access Systems
No ratings yet
User Simulation for Evaluating Information Access Systems
246 pages
Data Science - UNIT-3 - Notes
No ratings yet
Data Science - UNIT-3 - Notes
32 pages
Pam Clustering Technique
No ratings yet
Pam Clustering Technique
10 pages

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

Uploaded by

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

Uploaded by

Data Mining

In 2 years (2003 to 2005),

Knowledge Discovery is NEEDED to make

Data Mining :1990 --

Knowledge Discovery in Databases (1989-)

Milk, Bread (4)

Describe features of the

Average length of stay in this study area rose 45.7 percent,

Find true patterns

(finding seemingly signifcant

Given a set of points from classes

For linear regression this is known as multi-

if X > 5 then blue

high normal true false

Can select more

Best Results: Hybrid, Integrated models

A hybrid method will have higher accuracy

How reliable are the predicted results ?

If many (>1000) examples are

Similar situation with multiple classes

The test data cant be used for parameter tuning!

How do we decide what is the best subset

Lift (at 5%)

Note: Some authors

Several numeric attributes:

Nominal attributes: distance is set to 1 if

Visually similar, but genetically very different

Best Model: 97% accuracy,

Phone fraud detection

First, can data mining be effective?

Can find unusual patterns in the network structure

Bayardo & Srikant, Technological Solutions for

You might also like