0% found this document useful (0 votes)
59 views14 pages

Classification and Prediction

The document discusses classification and prediction, describing classification as predicting categorical class labels by constructing a model based on training data, while regression models continuous functions. It covers issues in classification like data preparation and model evaluation, and describes decision tree induction as a method for classification that generates trees to partition data based on attribute tests at internal nodes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views14 pages

Classification and Prediction

The document discusses classification and prediction, describing classification as predicting categorical class labels by constructing a model based on training data, while regression models continuous functions. It covers issues in classification like data preparation and model evaluation, and describes decision tree induction as a method for classification that generates trees to partition data based on attribute tests at internal nodes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 14

Classification and

Prediction
Classification and Prediction
 What is classification? What is
regression?
 Issues regarding classification and
prediction
 Classification by decision tree induction
 Scalable decision tree induction
Classification vs. Prediction
 Classification:
 predicts categorical class labels
 classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
 Regression:
 models continuous-valued functions, i.e., predicts
unknown or missing values
 Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
Why Classification? A motivating
application
 Credit approval
 A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
 The history of past customers is used to train the
classifier
 The classifier provides rules, which identify potentially
reliable future customers
 Classification rule:
 If age = “31...40” and income = high then credit_rating =
excellent
 Future customers
 Paul: age = 35, income = high  excellent credit rating
 John: age = 20, income = medium  fair credit rating
Classification—A Two-Step Process
 Model construction: describing a set of predetermined
classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision
trees, or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test samples is compared with the

classified result from the model


 Accuracy rate is the percentage of test set samples that

are correctly classified by the model


 Test set is independent of training set, otherwise over-

fitting will occur


Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Classification Process (2): Use
the Model in Prediction
Accuracy=?
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues regarding classification and
prediction (1): Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data
 numerical attribute income  categorical
{low,medium,high}
 normalize all numerical attributes to [0,1)
Issues regarding classification and
prediction (2): Evaluating Classification
Methods
 Predictive accuracy
 Speed
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provided by the model
 Goodness of rules (quality)
 decision tree size
 compactness of classification rules
Classification by Decision Tree
Induction
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root

 Partition examples recursively based on selected attributes

 Tree pruning
 Identify and remove branches that reflect noise or outliers

 Use of decision tree: Classifying an unknown sample


 Test the attribute values of the sample against the decision tree
Training Dataset
age income student credit_rating buys_computer
This <=30 high no fair no
<=30 high no excellent no
follows 31…40 high no fair yes
an >40 medium no fair yes
example >40 low yes fair yes
>40 low yes excellent no
from 31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
<=30 low yes fair yes
ID3 >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Output: A Decision Tree for
“buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Scalable Decision Tree Induction Methods

 SLIQ (EDBT’96 — Mehta et al.)


 Builds an index for each attribute and only class list and the
current attribute list reside in memory
 SPRINT (VLDB’96 — J. Shafer et al.)
 Constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
 Integrates tree splitting and tree pruning: stop growing the
tree earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti)
 Builds an AVC-list (attribute, value, class label)
 BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan &
Loh)
 Uses bootstrapping to create several small samples

You might also like