0% found this document useful (0 votes)
4 views88 pages

ClassificationandPrediction Module3

The document discusses classification and prediction in data analysis, defining classification as predicting categorical class labels and prediction as modeling continuous-valued functions. It covers various classification methods, including decision trees and Bayesian classification, and emphasizes the importance of model construction and evaluation for accuracy. Additionally, it highlights issues related to data preparation and the evaluation of classification methods.

Uploaded by

pobocow192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views88 pages

ClassificationandPrediction Module3

The document discusses classification and prediction in data analysis, defining classification as predicting categorical class labels and prediction as modeling continuous-valued functions. It covers various classification methods, including decision trees and Bayesian classification, and emphasizes the importance of model construction and evaluation for accuracy. Additionally, it highlights issues related to data preparation and the evaluation of classification methods.

Uploaded by

pobocow192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 88

Classification and Prediction

Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Other Classification Methods
 Prediction
 Classification accuracy
 Summary

2
Classification vs. Prediction

 Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
 Prediction:
– models continuous-valued functions, i.e., predicts unknown or
missing values
 Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis

3
Definition

 CLASSIFICATION

– Classification is to build structures from examples of


past decisions that can be used to make decisions for
unseen cases.

– Classification is a form of data analysis that can be used


to extract models describing important data classes.

– Classification task concentrates on predicting the value


of the decision class for an object with unknown class
among a predefined set of classes’ values given the
values of some given attributes for the object

4
Classification in Literature

 Classification has been an essential theme in machine learning, and


statistics research
– Often referred to as supervised learning.
– Decision trees, Bayesian classification, neural networks, k-nearest
neighbors, etc.
– Tree-pruning, Boosting, bagging techniques
 Efficient and scalable classification methods
– SLIQ, SPRINT, RainForest, BOAT, etc.
 Classification of semi-structured and non-structured data
– Classification by clustering association rules (ARCS)
– Association-based classification
– Web document classification
– Text Categorization

5
Classification: Formal Definition

 Given a collection of records (training set )


– Each record contains a set of attributes, one of the attributes
is the class.
 Find a model for class attribute as a function
of the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.

6
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


No
1 Yes Large 125K
algorithm
2 No Medium 100K No

3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

7
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
 Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting will
occur

8
Classification—A Two-Step
Process
 learning/Training (Model Construction)
– Using a classification algorithm, a Model is build by analyzing a set of training
database objects.
– The model is represented as classification rules or decision trees …..etc

 Testing / Evaluation
– The Model is tested using a different data set (Test data set) for which the class
label is unseen and the classification accuracy will be estimated.

Estimate accuracy of the model


• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by
the model
• Test set is independent of training set, otherwise over-fitting will occur

Model usage Model usage


Decision Making

If the accuracy of the Model is considered acceptable, the


Model can be used to classify/predict future data objects for
which the class label is unknown.

9
Evaluation of Classification Systems

Training Set: examples with class


Predicted
values for learning.
False Positives
Test Set: examples with class
values for evaluating.
True Positives
Evaluation: Hypotheses are used to
infer classification of examples in the
test set; inferred classification is False Negatives
compared to known classification.
Actual

Accuracy: percentage of examples


in the test set that are classified
correctly.
10
Sample Dataset

Dean Year Rank Name

No 2 Assistant Prof Ali


Conditional
No 3 Assistant Prof Mohd
Attributes
Yes 7 Assistant Prof Qasem Class/ Decision
Attribute
No 7 Associate Prof Azeem

Yes 2 Professor Hasan

Yes 7 Associate Prof Azmi

No 6 Assistant Prof Hamidah

Yes 5 Professor Lim

Yes 7 Assistant Prof Ahmad

No 3 Associate Prof Fatimah

Decision Table (Historical Data)

11
Model Construction

Train Classification
Dataset Algorithms

Dean Year Rank Name


Classifier
No 3 Assistant Prof Mohd
(Model)
Yes 7 Assistant Prof Qasem

Yes 2 Professor Hasan

Yes 7 Associate Prof Azmi

No 6 Assistant Prof Hamedah

No 3 Associate Prof Fatimah

IF Rank = ‘professor’ OR Years


>6
THEN Dean = ‘Yes’

12
Evaluate and use the Model
IF Rank = ‘professor’ OR Years > 6 THEN Dean =
‘Yes’

Unknown
Test DS Classifier Future DS

Unseen

Dean Year Rank Name


Dean Year Rank Name
? 4 Prof Ramli
No 2 Assistant Prof Ali

No 7 Associate Prof Azeem

Yes 5 Professor Lim


Dean
Yes 7 Assistant Prof Ahmad
No

Yes

Yes
Compute
Yes
Accuracy
75%

Evaluation Phase

13
Use the Model in Prediction

Classifier

Testing
Unknown Data
Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
14 Assistant Prof 7 yes
Supervised vs. Unsupervised Learning

 Supervised learning (classification)


– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
 Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data

15
Issues regarding classification and
prediction

Data Preparation :)1(


 Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
 Data transformation
– Generalize and/or normalize data

16
(2): Evaluating Classification Methods
 Predictive accuracy
 Speed and scalability
– time to construct the model
– time to use the model
 Robustness
– handling noise and missing values
 Scalability
– efficiency in disk-resident databases
 Interpretability:
– understanding and insight provided by the model
 Goodness of rules
– decision tree size
– compactness of classification rules

17
Classification Techniques

 Decision Tree based Methods


 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines

18
Classification by Decision Tree Induction
 Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
 Decision tree generation consists of two phases
– Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
– Tree pruning
Identify and remove branches that reflect noise or outliers
 Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

19
Example 1 : Training Dataset

age income student credit_rating


This <=30 high no fair
follows <=30 high no excellent
an 31…40 high no fair
example >40 medium no fair
from >40 low yes fair
Quinlan’s >40 low yes excellent
ID3
31…40 low yes excellent
<=30 medium no fair
<=30 low yes fair
>40 medium yes fair
<=30 medium yes excellent
31…40 medium no excellent
31…40 high yes fair
20 >40 medium no excellent
Output: A Decision Tree for
“buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

21
Example 2 of a Decision Tree

cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


22
Example 3 of Decision Tree

cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

23
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

24
Apply Model to Test Data

Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

25
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

26
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

27
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

28
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

29
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

30
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
31
Decision Tree Induction

 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

32
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
 Let Dt be the set of training records Status Income Cheat

that reach a node t 1 Yes Single 125K No


 General Procedure: 2 No Married 100K No
3 No Single 70K No
– If Dt contains records that belong
4 Yes Married 120K No
the same class yt, then t is a leaf 5 No Divorced 95K Yes
node labeled as yt 6 No Married 60K No

– If Dt is an empty set, then t is a leaf 7 Yes Divorced 220K No

node labeled by the default class, 8 No Single 85K Yes

yd 9 No Married 75K No
10 No Single 90K Yes
– If Dt contains records that belong 10

to more than one class, use an Dt


attribute test to split the data into
smaller subsets. Recursively apply
the procedure to each subset. ?

33
Example: C4.5

 Simple depth-first construction.


 Uses Information Gain
 Sorts Continuous Attributes at each node.
 Needs entire data to fit in memory.
 Unsuitable for Large Datasets.
– Needs out-of-core sorting.

 You can download the software from:


http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

34
Algorithm for Decision Tree Induction

 Basic algorithm (a greedy algorithm)


– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
– There are no samples left

35
Attribute Selection Measure

 Information gain (ID3/C4.5)


– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
 Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split
values
– Can be modified for categorical attributes

36
37
Information Gain (ID3/C4.5)

 Select the attribute with the highest information gain


 Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary example
in S belongs to P or N is defined as

p p n n
I ( p, n)  log 2  log 2
pn pn pn pn

38
Information Gain in Decision Tree
Induction

 Assume that using attribute A a set S will be partitioned


into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify objects
in all subtrees Si is
 pi  ni
E ( A)  I ( pi , ni )
i 1 p  n

 The encoding information that would be gained by


branching on A
Gain( A) I ( p, n)  E ( A)

39
Attribute Selection by Information Gain
Computation

 Class P: buys_computer =
“yes” 5 4
E ( age)  I ( 2,3)  I ( 4,0)
 Class N: buys_computer = “no” 14 14
p p n n 5

I ( p,I(p,
n) n)
 = I(9,
log5)
2 =0.940
 log 2  I (3,2) 0.69
pn pn pn pn 14

Hence
 Compute the entropy for age:
Gain(age) I ( p, n)  E (age)
Similarly
age pi ni I(pi, ni) Gain(income) 0.029
<=30 2 3 0.971
Gain( student ) 0.151
30…40 4 0 0
>40 3 2 0.971 Gain(credit _ rating ) 0.048
40
41
Gini Index (IBM IntelligentMiner)

 If a data set T contains examples from n classes, gini index,


n
gini(T) is defined as gini(T ) 1  ( p )2
j
j 1

where pj is the relative frequency of class j in T.


 If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains
examples from n classes, the gini index gini(T) is defined as
N N
gini split (T )  1 gini(T 1)  2 gini(T 2)
N N
 The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting points
for each attribute).
42
Extracting Classification Rules from
Trees

 Represent the knowledge in the form of IF-THEN rules


 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand
 Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

43
Avoid Overfitting in
Classification
 The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise
or outliers
– Result is in poor accuracy for unseen samples
 Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
Difficult to choose an appropriate threshold

– Postpruning: Remove branches from a “fully grown” tree—get


a sequence of progressively pruned trees
Use a set of data different from the training data to decide
which is the “best pruned tree”

44
Approaches to Determine the Final Tree
Size

 Separate training (2/3) and testing (1/3) sets


 Use cross validation, e.g., 10-fold cross validation
 Use all the data for training
– but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
 Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
minimized
45
Enhancements to basic decision tree induction

 Allow for continuous-valued attributes


– Dynamically define new discrete-valued attributes that partition
the continuous attribute value into a discrete set of intervals
 Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
 Attribute construction
– Create new attributes based on existing ones that are sparsely
represented
– This reduces fragmentation, repetition, and replication

46
Classification in Large Databases

 Classification—a classical problem extensively studied by


statisticians and machine learning researchers
 Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
 Why decision tree induction in data mining?
– relatively faster learning speed (than other classification methods)
– convertible to simple and easy to understand classification rules
– can use SQL queries for accessing databases
– comparable classification accuracy with other methods

47
Scalable Decision Tree Induction Methods in
Data Mining Studies

 SLIQ (EDBT’96 — Mehta et al.)


– builds an index for each attribute and only class list and the current
attribute list reside in memory
 SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
 PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the tree
earlier
 RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that determine
the quality of the tree
– builds an AVC-list (attribute, value, class label)

48
Presentation of Classification Results

49
Bayesian Classification: Why?

 Probabilistic learning: Calculate explicit probabilities for


hypothesis, among the most practical approaches to certain
types of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
 Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured

50
Bayesian Theorem

 Given training data D, posteriori probability of a


hypothesis h, P(h|D) follows the Bayes theorem
P(h | D) P(D | h)P(h)
P(D)

 MAP (maximum posteriori) hypothesis


h arg max P(h | D) arg max P(D | h)P(h).
MAP hH hH

 Practical difficulty: require initial knowledge of many


probabilities, significant computational cost

51
Bayesian classification

The classification problem may be formalized


using a-posteriori probabilities:
 P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.

 E.g. P(class=N | outlook=sunny,windy=true,…)

 Idea: assign to sample X the class label C such


that P(C|X) is maximal

54
Estimating a-posteriori probabilities

 Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
 P(X) is constant for all classes
 P(C) = relative freq of class C samples
 C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
 Problem: computing P(X|C) is unfeasible!

55
Naïve Bayesian Classification

 Naïve assumption: attribute independence


P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
 If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class C
 If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
 Computationally easy in both cases

56
Play-tennis example: estimating P(xi|C)
outlook
P(sunny|n) = 3/5 P(sunny|p) = 2/9
Outlook Temperature Humidity Windy Class
sunny hot high false N P(overcast|n) = 0 P(overcast|p) = 4/9
sunny hot high true N
overcast
rain
hot
mild
high
high
false
false
P
P
P(rain|n) = 2/5 P(rain|p) = 3/9
rain
rain
cool
cool
normal false
normal true
P
N
temperature
overcast cool normal true P
sunny mild high false N P(hot|n) = 2/5 P(hot|p) = 2/9
sunny cool normal false P
rain mild normal false P P(mild|n) = 2/5 P(mild|p) = 4/9
sunny mild normal true P
overcast mild high true P P(cool|n) = 1/5 P(cool|p) = 3/9
overcast hot normal false P
rain mild high true N humidity
P(high|n) = 4/5 P(high|p) = 3/9
P(p) = 9/14 P(normal|n) = 2/5 P(normal|p) = 6/9
windy
P(n) = 5/14
P(true|n) = 3/5 P(true|p) = 3/9
57
P(false|n) = 2/5 P(false|p) = 6/9
Play-tennis example: classifying X

 An unseen sample X = <rain, hot, high, false>

 P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
 P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

 Sample X is classified in class n (don’t play)

58
The independence hypothesis…

 … makes computation possible


 … yields optimal classifiers when satisfied
 … but is seldom satisfied in practice, as attributes
(variables) are often correlated.
 Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
– Decision trees, that reason on one attribute at the time,
considering most important attributes first

59
Other Classification Methods

 k-nearest neighbor classifier


 case-based reasoning
 Genetic algorithm
 Rough set approach
 Fuzzy set approaches
 Neural Network

60
Instance-Based Methods

 Instance-based learning:
– Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
 Typical approaches
– k-nearest neighbor approach
Instances represented as points in a Euclidean space.

– Locally weighted regression


Constructs local approximation

– Case-based reasoning
Uses symbolic representations and knowledge-based
inference

61
The k-Nearest Neighbor Algorithm

 All instances correspond to points in the n-D space.


 The nearest neighbor are defined in terms of Euclidean
distance.
 The target function could be discrete- or real- valued.
 For discrete-valued, the k-NN returns the most
common value among the k training examples nearest
to xq.
 Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples.
_
_
_ _ .
+
_ .
+
xq + . . .
62 _
+ .
Discussion on the k-NN Algorithm

 The k-NN algorithm for continuous-valued target functions


– Calculate the mean values of the k nearest neighbors
 Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k neighbors according to
their distance to the query point xq
giving greater weight to closer neighbors
w 1
– Similarly, for real-valued target functions d ( xq , xi )2
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes.
– To overcome it, axes stretch or elimination of the least relevant
attributes.
63
Case-Based Reasoning

 Also uses: lazy evaluation + analyze similar instances


 Difference: Instances are not “points in a Euclidean space”
 Example: Water faucet problem in CADET (Sycara et al’92)
 Methodology
– Instances represented by rich symbolic descriptions (e.g., function
graphs)
– Multiple retrieved cases may be combined
– Tight coupling between case retrieval, knowledge-based reasoning,
and problem solving
 Research issues
– Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases

64
Remarks on Lazy vs. Eager Learning

 Instance-based learning: lazy evaluation


 Decision-tree and Bayesian classification: eager evaluation
 Key differences
– Lazy method may consider query instance xq when deciding how to
generalize beyond the training data D
– Eager method cannot since they have already chosen global approximation
when seeing the query
 Efficiency: Lazy - less time training but more time predicting
 Accuracy
– Lazy method effectively uses a richer hypothesis space since it uses many
local linear functions to form its implicit global approximation to the target
function
– Eager: must commit to a single hypothesis that covers the entire instance
space

65
Genetic Algorithms

 GA: based on an analogy to biological evolution


 Each rule is represented by a string of bits
 An initial population is created consisting of randomly
generated rules
– e.g., IF A1 and Not A2 then C2 can be encoded as 100
 Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules and
their offsprings
 The fitness of a rule is represented by its classification
accuracy on a set of training examples
 Offsprings are generated by crossover and mutation

66
Rough Set Approach

 Rough sets are used to approximately or “roughly”


define equivalent classes
 A rough set for a given class C is approximated by two
sets: a lower approximation (certain to be in C) and an
upper approximation (cannot be described as not
belonging to C)
 Finding the minimal subsets (reducts) of attributes (for
feature reduction) is NP-hard but a discernibility matrix
is used to reduce the computation intensity

67
Rough Set based Classification (Pawlak Z )

dec a5 a4 a3 a2 a1
Generate Reducts
x1 Highly
es x2 Influenced
l u Set of Reducts by Number
Va
x3 of attributes
x4 Generate Rules
x5
Datase
t Set of
Rules
(Classifier)
Reduct : The Minimum
Number of attributes
DS = {U,A} (Decision Table)
that represent DS.
A = {a1 a2 a3 a4 a5 dec} (Set of
Core: The set of
Attributes)
attributes that are
U = {x1, x2, x3, x4,x5} (Set of Objects)
exist in all reducts of
DS.
68
Mining Classification Rules: An
Example

dec a5 a4 a3 a2 a1 Reducts Set


x1 of DS
1 1 2 3 1 2 a4(2) =>
2 2 3 2 1 1 x2 dec(1)
{a4} a1(1) =>
1 1 2 3 1 3 x3 {a1} dec(2)
2 3 3 3 1 1 x4 {a2} a1(3) =>
Set of Rules
2 3 3 3 3 2 x5 dec(1)
Decision System (DS) (Classifier)
a2(3) =>
dec(2)
C5 C4 C3 C2 C1
{a2,a4,a5} {a1,a4,a5} - {a1,a3,a4,a5} - C1

- - {a1,a3,a4,a5} - {a1,a3,a4,a5} C2
{a1,a2,a4,a5} {a1,a4,a5} - {a1,a3,a4,a5} - C3

- - {a1,a4,a5} - {a1,a4,a5} C4

- - {a1,a2,a4,a5} - {a2,a4,a5} C5
Discernibility Matrix Modulo
69
Fuzzy Set
Approaches

 Fuzzy logic uses truth values between 0.0 and 1.0 to


represent the degree of membership (such as using
fuzzy membership graph)
 Attribute values are converted to fuzzy values
– e.g., income is mapped into the discrete categories {low,
medium, high} with fuzzy values calculated
 For a given new sample, more than one fuzzy value may
apply
 Each applicable rule contributes a vote for membership
in the categories
 Typically, the truth values for each predicted category are
70 summed
Prediction

71
What Is Prediction?

 Prediction is similar to classification


– First, construct a model
– Second, use model to predict unknown value
Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
 Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions

72
Predictive Modeling in
Databases
 Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
 One can only predict value ranges or category distributions
 Method outline:
– Minimal generalization
– Attribute relevance analysis
– Generalized linear model construction
– Prediction
 Determine the major factors which influence the prediction
– Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgement, etc.
 Multi-level prediction: drill-down and roll-up analysis

73
Regress Analysis and Log-Linear Models in
Prediction

 Linear regression: Y =  +  X
– Two parameters ,  and  specify the line and are to be
estimated by using the data at hand.
– using the least squares criterion to the known values of Y 1, Y2,
…, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
 Log-linear models:
– The multi-way table of joint probabilities is approximated by a
product of lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd

74
Locally Weighted Regression

 Construct an explicit approximation to f over a


local region surrounding query instance xq.
 Locally weighted linear regression:
– The target function f is approximated near xq using the
linear function: f ( x) w  w a ( x)wnan ( x)
0 11
– minimize the squared error: distance-decreasing
weight K
E ( xq ) 1  ( f ( x)  f ( x))2 K (d ( xq , x))
2 xk _nearest _neighbors_of _ x
q
– the gradient descent training rule:
w j   K (d ( xq , x))(( f ( x)  f ( x))a j ( x)
x k _ nearest _ neighbors_ of _ xq
 In most cases, the target function is approximated
by a constant, linear, or quadratic function.
75
Prediction: Numerical Data

76
Prediction: Categorical Data

77
Classification Accuracy: Estimating Error
Rates

 Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test
set(1/3)
– used for data set with large number of samples
 Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as
test data --- k-fold cross-validation
– for data set with moderate size
 Bootstrapping (leave-one-out)
– for small size data

78
Classification Accuracy as efficiency
measure
Confusion Matrix
• A confusion matrix contains information about actual and
predicted
classifications done by a classification system.
• The following table shows the confusion matrix for a two class
classifier.
• The entries in the confusion matrix have the following meaning:

• a is the number of correct predictions that an instance is negative,


• b is the number of incorrect predictions that an instance is positive,
• c is the number of incorrect of predictions that an instance negative,
and Predicted
• d is the number Accuracy
Classification of correct predictions
(AC) that an instance is positive.
Positive Negative
b A Negative Actual
d c Positive

Fig: confusion matrix

79
Confusion Matrix for the Iris Dataset

- Data Set : 150 Objects


- Training Dataset : 105 objects (70%)
- Testing Dataset : 45 objects (30%)
- Classes : 3 ( Iris 1, Iris 2, Iris 3)

Predicted
Accuracy % Iris 3 Iris 2 Iris 1

100.0 0 0 14 Iris 1

78.95 3 15 1 Iris 2
Actual
91.67 11 1 0 Iris 3

88.89 78.57 93.75 93.3

Table : Confusion Matrix for Iris


Dataset

80
Approaches of Evaluating Classification
Algorithms

 Random Train and Test Approach (Holdout)


 K-Fold Cross Validation Approach (Rotation Estimation)
 Bootstrap Approaches

81
Train and Test (Holdout) approach

Data
Patterns
Training Mining
DS Task
Random
Dataset Splitter

Pattern
Test Evaluation
DS

Train : 70%
Test : 30%

82
Example

Dean Year Rank Name

No 3 Assistant Prof Mohd


Dean Year Rank Name

No 2 Assistant Prof Ali Yes 7 Assistant Prof Qasem

No 3 Assistant Prof Mohd Yes 2 Professor Hasan

Qasem
Yes 7 Associate Prof Azmi
Yes 7 Assistant Prof
No 6 Assistant Prof Hamedah
No 7 Associate Prof Azeem
No 3 Associate Prof Fatimah
Yes 2 Professor Hasan

Yes 7 Associate Prof Azmi Train Dataset


No 6 Assistant Prof Hamedah

Yes 5 Professor Lim Dean Year Rank Name

Yes 7 Assistant Prof Ahmad No 2 Assistant Prof Ali

No 3 Associate Prof Fatimah No 7 Associate Prof Azeem

Yes 5 Professor Lim

Yes 7 Assistant Prof Ahmad

Test Dataset

83
K-Fold Cross Validation

4-Fold Cross
Validation

84
Boosting and Bagging

 Boosting increases classification accuracy


– Applicable to decision trees or Bayesian classifier
 Learn a series of classifiers, where each
classifier in the series pays more attention to
the examples misclassified by its predecessor
 Boosting requires only linear time and
constant space

85
Boosting Technique (II) — Algorithm

 Assign every example an equal weight 1/N


 For t = 1, 2, …, T Do
– Obtain a hypothesis (classifier) h(t) under w(t)
– Calculate the error of h(t) and re-weight the
examples based on the error
– Normalize w(t+1) to sum to 1
 Output a weighted sum of all the hypothesis,
with each hypothesis weighted according to its
accuracy on the training set

86
Summary
 Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
 Classification is probably one of the most widely used data
mining techniques with a lot of extensions
 Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
 Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..

87
References (I)

 C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
 P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for
scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining
(KDD'95), pages 39-44, Montreal, Canada, August 1995.
 U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994
AAAI Conf., pages 601-606, AAAI Press, 1994.
 J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-
427, New York, NY, August 1998.
 M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree
induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research
Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham, England, April 1997.

88
References (II)

 J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic


interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France,
March 1996.
 S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary
Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
 J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
 R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August
1998.
 J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.
1996.
 S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991.

89
Thank you

90

You might also like