Supervised Learning Algorithms
Supervised Learning Algorithms
Algorithms
Bayesian classification
Rule-based classification
Classification by back
propagation
prediction
unsupervised classification
What is Classification?
In Machine Learning
If forecasting discrete value Classification
If forecasting continuous value Prediction
Classification Example
Example training database
Two predictor attributes: Age Car Class
Age and Car-type (Sport, 20 M Yes
Minivan and Truck) 30 M Yes
Age is numeric, Car-type is 25 T No
categorical attribute 30 S Yes
Class label indicates 40 S Yes
whether person bought 20 T No
product
30 M Yes
Dependent attribute is 25 M Yes
categorical 40 M Yes
20 S No
Regression (Prediction) Example
Example training database
Two predictor attributes: Age Car Spent
Age and Car-type (Sport, 20 M $200
Minivan and Truck) 30 M $150
Spent indicates how much 25 T $300
person spent during a recent 30 S $220
visit to the web site 40 S $400
Dependent attribute is 20 T $80
numerical 30 M $100
25 M $125
40 M $500
20 S $420
Supervised and Unsupervised
Supervised Classification = Classification
We know the class labels and the number of classes
IF Income = ‘High’
Training Data class OR Age > 30
THEN Class = ‘Good
OR
Decision Tree
OR
Mathematical For
Classification is a 3-step process
2. Model Evaluation (Accuracy):
Estimate accuracy rate of the model based on a test set.
The known label of test sample is compared with the classified
result from the model.
Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
Test set is independent of training set otherwise over-fitting will
occur
2. Classification Process (Accuracy
Evaluation)
Classification Model
class
Classification is a three-step process
3. Model Use (Classification):
The model is used to classify unseen objects.
Give a class label to a new tuple
Predict the value of an actual attribute <prediction>
3. Classification Process (Use)
Classification Model
time)
However if we have two classes and half of the examples belong to one
class and half belong to another class, then entropy is high
m
Entropy ( S ) pi log 2 ( pi )
i 1
Entropy of heterogeneous data
Information Gain(IG)
v | Dj |
Entropy ( S , A) I (D j )
j 1 |D|
Calculate Entropy & Information
Gain to build a Decision Tree
Step 1: Let’s calculate Entropy for
entire sample
Step 2: Calculate Entropy for each
column
v | Dj |
Entropy ( S , A) I (D j )
j 1 |D|
Step 3: Calculate Information Gain
Information Gain from all
attributes
How does the tree look initially
Build Decision Tree –But what
next?
Build Decision Tree –next is here
How does the tree look finally?
Decision rules
Sample Decision Tree
Excellent customers
Fair customers
80
Income
< 6K >= 6K
Age YES
50 No
20
2000 6000 10000
Income
Sample Decision Tree
80
Income
<6k >=6k
NO Age
Age 50 >=50
<50
NO Yes
20
2000 6000 10000
Income
Decision-Tree Classification Methods
1. Tree construction
At the start, all the training examples are at the root.
Partition examples are recursively based on selected
attributes.
2. Tree pruning
Aiming at removing tree branches that may reflect
noise in the training data and lead to errors when
classifying test data improve classification accuracy
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
CarType
Family Luxury
Sports
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Size
Size {Medium,
{Small, Large} {Small}
Medium} {Large} OR
Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
Customers
Income Age
<10k >=10k young old
Algorithm for Decision Tree Induction
Basic algorithm
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized
in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
There are no remaining attributes for further partitioning
There are no samples left
Classification Algorithms
ID3
Uses information gain
C4.5
Uses Gain Ratio
Decision Tree Induction: Training Dataset
age?
<=30 overcast
31..40 >40
no yes yes
class-labeled tuples.
Suppose the class label attribute has m distinct
j 1 |D|
The term |Dj| / |D| acts as the weight of the j th partition. Info A(D) is the
expected information required to classify a tuple from D based on the
partitioning by A.
<=30 overcast
31..40 >40
no yes yes
The information gain measure is biased toward tests with many outcomes. That is,
it prefers to select attributes having a large number of values.
For example, consider an attribute that acts as a unique identifier, such as product
ID.
Because each partition is pure, the information required to classify data set D
based on this partitioning would be Infoproduct ID(D) = 0.
(A test on income splits the data into three partitions, namely low, medium & high containing four,six & four
tuples)
Ex. 4 4 6 6 4 4
SplitInfoA ( D) log2 ( ) log2 ( ) log2 ( ) 0.926
14 14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as the splitting attribute
Information gain:
biased towards multi valued attributes
Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
Challenge with Decision Tree
models
Random Forest helps overcome this
challenge
Let’s know what overfitting is
How
How overfitting causes challenge in
Decision Tree
Random Forest to the rescue
Low Bias: Basically it says that if I am creating my decision tree to its complete depth,
then it will get properly trained for training dataset. So training error will be very less.
High Variance: Whenever we get new test data, the decision tree is prone to give larger
amount of error.
How does Random Forest work in
Regression?
How does Random Forest work in
Classification?
•
Benefits of Random Forest
Use cases of Random Forest
•
Naïve Bayes classifier
Naïve Bayes classifier
Background
Classification algorithms that differentiates between classes on the basis of
definite decision boundaries.
Classification algorithms that learn boundaries between classes.
Classification algorithms that constructs decision boundaries that separates
classes are called discriminative models.
Background
What is Probability?
What is Probability?
Probability explained through an
example
John’s emails have multiple occurrences
of the word ‘Lottery’. Let’s analyze them
closely..
Analyze Emails with word “lottery”
Let us consider two simple events..
Let us consider two simple events in
Emails
Appearance of “lottery” in spam and
genuine emails
Compute probability of word ‘lottery’
appearing in emails
Let us explore different types of
probabilities…
Types of Probabilities: Joint Probability
Types of Probabilities: Joint Probability
Venn Diagram for representing count of
events
Let us compute joint probability of word
‘lottery’ appearing in spam
Types of Probabilities: Marginal
Probability
Types of Probabilities: Marginal
Probability
Types of Probabilities: Conditional
Probability
Types of Probabilities: Conditional
Probability
Predicts X belongs to Ci if the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes.
Practical difficulty: require initial knowledge of many probabilities.
vMAP arg max P v j a1 , a2 ,....., an
j V
P a1 , a2 ,....., an v j P v j
arg max
j V P a1 , a2 ,....., an
arg max P a1 , a2 ,....., an v j P v j
j V
PC P X | C
PC | X
P X
P(yes) = 9/14
P(no) = 5/14
Bayesian Classifier – Probabilities for the weather data
Frequency Tables
Likelihood Tables
Bayesian Classifier – Predicting a new day
P(outlook=overcast|No)=0
PX | Ck PCk
PCk | X
PX
Likelihood: P(X | Ck )
( x )2
Continues variable: P x | C
1
exp
(2 2 )1/ 2 2 2
Bayesian Classifier – Dealing with numeric attributes
EXAMPLE-I
Department status age salary
Sales senior 31. . .35 41K.. .45K
Sales junior 26. . .30 26K.. .30K
Sales junior 31. . .35 31K.. .35K
systems junior 21. . .25 31K.. .35K
systems senior 31. . .35 66K.. .70K
systems junior 26. . .30 31K.. .35K
systems senior 41. . .45 66K.. .70K
marketing senior 26. . .30 46K.. .50K
marketing junior 31. . .35 41K.. .45K
secretary senior 46. . .50 41K.. .45K
secretary junior 26. . .30 26K.. .30K
Given a data tuple having the values “sunny”, 66, 89 and “true” for the attributes outlook, temp.,
humidity and windy respectively, what would be a naive Bayesian classification of the Play for the
given tuple?
Example- continuous attributes
The numeric weather data with summary statistics
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std dev 6.2 7.9 std dev 10.2 9.7 true 3/9 3/5
ANN Brain
It is simple (few neuron in connection) It is complex (1011 Neurons and 1015
connections)
x1 x2 x3 x4 x5
Input
Layer
Hidden
Layer
Output
Layer
j is the bias of the unit. The bias acts as a threshold, which is used to adjust
the output along with the weighted sum of the inputs to the neuron. Therefore
bias is a constant which helps the model in a way that it can fit best for the
given data..
ANN
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
x1 x2 x3 x4 x5
Input
Layer
wij
I j wij Oi j
Hidden
Layer
i
1
Oj
Output
Layer I j
1 e
y
How A Multi-Layer Neural Network Works?
The inputs to the network correspond to the attributes measured for each
training tuple
Inputs are fed simultaneously into the units making up the input layer
The weighted outputs of the last hidden layer are input to units making up the
output layer, which gives out the network's prediction
Input
Layer
Err j O j (1 O j ) Errk w jk
k
Hidden
Layer wij wij (l ) Err j Oi
Output
Layer
j j (l) Err j
y
Err j O j (1 O j )(T j O j )
Strength
High tolerance to noisy data as well as their ability to classify patterns on which
they have not been trained.
They are well-suited for continuous-valued inputs and outputs, unlike most decision
tree algorithms.
They have been successful on a wide array of real-world data, including
handwritten character recognition, pathology and laboratory medicine, and training
a computer to pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be
used to speed up the computation process.
These above factors contribute toward the usefulness of neural networks for
classification and prediction in machine learning.
Lazy: less time in training but more time in predicting so lazy learners
can be computationally expensive.
Lazy Learner: Instance-Based Methods
Instance-based learning:
Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified.
Typical approaches
Case-based reasoning
Uses symbolic representations and knowledge-
based inference.
The k-Nearest Neighbor Algorithm
The k-nearest-neighbor method was first described in the early 1950s.
It has since been widely used in the area of pattern recognition.
(b) What would be the class assigned to this test instance for K=3.
KNN assigns a test instance the target class associated with the
majority of the test instance’s K nearest neighbors. For K=3, this test
instance would be predicted negative. Out of its three nearest
neighbors, two are negative and one is positive.
Advantages of KNN
Advantages of KNN
Advantages of KNN
Example of application of KNN
Example of application of KNN
KNN(K Nearest Neighbor) in a
nutshell
How does KNN work?
Let’s take a simple example of
Classification
Step 1: Build neighborhood
Step 2: Find distance from query point to
each point in neighborhood
FYI: Distance measures for
continuous data
Step 3: Assign to class
Classification with KNN: Loan
default data
Step 1: Build neighborhood
Classification with KNN: Build
neighborhood
Step 2: Measure distance from each
data point
Step 2: Graphical representation of
distance
Step 3: Assign to class based on
majority vote
KNN for Regression: Let’s work
withLoan data set
Step 1: Define ‘K’(number of
neighbors)
Step 2: Measure distance from each
data point
Regression with KNN: Predict
income of Query point
What should be the value of K?
What should be the value of K?
Case Study: Identify whether a
website is malicious or not
Identify whether a website is
malicious or not: Data Attributes
Metrics for Performance Evaluation of
Classifier
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models,
scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
(Positive) (Negative)
ACTUAL Class=Yes a b
CLASS (Positive)
Class=No c d
(Negative)
c FP
FP
c d FP TN
Contd…
a TP
P
c a FP TP
Example
Suppose we train a model to predict whether an email is Spam or
Not Spam. After training the model, we apply it to a test set of 500
new email messages (also labeled) and the model produces the
contingency matrix below.
(a) Compute the precision of this model with respect to the Spam class.
Precision with respect to SPAM = # correctly predicted as SPAM / #
predicted as SPAM
= 70 / (70 + 10) = 70 / 80.
Cond…
(b) Compute the recall of this model with respect to the Spam class.
High recall and low precision with respect to SPAM: the model
filters all the SPAM emails, but also incorrectly classifies some genuine
emails as SPAM i.e. <False Positive (False Rejectance)>.
End of Presentataion