Supervised Learning
Algorithms
October 10, 2022 1
What is classification? What is Support Vector Machines (SVM)
prediction? Model selection
Issues regarding classification Summary
and prediction
Classification by decision tree
induction
Bayesian classification
Rule-based classification
Classification by back
propagation
October 10, 2022 2
Objectives
Learn basic techniques for data classification and
prediction.
Realize the difference between the following classifications
of data:
supervised classification
prediction
unsupervised classification
What is Classification?
The goal of data classification is to organize and categorize data
in distinct classes.
A model is first created.
The model is then used to classify new data.
Given the model, a class can be predicted for new
data.
Classification = prediction for discrete and nominal values
What is Prediction?
The goal of prediction is to forecast or deduce the value of an attribute based
on values of other attributes.
A model is first created based on the data distribution.
The model is then used to predict future or unknown values
In Machine Learning
If forecasting discrete value Classification
If forecasting continuous value Prediction
Classification Example
Example training database
Two predictor attributes: Age Car Class
Age and Car-type (Sport, 20 M Yes
Minivan and Truck) 30 M Yes
Age is numeric, Car-type is 25 T No
categorical attribute 30 S Yes
Class label indicates 40 S Yes
whether person bought 20 T No
product
30 M Yes
Dependent attribute is 25 M Yes
categorical 40 M Yes
20 S No
Regression (Prediction) Example
Example training database
Two predictor attributes: Age Car Spent
Age and Car-type (Sport, 20 M $200
Minivan and Truck) 30 M $150
Spent indicates how much 25 T $300
person spent during a recent 30 S $220
visit to the web site 40 S $400
Dependent attribute is 20 T $80
numerical 30 M $100
25 M $125
40 M $500
20 S $420
Supervised and Unsupervised
Supervised Classification = Classification
We know the class labels and the number of classes
Unsupervised Classification = Clustering
We do not know the class labels and may not know the
number of classes
Preparing Data Before Classification
Data transformation:
Discretization of continuous data
Normalization to [-1..1] or [0..1]
Data Cleaning:
Smoothing to reduce noise
Relevance Analysis:
Feature selection to eliminate irrelevant attributes
Application
Credit approval
Target marketing
Medical diagnosis
Defective parts identification in manufacturing
Crime zoning
Treatment effectiveness analysis
Classification is a 3-step process
1. Model construction (Learning):
Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
The set of all tuples used for construction of the model is
called training set.
The model is represented in the following forms:
Classification rules, (IF-THEN statements),
Decision tree
Mathematical formulae
1. Classification Process (Learning)
Name Income Age Credit rating
Samir Low <30 bad Classification Method
Ahmed Medium [30...40] good
Salah High <30 good
Ali Medium >40 good
Sami Low [30..40] good
Emad Medium <30 bad Classification Model
IF Income = ‘High’
Training Data class OR Age > 30
THEN Class = ‘Good
OR
Decision Tree
OR
Mathematical For
Classification is a 3-step process
2. Model Evaluation (Accuracy):
Estimate accuracy rate of the model based on a test set.
The known label of test sample is compared with the classified
result from the model.
Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
Test set is independent of training set otherwise over-fitting will
occur
2. Classification Process (Accuracy
Evaluation)
Classification Model
Name Income Age Credit rating Model
Naser Low <30 Bad Bad
Lutfi Medium <30 Bad good Accuracy
Adel High >40 good good 75%
Fahd Medium [30..40] good good
class
Classification is a three-step process
3. Model Use (Classification):
The model is used to classify unseen objects.
Give a class label to a new tuple
Predict the value of an actual attribute <prediction>
3. Classification Process (Use)
Classification Model
Name Income Age
Credit rating
Adham Low <30
?
Classification Methods
Decision Tree Induction
Neural Networks
Bayesian Classification
Association-Based Classification
K-Nearest Neighbour
Case-Based Reasoning
Genetic Algorithms
Rough Set Theory
Fuzzy Sets
Etc.
Comparing Classification and Prediction Methods
Accuracy- This is the ability of the model to correctly predict the
class level of new or previously unseen data.
classifier accuracy: predicting class label of
new or previously unseen data.
predictor accuracy: guessing value of predicted
attributes new or previously unseen data.
Speed (computational cost)
time to construct the model (training time)
time to use the model (classification/prediction
time)
October 10, 2022 19
Comparing Classification and Prediction Methods
Robustness: handling noise and missing values
(ability of model to make correct predictions)
Scalability: the ability to construct the model
efficiently given large amounts of data.
Interpretability:
This refers Level of understanding and insight
provided by the model (classifier or predictor).
Other measures, e.g., goodness of rules, such as
decision tree size.
October 10, 2022 20
Decision Tree
Example 3
Example 4
What is a Decision Tree?
A decision tree is a flow-chart-like tree structure.
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf node represents class label
A quick recap of Linear Regression
–Linear models
Can Linear Regression help us in
this scenario?
How does Decision Tree come to
the rescue?
Simplicity
Feature selection
Handling different types of data
What is tree ?
What is Decision Tree(DT) ?
Use cases of Decision Tree
What is Root node of Decision
Tree?
What is decision node of decision
tree ?
What are leaf nodes of Decision
Tree?
CART(Classification and Regression
Tree)
CART(Classification and Regression
Tree)
CART: Classification Tree
CART: Regression Tree
How to build Decision Tree?
ID3( Iterative Dichotomiser 3)
What is Entropy?
Entropy: It is used to measure disorder in the system. If in a particular node,
all examples are positive OR all examples are negative (i.e. all examples belong to
the same class), then it is homogeneous set of examples and entropy is low.
However if we have two classes and half of the examples belong to one
class and half belong to another class, then entropy is high
m
Entropy ( S ) pi log 2 ( pi )
i 1
Entropy of heterogeneous data
Information Gain(IG)
v | Dj |
Entropy ( S , A) I (D j )
j 1 |D|
Calculate Entropy & Information
Gain to build a Decision Tree
Step 1: Let’s calculate Entropy for
entire sample
Step 2: Calculate Entropy for each
column
v | Dj |
Entropy ( S , A) I (D j )
j 1 |D|
Step 3: Calculate Information Gain
Information Gain from all
attributes
How does the tree look initially
Build Decision Tree –But what
next?
Build Decision Tree –next is here
How does the tree look finally?
Decision rules
Sample Decision Tree
Excellent customers
Fair customers
80
Income
< 6K >= 6K
Age YES
50 No
20
2000 6000 10000
Income
Sample Decision Tree
80
Income
<6k >=6k
NO Age
Age 50 >=50
<50
NO Yes
20
2000 6000 10000
Income
Decision-Tree Classification Methods
The basic top-down decision tree generation approach usually
consists of two phases:
1. Tree construction
At the start, all the training examples are at the root.
Partition examples are recursively based on selected
attributes.
2. Tree pruning
Aiming at removing tree branches that may reflect
noise in the training data and lead to errors when
classifying test data improve classification accuracy
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Size
Size {Medium,
{Small, Large} {Small}
Medium} {Large} OR
Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical
attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal
interval bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split
Tree Induction
Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to determine the Best Split
Excellent customers fair customers
Customers
Income Age
<10k >=10k young old
Algorithm for Decision Tree Induction
Basic algorithm
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized
in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
There are no remaining attributes for further partitioning
There are no samples left
Classification Algorithms
ID3
Uses information gain
C4.5
Uses Gain Ratio
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
October 10, 2022 73
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes yes
October 10, 2022 74
ID3
Attribute Selection Measure: Information
Gain
Notations:
Let D, the data partition, be a training set of
class-labeled tuples.
Suppose the class label attribute has m distinct
values defining m distinct classes, Ci (for i = 1,
….. , m).
Let Ci,D be the set of tuples of class Ci in D. Let
| D | and |Ci,D | denote the number of tuples in
D and Ci,D, respectively.
October 10, 2022 75
Attribute Selection Measure:
Information Gain
Select the attribute with the highest information gain for
current node
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information needed to classify a given tuple in D:
(log function base 2 is used since the info. is encoded in bits.)
m
Info( D) pi log2 ( pi )
i 1
Info (D) is just the average amount of information needed to identify the class
label of a tuple in D. Info (D) is also known as the entropy of D.
Now, suppose we were to partition the tuples in D on some attribute A having
v distinct values, (a1, a2, …. , av), as observed from the training data. If A is
discrete-valued, then it gives the v outcomes of a test on A. Attribute A can be
used to split D into v partitions or subsets, (D1, D2, ….., Dv).
October 10, 2022 76
Attribute Selection Measure:
Information Gain
Information needed (after using A to split D into v
partitions) to classify D: v |D |
Info A ( D) I (D j )
j
j 1 |D|
The term |Dj| / |D| acts as the weight of the j th partition. Info A(D) is the
expected information required to classify a tuple from D based on the
partitioning by A.
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
October 10, 2022 77
Attribute Selection: Information Gain
In training data set, The class level attribute,
buys_computer, has two distinct values (namely,
{yes,no}); therefore, there are two distinct classes (m=2).
Let class P correspond to yes and N correspond to no.
there are 9 samples of class yes and 5 samples of class
no.
To compute the information gain of each attribute, we first
use Equation 1, to compute the expected information
needed to classify a given sample.
October 10, 2022 78
Attribute Selection: Information Gain
Class P: buys_computer = “yes” Infoage ( D)
5
I (2,3)
4
I (4,0)
Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) I (9,5) log2 ( ) log2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
5
age pi ni I(pi, ni) I (2,3) means “age <=30” has 5
14
<=30 2 3 0.971 out of 14 samples, with 2 yes’es
31…40 4 0 0 and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age) Info( D) Infoage ( D) 0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high
>40 medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40
31…40 low
low yes
yes
excellent
excellent
no
yes Gain(income) 0.029
Gain( student ) 0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40October 10, 2022 no
medium excellent no 79
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
October 10, 2022 80
Final decision tree:
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes yes
October 10, 2022 81
Why are decision tree classifiers so popular?
The construction of decision tree classifiers does not require any
domain knowledge or parameter setting, and therefore is appropriate
for knowledge discovery.
Decision trees can handle high dimensional data.
Representation of acquired knowledge in tree form is generally easy
to humans.
The learning and classification steps of decision tree induction are
simple and fast.
In general, decision tree classifiers have good accuracy. However,
successful use may depend on the data at hand.
Decision tree induction algorithms have been used for classification in
many application areas, such as medicine, manufacturing and
production, financial analysis, and molecular biology.
Decision trees are the basis of several commercial rule induction
systems.
October 10, 2022 82
Gain Ratio for Attribute Selection
The information gain measure is biased toward tests with many outcomes. That is,
it prefers to select attributes having a large number of values.
For example, consider an attribute that acts as a unique identifier, such as product
ID.
A split on product ID would result in a large number of partitions (as many as
there are values), each one containing just one tuple.
Because each partition is pure, the information required to classify data set D
based on this partitioning would be Infoproduct ID(D) = 0.
Therefore, the information gained by partitioning on this attribute is maximal.
Clearly, such a partitioning is useless for classification.
Gain Ratio for Attribute Selection
C4.5, a successor of ID3, uses an extension to information gain known as gain
ratio.
(normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) log2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
(A test on income splits the data into three partitions, namely low, medium & high containing four,six & four
tuples)
Ex. 4 4 6 6 4 4
SplitInfoA ( D) log2 ( ) log2 ( ) log2 ( ) 0.926
14 14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as the splitting attribute
October 10, 2022 84
Comparing Attribute Selection Measures
Information gain:
biased towards multi valued attributes
Gain ratio:
tends to prefer unbalanced splits in which one
partition is much smaller than the others
Challenge with Decision Tree
models
Random Forest helps overcome this
challenge
Let’s know what overfitting is
How
How overfitting causes challenge in
Decision Tree
Random Forest to the rescue
R2 is a measure of the goodness of fit of a model.
Random Forest
What is Bagging?
What is Bagging?
Types of Ensemble learning
What is ensemble learning?
Ensemble
Why Random Forest is called
Random?
Row level randomness in Random
Forest
Column level randomness in
Random Forest
Example
RS: Row sample
FS: Feature/Column
sample
Low Bias: Basically it says that if I am creating my decision tree to its complete depth,
then it will get properly trained for training dataset. So training error will be very less.
High Variance: Whenever we get new test data, the decision tree is prone to give larger
amount of error.
How does Random Forest work in
Regression?
How does Random Forest work in
Classification?
•
Benefits of Random Forest
Use cases of Random Forest
•
Naïve Bayes classifier
Naïve Bayes classifier
Background
Classification algorithms that differentiates between classes on the basis of
definite decision boundaries.
Classification algorithms that learn boundaries between classes.
Classification algorithms that constructs decision boundaries that separates
classes are called discriminative models.
Background
What if we differentiate between two classes by
analyzing probability distribution of data…
What is Naïve Bayes?
Naive Bayes classifier is an algorithm that learns
the probability that an object with certain
features, belong to a particular group or class.
Where is Naïve Bayes used?
Advantages of Naïve Bayes
Basics of Probability
What is Probability?
What is Probability?
Probability explained through an
example
John’s emails have multiple occurrences
of the word ‘Lottery’. Let’s analyze them
closely..
Analyze Emails with word “lottery”
Let us consider two simple events..
Let us consider two simple events in
Emails
Appearance of “lottery” in spam and
genuine emails
Compute probability of word ‘lottery’
appearing in emails
Let us explore different types of
probabilities…
Types of Probabilities: Joint Probability
Types of Probabilities: Joint Probability
Venn Diagram for representing count of
events
Let us compute joint probability of word
‘lottery’ appearing in spam
Types of Probabilities: Marginal
Probability
Types of Probabilities: Marginal
Probability
Types of Probabilities: Conditional
Probability
Types of Probabilities: Conditional
Probability
Probability of an event given that another event has already occurred
is called conditional probability.
Conditional Probability
For example, suppose you go out for lunch at the same
place and time every Friday and you are served lunch
within 15 minutes with probability 0.9. However, given
that you notice that the restaurant is exceptionally busy,
the probability of being served lunch within 15 minutes
may reduce to 0.7. This is the conditional probability of
being served lunch within 15 minutes given that the
restaurant is exceptionally busy.
The usual notation for "event A occurs given that event B
has occurred" is "A | B" (A given B). The symbol | is a
vertical line and does not imply division.
P(A | B) denotes the probability that event A will occur
given that event B has occurred already.
Conditional Probability
A rule that can be used to determine a conditional
probability from unconditional probabilities is:
P A B
P A B
P B
where:
P(A | B) = the (conditional) probability that event A will
occur given that event B has occurred already.
P(A B) = the (unconditional) probability that event A and
event B both occur.
P(B) = the (unconditional) probability that event B occurs.
Naive Bayes classifier
Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.
Bayesian classification is based on Bayes’ Theorem.
It is based on simplifying assumpations that the attribute
values are conditionally independent,
A naive Bayes classifier assumes that the presence (or
absence) of a particular feature of a class is unrelated to
the presence (or absence) of any other feature, given the
class variable.
October 10, 2022 132
Naive Bayes classifier
For example, a fruit may be considered to be an apple if it
is red, round, and about 4" in diameter. A naive Bayes
classifier considers all these features to contribute
independently to the probability that this fruit is an apple,
whether or not they're in fact related to each other or to
the existence of the other features.
This reduces significantly computation cost since
calculating each one of the P ai v j requires only a
frequency count over the tuples in the training data with
class value equal to v j .
Bayes Theorem : Basics
Let X be a data sample : class label is unknown
Let H be a hypothesis that X belongs to a specified class C
For classification problems, we want to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income or any other
information, for that matter.
P(H|X) (posteriori probability), the probability of observing the sample X, given
that the hypothesis holds
For example, suppose our world of data tuples is confined to customers
described by the attributes age and income, respectively,
and that X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a computer.
Then P(H|X) reflects the probability that customer X will buy a computer
given that we know the customer’s age and income.
October 10, 2022 134
Bayesian Theorem
Given data X, posteriori probability of a hypothesis H, P(H|X), follows the
Bayes theorem
P(H | X) P(X | H )P(H )
P(X)
P(X|H) is the descriptor posterior probability of X conditioned on H.
That is, it is the probability that a customer, X, is 35 years old and
earns $40,000, given that we know the customer will buy a computer.
Predicts X belongs to Ci if the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes.
Practical difficulty: require initial knowledge of many probabilities.
October 10, 2022 135
Bayesian Theorem
Assume target function f : X Y (A function f with domain X and
codomain Y). The elements of X are called arguments of f. For each
argument x, the corresponding unique y in the codomain is called the
function value at x or the image of x under f.
If, each instance X is describes by attributes <a1, a2, a3, ….an>.
Most probable value of f(X) is: vMAP
Using Bayes Theorem we can write the expression as :
vMAP arg max P v j a1 , a2 ,....., an
j V
P a1 , a2 ,....., an v j P v j
arg max
j V P a1 , a2 ,....., an
arg max P a1 , a2 ,....., an v j P v j
j V
vMAP arg max P v j P ai v j
v j V
i
The denominator does not depend on the choice of v j and thus, it can be
omitted from the arg max argument.
Bayesian Theorem
In mathematics, argmax stands for
the argument of the maximum, that is to say,
the set of points of the given argument for which
the given function attains its maximum value.
Towards Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n- dimensional
attribute vector X = (x1, x2,…, xn), showing n
measurements made on the tuple from n attributes.
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only
P(C | X) P(X| C )P(C )
i i i
needs to be maximized
October 10, 2022 138
Bayesian Classifier – Basic Equation
Class Prior Probability Descriptor Posterior Probability
PC P X | C
PC | X
P X
Class Posterior Probability
Descriptor Prior Probability
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
October 10, 2022 140
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
October 10, 2022 141
Training Data
Outlook Temp Humidity Windy Play?
sunny hot high FALSE No
sunny hot high TRUE No
overcast hot high FALSE Yes
rainy mild high FALSE Yes
rainy cool normal FALSE Yes
rainy cool Normal TRUE No
overcast cool Normal TRUE Yes
sunny mild High FALSE No
sunny cool Normal FALSE Yes
rainy mild Normal FALSE Yes
sunny mild normal TRUE Yes
overcast mild High TRUE Yes
overcast hot Normal FALSE Yes
rainy mild high TRUE No
P(yes) = 9/14
P(no) = 5/14
Bayesian Classifier – Probabilities for the weather data
Frequency Tables
Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3 2 Hot | 2 2 High | 4 3 False | 2 6
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0 4 Mild | 2 4 Normal | 1 6 True | 3 3
---------------------------------- ----------------------------------
Rainy | 2 3 Cool | 1 3
Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3/5 2/9 Hot | 2/5 2/9 High | 4/5 3/9 False | 2/5 6/9
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0/5 4/9 Mild | 2/5 4/9 Normal | 1/5 6/9 True | 3/5 3/9
---------------------------------- ----------------------------------
Rainy | 2/5 3/9 Cool | 1/5 3/9
Likelihood Tables
Bayesian Classifier – Predicting a new day
Outlook Temp. Humidity Windy Play
X sunny cool high true ? Class?
P(yes|X) = p(sunny|yes) x p(cool|yes) x p(high|yes) x p(true|yes) x p(yes)
= 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053 => 0.0053/(0.0053+0.0206) = 0.205
P(no|X) = p(sunny|no) x p(cool|no) x p(high|no) x p(true|no) x p(no)
= 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206=0.0206/(0.0053+0.0206) = 0.795
Bayesian Classifier – zero frequency problem
What if a descriptor value doesn’t occur with every class value
P(outlook=overcast|No)=0
Remedy: add 1 to the count for every descriptor-class combination
(Laplace Estimator)
Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3+1 2+1 Hot | 2+1 2+1 High | 4+1 3+1 False | 2+1 6+1
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0+1 4+1 Mild | 2+1 4+1 Normal | 1+1 6+1 True | 3+1 3+1
---------------------------------- ----------------------------------
Rainy | 2+1 3+1 Cool | 1+1 3+1
Bayesian Classifier – General Equation
PX | Ck PCk
PCk | X
PX
Likelihood: P(X | Ck )
( x )2
Continues variable: P x | C
1
exp
(2 2 )1/ 2 2 2
Bayesian Classifier – Dealing with numeric attributes
EXAMPLE-I
Department status age salary
Sales senior 31. . .35 41K.. .45K
Sales junior 26. . .30 26K.. .30K
Sales junior 31. . .35 31K.. .35K
systems junior 21. . .25 31K.. .35K
systems senior 31. . .35 66K.. .70K
systems junior 26. . .30 31K.. .35K
systems senior 41. . .45 66K.. .70K
marketing senior 26. . .30 46K.. .50K
marketing junior 31. . .35 41K.. .45K
secretary senior 46. . .50 41K.. .45K
secretary junior 26. . .30 26K.. .30K
Define Bayesian Classification .Given a data tuple having the values
“systems”, “26. . . 30”, and “41K.. .45K” for the attributes department, age,
and salary, respectively, what would be a naive Bayesian classification of the
status for the given data tuple ?
Example- continuous attributes
Consider the training dataset as shown in below table. Let Play be the class label attribute. There
are two distinct classes, namely, yes and no and two numeric attributes namely “temp” and
“humidity”.
Outlook Temp Humidity Windy Play?
sunny 85 85 FALSE No
sunny 80 90 TRUE No
overcast 83 86 FALSE Yes
rainy 70 96 FALSE Yes
rainy 68 80 FALSE Yes
rainy 65 70 TRUE No
overcast 64 65 TRUE Yes
sunny 72 95 FALSE No
sunny 69 70 FALSE Yes
rainy 75 80 FALSE Yes
sunny 75 70 TRUE Yes
overcast 72 90 TRUE Yes
overcast 81 75 FALSE Yes
rainy 71 91 TRUE No
Given a data tuple having the values “sunny”, 66, 89 and “true” for the attributes outlook, temp.,
humidity and windy respectively, what would be a naive Bayesian classification of the Play for the
given tuple?
Example- continuous attributes
The numeric weather data with summary statistics
Outlook temperature humidity windy play
yes no yes no yes no yes no yes no
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std dev 6.2 7.9 std dev 10.2 9.7 true 3/9 3/5
rainy 3/9 2/5
Artificial Neural Networks
(ANN)
Neural Networks -Origin
Best learning system known to us?
How does the brain work?
In brain a neuron has three principal components:
1. Dendrites:- that carry electrical signals into the cell body.
2. Cell Body:- effectively sums and thresholds these incoming signals.
3. Axon:- is a single long fiber that carries the signal from the cell body out to other
neurons.
4. The point of contact between an axon of one cell and a dendrite of another cell is
called a ‘synapse’
Background: ANN Vs Brain
ANN Brain
It is simple (few neuron in connection) It is complex (1011 Neurons and 1015
connections)
It is dedicated for specific purpose It is generalized for all purpose
Response time is fast ( it may be in Response time is slow ( it may be in
Nanosecond) millisecond)
Design is regular Design is arbitrary
Activities are synchronous Activities are asynchronous
October 10, 2022 155
What is a neural network?
Perceptron
Perceptrons can only model linearly separable functions.
We need to use multi-layer perceptron to tackle non-linear problems.
Perceptron
Activation Functions
Multi Layer Perceptron
Multi Layer Perceptron
General Structure of ANN
x1 x2 x3 x4 x5
Input
Layer
Hidden
Layer
Output
Layer
j is the bias of the unit. The bias acts as a threshold, which is used to adjust
the output along with the weighted sum of the inputs to the neuron. Therefore
bias is a constant which helps the model in a way that it can fit best for the
given data..
ANN
X1 X2 X3 Y Input Black box
1 0 0 0
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1 X2
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3
0 0 0 0
Output Y is 1 if at least two of the three inputs are equal to 1.
ANN
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
Y I (0.3 X 1 0.3 X 2 0.3 X 3 0.4 0)
1 if z is true
where I ( z )
0 otherwise
Artificial Neural Networks
Input
Model is an assembly of inter- nodes Black box
connected nodes and weighted Output
links X1 w1 node
w2
Output node sums up each of its X2 Y
input value according to the w3
weights of its links X3 t
Compare output node against some
Perceptron Model
threshold t
Y I ( wi X i t )
i
Given the net input Ij to unit j, then Oj, the output of unit j, is computed as,
This function is also referred to as a squashing function, because it maps a large
input domain onto the smaller range of 0 to 1.
Where Do The Weights Come From?
Where Do The Weights Come From?
How Do Perceptrons Learn?
Learning Algorithms:
Back propagation for classification
What is backpropagation
Backpropagation is a neural network learning algorithm.
There are many different kinds of neural networks and
neural network algorithms.
The most popular neural network algorithm is
backpropagation, which gained repute in the 1980s.
Multilayer feed-forward networks is type of neural network
on which the backpropagation algorithm performs.
Backpropagation learns for a set of weights that fits the
training data so as to minimize the mean squared distance
between the network’s class prediction and the known
target value of the tuples.
Major Steps for Back Propagation Network
Constructing a network
input data representation
selection of number of layers, number of nodes in each
layer.
Training the network using training data
Pruning the network
Interpret the results
A Multi-Layer Feed-Forward Neural Network
x1 x2 x3 x4 x5
Input
Layer
wij
I j wij Oi j
Hidden
Layer
i
1
Oj
Output
Layer I j
1 e
y
How A Multi-Layer Neural Network Works?
The inputs to the network correspond to the attributes measured for each
training tuple
Inputs are fed simultaneously into the units making up the input layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units making up the
output layer, which gives out the network's prediction
The network is feed-forward in that none of the weights cycles back to an
input unit or to an output unit of a previous layer
Defining a Network Topology
First decide the network topology: # of units in the input layer, # of
hidden layers (if > 1), # of units in each hidden layer, and # of units in
the output layer
Normalizing the input values for each attribute measured in the training
tuples to [0.0—1.0]
One input unit per domain value
Output, if for classification and more than two classes, one output unit
per class is used
Once a network has been trained and its accuracy is unacceptable,
repeat the training process with a different network topology or a
different set of initial weights
Backpropagation
Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target
value
Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
Steps
Initialize weights and biases in the network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
Backpropagation: Algorithm
Backpropagation
x1 x2 x3 x4 x5
Input
Layer
Err j O j (1 O j ) Errk w jk
k
Hidden
Layer wij wij (l ) Err j Oi
Output
Layer
j j (l) Err j
y
Err j O j (1 O j )(T j O j )
Generated value Correct value
Example - Sample calculations for learning by
the backpropagation algorithm
Figure shows : a multilayer feed-forward neural network.
Let the learning rate be 0.9.
The first training tuple, X = (1, 0, 1), whose class label is 1.
Neural Network as a Classifier
Weakness
Long training time
Require a number of parameters, e.g., the network topology or ``structure."
Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of ``hidden units" in the network
Strength
High tolerance to noisy data as well as their ability to classify patterns on which
they have not been trained.
They are well-suited for continuous-valued inputs and outputs, unlike most decision
tree algorithms.
They have been successful on a wide array of real-world data, including
handwritten character recognition, pathology and laboratory medicine, and training
a computer to pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be
used to speed up the computation process.
These above factors contribute toward the usefulness of neural networks for
classification and prediction in machine learning.
October 10, 2022 182
K Nearest Neighbor
Lazy vs. Eager Learning
The classification methods —decision tree
induction, Bayesian classification, classification by
backpropagation, support vector machines—are
all examples of eager learners.
Lazy vs. Eager Learning
Lazy vs. eager learning
Lazy learning (instance-based learning): Simply stores
training data (or only minor processing) and waits until it is given a
test tuple.
Eager learning : Given a set of training set, constructs a
classification model before receiving new (e.g., test) data to
classify.
We can think of the learned model as being ready and
eager to classify previously unseen tuples.
Lazy: less time in training but more time in predicting so lazy learners
can be computationally expensive.
Lazy Learner: Instance-Based Methods
Instance-based learning:
Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified.
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean
space.
Case-based reasoning
Uses symbolic representations and knowledge-
based inference.
The k-Nearest Neighbor Algorithm
The k-nearest-neighbor method was first described in the early 1950s.
It has since been widely used in the area of pattern recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is,
by comparing a given test tuple with training tuples that are similar to
it.
The training tuples are described by n attributes.
Each tuple represents a point in an n-dimensional space.
In this way, all of the training tuples are stored in an n-dimensional
pattern space.
When given an unknown tuple, a k-nearest-neighbor classifier
searches the pattern space for the k training tuples that are closest to
the unknown tuple. These k training tuples are the k “nearest
neighbors” of the unknown tuple.
The k-Nearest Neighbor Algorithm
“Closeness” is defined in terms of a distance metric, such
as Euclidean distance. The Euclidean distance between
two points or tuples, say, X1 = (x11, x12, ….. , x1n) and
X2 = (x21, x22, …… , x2n), is
For k-nearest-neighbor classification, the unknown tuple is
assigned the most common class among its k nearest
neighbors. When k = 1, the unknown tuple is assigned the
class of the training tuple that is closest to it in pattern
space.
Nearest neighbor classifiers can also be used for
prediction, that is, to return a real-valued prediction for a
given unknown tuple.
Example-
Instance-Based Classification
A KNN classifier assigns a test instance the majority class associated
with its K nearest training instances. Distance between instances is
measured using the Euclidean distance.
Suppose we have the following training set of positive (+) and
negative (-) instances and a single test instance (o).
All instances are projected onto a vector space of two real-valued
features: X and Y.
Contd…
(a) What would be the class assigned to this test instance for K=1 .
KNN assigns a test instance the target class associated with the
majority of the test instance’s K nearest neighbors. For K=1, this test
instance would be predicted negative because it’s single nearest
neighbor is negative.
(b) What would be the class assigned to this test instance for K=3.
KNN assigns a test instance the target class associated with the
majority of the test instance’s K nearest neighbors. For K=3, this test
instance would be predicted negative. Out of its three nearest
neighbors, two are negative and one is positive.
Advantages of KNN
Advantages of KNN
Advantages of KNN
Example of application of KNN
Example of application of KNN
KNN(K Nearest Neighbor) in a
nutshell
How does KNN work?
Let’s take a simple example of
Classification
Step 1: Build neighborhood
Step 2: Find distance from query point to
each point in neighborhood
FYI: Distance measures for
continuous data
Step 3: Assign to class
Classification with KNN: Loan
default data
Step 1: Build neighborhood
Classification with KNN: Build
neighborhood
Step 2: Measure distance from each
data point
Step 2: Graphical representation of
distance
Step 3: Assign to class based on
majority vote
KNN for Regression: Let’s work
withLoan data set
Step 1: Define ‘K’(number of
neighbors)
Step 2: Measure distance from each
data point
Regression with KNN: Predict
income of Query point
What should be the value of K?
What should be the value of K?
Case Study: Identify whether a
website is malicious or not
Identify whether a website is
malicious or not: Data Attributes
Metrics for Performance Evaluation of
Classifier
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models,
scalability, etc.
Confusion Matrix:
PREDICTED CLASS a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
Class=Yes a b d: TN (true negative)
ACTUAL
CLASS Class=No c d
Metrics for Performance Evaluation of
Classifier
PREDICTED CLASS
Class=Yes Class=No
(Positive) (Negative)
ACTUAL Class=Yes a b
CLASS (Positive)
Class=No c d
(Negative)
The entries in the confusion matrix have the
following meaning :
a is the number of correct predictions that an instance is positive,
b is the number of incorrect of predictions that an instance negative,
c is the number of incorrect predictions that an instance is positive, and
d is the number of correct predictions that an instance is negative.
Metrics for Performance Evaluation of
Classifier
The accuracy (AC)- is the proportion of the total number of
predictions that were correct. It is determined using the
equation:
ad TP TN
Accuracy
a b c d TP TN FP FN
Consider a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
Accuracy is misleading because model does not detect any
class 1 example
Contd…
The recall or true positive rate (TP) is the proportion of
positive cases that were correctly identified, as calculated
using the equation:
a TP
TP
a b TP FN
The false positive rate (FP) is the proportion of negatives
cases that were incorrectly classified as positive, as
calculated using the equation:
c FP
FP
c d FP TN
Contd…
The true negative rate (TN) is defined as the
proportion of negatives cases that were classified
correctly, as calculated using the equation:
d TN
TN
d c TN FP
The false negative rate (FN) is the proportion of
positives cases that were incorrectly classified as
negative, as calculated using the equation:
b FN
FN
b a FN TP
Contd…..
The precision (P) is the proportion of the
predicted positive cases that were correct, as
calculated using the equation:
a TP
P
c a FP TP
Example
Suppose we train a model to predict whether an email is Spam or
Not Spam. After training the model, we apply it to a test set of 500
new email messages (also labeled) and the model produces the
contingency matrix below.
(a) Compute the precision of this model with respect to the Spam class.
Precision with respect to SPAM = # correctly predicted as SPAM / #
predicted as SPAM
= 70 / (70 + 10) = 70 / 80.
Cond…
(b) Compute the recall of this model with respect to the Spam class.
recall with respect to SPAM = # correctly predicted as SPAM / # truly
SPAM
= 70 / (70 + 40) = 70 / 110.
High-precision and low recall with respect to SPAM: whatever
the model classifies as SPAM is probably SPAM. However, many emails
that are truly SPAM are misclassified as NOT SPAM i.e <False
Negative ( False Acceptance)>
High recall and low precision with respect to SPAM: the model
filters all the SPAM emails, but also incorrectly classifies some genuine
emails as SPAM i.e. <False Positive (False Rejectance)>.
End of Presentataion