0% found this document useful (0 votes)
82 views

Supervised Learning Algorithms

The document discusses supervised learning algorithms for classification and prediction. It describes classification as organizing data into distinct classes using a model, while prediction forecasts attribute values. Classification algorithms covered include decision trees, Bayesian classification, rule-based classification, and backpropagation neural networks. The document outlines the three steps of the classification process: model construction, model evaluation for accuracy, and model use for classification.

Uploaded by

Bijal Vaza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Supervised Learning Algorithms

The document discusses supervised learning algorithms for classification and prediction. It describes classification as organizing data into distinct classes using a model, while prediction forecasts attribute values. Classification algorithms covered include decision trees, Bayesian classification, rule-based classification, and backpropagation neural networks. The document outlines the three steps of the classification process: model construction, model evaluation for accuracy, and model use for classification.

Uploaded by

Bijal Vaza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 224

Supervised Learning

Algorithms

October 10, 2022 1


 What is classification? What is  Support Vector Machines (SVM)
prediction?  Model selection
 Issues regarding classification  Summary
and prediction

 Classification by decision tree


induction

 Bayesian classification

 Rule-based classification

 Classification by back
propagation

October 10, 2022 2


Objectives
 Learn basic techniques for data classification and
prediction.

 Realize the difference between the following classifications


of data:
 supervised classification

 prediction

 unsupervised classification
What is Classification?

 The goal of data classification is to organize and categorize data


in distinct classes.

 A model is first created.


 The model is then used to classify new data.
 Given the model, a class can be predicted for new
data.

 Classification = prediction for discrete and nominal values


What is Prediction?
 The goal of prediction is to forecast or deduce the value of an attribute based
on values of other attributes.
 A model is first created based on the data distribution.
 The model is then used to predict future or unknown values

 In Machine Learning
 If forecasting discrete value  Classification
 If forecasting continuous value  Prediction
Classification Example
 Example training database
 Two predictor attributes: Age Car Class
Age and Car-type (Sport, 20 M Yes
Minivan and Truck) 30 M Yes
 Age is numeric, Car-type is 25 T No
categorical attribute 30 S Yes
 Class label indicates 40 S Yes
whether person bought 20 T No
product
30 M Yes
 Dependent attribute is 25 M Yes
categorical 40 M Yes
20 S No
Regression (Prediction) Example
 Example training database
 Two predictor attributes: Age Car Spent
Age and Car-type (Sport, 20 M $200
Minivan and Truck) 30 M $150
 Spent indicates how much 25 T $300
person spent during a recent 30 S $220
visit to the web site 40 S $400
 Dependent attribute is 20 T $80
numerical 30 M $100
25 M $125
40 M $500
20 S $420
Supervised and Unsupervised
 Supervised Classification = Classification
 We know the class labels and the number of classes

 Unsupervised Classification = Clustering


 We do not know the class labels and may not know the
number of classes
Preparing Data Before Classification
 Data transformation:

 Discretization of continuous data


 Normalization to [-1..1] or [0..1]
 Data Cleaning:

 Smoothing to reduce noise


 Relevance Analysis:

 Feature selection to eliminate irrelevant attributes


Application
 Credit approval
 Target marketing
 Medical diagnosis
 Defective parts identification in manufacturing
 Crime zoning
 Treatment effectiveness analysis
Classification is a 3-step process

 1. Model construction (Learning):


 Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
 The set of all tuples used for construction of the model is
called training set.

 The model is represented in the following forms:


 Classification rules, (IF-THEN statements),
 Decision tree
 Mathematical formulae
1. Classification Process (Learning)
Name Income Age Credit rating
Samir Low <30 bad Classification Method
Ahmed Medium [30...40] good
Salah High <30 good
Ali Medium >40 good
Sami Low [30..40] good
Emad Medium <30 bad Classification Model

IF Income = ‘High’
Training Data class OR Age > 30
THEN Class = ‘Good
OR
Decision Tree
OR
Mathematical For
Classification is a 3-step process
2. Model Evaluation (Accuracy):
 Estimate accuracy rate of the model based on a test set.
 The known label of test sample is compared with the classified
result from the model.
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
 Test set is independent of training set otherwise over-fitting will
occur
2. Classification Process (Accuracy
Evaluation)

Classification Model

Name Income Age Credit rating Model


Naser Low <30 Bad Bad
Lutfi Medium <30 Bad good Accuracy
Adel High >40 good good 75%
Fahd Medium [30..40] good good

class
Classification is a three-step process
3. Model Use (Classification):
 The model is used to classify unseen objects.
 Give a class label to a new tuple
 Predict the value of an actual attribute <prediction>
3. Classification Process (Use)

Classification Model

Name Income Age


Credit rating
Adham Low <30
?
Classification Methods
 Decision Tree Induction
 Neural Networks
 Bayesian Classification
 Association-Based Classification
 K-Nearest Neighbour
 Case-Based Reasoning
 Genetic Algorithms
 Rough Set Theory
 Fuzzy Sets
 Etc.
Comparing Classification and Prediction Methods

 Accuracy- This is the ability of the model to correctly predict the


class level of new or previously unseen data.
classifier accuracy: predicting class label of
new or previously unseen data.
 predictor accuracy: guessing value of predicted

attributes new or previously unseen data.


 Speed (computational cost)
 time to construct the model (training time)

 time to use the model (classification/prediction

time)

October 10, 2022 19


Comparing Classification and Prediction Methods

 Robustness: handling noise and missing values


(ability of model to make correct predictions)
 Scalability: the ability to construct the model
efficiently given large amounts of data.
 Interpretability:
 This refers Level of understanding and insight

provided by the model (classifier or predictor).


 Other measures, e.g., goodness of rules, such as
decision tree size.

October 10, 2022 20


Decision Tree
Example 3
Example 4
What is a Decision Tree?
 A decision tree is a flow-chart-like tree structure.
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf node represents class label
A quick recap of Linear Regression
–Linear models
Can Linear Regression help us in
this scenario?
How does Decision Tree come to
the rescue?
Simplicity
Feature selection
Handling different types of data
What is tree ?
What is Decision Tree(DT) ?
Use cases of Decision Tree
What is Root node of Decision
Tree?
What is decision node of decision
tree ?
What are leaf nodes of Decision
Tree?
CART(Classification and Regression
Tree)
CART(Classification and Regression
Tree)
CART: Classification Tree
CART: Regression Tree
How to build Decision Tree?
ID3( Iterative Dichotomiser 3)
What is Entropy?
Entropy: It is used to measure disorder in the system. If in a particular node,
all examples are positive OR all examples are negative (i.e. all examples belong to
the same class), then it is homogeneous set of examples and entropy is low.

However if we have two classes and half of the examples belong to one
class and half belong to another class, then entropy is high
m
Entropy ( S )   pi log 2 ( pi )
i 1
Entropy of heterogeneous data
Information Gain(IG)

v | Dj |
Entropy ( S , A)    I (D j )
j 1 |D|
Calculate Entropy & Information
Gain to build a Decision Tree
Step 1: Let’s calculate Entropy for
entire sample
Step 2: Calculate Entropy for each
column
v | Dj |
Entropy ( S , A)    I (D j )
j 1 |D|
Step 3: Calculate Information Gain
Information Gain from all
attributes
How does the tree look initially
Build Decision Tree –But what
next?
Build Decision Tree –next is here
How does the tree look finally?
Decision rules
Sample Decision Tree
Excellent customers
Fair customers
80

Income
< 6K >= 6K
Age YES
50 No

20
2000 6000 10000
Income
Sample Decision Tree
80

Income
<6k >=6k

NO Age
Age 50 >=50
<50
NO Yes

20
2000 6000 10000
Income
Decision-Tree Classification Methods

 The basic top-down decision tree generation approach usually


consists of two phases:

1. Tree construction
 At the start, all the training examples are at the root.
 Partition examples are recursively based on selected
attributes.

2. Tree pruning
 Aiming at removing tree branches that may reflect
noise in the training data and lead to errors when
classifying test data  improve classification accuracy
How to Specify Test Condition?
 Depends on attribute types
 Nominal
 Ordinal
 Continuous

 Depends on number of ways to split


 2-way split
 Multi-way split
Splitting Based on Nominal Attributes

 Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
 Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.

Size
Size {Medium,
{Small, Large} {Small}
Medium} {Large} OR

Size
{Small,
Large} {Medium}
Splitting Based on Continuous Attributes

 Different ways of handling


 Discretization to form an ordinal categorical
attribute
 Static – discretize once at the beginning

 Dynamic – ranges can be found by equal

interval bucketing, equal frequency bucketing


(percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)


 consider all possible splits and finds the best cut
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction

 Greedy strategy.
 Split the records based on an attribute test that
optimizes certain criterion.

 Issues
 Determine how to split the records
 How to specify the attribute test condition?

 How to determine the best split?

 Determine when to stop splitting


How to determine the Best Split

Excellent customers fair customers

Customers

Income Age
<10k >=10k young old
Algorithm for Decision Tree Induction

 Basic algorithm
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are discretized
in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
 There are no remaining attributes for further partitioning
 There are no samples left
Classification Algorithms

 ID3
 Uses information gain

 C4.5
 Uses Gain Ratio
Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

October 10, 2022 73


Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes

October 10, 2022 74


ID3
Attribute Selection Measure: Information
Gain
 Notations:
 Let D, the data partition, be a training set of

class-labeled tuples.
 Suppose the class label attribute has m distinct

values defining m distinct classes, Ci (for i = 1,


….. , m).
 Let Ci,D be the set of tuples of class Ci in D. Let

| D | and |Ci,D | denote the number of tuples in


D and Ci,D, respectively.

October 10, 2022 75


Attribute Selection Measure:
Information Gain

 Select the attribute with the highest information gain for


current node
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information needed to classify a given tuple in D:
(log function base 2 is used since the info. is encoded in bits.)
m
Info( D)   pi log2 ( pi )
i 1
 Info (D) is just the average amount of information needed to identify the class
label of a tuple in D. Info (D) is also known as the entropy of D.
 Now, suppose we were to partition the tuples in D on some attribute A having
v distinct values, (a1, a2, …. , av), as observed from the training data. If A is
discrete-valued, then it gives the v outcomes of a test on A. Attribute A can be
used to split D into v partitions or subsets, (D1, D2, ….., Dv).

October 10, 2022 76


Attribute Selection Measure:
Information Gain

 Information needed (after using A to split D into v


partitions) to classify D: v |D |
Info A ( D)    I (D j )
j

j 1 |D|
 The term |Dj| / |D| acts as the weight of the j th partition. Info A(D) is the
expected information required to classify a tuple from D based on the
partitioning by A.

 Information gained by branching on attribute A

Gain(A)  Info(D)  InfoA(D)

October 10, 2022 77


Attribute Selection: Information Gain

 In training data set, The class level attribute,


buys_computer, has two distinct values (namely,
{yes,no}); therefore, there are two distinct classes (m=2).

 Let class P correspond to yes and N correspond to no.


 there are 9 samples of class yes and 5 samples of class
no.
 To compute the information gain of each attribute, we first
use Equation 1, to compute the expected information
needed to classify a given sample.

October 10, 2022 78


Attribute Selection: Information Gain

 Class P: buys_computer = “yes” Infoage ( D) 


5
I (2,3) 
4
I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log2 ( )  log2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
5
age pi ni I(pi, ni) I (2,3) means “age <=30” has 5
14
<=30 2 3 0.971 out of 14 samples, with 2 yes’es
31…40 4 0 0 and 3 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info( D)  Infoage ( D)  0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high
>40 medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40
31…40 low
low yes
yes
excellent
excellent
no
yes Gain(income)  0.029
Gain( student )  0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes
>40October 10, 2022 no
medium excellent no 79
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

October 10, 2022 80


 Final decision tree:
age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes

October 10, 2022 81


Why are decision tree classifiers so popular?

 The construction of decision tree classifiers does not require any


domain knowledge or parameter setting, and therefore is appropriate
for knowledge discovery.
 Decision trees can handle high dimensional data.
 Representation of acquired knowledge in tree form is generally easy
to humans.
 The learning and classification steps of decision tree induction are
simple and fast.
 In general, decision tree classifiers have good accuracy. However,
successful use may depend on the data at hand.
 Decision tree induction algorithms have been used for classification in
many application areas, such as medicine, manufacturing and
production, financial analysis, and molecular biology.
 Decision trees are the basis of several commercial rule induction
systems.
October 10, 2022 82
Gain Ratio for Attribute Selection

The information gain measure is biased toward tests with many outcomes. That is,
it prefers to select attributes having a large number of values.

For example, consider an attribute that acts as a unique identifier, such as product
ID.

A split on product ID would result in a large number of partitions (as many as


there are values), each one containing just one tuple.

Because each partition is pure, the information required to classify data set D
based on this partitioning would be Infoproduct ID(D) = 0.

Therefore, the information gained by partitioning on this attribute is maximal.


Clearly, such a partitioning is useless for classification.
Gain Ratio for Attribute Selection

 C4.5, a successor of ID3, uses an extension to information gain known as gain


ratio.
 (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D)    log2 ( )
j 1 |D| |D|
 GainRatio(A) = Gain(A)/SplitInfo(A)

(A test on income splits the data into three partitions, namely low, medium & high containing four,six & four
tuples)

 Ex. 4 4 6 6 4 4
SplitInfoA ( D)    log2 ( )   log2 ( )   log2 ( )  0.926
14 14 14 14 14 14
 gain_ratio(income) = 0.029/0.926 = 0.031
 The attribute with the maximum gain ratio is selected as the splitting attribute

October 10, 2022 84


Comparing Attribute Selection Measures

 Information gain:
 biased towards multi valued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one
partition is much smaller than the others
Challenge with Decision Tree
models
Random Forest helps overcome this
challenge
Let’s know what overfitting is
How
How overfitting causes challenge in
Decision Tree
Random Forest to the rescue

R2 is a measure of the goodness of fit of a model.


Random Forest
What is Bagging?
What is Bagging?
Types of Ensemble learning
What is ensemble learning?
Ensemble
Why Random Forest is called
Random?
Row level randomness in Random
Forest
Column level randomness in
Random Forest
Example

 RS: Row sample


 FS: Feature/Column
sample

 Low Bias: Basically it says that if I am creating my decision tree to its complete depth,
then it will get properly trained for training dataset. So training error will be very less.
 High Variance: Whenever we get new test data, the decision tree is prone to give larger
amount of error.
How does Random Forest work in
Regression?
How does Random Forest work in
Classification?

Benefits of Random Forest
Use cases of Random Forest

Naïve Bayes classifier
Naïve Bayes classifier
Background
 Classification algorithms that differentiates between classes on the basis of
definite decision boundaries.
 Classification algorithms that learn boundaries between classes.
 Classification algorithms that constructs decision boundaries that separates
classes are called discriminative models.
Background

 What if we differentiate between two classes by


analyzing probability distribution of data…
What is Naïve Bayes?

 Naive Bayes classifier is an algorithm that learns


the probability that an object with certain
features, belong to a particular group or class.
Where is Naïve Bayes used?
Advantages of Naïve Bayes
Basics of Probability

 What is Probability?
What is Probability?
Probability explained through an
example
John’s emails have multiple occurrences
of the word ‘Lottery’. Let’s analyze them
closely..
Analyze Emails with word “lottery”
Let us consider two simple events..
Let us consider two simple events in
Emails
Appearance of “lottery” in spam and
genuine emails
Compute probability of word ‘lottery’
appearing in emails
Let us explore different types of
probabilities…
Types of Probabilities: Joint Probability
Types of Probabilities: Joint Probability
Venn Diagram for representing count of
events
Let us compute joint probability of word
‘lottery’ appearing in spam
Types of Probabilities: Marginal
Probability
Types of Probabilities: Marginal
Probability
Types of Probabilities: Conditional
Probability
Types of Probabilities: Conditional
Probability

Probability of an event given that another event has already occurred


is called conditional probability.
Conditional Probability
 For example, suppose you go out for lunch at the same
place and time every Friday and you are served lunch
within 15 minutes with probability 0.9. However, given
that you notice that the restaurant is exceptionally busy,
the probability of being served lunch within 15 minutes
may reduce to 0.7. This is the conditional probability of
being served lunch within 15 minutes given that the
restaurant is exceptionally busy.
 The usual notation for "event A occurs given that event B
has occurred" is "A | B" (A given B). The symbol | is a
vertical line and does not imply division.
 P(A | B) denotes the probability that event A will occur
given that event B has occurred already.
Conditional Probability
 A rule that can be used to determine a conditional
probability from unconditional probabilities is:
P  A  B
P  A B 
P  B
 where:
 P(A | B) = the (conditional) probability that event A will
occur given that event B has occurred already.
 P(A B) = the (unconditional) probability that event A and
event B both occur.
 P(B) = the (unconditional) probability that event B occurs.
Naive Bayes classifier

 Bayesian classifiers are statistical classifiers. They can


predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.
 Bayesian classification is based on Bayes’ Theorem.
 It is based on simplifying assumpations that the attribute
values are conditionally independent,
 A naive Bayes classifier assumes that the presence (or
absence) of a particular feature of a class is unrelated to
the presence (or absence) of any other feature, given the
class variable.

October 10, 2022 132


Naive Bayes classifier
 For example, a fruit may be considered to be an apple if it
is red, round, and about 4" in diameter. A naive Bayes
classifier considers all these features to contribute
independently to the probability that this fruit is an apple,
whether or not they're in fact related to each other or to
the existence of the other features.

 This reduces significantly computation cost since


calculating each one of the P  ai v j  requires only a
frequency count over the tuples in the training data with
class value equal to v j .
Bayes Theorem : Basics

 Let X be a data sample : class label is unknown


 Let H be a hypothesis that X belongs to a specified class C
 For classification problems, we want to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
 P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income or any other
information, for that matter.
 P(H|X) (posteriori probability), the probability of observing the sample X, given
that the hypothesis holds
 For example, suppose our world of data tuples is confined to customers
described by the attributes age and income, respectively,
 and that X is a 35-year-old customer with an income of $40,000.
 Suppose that H is the hypothesis that our customer will buy a computer.
 Then P(H|X) reflects the probability that customer X will buy a computer
given that we know the customer’s age and income.

October 10, 2022 134


Bayesian Theorem
 Given data X, posteriori probability of a hypothesis H, P(H|X), follows the
Bayes theorem

P(H | X)  P(X | H )P(H )


P(X)
 P(X|H) is the descriptor posterior probability of X conditioned on H.
That is, it is the probability that a customer, X, is 35 years old and
earns $40,000, given that we know the customer will buy a computer.

 Predicts X belongs to Ci if the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes.
 Practical difficulty: require initial knowledge of many probabilities.

October 10, 2022 135


Bayesian Theorem
 Assume target function f : X  Y (A function f with domain X and
codomain Y). The elements of X are called arguments of f. For each
argument x, the corresponding unique y in the codomain is called the
function value at x or the image of x under f.
 If, each instance X is describes by attributes <a1, a2, a3, ….an>.
 Most probable value of f(X) is: vMAP
 Using Bayes Theorem we can write the expression as :


vMAP  arg max P v j a1 , a2 ,....., an
 j V

P  a1 , a2 ,....., an v j  P  v j 
 arg max
 j V P  a1 , a2 ,....., an 
 arg max P  a1 , a2 ,....., an v j  P  v j 
 j V

vMAP  arg max P  v j   P  ai v j 


v j V
i
 The denominator does not depend on the choice of v j and thus, it can be
omitted from the arg max argument.
Bayesian Theorem

 In mathematics, argmax stands for


the argument of the maximum, that is to say,
the set of points of the given argument for which
the given function attains its maximum value.
Towards Naïve Bayesian Classifier
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n- dimensional
attribute vector X = (x1, x2,…, xn), showing n
measurements made on the tuple from n attributes.
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
 This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
 Since P(X) is constant for all classes, only
P(C | X)  P(X| C )P(C )
i i i
needs to be maximized
October 10, 2022 138
Bayesian Classifier – Basic Equation

Class Prior Probability Descriptor Posterior Probability

PC  P X | C 
PC | X  
P X 

Class Posterior Probability


Descriptor Prior Probability
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
October 10, 2022 140
Naïve Bayesian Classifier: An Example
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

 Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)


October 10, 2022 141
Training Data

Outlook Temp Humidity Windy Play?

sunny hot high FALSE No

sunny hot high TRUE No

overcast hot high FALSE Yes

rainy mild high FALSE Yes

rainy cool normal FALSE Yes

rainy cool Normal TRUE No

overcast cool Normal TRUE Yes

sunny mild High FALSE No

sunny cool Normal FALSE Yes

rainy mild Normal FALSE Yes

sunny mild normal TRUE Yes

overcast mild High TRUE Yes

overcast hot Normal FALSE Yes

rainy mild high TRUE No

P(yes) = 9/14
P(no) = 5/14
Bayesian Classifier – Probabilities for the weather data

Frequency Tables

Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes


---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3 2 Hot | 2 2 High | 4 3 False | 2 6
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0 4 Mild | 2 4 Normal | 1 6 True | 3 3
---------------------------------- ----------------------------------
Rainy | 2 3 Cool | 1 3

Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes


---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3/5 2/9 Hot | 2/5 2/9 High | 4/5 3/9 False | 2/5 6/9
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0/5 4/9 Mild | 2/5 4/9 Normal | 1/5 6/9 True | 3/5 3/9
---------------------------------- ----------------------------------
Rainy | 2/5 3/9 Cool | 1/5 3/9

Likelihood Tables
Bayesian Classifier – Predicting a new day

Outlook Temp. Humidity Windy Play


X sunny cool high true ? Class?

P(yes|X) = p(sunny|yes) x p(cool|yes) x p(high|yes) x p(true|yes) x p(yes)

= 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053 => 0.0053/(0.0053+0.0206) = 0.205

P(no|X) = p(sunny|no) x p(cool|no) x p(high|no) x p(true|no) x p(no)

= 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206=0.0206/(0.0053+0.0206) = 0.795


Bayesian Classifier – zero frequency problem

 What if a descriptor value doesn’t occur with every class value

P(outlook=overcast|No)=0

 Remedy: add 1 to the count for every descriptor-class combination


(Laplace Estimator)

Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes


---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Sunny | 3+1 2+1 Hot | 2+1 2+1 High | 4+1 3+1 False | 2+1 6+1
---------------------------------- ---------------------------------- ---------------------------------- ----------------------------------
Overcast | 0+1 4+1 Mild | 2+1 4+1 Normal | 1+1 6+1 True | 3+1 3+1
---------------------------------- ----------------------------------
Rainy | 2+1 3+1 Cool | 1+1 3+1
Bayesian Classifier – General Equation

PX | Ck  PCk 
PCk | X 
PX

Likelihood: P(X | Ck )

 ( x   )2 
Continues variable: P x | C  
1
exp 
(2 2 )1/ 2  2 2

Bayesian Classifier – Dealing with numeric attributes
EXAMPLE-I
Department status age salary
Sales senior 31. . .35 41K.. .45K
Sales junior 26. . .30 26K.. .30K
Sales junior 31. . .35 31K.. .35K
systems junior 21. . .25 31K.. .35K
systems senior 31. . .35 66K.. .70K
systems junior 26. . .30 31K.. .35K
systems senior 41. . .45 66K.. .70K
marketing senior 26. . .30 46K.. .50K
marketing junior 31. . .35 41K.. .45K
secretary senior 46. . .50 41K.. .45K
secretary junior 26. . .30 26K.. .30K

 Define Bayesian Classification .Given a data tuple having the values


“systems”, “26. . . 30”, and “41K.. .45K” for the attributes department, age,
and salary, respectively, what would be a naive Bayesian classification of the
status for the given data tuple ?
Example- continuous attributes
 Consider the training dataset as shown in below table. Let Play be the class label attribute. There
are two distinct classes, namely, yes and no and two numeric attributes namely “temp” and
“humidity”.

Outlook Temp Humidity Windy Play?


sunny 85 85 FALSE No
sunny 80 90 TRUE No
overcast 83 86 FALSE Yes
rainy 70 96 FALSE Yes
rainy 68 80 FALSE Yes
rainy 65 70 TRUE No
overcast 64 65 TRUE Yes
sunny 72 95 FALSE No
sunny 69 70 FALSE Yes
rainy 75 80 FALSE Yes
sunny 75 70 TRUE Yes
overcast 72 90 TRUE Yes
overcast 81 75 FALSE Yes
rainy 71 91 TRUE No

 Given a data tuple having the values “sunny”, 66, 89 and “true” for the attributes outlook, temp.,
humidity and windy respectively, what would be a naive Bayesian classification of the Play for the
given tuple?
Example- continuous attributes
The numeric weather data with summary statistics

Outlook temperature humidity windy play

yes no yes no yes no yes no yes no

sunny 2 3 83 85 86 85 false 6 2 9 5

overcast 4 0 70 80 96 90 true 3 3

rainy 3 2 68 65 80 70

64 72 65 95

69 71 70 91

75 80

75 70

72 90

81 75

sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14

overcast 4/9 0/5 std dev 6.2 7.9 std dev 10.2 9.7 true 3/9 3/5

rainy 3/9 2/5


Artificial Neural Networks
(ANN)
Neural Networks -Origin
Best learning system known to us?
How does the brain work?

In brain a neuron has three principal components:


1. Dendrites:- that carry electrical signals into the cell body.
2. Cell Body:- effectively sums and thresholds these incoming signals.
3. Axon:- is a single long fiber that carries the signal from the cell body out to other
neurons.
4. The point of contact between an axon of one cell and a dendrite of another cell is
called a ‘synapse’
Background: ANN Vs Brain

ANN Brain
It is simple (few neuron in connection) It is complex (1011 Neurons and 1015
connections)

It is dedicated for specific purpose It is generalized for all purpose

Response time is fast ( it may be in Response time is slow ( it may be in


Nanosecond) millisecond)

Design is regular Design is arbitrary

Activities are synchronous Activities are asynchronous

October 10, 2022 155


What is a neural network?
Perceptron

Perceptrons can only model linearly separable functions.

We need to use multi-layer perceptron to tackle non-linear problems.


Perceptron
Activation Functions
Multi Layer Perceptron
Multi Layer Perceptron
General Structure of ANN

x1 x2 x3 x4 x5

Input
Layer

Hidden
Layer

Output
Layer

 j is the bias of the unit. The bias acts as a threshold, which is used to adjust
the output along with the weighted sum of the inputs to the neuron. Therefore
bias is a constant which helps the model in a way that it can fit best for the
given data..
ANN

X1 X2 X3 Y Input Black box


1 0 0 0
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1 X2
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3
0 0 0 0

Output Y is 1 if at least two of the three inputs are equal to 1.


ANN

Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3 
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0

Y  I (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)


1 if z is true
where I ( z )  
0 otherwise
Artificial Neural Networks
Input
 Model is an assembly of inter- nodes Black box
connected nodes and weighted Output
links X1 w1 node
w2
 Output node sums up each of its X2  Y
input value according to the w3
weights of its links X3 t

 Compare output node against some


Perceptron Model
threshold t
Y  I ( wi X i  t )
i
Given the net input Ij to unit j, then Oj, the output of unit j, is computed as,

This function is also referred to as a squashing function, because it maps a large


input domain onto the smaller range of 0 to 1.
Where Do The Weights Come From?
Where Do The Weights Come From?
How Do Perceptrons Learn?
Learning Algorithms:
Back propagation for classification
What is backpropagation
 Backpropagation is a neural network learning algorithm.
 There are many different kinds of neural networks and
neural network algorithms.
 The most popular neural network algorithm is
backpropagation, which gained repute in the 1980s.
 Multilayer feed-forward networks is type of neural network
on which the backpropagation algorithm performs.
 Backpropagation learns for a set of weights that fits the
training data so as to minimize the mean squared distance
between the network’s class prediction and the known
target value of the tuples.
Major Steps for Back Propagation Network
 Constructing a network

 input data representation


 selection of number of layers, number of nodes in each
layer.
 Training the network using training data
 Pruning the network
 Interpret the results
A Multi-Layer Feed-Forward Neural Network

x1 x2 x3 x4 x5

Input
Layer
wij

I j   wij Oi   j
Hidden
Layer
i

1
Oj 
Output
Layer I j
1 e
y
How A Multi-Layer Neural Network Works?

 The inputs to the network correspond to the attributes measured for each
training tuple

 Inputs are fed simultaneously into the units making up the input layer

 They are then weighted and fed simultaneously to a hidden layer

 The number of hidden layers is arbitrary, although usually only one

 The weighted outputs of the last hidden layer are input to units making up the
output layer, which gives out the network's prediction

 The network is feed-forward in that none of the weights cycles back to an


input unit or to an output unit of a previous layer
Defining a Network Topology

 First decide the network topology: # of units in the input layer, # of


hidden layers (if > 1), # of units in each hidden layer, and # of units in
the output layer
 Normalizing the input values for each attribute measured in the training
tuples to [0.0—1.0]
 One input unit per domain value
 Output, if for classification and more than two classes, one output unit
per class is used
 Once a network has been trained and its accuracy is unacceptable,
repeat the training process with a different network topology or a
different set of initial weights
Backpropagation
 Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
 For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target
value
 Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
 Steps
 Initialize weights and biases in the network
 Propagate the inputs forward (by applying activation function)
 Backpropagate the error (by updating weights and biases)
 Terminating condition (when error is very small, etc.)
Backpropagation: Algorithm
Backpropagation
x1 x2 x3 x4 x5

Input
Layer
Err j  O j (1  O j ) Errk w jk
k

Hidden
Layer wij  wij  (l ) Err j Oi

Output
Layer
 j   j  (l) Err j
y

Err j  O j (1  O j )(T j  O j )

Generated value Correct value


Example - Sample calculations for learning by
the backpropagation algorithm

Figure shows : a multilayer feed-forward neural network.


Let the learning rate be 0.9.
The first training tuple, X = (1, 0, 1), whose class label is 1.
Neural Network as a Classifier
 Weakness
 Long training time
 Require a number of parameters, e.g., the network topology or ``structure."
 Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of ``hidden units" in the network

 Strength
 High tolerance to noisy data as well as their ability to classify patterns on which
they have not been trained.
 They are well-suited for continuous-valued inputs and outputs, unlike most decision
tree algorithms.
 They have been successful on a wide array of real-world data, including
handwritten character recognition, pathology and laboratory medicine, and training
a computer to pronounce English text.
 Neural network algorithms are inherently parallel; parallelization techniques can be
used to speed up the computation process.

These above factors contribute toward the usefulness of neural networks for
classification and prediction in machine learning.

October 10, 2022 182


K Nearest Neighbor
Lazy vs. Eager Learning

 The classification methods —decision tree


induction, Bayesian classification, classification by
backpropagation, support vector machines—are
all examples of eager learners.
Lazy vs. Eager Learning
 Lazy vs. eager learning

 Lazy learning (instance-based learning): Simply stores


training data (or only minor processing) and waits until it is given a
test tuple.
 Eager learning : Given a set of training set, constructs a
classification model before receiving new (e.g., test) data to
classify.
We can think of the learned model as being ready and
eager to classify previously unseen tuples.

 Lazy: less time in training but more time in predicting so lazy learners
can be computationally expensive.
Lazy Learner: Instance-Based Methods

 Instance-based learning:
 Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified.
 Typical approaches

 k-nearest neighbor approach


 Instances represented as points in a Euclidean
space.

 Case-based reasoning
 Uses symbolic representations and knowledge-

based inference.
The k-Nearest Neighbor Algorithm
 The k-nearest-neighbor method was first described in the early 1950s.
 It has since been widely used in the area of pattern recognition.

 Nearest-neighbor classifiers are based on learning by analogy, that is,


by comparing a given test tuple with training tuples that are similar to
it.

 The training tuples are described by n attributes.

 Each tuple represents a point in an n-dimensional space.


 In this way, all of the training tuples are stored in an n-dimensional
pattern space.
 When given an unknown tuple, a k-nearest-neighbor classifier
searches the pattern space for the k training tuples that are closest to
the unknown tuple. These k training tuples are the k “nearest
neighbors” of the unknown tuple.
The k-Nearest Neighbor Algorithm

 “Closeness” is defined in terms of a distance metric, such


as Euclidean distance. The Euclidean distance between
two points or tuples, say, X1 = (x11, x12, ….. , x1n) and
X2 = (x21, x22, …… , x2n), is

 For k-nearest-neighbor classification, the unknown tuple is


assigned the most common class among its k nearest
neighbors. When k = 1, the unknown tuple is assigned the
class of the training tuple that is closest to it in pattern
space.
 Nearest neighbor classifiers can also be used for
prediction, that is, to return a real-valued prediction for a
given unknown tuple.
Example-
Instance-Based Classification
 A KNN classifier assigns a test instance the majority class associated
with its K nearest training instances. Distance between instances is
measured using the Euclidean distance.
 Suppose we have the following training set of positive (+) and
negative (-) instances and a single test instance (o).
 All instances are projected onto a vector space of two real-valued
features: X and Y.
Contd…
(a) What would be the class assigned to this test instance for K=1 .
KNN assigns a test instance the target class associated with the
majority of the test instance’s K nearest neighbors. For K=1, this test
instance would be predicted negative because it’s single nearest
neighbor is negative.

(b) What would be the class assigned to this test instance for K=3.
KNN assigns a test instance the target class associated with the
majority of the test instance’s K nearest neighbors. For K=3, this test
instance would be predicted negative. Out of its three nearest
neighbors, two are negative and one is positive.
Advantages of KNN
Advantages of KNN
Advantages of KNN
Example of application of KNN
Example of application of KNN
KNN(K Nearest Neighbor) in a
nutshell
How does KNN work?
Let’s take a simple example of
Classification
Step 1: Build neighborhood
Step 2: Find distance from query point to
each point in neighborhood
FYI: Distance measures for
continuous data
Step 3: Assign to class
Classification with KNN: Loan
default data
Step 1: Build neighborhood
Classification with KNN: Build
neighborhood
Step 2: Measure distance from each
data point
Step 2: Graphical representation of
distance
Step 3: Assign to class based on
majority vote
KNN for Regression: Let’s work
withLoan data set
Step 1: Define ‘K’(number of
neighbors)
Step 2: Measure distance from each
data point
Regression with KNN: Predict
income of Query point
What should be the value of K?
What should be the value of K?
Case Study: Identify whether a
website is malicious or not
Identify whether a website is
malicious or not: Data Attributes
Metrics for Performance Evaluation of
Classifier
 Focus on the predictive capability of a model
 Rather than how fast it takes to classify or build models,
scalability, etc.
 Confusion Matrix:

PREDICTED CLASS a: TP (true positive)

Class=Yes Class=No b: FN (false negative)


c: FP (false positive)
Class=Yes a b d: TN (true negative)
ACTUAL
CLASS Class=No c d
Metrics for Performance Evaluation of
Classifier

PREDICTED CLASS
Class=Yes Class=No
(Positive) (Negative)
ACTUAL Class=Yes a b
CLASS (Positive)
Class=No c d
(Negative)

 The entries in the confusion matrix have the


following meaning :
 a is the number of correct predictions that an instance is positive,
 b is the number of incorrect of predictions that an instance negative,
 c is the number of incorrect predictions that an instance is positive, and
 d is the number of correct predictions that an instance is negative.
Metrics for Performance Evaluation of
Classifier

 The accuracy (AC)- is the proportion of the total number of


predictions that were correct. It is determined using the
equation:
ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
 Consider a 2-class problem
 Number of Class 0 examples = 9990
 Number of Class 1 examples = 10
 If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
 Accuracy is misleading because model does not detect any
class 1 example
Contd…
 The recall or true positive rate (TP) is the proportion of
positive cases that were correctly identified, as calculated
using the equation:
a TP
TP  
a  b TP  FN
 The false positive rate (FP) is the proportion of negatives
cases that were incorrectly classified as positive, as
calculated using the equation:

c FP
FP  
c  d FP  TN
Contd…

 The true negative rate (TN) is defined as the


proportion of negatives cases that were classified
correctly, as calculated using the equation:
d TN
TN  
d  c TN  FP
 The false negative rate (FN) is the proportion of
positives cases that were incorrectly classified as
negative, as calculated using the equation:
b FN
FN  
b  a FN  TP
Contd…..

 The precision (P) is the proportion of the


predicted positive cases that were correct, as
calculated using the equation:

a TP
P  
c  a FP  TP
Example
 Suppose we train a model to predict whether an email is Spam or
Not Spam. After training the model, we apply it to a test set of 500
new email messages (also labeled) and the model produces the
contingency matrix below.

(a) Compute the precision of this model with respect to the Spam class.
Precision with respect to SPAM = # correctly predicted as SPAM / #
predicted as SPAM
= 70 / (70 + 10) = 70 / 80.
Cond…
(b) Compute the recall of this model with respect to the Spam class.

recall with respect to SPAM = # correctly predicted as SPAM / # truly


SPAM
= 70 / (70 + 40) = 70 / 110.

 High-precision and low recall with respect to SPAM: whatever


the model classifies as SPAM is probably SPAM. However, many emails
that are truly SPAM are misclassified as NOT SPAM i.e <False
Negative ( False Acceptance)>

 High recall and low precision with respect to SPAM: the model
filters all the SPAM emails, but also incorrectly classifies some genuine
emails as SPAM i.e. <False Positive (False Rejectance)>.
End of Presentataion

You might also like