Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree
Supervised Learning
Department of Information
Technology
Mekdela Amba University
2024
Topics
• Introduction to Supervised Learning
• Classification
o Decision tree
o Bayes classification
o Bayesian Belief Networks
o SVM
o KNN
o ANN
o Ensemble
• Regression-continuous value prediction
• Time series data prediction
2
Supervised Learning
• A supervised scenario is characterized by the concept of
a teacher or supervisor, whose main task is to provide
the agent with a precise measure of its error (directly
comparable with output values)
• The goal is to infer a function or mapping from training
data that is labeled.
• The training data consist of input vector X and output
vector Y of labels or tags.
• Based the training set, the algorithms generalize to
respond correctly to all possible inputs i.e. it is called
learning from Examples.
3
Supervised Learning
Dataset Example:
• Weather information of last
14 days
4
Supervised Learning
• A data set denoted in the form of
o Where the inputs are , the outputs are and i=1 to N, N is the
number of observation.
• Generalization: the algorithms should produce sensible
output for the inputs that were not encountered during
learning.
Supervised Learning categorized into two:
o Classification: data is classified into one of two or more classes
o Regression: a task of predicting continuous quantity.
5
Classification
Classification:
o It is a systematic approach to building classification models
from an input data set.
o It is the task of assigning a new object to one of
several predefined categories.
• Examples of classification algorithms: decision tree
classifiers, rule-based classifiers, neural networks,
support vector machines, naive Bayes classifiers etc.
• Each technique employs a learning algorithm to identify
a model that best fits the relationship between the
attribute set and class label of the input data.
6
Classification
• A learned models will accurately predict the class labels
of previously unknown records.
• Example:
o assigning a given email into Spam or non-spam category
o assign a bank loans applicants as safe or risky for the bank
• A classification is a two-step process which consisting of:
o a learning step (where a classification model is
constructed using training set), and
o a classification step (where the model is used to
predict class labels for given data)
7
Decision Tree(DT)
• Decision tree (DT) is a statistical model that
builds classification models in the form a
tree structure.
• This model classifies data in a dataset by
flowing through a query structure from the
root until it reaches the leaf, which
represents one class.
• The root represents the attribute that plays
a main role in classification, and the leaf
represents the class.
o Given an input, at each node, a test is applied
and one of the branches is taken depending on
the outcome.
8
Decision Tree
• DT learning is supervised, because it constructs DT from class-
labeled training tuples.
• During the late 1970s and early 1980s, J. Ross Quinlan, a
researcher in machine learning, developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Quinlan later
presented C4.5 (a successor of ID3), which became a benchmark
to which newer supervised learning algorithms are often
compared.
• The statistical measure used to select attribute (that best splits the
dataset in terms of given classes) are information gain and gain
ratio.
• Both measures have a close relationship with another concept
called entropy.
9
How decision tree works?
• DT algorithm adopt (a greedy divide-and-conquer algorithm)
approach:
o Attributes can be categorical (or continuous) values
o Tree is constructed in a top-down recursive manner/no backtracking
o At start, all the training examples are at the root
o Examples are partitioned recursively based on selected attributes
o Attributes are selected on the basis of an impurity function (e.g.,
information gain)
• Conditions for stopping partitioning
o All examples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority class is
the leaf( Majority Voting)
o There are no examples left for partition
10
How decision tree works?
11
How decision tree works?
12
How decision tree works?
13
How Decision Tree works?
GrowTree(TrainingData D)
Partition(D);
Partition(Data D)
if (all points in D belong to the same
class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into
D1 and D2;
Partition(D1);
Partition(D2);
14
Choose an attribute to partition data
• The key to building a decision tree - which attribute to
choose in order to branch.
• The objective is to reduce impurity or uncertainty in data
as much as possible.
o A subset of data is pure if all instances belong to the same
class.
• The best attribute is selected for splitting the training
examples using a Goodness function.
o The best attribute:
separate the classes of the training examples faster, and
the tree is smallest
15
Decision Tree…
Entropy:
• It is a degree of randomness of elements or it is a measure of
impurity or uncertainty.
• This measure is based on Claude Shannon on Information Theory
which studied the value or “information content” of messages.
• Shannon entropy quantifies this uncertainty in term of expected
value of the information present in the message.
16
Decision Tree
• Suppose a set of D containing a total of N examples
which are positive and are negative outcomes and the
entropy is given by:
17
Decision tree…
Information Gain:
•It is defined as the difference between the original
information requirement (i.e. based on just the proportion
of classes) and the new requirement (i.e. obtained after
partitioning on attribute A)
•Select the attribute with the highest information gain,
that create small average disorder:
First, compute the disorder using Entropy:
– the expected information needed to classify objects into
classes
Second, measure the Information Gain
– calculate by how much the disorder of a set would reduce
by knowing the value of a particular attribute.
18
Decision Tree …
• Select the attribute with the highest information gain
• Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D:
m
Info ( D) pi log 2 ( pi )
i 1
19
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D ) I (2,3) I (4,0)
g Class N: buys_computer = “no” 14 14
5
Info ( D) I (9,5)
9 9
log 2 ( )
5 5
log 2 ( ) 0.940 I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i)
<=30 2 3 0.971 5
I (2,3) means “age <=30” has 5 out of
31…40 4 0 0 14
14 samples, with 2 yes’es and 3
>40 3 2 0.971
no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info ( D ) Info age ( D ) 0.246
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40
>40
low
low
yes
yes
fair
excellent
yes
no Gain(income) 0.029
31…40 low yes excellent yes
<=30
<=30
medium
low
no
yes
fair
fair
no
yes
Gain( student ) 0.151
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent
20 no
Computing Information-Gain for Continuous-
Valued Attributes
Let attribute A be a continuous-valued
attribute
Must determine the best split point for A
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
24
Decision tree
•
25
Decision Tree
•
26
Decision tree
• Information gain of each attribute
27
The best decision tree?
is_sunburned
Hair colour
blonde
red brown
Sunburned None
?
28
The best Decision Tree
• This is the simplest and optimal one possible and
it makes a lot of sense.
• It classifies 4 of the people on just the hair colour
alone.
is_sunburned
Hair colour
blonde brow
red n
None
Sunburned
Lotion used
no yes
Sunburned
None
29
Decision Tree: Rule Extraction from Trees
30
Avoid overfitting in classification
Overfitting: A tree may overfit the training data
o Good accuracy on training data but poor on test data
o Symptoms: tree too deep and too many branches, some may reflect
anomalies due to noise or outliers
32
Avoid overfitting in classification
• Post-pruning: Remove branches or sub-trees from a “fully
grown” tree.
• A sub-tree at a given node is pruned by removing its
branches and replacing it with a leaf.
• The leaf is labeled with the most frequent class among the
sub-tree being replaced.
C4.5 uses a statistical method to estimates the errors at each node for
pruning.
The leaf is labeled with the most frequent class among the subtree being
replaced
33
Avoid overfitting in classification
• An unpruned decision tree and a pruned version of it.
34
Decision Tree
• The benefits of having a decision tree are as
follows −
o It does not require any domain knowledge.
o It is easy to comprehend.
o The learning and classification steps of a
decision tree are simple and fast.
35
Exercises
• Suppose that you have a free afternoon and you are thinking
whether or not to go and play tennis, How you do that?
o Goal is to Predict When This Player Will Play Tennis?
o The following training Data Example are prepared for the classifier
36
Bayesian Classification
• Bayesian classifiers are statistical classifiers which combining prior
knowledge of the classes with new evidence gathered from data.
• They can predict class membership probabilities
o the probability that a given tuple belongs to a particular class.
• For each new sample they provide a probability that the sample
belongs to a class (for all classes)
• Bayesian classification is based on Bayes’ theorem
Provides practical learning algorithms:
• Probabilistic learning: Calculate explicit probabilities for hypothesis. E.g. Naïve
Bayes
• Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes
theorem with strong (naïve) independence assumption between the features.
37
Bayes’ Theorem: Basics
• Let X be a data tuple and it is considered “Evidence” .
• It is described by measurements made on a set of n attributes.
• Let H be a hypothesis that X belongs to class C.
• P(H/X) is the posterior probability of H conditioned on X.
• For example, suppose our world of data tuples is confined to
customers described by the attributes age and income ,
respectively, and that X is a 35-years-old customer with an income
of $40,00.
o Suppose that H is the hypothesis that our customer will buy a computer.
Then P(H/X) reflects the probabilities that customer X will buy a computer
given that we know the customer’s age and income.
38
Bayes’ Theorem: Basics
• P(H) is the probability that any given customer will buy a
computer, regardless of age , income or any other information for
that matter.
• P(X) is the prior probability of X. Using our example, it is the
probability that a person from our set of customers is 35 years old
and earns $40,000.
• The P(X/H) is the posterior probability of X conditioned on H. That
is, it is the probability that a customer, X, is 35 years old and earns
$40,000, given that we know the customer will buy a computer.
• How are these probabilities estimated?
• Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, p(H/X), from P(X), P(H) and P(X/H).
Posteriori=likelihood X Prior/evidence
39
Bayes Classification
Example :
• A doctor knows that meningitis causes stiff neck 50% of
the time
• Prior probability of any patient having meningitis is
1 / 50,000
• Prior probability of any patient having stiff neck is 1 / 20
• If a patient has stiff neck, what’s the probability he/she
has meningitis?
40
Bayes Classification
Example : A medical cancer diagnosis problem. There are 2 possible outcomes
of a diagnosis: +ve, -ve. We know 0.8% of world population has cancer. Test gives
correct +ve result 98% of the time and gives correct –ve result 97% of the time. If a
patient’s test returns +ve, should we diagnose the patient as having cancer?
Given
P(cancer) = 0.008 p(no-cancer) = 0.992
P(+ve|cancer) = 0.98 P(-ve|cancer) = 0.02
P(+ve|no-cancer) = 0.03 P(-ve|no-cancer) = 0.97
Using Bayes Formula:
o P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = 0.0078 / P(+ve)
o P(no-cancer|+ve) = P(+ve|no-cancer)xP(no-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve)
So, the patient most likely does not have cancer.
41
General Bayes Theorem
• Consider each attribute & class label as random variables
• Given a record with attributes (A1, A2,…,An)
o Goal is to predict class C
o we want to find the value of C that maximizes P(C| A1, A2,…,An )
43
Naïve Bayes Classifier: Training Dataset
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 Strong 3/9 3/5
mild 4/9 2/5 Weak 6/9 2/5
cool 3/9 1/5
47
Play-tennis example
Based on the model created, predict Play Tennis or Not for the
following unseen sample
(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
arg max P (C ) P (Outl sunny | C ) P (Temp cool | C ) P ( Hum high | C ) P (Wind strong | C )
C[ yes , no ]
Working:
P ( yes ) P ( sunny | yes ) P (cool | yes ) P (high | yes ) P ( strong | yes ) 0.0053
P (no) P ( sunny | no) P (cool | no) P (high | no) P ( strong | no) 0.0206
answer : PlayTennis no
49
SVM—Support Vector Machines
50
Non-linear SVM
51
Transformed into linear hyperplane
52
SVM—History and Applications
54
SVM—When Data Is Linearly Separable
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the
class labels yi , each yi can take one of the two values (+1 or -1) corresponding to the classes
buys-computer = yes and buys-computer = no, respectively.
There are infinite lines (hyperplanes) separating the two classes but we want to find the best
one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
55
Linear SVM: Separable Case
•
56
Linear SVM
• A two-dimensional training set
consisting of squares and
circles.
• A decision boundary that
bisects the training examples
into their respective classes is
illustrated with a solid line
• If we label all the squares as
class +1 and all the circles as
class -1, then we can predict
the class label y for any test
example z in the following way:
57
Margin of a Linear Classifier
•
58
Margin of a Linear Classifier
•
59
Why Is SVM Effective on High Dimensional
Data?
The complexity of trained classifier is characterized by the #
of support vectors rather than the dimensionality of the data
The support vectors are the essential or critical training
examples —they lie closest to the decision boundary
If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
The number of support vectors found can be used to compute
an (upper) bound on the expected error rate of the SVM
classifier, which is independent of the data dimensionality
Thus, an SVM with a small number of support vectors can
have good generalization, even when the dimensionality of
the data is high
60
Support Vector Machines
B1
B2
61
Support Vector Machines
B1
B2
b21
b22
margin
b11
b12
62
Linear SVM: Nonseparable Case
What if the problem is not linearly
separable?
63
Linear SVM: Nonseparable Case
What if the problem is not linearly
separable?
Introduce slack variables 2 N
Need to minimize: L( w) || w || k
C i
2 i 1
Subject to:
1 if w x i b 1 - i
yi
1 if w x i b 1 i
65
Attribute Transformation
66
Learning a Nonlinear SVM Model
•
67
Ensemble Methods
68
Ensemble Methods
69
Ensemble Methods
70
Ensemble Methods
71
Ensemble Methods
72
Methods for Constructing an Ensemble Classifier
The ensemble of classifiers can be constructed in many
ways:
By manipulating the training set:
• In this approach, multiple training sets are created by
resampling the original data according to some
sampling distribution.
• A classifier is then built from each training set using a
particular learning algorithm.
• Bagging and Boosting are two examples of ensemble
methods that manipulate their training sets.
73
Methods for Constructing an Ensemble Classifier
By manipulating the input features:
• A subset of input features is chosen to form each training
set.
• The subset can be either chosen randomly or based on
the recommendation of domain experts.
• Random forest is an ensemble method that manipulates
its input features and uses decision trees as its base
classifiers
74
Methods for Constructing an Ensemble Classifier
By manipulating the learning algorithm
• Many learning algorithms can be manipulated in such a
way that applying the algorithm several times on the
same training data may result in different models.
• For example, an artificial neural network can produce
different models by changing its network topology or the
initial weights of the links between neurons.
75
Bagging(Bootstrap Aggregation)
76
Bagging
• It is a technique that repeatedly samples (with
replacement) from a data set.
• These samples are similar since all drawn from the same
original data, but they are also slightly different due to
chance.
• A learning algorithm is an unstable algorithm if small
changes in the training set causes a large difference in the
generated learner, namely, the learning algorithm has high
variance
• Bagging improves generalization error by reducing the
variance of the base classifiers.
•
77
Bagging
• Assume that we have a training set:
78
Bagging
• The performance of bagging depends on the stability of
the base classifier
• Bagging uses bootstrap to generate n number of training
sets then trains n base-learners and then, during testing,
takes an average.
Fit classification or
regression models to
bootstrap samples from
the data and combine by
voting (classification)
Or
averaging (regression).
79
Random Forest
80
Random Forest
83
Boosting
• Boosting is a process that uses a set of Machine Learning
algorithms to combine weak learner to form strong
learners in order to increase the accuracy of the model.
How does Boosting algorithms work?
• The basic principle behind the working of the boosting algorithms
is to generate multiple weak learner and combine their
predictions to form one strong rule
Step 1: the base algorithms reads the data and assigns equal weight
to each sample observation.
Step 2: False predictions are assigning to the next base learner with a
higher weightage on these incorrect prediction.
Step 3: Repeat step 2 until the algorithm can correctly classify the
output
84
Types of Boosting
1. Adaptive Boosting(AdaBoost)
o Which is similar the previous boosting concepts
85
Type of Boosting
2. Gradient Boosting
86
XGBoost
87
k-Nearest Neighbor Classification (kNN)
KNN stores all available cases and
classified new cases based on a similarity
measure.
Unlike all the previous learning methods,
kNN does not build model from the
training data. Due to this called Lazy
Learner.
To classify a test instance d, define k-
neighborhood.
K in KNN is a parameter that refers to the
88
k-Nearest Neighbor Classification (kNN)
Unknown record Requires three things
– The set of labeled records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other training
records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)
How do we choose K?
90
When do we use KNN Algorithms?
91
How does KNN Algorithm Works?
92
Example
• We have data from the questionnaires survey (to ask people
opinion) & objective testing with two attributes (acid
durability & strength) to classify whether a special paper tissue
is good or not. Here is four training samples.
96
Artificial Neural Network (ANN)
• Neural Network has different layers connected to each others and
work on the structure and functions of a human brain. It learns
from huge volume of data and uses complex algorithms to train a
neural net.
• ANN structure
97
Artificial Neural Network (ANN)
• You can see how they are similarity
VS
98
Application of ANN
• Handwriting recognition- used to convert handwritten
characters into digital characters that the system can recognize.
• Stock Exchange prediction- ANN can examine a lot of factors and
predict the prices on a daily basis helping the stock brokers.
• Image analysis- to identify age of person
Future of ANN:
• More personalized choices for users and customers all over the
world
• Hyper intelligent virtual assistant will make life easier
• ANN will be used in the field of medicine, agriculture etc.
99
How does ANN works?
100
How does ANN works?
101
How does ANN works?
102
How does ANN works?
103
Example
104
Example
105
Example
106
Loss Function
107
Gradient Descent
• The graphical methods of finding the minimum of a
function is called gradient descent.
• The plot graph for weight versus loss
108
Backpropagation
109
BackPropagation
110
Backpropagation
111
Popular Neural Network
112
Feed Forward NN
113
Convolutional Neural Network(CNN)
114
Convolutional Neural Network(CNN)
115
How CNN recognize Image?
116
Convolutional Neural Network(CNN)
• How CNN recognize an image of a bird ?
117
Convolutional Neural Network(CNN)
• How CNN recognize an image of a bird ?
118
Convolutional Neural Network(CNN)
119
Why Recurrent Neural Network?
120
Why Recurrent Neural Network?
121
Application of Recurrent Neural Network
122
Application of Recurrent Neural Network
123
Application of Recurrent Neural Network
124
Application of Recurrent Neural Network
125
Type of Recurrent Neural
126
Type of Recurrent Neural
127
Neural Network as a Classifier
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on an array of real-world data, e.g., hand-written letters
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules from
trained neural networks
Weakness
Long training time
Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network
128
quiz
129