0% found this document useful (0 votes)
38 views129 pages

Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree

Uploaded by

Hana Yaregal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views129 pages

Asset v1 MKAU+SEng9032+DEV 01+Type@Asset+Block@ML Chapterthree

Uploaded by

Hana Yaregal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 129

Machine Learning:

Supervised Learning

Department of Information
Technology
Mekdela Amba University
2024
Topics
• Introduction to Supervised Learning

• Classification
o Decision tree
o Bayes classification
o Bayesian Belief Networks
o SVM
o KNN
o ANN
o Ensemble
• Regression-continuous value prediction
• Time series data prediction
2
Supervised Learning
• A supervised scenario is characterized by the concept of
a teacher or supervisor, whose main task is to provide
the agent with a precise measure of its error (directly
comparable with output values)
• The goal is to infer a function or mapping from training
data that is labeled.
• The training data consist of input vector X and output
vector Y of labels or tags.
• Based the training set, the algorithms generalize to
respond correctly to all possible inputs i.e. it is called
learning from Examples.

3
Supervised Learning

Dataset Example:
• Weather information of last
14 days

• Whether match was played or


not on that particular day.

• Then predict whether the


game will happen or not if the
weather condition is
(outlook=Rain,
Humidity=High, wind=weak)
using Supervised learning
model

4
Supervised Learning
• A data set denoted in the form of
o Where the inputs are , the outputs are and i=1 to N, N is the
number of observation.
• Generalization: the algorithms should produce sensible
output for the inputs that were not encountered during
learning.
Supervised Learning categorized into two:
o Classification: data is classified into one of two or more classes
o Regression: a task of predicting continuous quantity.

5
Classification
Classification:
o It is a systematic approach to building classification models
from an input data set.
o It is the task of assigning a new object to one of
several predefined categories.
• Examples of classification algorithms: decision tree
classifiers, rule-based classifiers, neural networks,
support vector machines, naive Bayes classifiers etc.
• Each technique employs a learning algorithm to identify
a model that best fits the relationship between the
attribute set and class label of the input data.

6
Classification
• A learned models will accurately predict the class labels
of previously unknown records.
• Example:
o assigning a given email into Spam or non-spam category
o assign a bank loans applicants as safe or risky for the bank
• A classification is a two-step process which consisting of:
o a learning step (where a classification model is
constructed using training set), and
o a classification step (where the model is used to
predict class labels for given data)

7
Decision Tree(DT)
• Decision tree (DT) is a statistical model that
builds classification models in the form a
tree structure.
• This model classifies data in a dataset by
flowing through a query structure from the
root until it reaches the leaf, which
represents one class.
• The root represents the attribute that plays
a main role in classification, and the leaf
represents the class.
o Given an input, at each node, a test is applied
and one of the branches is taken depending on
the outcome.
8
Decision Tree
• DT learning is supervised, because it constructs DT from class-
labeled training tuples.
• During the late 1970s and early 1980s, J. Ross Quinlan, a
researcher in machine learning, developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Quinlan later
presented C4.5 (a successor of ID3), which became a benchmark
to which newer supervised learning algorithms are often
compared.
• The statistical measure used to select attribute (that best splits the
dataset in terms of given classes) are information gain and gain
ratio.
• Both measures have a close relationship with another concept
called entropy.

9
How decision tree works?
• DT algorithm adopt (a greedy divide-and-conquer algorithm)
approach:
o Attributes can be categorical (or continuous) values
o Tree is constructed in a top-down recursive manner/no backtracking
o At start, all the training examples are at the root
o Examples are partitioned recursively based on selected attributes
o Attributes are selected on the basis of an impurity function (e.g.,
information gain)
• Conditions for stopping partitioning
o All examples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority class is
the leaf( Majority Voting)
o There are no examples left for partition

10
How decision tree works?

11
How decision tree works?

12
How decision tree works?

13
How Decision Tree works?
GrowTree(TrainingData D)
Partition(D);
Partition(Data D)
if (all points in D belong to the same
class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into
D1 and D2;
Partition(D1);
Partition(D2);
14
Choose an attribute to partition data
• The key to building a decision tree - which attribute to
choose in order to branch.
• The objective is to reduce impurity or uncertainty in data
as much as possible.
o A subset of data is pure if all instances belong to the same
class.
• The best attribute is selected for splitting the training
examples using a Goodness function.
o The best attribute:
 separate the classes of the training examples faster, and
 the tree is smallest

15
Decision Tree…
Entropy:
• It is a degree of randomness of elements or it is a measure of
impurity or uncertainty.
• This measure is based on Claude Shannon on Information Theory
which studied the value or “information content” of messages.
• Shannon entropy quantifies this uncertainty in term of expected
value of the information present in the message.

Where is the nonzero probability that an arbitrary tuple in D belongs


to class and estimated by , m is the number of class labels

16
Decision Tree
• Suppose a set of D containing a total of N examples
which are positive and are negative outcomes and the
entropy is given by:

• Some useful property of the entropy:


o D(

 means that all the examples in the same class


o means that half the examples in S are of one class and half in
the opposite class.

17
Decision tree…
Information Gain:
•It is defined as the difference between the original
information requirement (i.e. based on just the proportion
of classes) and the new requirement (i.e. obtained after
partitioning on attribute A)
•Select the attribute with the highest information gain,
that create small average disorder:
 First, compute the disorder using Entropy:
– the expected information needed to classify objects into
classes
 Second, measure the Information Gain
– calculate by how much the disorder of a set would reduce
by knowing the value of a particular attribute.
18
Decision Tree …
• Select the attribute with the highest information gain
• Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D:
m
Info ( D)   pi log 2 ( pi )
i 1

• Information needed (after using A to split D into v partitions) to


v | D |
classify D:
Info A ( D ) 
j
Info ( D j )
j 1 | D |

• Information gained by branching on attribute A


Gain(A) Info(D)  Info A(D)

19
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D )  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
5
Info ( D) I (9,5) 
9 9
log 2 ( ) 
5 5
log 2 ( ) 0.940  I (3,2) 0.694
14 14 14 14 14
age pi ni I(p i, n i)
<=30 2 3 0.971 5
I (2,3) means “age <=30” has 5 out of
31…40 4 0 0 14
14 samples, with 2 yes’es and 3
>40 3 2 0.971
no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain(age) Info ( D )  Info age ( D ) 0.246
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40
>40
low
low
yes
yes
fair
excellent
yes
no Gain(income) 0.029
31…40 low yes excellent yes
<=30
<=30
medium
low
no
yes
fair
fair
no
yes
Gain( student ) 0.151
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes
Gain(credit _ rating ) 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent
20 no
Computing Information-Gain for Continuous-
Valued Attributes
 Let attribute A be a continuous-valued
attribute
 Must determine the best split point for A

Sort the value A in increasing order


Typically, the midpoint between each
pair of adjacent values is considered as a
possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected


21
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values.
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem of information gain.
v | Dj | | Dj |
SplitInfo A ( D )   log 2 ( )
j 1 |D| |D|

 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.

 gain_ratio(income) = 0.029/1.557 = 0.019

 The attribute with the maximum gain ratio is selected as the


splitting attribute
Gini Index (CART, IBM IntelligentMiner)

 If a data set D contains examples from n classes, gini index,


gini(D) is defined as n 2
gini( D) 1  p j
j 1
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as |D1| |D2 |
gini A ( D)  gini( D1)  gini( D 2)
|D| |D|
 Reduction in Impurity:
gini( A) gini( D)  giniA ( D)
 The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
23
Example-Decision tree
The problem of “Sunburn”: You want to predict whether another
person is likely to get sunburned if he is back to the beach.
How can you do this? Data Collected: predict based on the
observed properties of the people
Name Hair Height Weight Lotion Result
Sarah Blonde Average Light No Sunburned
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Sunburned
Emily Red Average Heavy No Sunburned
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Kate Blonde Short Light Yes None

24
Decision tree

25
Decision Tree

Test expected information for each attribute


Hair 0.50
height 0.69
weight 0.94
lotion 0.61

26
Decision tree
• Information gain of each attribute

Gain(hair) = 0.954 - 0.50 = 0.454


Gain(height) = 0.954 - 0.69 =0.264
Gain(weight) = 0.954 - 0.94 =0.014
Gain (lotion) = 0.954 - 0.61 =0.344
Which decision variable maximises the Info Gain?

27
The best decision tree?

is_sunburned
Hair colour
blonde
red brown
Sunburned None
?

Sunburned = Sarah, Annie,


None = Dana, Katie

• Once we have finished with hair colour we then need to


calculate the remaining branches of the decision tree.
• Which attributes is better to classify the remaining ?

28
The best Decision Tree
• This is the simplest and optimal one possible and
it makes a lot of sense.
• It classifies 4 of the people on just the hair colour
alone.
is_sunburned
Hair colour
blonde brow
red n
None
Sunburned
Lotion used

no yes

Sunburned
None

29
Decision Tree: Rule Extraction from Trees

 A decision tree does its own feature extraction


 You can view Decision Tree as an IF-
THEN_ELSE statement which tells us whether
someone will suffer from sunburn.
If (Hair-Colour=“red”) then
return (sunburned = yes)
else if (hair-colour=“blonde” and lotion-used=“No”) then
return (sunburned = yes)
else
return (false)

30
Avoid overfitting in classification
Overfitting: A tree may overfit the training data
o Good accuracy on training data but poor on test data
o Symptoms: tree too deep and too many branches, some may reflect
anomalies due to noise or outliers

Two approaches to avoid overfitting


• Pre-pruning: Halt tree construction early
o A tree is “pruned” by halting its construction early (e.g., by
deciding not to further split or partition the subset of training
tuples at a given node).
o Difficult to decide because we do not know what may happen
subsequently if we keep growing the tree.
o Upon halting, the node becomes a leaf. The leaf may hold the
most frequent class among the subset tuples or the probability
distribution of those tuples
31
Avoid overfitting in classification
• Training Set and Test Set errors for decision trees

32
Avoid overfitting in classification
• Post-pruning: Remove branches or sub-trees from a “fully
grown” tree.
• A sub-tree at a given node is pruned by removing its
branches and replacing it with a leaf.
• The leaf is labeled with the most frequent class among the
sub-tree being replaced.

 C4.5 uses a statistical method to estimates the errors at each node for
pruning.
 The leaf is labeled with the most frequent class among the subtree being
replaced

33
Avoid overfitting in classification
• An unpruned decision tree and a pruned version of it.

34
Decision Tree
• The benefits of having a decision tree are as
follows −
o It does not require any domain knowledge.
o It is easy to comprehend.
o The learning and classification steps of a
decision tree are simple and fast.

35
Exercises
• Suppose that you have a free afternoon and you are thinking
whether or not to go and play tennis, How you do that?
o Goal is to Predict When This Player Will Play Tennis?
o The following training Data Example are prepared for the classifier

36
Bayesian Classification
• Bayesian classifiers are statistical classifiers which combining prior
knowledge of the classes with new evidence gathered from data.
• They can predict class membership probabilities
o the probability that a given tuple belongs to a particular class.
• For each new sample they provide a probability that the sample
belongs to a class (for all classes)
• Bayesian classification is based on Bayes’ theorem
Provides practical learning algorithms:
• Probabilistic learning: Calculate explicit probabilities for hypothesis. E.g. Naïve
Bayes
• Naïve Bayes classifier is a simple probabilistic classifier based on applying Bayes
theorem with strong (naïve) independence assumption between the features.

37
Bayes’ Theorem: Basics
• Let X be a data tuple and it is considered “Evidence” .
• It is described by measurements made on a set of n attributes.
• Let H be a hypothesis that X belongs to class C.
• P(H/X) is the posterior probability of H conditioned on X.
• For example, suppose our world of data tuples is confined to
customers described by the attributes age and income ,
respectively, and that X is a 35-years-old customer with an income
of $40,00.
o Suppose that H is the hypothesis that our customer will buy a computer.
Then P(H/X) reflects the probabilities that customer X will buy a computer
given that we know the customer’s age and income.

• P(H) is the prior probability, or a priori probability, of H.

38
Bayes’ Theorem: Basics
• P(H) is the probability that any given customer will buy a
computer, regardless of age , income or any other information for
that matter.
• P(X) is the prior probability of X. Using our example, it is the
probability that a person from our set of customers is 35 years old
and earns $40,000.
• The P(X/H) is the posterior probability of X conditioned on H. That
is, it is the probability that a customer, X, is 35 years old and earns
$40,000, given that we know the customer will buy a computer.
• How are these probabilities estimated?
• Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, p(H/X), from P(X), P(H) and P(X/H).

Posteriori=likelihood X Prior/evidence
39
Bayes Classification
Example :
• A doctor knows that meningitis causes stiff neck 50% of
the time
• Prior probability of any patient having meningitis is
1 / 50,000
• Prior probability of any patient having stiff neck is 1 / 20
• If a patient has stiff neck, what’s the probability he/she
has meningitis?

40
Bayes Classification
Example : A medical cancer diagnosis problem. There are 2 possible outcomes
of a diagnosis: +ve, -ve. We know 0.8% of world population has cancer. Test gives
correct +ve result 98% of the time and gives correct –ve result 97% of the time. If a
patient’s test returns +ve, should we diagnose the patient as having cancer?
Given
P(cancer) = 0.008 p(no-cancer) = 0.992
P(+ve|cancer) = 0.98 P(-ve|cancer) = 0.02
P(+ve|no-cancer) = 0.03 P(-ve|no-cancer) = 0.97
Using Bayes Formula:
o P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = 0.0078 / P(+ve)
o P(no-cancer|+ve) = P(+ve|no-cancer)xP(no-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve)
So, the patient most likely does not have cancer.

41
General Bayes Theorem
• Consider each attribute & class label as random variables
• Given a record with attributes (A1, A2,…,An)
o Goal is to predict class C
o we want to find the value of C that maximizes P(C| A1, A2,…,An )

• Can we estimate P(C| A1, A2,…,An ) directly from data?


o Approach: compute the posterior probability P(C | A1, A2, …, An) for all values
of C using the Bayes theorem
P ( A1 A2  An | C ) P (C )
P (C | A1 A2  An ) 
P ( A1 A2  An )

o Choose value of C that maximizes: P(C | A1, A2, …, An)


o Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A
42 , A , …, A | C )?
Naïve Bayes Classifier
• Assume independence among attributes Ai when class is
given:
o P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
Can estimate P(Ai| Cj) for all Ai and Cj.
o New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

C Naive Bayes arg max P (C j ) P ( Ai | C j )


j i

43
Naïve Bayes Classifier: Training Dataset

Class: age income studentcredit_rating


buys_compu
<=30 high no fair no
C1:buys_computer = ‘yes’
<=30 high no excellent no
C2:buys_computer = ‘no’
31…40 high no fair yes
>40 medium no fair yes
Data to be classified: >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
Income = medium, 31…40 low yes excellent yes
Student = yes <=30 medium no fair no
<=30 low yes fair yes
Credit_rating = Fair)
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
44
Naïve Bayes Classifier: An Example age
<=30
income studentcredit_rating
high no fair
buys_compu
no
<=30 high no excellent no
 P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 31…40
>40
high
medium
no fair
no fair
yes
yes
>40 low yes fair yes

P(buys_computer = “no”) = 5/14= 0.357 >40


31…40
low
low
yes excellent
yes excellent
no
yes

 Compute P(X|Ci) for each class


<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222


<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 medium no excellent no

P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444


P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) =
0.007
Therefore, X belongs to class (“buys_computer = yes”)
45
Example. ‘Play Tennis’ data
• Suppose that you have a free afternoon and you are
thinking whether or not to go and play tennis, How you
do that?
Based on the following training data, predict when this player
will Play Tennis?
Day Outlook Temperature Humidity Wind Play Tennis
Day1 Sunny Hot High Weak No
Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
46
Naive Bayes Classifier

 Given a training set, we can compute the


• Where, P(P) is the probability of playing
probabilities
P(P) = 9/14 tennis = Yes and P(n) is the probability of
P(N) = 5/14 playing tennis = No

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 Strong 3/9 3/5
mild 4/9 2/5 Weak 6/9 2/5
cool 3/9 1/5

47
Play-tennis example
 Based on the model created, predict Play Tennis or Not for the
following unseen sample
(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

C NB arg max P (C ) P (at | C )


C[ yes , no ] t

arg max P (C ) P (Outl sunny | C ) P (Temp cool | C ) P ( Hum high | C ) P (Wind strong | C )
C[ yes , no ]
Working:

P ( yes ) P ( sunny | yes ) P (cool | yes ) P (high | yes ) P ( strong | yes ) 0.0053
P (no) P ( sunny | no) P (cool | no) P (high | no) P ( strong | no) 0.0206
 answer : PlayTennis no

• More example: What if the following test data is given:


X= <Outlook=rain, Temperature=hot, Humidity=high, Wind=weak>
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence, therefore loss of accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
 Dependencies among these cannot be modeled by Naïve Bayes
Classifier
 How to deal with these dependencies? Bayesian Belief Networks

49
SVM—Support Vector Machines

 classification method for both linear and nonlinear data.


 Transform nonlinear training data into a higher dimension
 With the new dimension, searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
 SVM finds this hyperplane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)

50
Non-linear SVM

51
Transformed into linear hyperplane

52
SVM—History and Applications

 Vapnik and colleagues (1992)—


groundwork from Vapnik & Chervonenkis’
statistical learning theory in 1960s
 Features: training can be slow but
accuracy is high owing to their ability to
model complex nonlinear decision
boundaries (margin maximization)
 Used for: classification(SVM) and numeric
prediction(SVR)
 Applications:
53
SVM—General Philosophy

Small Margin Large Margin


Support Vectors

54
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the
class labels yi , each yi can take one of the two values (+1 or -1) corresponding to the classes
buys-computer = yes and buys-computer = no, respectively.
There are infinite lines (hyperplanes) separating the two classes but we want to find the best
one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)

55
Linear SVM: Separable Case

56
Linear SVM
• A two-dimensional training set
consisting of squares and
circles.
• A decision boundary that
bisects the training examples
into their respective classes is
illustrated with a solid line
• If we label all the squares as
class +1 and all the circles as
class -1, then we can predict
the class label y for any test
example z in the following way:
57
Margin of a Linear Classifier

58
Margin of a Linear Classifier

59
Why Is SVM Effective on High Dimensional
Data?
 The complexity of trained classifier is characterized by the #
of support vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training
examples —they lie closest to the decision boundary
 If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
 The number of support vectors found can be used to compute
an (upper) bound on the expected error rate of the SVM
classifier, which is independent of the data dimensionality
 Thus, an SVM with a small number of support vectors can
have good generalization, even when the dimensionality of
the data is high
60
Support Vector Machines
B1

B2

 Which one is better? B1 or B2?


 How do you define better?

61
Support Vector Machines
B1

B2

b21
b22

margin
b11

b12

 Find hyperplane maximizes the margin => B1 is better than B2

62
Linear SVM: Nonseparable Case
 What if the problem is not linearly
separable?

63
Linear SVM: Nonseparable Case
 What if the problem is not linearly
separable?
Introduce slack variables  2 N
 Need to minimize: L( w)  || w ||  k
 C   i 
2  i 1 
 Subject to:
 
1 if w  x i  b 1 - i
yi    
 1 if w  x i  b  1  i

 If k is 1 or 2, this leads to same objective


function as linear SVM but with different
constraints 64
Nonlinear Support Vector Machines
 What if decision boundary is not linear?
• The data set is generated in
such a way that all the
circles are clustered near the
center of the diagram and all
the squares are distributed
farther away from the
center.
• Instances of the data set can
be classified using the
following equation

65
Attribute Transformation

A nonlinear transformation Φ is needed to map the


data from its original feature space into a new
space where the decision boundary becomes linear

66
Learning a Nonlinear SVM Model

67
Ensemble Methods

68
Ensemble Methods

69
Ensemble Methods

70
Ensemble Methods

71
Ensemble Methods

72
Methods for Constructing an Ensemble Classifier
The ensemble of classifiers can be constructed in many
ways:
By manipulating the training set:
• In this approach, multiple training sets are created by
resampling the original data according to some
sampling distribution.
• A classifier is then built from each training set using a
particular learning algorithm.
• Bagging and Boosting are two examples of ensemble
methods that manipulate their training sets.

73
Methods for Constructing an Ensemble Classifier
By manipulating the input features:
• A subset of input features is chosen to form each training
set.
• The subset can be either chosen randomly or based on
the recommendation of domain experts.
• Random forest is an ensemble method that manipulates
its input features and uses decision trees as its base
classifiers

74
Methods for Constructing an Ensemble Classifier
By manipulating the learning algorithm
• Many learning algorithms can be manipulated in such a
way that applying the algorithm several times on the
same training data may result in different models.
• For example, an artificial neural network can produce
different models by changing its network topology or the
initial weights of the links between neurons.

75
Bagging(Bootstrap Aggregation)

76
Bagging
• It is a technique that repeatedly samples (with
replacement) from a data set.
• These samples are similar since all drawn from the same
original data, but they are also slightly different due to
chance.
• A learning algorithm is an unstable algorithm if small
changes in the training set causes a large difference in the
generated learner, namely, the learning algorithm has high
variance
• Bagging improves generalization error by reducing the
variance of the base classifiers.

77
Bagging
• Assume that we have a training set:

• We generate, say, B = 3 datasets by bootstrapping:

78
Bagging
• The performance of bagging depends on the stability of
the base classifier
• Bagging uses bootstrap to generate n number of training
sets then trains n base-learners and then, during testing,
takes an average.
Fit classification or
regression models to
bootstrap samples from
the data and combine by
voting (classification)
Or
averaging (regression).
79
Random Forest

80
Random Forest

In this random forest, two decision trees generate class B


then the output become class B
81
Random Forest
• Random forests can be built using bagging in tandem with random
selection of attributes and samples of datasets.
• It combines the predictions made by multiple decision trees or base
learners models, where each tree is generated based on the values of
an independent set of random vectors.
• During classification, each tree votes and the most popular class is
returned.
• Random forests are comparable in accuracy to AdaBoost, yet are
more robust to errors and outliers
• For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is called the out-of-bag
error rate.
• Overfitting is not a problem
82
Random Forest
Random forest for Spam classification

83
Boosting
• Boosting is a process that uses a set of Machine Learning
algorithms to combine weak learner to form strong
learners in order to increase the accuracy of the model.
How does Boosting algorithms work?
• The basic principle behind the working of the boosting algorithms
is to generate multiple weak learner and combine their
predictions to form one strong rule
Step 1: the base algorithms reads the data and assigns equal weight
to each sample observation.
Step 2: False predictions are assigning to the next base learner with a
higher weightage on these incorrect prediction.
Step 3: Repeat step 2 until the algorithm can correctly classify the
output
84
Types of Boosting
1. Adaptive Boosting(AdaBoost)
o Which is similar the previous boosting concepts

85
Type of Boosting
2. Gradient Boosting

86
XGBoost

87
k-Nearest Neighbor Classification (kNN)
 KNN stores all available cases and
classified new cases based on a similarity
measure.
 Unlike all the previous learning methods,
kNN does not build model from the
training data. Due to this called Lazy
Learner.
 To classify a test instance d, define k-
neighborhood.
 K in KNN is a parameter that refers to the
88
k-Nearest Neighbor Classification (kNN)
Unknown record  Requires three things
– The set of labeled records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
 To classify an unknown record:
– Compute distance to other training
records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)
How do we choose K?

90
When do we use KNN Algorithms?

91
How does KNN Algorithm Works?

92
Example
• We have data from the questionnaires survey (to ask people
opinion) & objective testing with two attributes (acid
durability & strength) to classify whether a special paper tissue
is good or not. Here is four training samples.

X1 = Acid Durability X2 = Strength Y=


Classification
(seconds) (kg/m2)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
• Now the factory produces a new paper tissue that pass
laboratory test with X1 = 3 and X2 = 7.
o Without undertaking another expensive survey, guess the goodness of the
new tissue? Use squared Euclidean distance for similarity measurement and
K=3
93
Solution

X1 = Acid X2 = Square Distance Rank Is it Y=


Durability Strength to query instance minimum included Category
(seconds) (kg/m2) (3, 7) distance in 3- of NN
NNs?
7 7 3 Yes Bad
7 4 4 No -
3 4 1 Yes Good
1 4 2 Yes Good
• Use simple majority of the category of nearest neighbors as the prediction value
of the query instance. We have 2 good and 1 bad, since 2>1 then we conclude
that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is
included in Good category.
94
k-Nearest Neighbor Classification (kNN)

 kNN can deal with complex and arbitrary


decision boundaries.
 Despite its simplicity, researchers have
shown that the classification accuracy of
kNN can be quite strong and in many
cases as accurate as those elaborated
methods.
 kNN is slow at the classification time
 kNN does not produce an
understandable95 model
Artificial Neural Network (ANN)
• The study of artificial neural networks (ANN) was inspired by
attempts to simulate biological neural systems.
• The human brain consists primarily of nerve cells called neurons,
linked together with other neurons via strands of fiber called
axons.
• Analogous to human brain structure, an ANN is composed of an
interconnected assembly of nodes and directed links with Weights.

96
Artificial Neural Network (ANN)
• Neural Network has different layers connected to each others and
work on the structure and functions of a human brain. It learns
from huge volume of data and uses complex algorithms to train a
neural net.
• ANN structure

97
Artificial Neural Network (ANN)
• You can see how they are similarity

VS

Biological Neural Network Artificial neural Network

98
Application of ANN
• Handwriting recognition- used to convert handwritten
characters into digital characters that the system can recognize.
• Stock Exchange prediction- ANN can examine a lot of factors and
predict the prices on a daily basis helping the stock brokers.
• Image analysis- to identify age of person
Future of ANN:
• More personalized choices for users and customers all over the
world
• Hyper intelligent virtual assistant will make life easier
• ANN will be used in the field of medicine, agriculture etc.

99
How does ANN works?

100
How does ANN works?

101
How does ANN works?

102
How does ANN works?

103
Example

104
Example

105
Example

106
Loss Function

107
Gradient Descent
• The graphical methods of finding the minimum of a
function is called gradient descent.
• The plot graph for weight versus loss

108
Backpropagation

109
BackPropagation

110
Backpropagation

111
Popular Neural Network

112
Feed Forward NN

113
Convolutional Neural Network(CNN)

114
Convolutional Neural Network(CNN)

115
How CNN recognize Image?

116
Convolutional Neural Network(CNN)
• How CNN recognize an image of a bird ?

117
Convolutional Neural Network(CNN)
• How CNN recognize an image of a bird ?

118
Convolutional Neural Network(CNN)

119
Why Recurrent Neural Network?

120
Why Recurrent Neural Network?

121
Application of Recurrent Neural Network

122
Application of Recurrent Neural Network

123
Application of Recurrent Neural Network

124
Application of Recurrent Neural Network

125
Type of Recurrent Neural

126
Type of Recurrent Neural

127
Neural Network as a Classifier
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on an array of real-world data, e.g., hand-written letters
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of rules from
trained neural networks
 Weakness
 Long training time
 Require a number of parameters typically best determined empirically,
e.g., the network topology or “structure.”
 Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of “hidden units” in the network

128
quiz

129

You might also like