0% found this document useful (0 votes)
15 views49 pages

AI Chapter 3 Part 3

The document discusses different machine learning algorithms including support vector machines (SVM), ensemble methods, random forests, and k-nearest neighbors (KNN). SVM searches for the optimal separating hyperplane between classes, while ensemble methods like bagging and boosting combine multiple models to improve performance. Random forests build decision trees on randomly selected subsets of features and data.

Uploaded by

biruck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views49 pages

AI Chapter 3 Part 3

The document discusses different machine learning algorithms including support vector machines (SVM), ensemble methods, random forests, and k-nearest neighbors (KNN). SVM searches for the optimal separating hyperplane between classes, while ensemble methods like bagging and boosting combine multiple models to improve performance. Random forests build decision trees on randomly selected subsets of features and data.

Uploaded by

biruck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Artificial Intelligence

Institute of Technology
University of Gondar
Biomedical Engineering Department

By Ewunate Assaye (MSc.)


Supervised Learning

Outlines: -
» SVM

» Ensemble Methods

» Random Forest

» KNN

2
SVM—Support Vector Machines

» Classification method for both linear and nonlinear data.


» Transform nonlinear training data into a higher dimension
» With the new dimension, searches for the linear optimal separating hyperplane
(i.e., “decision boundary”)
» SVM finds this hyperplane using support vectors (“essential” training tuples)
and margins (defined by the support vectors)

3
Non-linear SVM

4
Transformed into linear hyperplane

5
SVM—History and Applications

» Vapnik and colleagues (1992)—groundwork from Vapnik &


Chervonenkis’ statistical learning theory in 1960s

» Features: training can be slow but accuracy is high owing to their ability
to model complex nonlinear decision boundaries (margin maximization)

» Used for: classification(SVM) and numeric prediction(SVR)

» Applications:

o handwritten digit recognition, object recognition, speaker


identification, benchmarking time-series prediction tests

6
SVM—General Philosophy

Small Margin Large Margin

Support Vectors

7
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated with the class labels
yi , each yi can take one of the two values (+1 or -1) corresponding to the classes buys-computer = yes
and buys-computer = no, respectively.

There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one
that minimizes classification error on unseen data)

SVM
8 searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
Linear SVM: Separable Case

» A linear SVM is a classifier that searches for a hyperplane with the largest margin

Linear decision boundary

» Consider a binary classification problem consisting of N training examples


» For each example is denoted by a tuple (xi, yi)(i = 1,2,3, … , N)

✓Where xi = (xi1, xi2, …, xid)r corresponds the attributes sets for the ith example. By
convention, let yi ∈ −1,1 denotes its class label.

» The decision boundary of a linear classifier can be written in the following form

✓w.x+b =0

» where w and b are parameters of the model


Linear SVM

» A two-dimensional training set


consisting of squares and circles.
» A decision boundary that bisects the
training examples into their respective
classes is illustrated with a solid line
» If we label all the squares as class +1 and
all the circles as class -1, then we can
predict the class label y for any test
example z in the following way:
10
Margin of a Linear Classifier

» The margin of the decision boundary is given by the distance between the two hyperplanes

» We can rescale the parameters w and b of the decision boundary so that the two parallel
hyperplanes bi1 and bi2 can be expressed as follows:
o bi1 : w.x+b = 1

o bi2 : w.x+b = -1

» The margin of the decision boundary is given by the distance between these two hyperplanes

» To compute the margin, let x1 be a data point located on bi1 and x2 be a data point located on
located on bi1

» By substituting these points in to one equation, the margin d can be computed by subtracting
the second equation from the first equation
Margin of a Linear Classifier

» By substituting these points in to one equation the margin d can be computed by subtracting the
second equation from the first equation

Learning a linear SVM model

» The training phase of SVM involves estimating the parameters w and b of the decision boundary
from the training data.

» The parameters must be chosen in such a way that the following two conditions are met
Support Vector Machines
B1

B2

» Which one is better? B1 or B2?


» How do you define better?
13
Support Vector Machines
B1

B2

b21
b22

margin
b11

b12

» Find hyperplane maximizes the margin => B1 is better than B2


14
Linear SVM: Nonseparable Case

» What if the problem is not linearly separable?

15
Linear SVM: Non separable Case

» What if the problem is not linearly separable?


o Introduce slack variables
✓ Need to minimize:  2
|| w ||  N k
L( w) = + C   i 
2  i =1 
✓ Subject to:  
1 if w • x i + b  1 - i
yi =   
− 1 if w • x i + b  −1 + i
✓ If k is 1 or 2, this leads to same objective function as linear SVM but
with different constraints

16
Nonlinear Support Vector Machines

» What if decision boundary is not linear?

o The data set is generated in such a way that


all the circles are clustered near the center
of the diagram and all the squares are
distributed farther away from the center.
o Instances of the data set can be classified
using the following equation

17
Attribute Transformation

A nonlinear transformation Φ is needed to map the data from its original feature
space into a new space where the decision boundary becomes linear
Learning a Nonlinear SVM Model

» Φ(𝑥), to transform a given data set, after the transformation, we need to construct a
linear decision boundary that will separate the instances into their respective classes.

» The linear decision boundary in the transformed space has the following form:

o w. Φ 𝑥 + 𝑏 = 0

» The learning task for a nonlinear SVM can be formalized as the following optimization
problem
Ensemble Methods

20
Ensemble Methods

21
Ensemble Methods

22
Ensemble Methods

23
Ensemble Methods

24
Methods for Constructing an Ensemble Classifier

» The ensemble of classifiers can be constructed in many ways:

By manipulating the training set:

» In this approach, multiple training sets are created by resampling the original data
according to some sampling distribution.

» A classifier is then built from each training set using a particular learning algorithm.

» Bagging and Boosting are two examples of ensemble methods that manipulate their
training sets.
Methods for Constructing an Ensemble Classifier

By manipulating the input features:

» A subset of input features is chosen to form each training set.

» The subset can be either chosen randomly or based on the recommendation of


domain experts.

» Random forest is an ensemble method that manipulates its input features and
uses decision trees as its base classifiers
Methods for Constructing an Ensemble Classifier

By manipulating the learning algorithm

» Many learning algorithms can be manipulated in such a way that applying the
algorithm several times on the same training data may result in different models.

» For example, an artificial neural network can produce different models by


changing its network topology or the initial weights of the links between neurons.
Bagging (Bootstrap Aggregation)
Bagging

» It is a technique that repeatedly samples (with replacement) from a data set.

» These samples are similar since all drawn from the same original data, but they
are also slightly different due to chance.

» A learning algorithm is an unstable algorithm if small changes in the training


set causes a large difference in the generated learner, namely, the learning
algorithm has high variance

» Bagging improves generalization error by reducing the variance of the base


classifiers.
29
Bagging

» Assume that we have a training set:

» We generate, say, B = 3 datasets by bootstrapping:

30
Bagging

» The performance of bagging depends on the stability of the base classifier

» Bagging uses bootstrap to generate n number of training sets then trains n base-
learners and then, during testing, takes an average.

Fit classification or regression models


to bootstrap samples from the data and
combine by voting (classification)
Or
averaging (regression).
Random Forest
Random Forest

In this random forest, two decision trees generate class B then the output become class B
Random Forest

» Random forests can be built using bagging in tandem with random selection of attributes
and samples of datasets.

» It combines the predictions made by multiple decision trees or base learners models, where
each tree is generated based on the values of an independent set of random vectors.

» During classification, each tree votes and the most popular class is returned.

» Random forests are comparable in accuracy to AdaBoost, yet are more robust to errors and
outliers

» For each tree grown on a bootstrap sample, the error rate for observations left out of the
bootstrap sample is called the out-of-bag error rate.

» Overfitting is not a problem


Random Forest

Random forest for Spam


classification
Boosting

» Boosting is a process that uses a set of Machine Learning algorithms to combine weak learner to
form strong learners in order to increase the accuracy of the model.

How does Boosting algorithms work?

» The basic principle behind the working of the boosting algorithms is to generate multiple weak
learner and combine their predictions to form one strong rule

Step 1: the base algorithms reads the data and assigns equal weight to each sample observation.

Step 2: False predictions are assigning to the next base learner with a higher weightage on these
incorrect prediction.

Step 3: Repeat step 2 until the algorithm can correctly classify the output
Types of Boosting

1. Adaptive Boosting(Ada Boost)


o Which is similar the previous boosting concepts
Type of Boosting

2. Gradient Boosting
XGBoost
k-Nearest Neighbor Classification (kNN)

» KNN stores all available cases and classified new cases based on a similarity
measure.
» Unlike all the previous learning methods, kNN does not build model from the
training data. Due to this called Lazy Learner.
» To classify a test instance d, define k-neighborhood.
» K in KNN is a parameter that refers to the number of nearest neighbors to include
in the majority of voting process

40
k-Nearest Neighbor Classification (kNN)

Unknown record

l Requires three things


– The set of labeled records
– Distance Metric to compute distance
between records
– The value of k, the number of nearest
neighbors to retrieve
l To classify an unknown record:
– Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to
determine the class label of unknown record
(e.g., by taking majority vote)
How do we choose K?
When do we use KNN Algorithms?
How does KNN Algorithm Works?
Example

» We have data from the questionnaires survey (to ask people opinion) & objective testing with
two attributes (acid durability & strength) to classify whether a special paper tissue is good
or not. Here is four training samples.

X1 = Acid Durability (seconds) X2 = Strength (kg/m2) Y = Classification

7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
» Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7.

o Without undertaking another expensive survey, guess the goodness of the new tissue? Use
squared Euclidean distance for similarity measurement and K=3
45
Solution

X1 = Acid X2 = Square Distance to Rank Is it Y=


Durability Strength query instance (3, 7) minimum included in Category of
(seconds) (kg/m2) distance 3-NNs? NN

7 7 3 Yes Bad

7 4 4 No -

3 4 1 Yes Good

1 4 2 Yes Good

» Use simple majority of the category of nearest neighbors as the prediction value of the query
instance. We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that
pass laboratory test with X1 = 3 and X2 = 7 is included in Good category.
46
k-Nearest Neighbor Classification (kNN)

» kNN can deal with complex and arbitrary decision boundaries.


» Despite its simplicity, researchers have shown that the
classification accuracy of kNN can be quite strong and in
many cases as accurate as those elaborated methods.
» kNN is slow at the classification time
» kNN does not produce an understandable model

47
Assignment 2

1. Write python algorithm for SVM, Clustering and value based machine
learning methods.

Submit via [email protected] before July 12 2022


Quiz 3

You might also like