0% found this document useful (0 votes)
16 views

Module 6

Ai and Ml

Uploaded by

Ankit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Module 6

Ai and Ml

Uploaded by

Ankit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 82

Module 6

Learning from Examples


Forms of Learning – Dimensionality reduction - Regression – Statistical Methods:
Naïve Bayes, Nearest Neighbor, Decision Trees – Random Forest, Clustering,
Ensemble Learning, Case studies – Machine Learning in Signal Processing,
Intelligent Antenna.
Machine learning
• Like human learning from past experiences.
• A computer does not have “experiences”.
• A computer system learns from data, which represent some “past
experiences” of an application domain.
• Machine learning is building machines that can adapt and learn from
experience without being explicitly programmed.
• To learn a target function that can be used to predict the values of a
class attribute.
Machine learning
In machine learning,
• There is a learning algorithm.
• Data called as training data set is fed to the learning algorithm.
• Learning algorithm draws inferences(conclusion) from the training data
set.
• It generates a model which is a function that maps input to the output.
Machine learning Algorithms
There are three types of machine learning algorithms.
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Supervised vs. Unsupervised Learning
• Supervised learning: Classification is seen as supervised learning from
examples.
• Supervision: The data (observations, measurements, etc.) are labeled with pre-
defined classes.
• Test data are classified into these classes.
• Unsupervised learning (clustering)
• Class labels of the data are unknown
• Given a set of data, the task is to establish the existence of classes or clusters in
the data

5
Supervised Learning

In this type of machine learning algorithm,


• The training data set is a labeled data set.
• In other words, the training data set contains the input value (X) and
target value (Y).
• The learning algorithm generates a model.
• Then, new data set consisting of only the input value is fed.
• The model then generates the target value based on its learning.
Supervised learning process: Two steps
 Learning (training): Learn a model using the training data
 Testing: Test the model using unseen test data to assess the model
accuracy

Number of correct classifications


Accuracy  ,
Total number of test cases

7
Application - 1
• A credit card company receives thousands of applications for new cards.
Each application contains information about an applicant,
• age
• Marital status
• annual salary
• outstanding debts
• credit rating
• etc.
• Data: Loan application data
• Task: Predict whether a loan should be approved or not.
• Performance measure: Accuracy
An example: Data (loan application)
Approved or not
Application - 2
• An emergency room in a hospital measures 17 variables (e.g.,
blood pressure, age, etc) of newly admitted patients.
• A decision is needed: whether to put a new patient in an intensive-
care unit.
• Due to the high cost of ICU, those patients who may survive less
than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate them from
low-risk patients.
Types of Supervised Learning Algorithm

• There are two types of supervised learning algorithm.


• Classification
• Regression
Unsupervised Learning

• In this type of machine learning algorithm,


• The training data set is an unlabeled data set.
• In other words, the training data set contains only the input value (X)
and not the target value (Y).
• Based on the similarity between data, it tries to draw inference from
the data such as finding patterns or clusters.
Overview of Classification

• The primary task performed by classifiers is to assign class labels to new


observations.
• Most classification methods are supervised, they start with a training set
of pre-labeled observations to learn how likely the attributes of these
observations may contribute to the classification of future unlabeled
observations.
• For example, existing marketing, sales, and customer demographic data
can be used to develop a classifier to assign a “purchase” or “no
purchase” label to potential future customers.
• Classification is widely used for prediction purposes.
• Classification can help health care professionals diagnose heart disease patients.
• Based on an e-mail’s content, e-mail providers also use classification to decide
whether the incoming e-mail messages are spam.
• The two fundamental classification methods are Decision trees and Naive
Bayes.
Decision Trees

• A decision tree (also called prediction tree) uses a tree structure to specify
sequences of decisions and consequences.
• Given input X = {x1,x2,….xn}, the goal is to predict a response or output
variable ‘Y’. Each member of the set {x1,x2,….xn} is called an input
variable.
• The prediction can be achieved by constructing a decision tree with test
points and branches.
• At each test point, a decision is made to pick a specific branch and
traverse down the tree. Eventually, a final point is reached, and a
prediction can be made.
• Each test point in a decision tree involves testing a particular input
variable (or attribute), and each branch represents the decision being
made.
• Due to its flexibility and easy visualization, decision trees are commonly
deployed in data mining applications for classification purposes.
Decision Trees
• The input values of a decision tree can be categorical or continuous.
• A decision tree employs a structure of test points (called nodes) and branches,
which represent the decision being made.
• A node without further branches is called a leaf node. The leaf nodes return
class labels and, in some implementations, they return the probability scores.
• A decision tree can be converted into a set of decision rules.
• For example, income and mortgage_amount are input variables, and the
response is the output variable default with a probability score.
IF income < $50,000 AND mortgage_amount > $100K THEN default =
True WITH PROBABILITY 75%
• Decision trees can be easily represented in a visual way, and the corresponding
decision rules are quite straightforward.
• Additionally, because the result is a series of logical if-then statements, there is
no underlying assumption of a linear (or nonlinear) relationship between the
input variables and the response variable.
Decision Trees
• Decision trees have two varieties:
• Classification trees
• Regression trees
• Classification trees are usually applied to output variables that are
categorical—often binary—in nature, such as yes or no, purchase or
not purchase, and so on.
• Regression trees, are applied to output variables that are numeric or
continuous, such as the predicted price of a consumer good or the
likelihood a subscription will be purchased.
Overview of a Decision Tree
• The following diagram shows an example of using a decision tree to
predict whether customers will buy a product.
• The term branch refers to the outcome of a decision and is
visualized as a line connecting two nodes. If a decision is numerical,
the “greater than” branch is usually placed on the right, and the
“less than” branch is placed on the left.
• Depending on the nature of the variable, one of the branches may
need to include an “equal to” component.
Example of a Decision Tree – Customer Buy a
Product

In the example, the root node splits into two branches with a Gender test.
The right branch contains all those records with the variable Gender equal to Male, and the left
branch contains all those records with the variable Gender equal to Female to create the depth 1
internal nodes.
Each internal node effectively acts as the root of a sub-tree, and a test for each node is determined
independently of the other internal nodes.
The left-hand side (LHS) internal node splits on a question based on the Income variable to create
leaf nodes at depth 2, whereas the right-hand side (RHS) splits on a question on the Age variable.
The decision tree shows that females with income less than or equal to $45,000 and males 40 years
old or younger are classified as people who would purchase the product. In traversing this tree, age
does not matter for females, and income does not matter for males.
Decision Tree
• Internal nodes are the decision or test points. Each internal node refers to an
input variable or an attribute. The top internal node is called the root.
• The decision tree in the above example is a binary tree in that each internal
node has no more than two branches. The branching of a node is referred to as
a split.
• Sometimes decision trees may have more than two branches stemming from a
node. For example, if an input variable Weather is categorical and has three
choices— Sunny, Rainy, and Snowy— the corresponding node Weather in the
decision tree may have three branches labeled as Sunny, Rainy, and Snowy,
respectively.
• The depth of a node is the minimum number of steps required to reach the
node from the root. In the above example, nodes Income and Age have a depth
of one, and the four nodes on the bottom of the tree have a depth of two.
• Leaf nodes are at the end of the last branches on the tree. They represent class
labels—the outcome of all the prior decisions. The path from the root to a leaf
node contains a series of decisions made at various internal nodes.
Applications
• Decision trees are used to classify animals, (like cold- blooded or warm-blooded,
mammal or not mammal).
• Another example is a checklist of symptoms during a doctor’s evaluation of a
patient.
• The artificial intelligence engine of a video game commonly uses decision trees
to control the autonomous actions of a character in response to various
scenarios.
• Retailers can use decision trees to segment customers or predict response rates
to marketing and promotions.
• Financial institutions can use decision trees to help decide if a loan application
should be approved or denied. In the case of loan approval, computers can use
the logical if-then statements to predict whether the customer will default on
the loan.
• For customers with a clear (strong) outcome, no human interaction is required;
for observations that may not generate a clear response, a human is needed for
the decision.
Construction – General Algorithm

• The objective of a decision tree algorithm is to construct a tree T from a training set S.
• If all the records in S belong to some class C, or if S is sufficiently pure (greater than a preset threshold), then
that node is considered a leaf node and assigned the label C.
• The purity of a node is defined as its probability of the corresponding class.
• If not all the records in S belong to class C or if S is not sufficiently pure, the algorithm selects the next most
informative attribute A, and partitions S according to A‘s values.
• The algorithm constructs sub-trees T1,T2….. for the subsets of S recursively until one of the following criteria
is met:
• All the leaf nodes in the tree satisfy the minimum purity threshold.
• The tree cannot be further split with the preset minimum purity threshold.
• Any other stopping criterion is satisfied (such as the maximum depth of the tree).
• The first step in constructing a decision tree is to choose the most informative attribute. A common way to
identify the most informative attribute is to use entropy-based methods, which are used by decision tree
learning algorithms such as ID3 (or Iterative Dichotomiser 3).
• The entropy methods select the most informative attribute based on two basic measures:
• Entropy, which measures the impurity of an attribute
• Information gain, which measures the purity of an attribute
• At each split, the decision tree algorithm picks the most informative attribute out of the remaining attributes.
• The extent to which an attribute is informative is determined by measures such as entropy and information
gain.
Construction – General Algorithm
• Given a class X, and its label x ϵ X , let p(x) be the probability of X, H(X) be
the entropy of X. H(X) is defined as,

• Entropy H(X) is zero when p(x) is either zero or one.


• For a binary classification (true or false), H(X) is zero if the probability of
each label x p(x) is either zero or one.
• On the other hand, H(X) achieves the maximum entropy when all the
class labels are equally probable.
• For a binary classification, H(X) = 1, if the probability of all class labels is
50/50.
• The maximum entropy increases as the number of possible outcomes
increases.
Example
• As an example of a binary random variable, consider tossing a coin with
known, probabilities of coming up heads or tails.
• Let x=1 represent ‘head’ and x=0 represent ‘tail’.
• The entropy of the unknown result of the next toss is maximized when the
coin is fair.
• That is, when heads and tails have equal probability, i.e p(x=1) = p(x=0) = 0.5,
the entropy is calculated as
H(X) = -(0.5 log2 0.5 + 0.5 log2 0.5) = 1.
• When the probability of tossing a head is equal to 0 or 1, the entropy is
minimized to 0.
• Therefore, the entropy for a completely pure variable is 0 and is 1 for a set
with equal occurrences for both the classes (head and tail, or yes and no).
Naïve Bayes Classifier
• Naive Bayes is a probabilistic classification method based on Bayes’ theorem (or Bayes’ law).
• Bayes’ theorem gives the relationship between the probabilities of two events and their
conditional probabilities.
• A naive Bayes classifier assumes that the presence or absence of a particular feature of a class is
unrelated to the presence or absence of other features.
• For example, an object can be classified based on its attributes such as shape, color, and weight.
• A classification for an object that is spherical, yellow, and less than 60 grams in weight may be a
tennis ball.
• Even if these features depend on each other or upon the existence of the other features, a
naïve Bayes classifier considers all these properties to contribute independently to the
probability that the object is a tennis ball.
• The input variables are generally categorical, but variations of the algorithm can accept
continuous variables.
• There are also ways to convert continuous variables into categorical ones.
• This process is often referred to as the discretization of continuous variables.
• In the tennis ball example, a continuous variable such as weight can be grouped into intervals to
be converted into a categorical variable.
Applications
• Naive Bayes classifiers are easy to implement and can execute efficiently
even without prior knowledge of the data.
• They are among the most popular algorithms for classifying text
documents.
• Spam filtering is a classic use case of naïve Bayes text classification.
• Bayesian spam filtering has become a popular mechanism to distinguish
spam e-mail from legitimate e-mail.
• Naive Bayes classifiers can also be used for fraud detection.
• In the domain of auto insurance, for example, based on a training set with
attributes such as driver’s rating, vehicle age, vehicle price, historical
claims by the policy holder, police report status, and claim genuineness,
Naive Bayes can provide probability-based classification of whether a new
claim is genuine.
Bayes Theorem
• The conditional probability of event C occurring, given that event A
has already occurred, is denoted as P(C/A), which can be found using
the formula,

where C is the class label C ϵ (C1,C2,….Cn) and A is the observed


attributes A = (A1,A2,….An).
• P(C) and P(A) and the conditional probabilities of C given A and A
given C, namely P(C|A) and P(A|C).
Naïve Bayes Classifier
Naïve Bayes Classifier
Prior Probability:
PlayTennis(Yes) = 9/14 = 0.642
PlayTennis(No) = 5/14 = 0.357

Conditional Probabilities:
Naïve Bayes Classifier
Naïve Bayes Classifier

P(No) > P(Yes)


Hence the New Instance can be classified as No
Naïve Bayes Classifier
Naïve Bayes Classifier
Solution:
Use the m-estimate of probabilities:
P() =
p : prior estimate of the probability
m: equivalent sample size (constant)
In the absence of other information, assume a
uniform prior: p =
where k is the number of values that the
attribute can take.
Naïve Bayes Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
K-Nearest Neighbour Classifier (KNN)
• The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning
method.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Working of KNN
The K-NN working can be explained on the basis of the below
algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of
neighbors

Step-3: Take the K nearest neighbors as per the calculated


Euclidean distance.
Step-4: Among these k neighbors, count the number of the data
points in each category.
Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
Working of KNN
How to choose the value of k for KNN Algorithm?
• The value of k is very crucial in the KNN algorithm to define the
number of neighbors in the algorithm.
• The value of k in the k-nearest neighbors (k-NN) algorithm should
be chosen based on the input data.
• If the input data has more outliers or noise, a higher value of k
would be better.
• It is recommended to choose an odd value for k to avoid ties in
classification.
• Cross-Validation methods can help in selecting the best k value
for the given dataset.
K- Nearest Neighbour (KNN) Classifier
K- Nearest Neighbour (KNN) Classifier
K- Nearest Neighbour (KNN) Classifier
K- Nearest Neighbour (KNN) Classifier
K- Nearest Neighbor (KNN) Classifier
K- Nearest Neighbor (KNN) Regression
Linear Regression
• Linear regression is one of most popular machine Learning
algorithms.
• It is a statistical method used for predictive analysis.
• Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product
price, etc.
• Linear regression algorithm shows a linear relationship between
a dependent (y) and one or more independent (y) variables.
• Since linear regression shows the linear relationship, which
means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
• The linear regression model provides a sloped straight line
representing the relationship between the variables.
Linear Regression
Linear Regression
Mathematically, linear regression is represented as:
Y= mX+b
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
m= slope of the line (how much Y changes for a change in
X)
b is the intercept (Value of Y when X is 0)
Types of Linear Regression
Simple Linear Regression:
If a single independent variable is used to predict the value of
a numerical dependent variable, then a Linear Regression
algorithm is called Simple Linear Regression.

Multiple Linear regression:


If more than one independent variable is used to predict the
value of a numerical dependent variable, then a Linear
Regression algorithm is called Multiple Linear Regression.

A linear line showing the relationship between the dependent


and independent variables is called a regression line.
Finding the best fit line
The main goal is to find the best fit line (ie) the error between
predicted values and actual values should be minimized.
The best fit line will have the least error.
The different values for weights or coefficient of lines (a 0, a1)
gives the different line of regression, and the cost function is
used to estimate the values of the coefficient for the best fit
line.
Cost function optimizes the regression coefficients or weights.
It measures how a linear regression model is performing.
The cost function is used to find the accuracy of
the mapping function, which maps the input variable to the
output variable.
Cost Function
For Linear Regression, the Mean Squared Error
(MSE) cost function is used, which is the average of
squared error occurred between the predicted values and
actual values.
It can be written as:

Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Linear Regression
Linear Regression
Formula used for linear regressions is, y = a + bx
Linear Regression
• For example, suppose we have the following dataset with the weight
and height of seven individuals:

Using Linear Regression For a person who weighs 170 pounds, how
tall would we expect them to be?
Linear Regression
Y = 32.7830 + 0.2001x
For a person who weighs 170 pounds
Y = 32.7830 + 0.2001 * 170
Y=32.7830+34.017
Y=66.8 inches
Overview of Clustering
• Clustering is one of the unsupervised learning algorithms for grouping
similar objects.
• In machine learning, unsupervised algorithms refers to the problem of
finding hidden structure (make inferences) within unlabeled data.
• it groups data instances that are similar to (near) each other in one cluster and
data instances that are very different (far away) from each other into different
clusters.
• Clustering techniques are unsupervised in the sense that the data
scientist does not determine, in advance, the labels to apply to the
clusters.
• The structure of the data describes the objects of interest and
determines how best to group the objects.
Overview of Clustering
• Clustering is a method used for exploratory analysis of the data.
• In clustering, there are no predictions made. Rather, clustering
methods find the similarities between objects according to the object
attributes and group the similar objects into clusters.
• Clustering techniques are utilized in marketing, economics, and various
branches of science. A popular clustering method is k-means.
Use Cases

• Clustering is often used as a lead-in to classification.


• Once the clusters are identified, labels can be applied to each cluster to
classify each group based on its characteristics.
• Clustering is primarily an exploratory technique to discover hidden
structures of the data, possibly as a prelude to more focused analysis or
decision processes.
• Some specific applications of k-means are image processing, medical,
and customer segmentation.
Use Cases - Customer Segmentation

• Marketing and sales groups use k-means to better identify customers


who have similar behaviors and spending patterns.
• For example, a wireless provider may look at the following customer
attributes: monthly bill, number of text messages, data volume
consumed, minutes used during various daily periods, and years as a
customer.
• The wireless company could then look at the naturally occurring
clusters and consider tactics to increase sales or reduce the customer
churn rate, the proportion of customers who end their relationship
with a particular company.
K-means clustering
• K-means is a partitional clustering algorithm.
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr, and r is the number of attributes
(dimensions) in the data.
• K means algorithm is an iterative algorithm that partition the dataset into K pre-defined
distinct non-overlapping subgroups (clusters) where each data point belongs to only one
group.
• The k-means algorithm partitions the similar data points into fixed number (k) of clusters in a
dataset.
• Each cluster has a cluster center, called centroid.
• k is specified by the user
• A cluster refers to a collection of data points aggregated together because of certain
similarities.
• K-means algorithm refers to the number of centroids in the dataset, and then allocates every
data point to the nearest cluster, while keeping the centroids as small as possible.
• ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
Cluster and Centroid
• A cluster is represented by a single point, known as centroid (or cluster center) of
the cluster.

• Centroid is computed as the mean of all data points in a cluster 𝐶𝑗 = Σ 𝑥 𝑖


• A centroid is the imaginary or real location representing the center of the cluster.

• The data points in each cluster are as similar as possible according to a similarity
measure such as Euclidean-based distance or correlation-based distance.
• The less variation within the clusters, the more homogeneous (similar) the data
points are within the same cluster.
K-Means Clustering
• Given a collection of objects each with n measurable attributes, k-means is an
analytical technique that, for a chosen value of k, identifies k clusters of objects based
on the objects’ proximity to the center of the k groups.
• The center is determined as the arithmetic average (mean) of each cluster’s n-
dimensional vector of attributes.
• The following diagram illustrates three clusters of objects with two attributes.
• Each object in the dataset is represented by a small dot color-coded to the closest large
dot, the mean of the cluster.
K-means Algorithm – Working Principle

• K-means algorithm starts with a first group of randomly selected


centroids, which are used as the beginning points for every cluster,
and then performs iterative (repetitive) calculations to optimize the
positions of the centroids.
• It halts creating and optimizing clusters when either:
• The centroids have stabilized — there is no change in their values because
the clustering has been successful.
• The defined number of iterations has been achieved.
K-means Algorithm – Working Principle
• To illustrate the method to find k clusters from a collection of M objects
with n attributes, the two- dimensional case (n = 2) is examined.
• It is much easier to visualize the k-means method in two dimensions.
• In two-dimension scenario, each object has two attributes, and consider
each object corresponding to the point (xi,yi), where x and y denote the
two attributes and i = 1, 2 … M.
• For a given cluster of m points (m ≤ M), the point that corresponds to
the cluster’s mean is called a centroid. A centroid refers to a point that
corresponds to the center of mass for an object.
K-means Algorithm – Working Principle
The K-means algorithm to find k clusters can be described using the four steps.
1. Choose the value of k and the k initial guesses for the centroids.
2. Compute the distance from each data point (xi,yi) to each centroid. Assign each data point to the closest
centroid. This association defines the first k clusters.
In two dimensions, the distance, d, between any two points, (x1,y1), and (x2,y2 is expressed by using the
Euclidean distance measure provided in by the Equation,

3. Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster. The
centroid (xc,yc) of m points in a k-means cluster is calculated as follows

where (xc,yc) is the ordered pair of the arithmetic means of the coordinates of the m points in the cluster. In this
step, a centroid is computed for each of the k clusters.
4. Repeat Steps 2 and 3 until the algorithm converges to an answer.
1. Assign each point to the closest centroid computed in Step 3.
2. Compute the centroid of newly defined clusters.
3. Repeat until the algorithm converges to the final answer
Convergence
• Convergence is reached when the computed centroids do not change or
the centroids and the assigned points oscillate back and forth from one
iteration to the next. The latter case can occur when there are one or
more points that are equal distances from the computed centroid.
• To generalize the prior algorithm to n dimensions, suppose there are M
objects, where each object is described by n attributes or property values
(p1,p2,…pn) . Then object ‘i’ is described by (pi1,pi2,…pin) for i = 1,2,…,
M. In other words, there is a matrix with M rows corresponding to the M
objects and n columns to store the attribute values.
• For a given point, pi, at (pi1,pi2,…pin) and a centroid, q, located at (q1,q2,
…qn), the distance, d, between pi and q, is expressed as

• The centroid, q, of a cluster of m points, (pi1,pi2,…pin) is calculated as


Example- K means Clustering
Example- K means Clustering
Reasons to Choose and Cautions
• K-means is a simple and straightforward method for defining clusters.
Once clusters and their associated centroids are identified, it is easy to
assign new objects (for example, new customers) to a cluster based on
the object’s distance from the closest centroid. Because the method is
unsupervised, k-means helps to eliminate subjectivity from the
analysis.
• Although k-means is considered as an unsupervised method, there are
still several decisions that the practitioner must make:
• What object attributes should be included in the analysis?
• What unit of measure (for example, miles or kilometers) should be used for
each attribute?
• Do the attributes need to be rescaled so that one attribute does not have a
disproportionate effect on the results?
• What other considerations might apply?
K-Means Clustering Example
• Consider height and weight information. Using these two variables,
group the objects based on height and weight information.

From the above chart, you will expect that there are two visible clusters/segments
and want these to be identified using K Means algorithm.
K-Means Clustering Example
K-Means Clustering Example
K-Means Clustering Example
INTRODUCTION
Model Ensembles

Rather than creating a single model, they generate a set of models and then make
predictions by aggregating the outputs of these models.
A prediction model that is composed of a set of models is called a model ensemble.
In the context of ensemble models, each model should make predictions independently
of the other models in the ensemble.
Given a large population of independent models, an ensemble can be very accurate
even if the individual models in the ensemble perform only marginally better than
random guessing.
INTRODUCTION
Model Ensembles

There are two defining characteristics of ensemble models:


1. They build multiple different models from the same dataset by inducing each
model using a modified version of the dataset.
2. They make a prediction by aggregating the predictions of the different models
in the ensemble. For categorical target features, this can be done using different
types of voting mechanisms, and for continuous target features, this can be done
using a measure of the central of the different model predictions, such as the
mean or the median.
INTRODUCTION
Boosting

There are two standard approaches to creating ensembles: boosting and bagging.
When we use boosting, each new model added to an ensemble is biased to pay
more attention to instances that previous models misclassified. This is done by
incrementally adapting the dataset used to train the models.
To do this we use a weighted dataset where each instance has an associated
weight wi ≥ 0, initially set to where n is the number of instances in the dataset.
These weights are used as a distribution over which the dataset is sampled to
create a replicated training set, in which the number of times an instance is
replicated is proportional to its weight.
INTRODUCTION
Boosting

Boosting works by iteratively creating models and adding them to the ensemble.
The iteration stops when a predefined number of models have been added.
During each iteration the algorithm does the following:
1. Induces a model using the weighted dataset and calculates the total error, ∈,
in the set of predictions made by the model for the instances in the training
dataset. The ∈ value is calculated by summing the weights of the training
instances for which the predictions made by the model are incorrect.
INTRODUCTION
Boosting
2. Increases the weights for the instances misclassified by the model using

and decreases the weights for the instances correctly classified by the model
using

3. Calculates a confidence factor, α, for the model such that α increases as ∈


decreases. A common way to calculate the confidence factor is
INTRODUCTION
Boosting

Once the set of models has been created, the ensemble makes predictions using a
weighted aggregate of the predictions made by the individual models.
The weights used in this aggregation are the confidence factors associated with
each model.
For categorical target features, the ensemble returns the majority target level
using a weighted vote, and for continuous target features, the ensemble returns
the weighted mean.
INTRODUCTION
Bagging
When we use bagging (or bootstrap aggregating), each model in the
ensemble is trained on a random sample of the dataset where,
importantly, each random sample is the same size as the dataset and
sampling with replacement is used. These random samples are known as
bootstrap samples, and one model is induced from each bootstrap
sample.
The reason that we sample with replacement is that this will result in
duplicates within each of the bootstrap samples, and consequently,
every bootstrap sample will be missing some of the instances from the
dataset.
As a result, each bootstrap sample will be different, and this means that
models trained on different bootstrap samples will also be different.
INTRODUCTION
Bagging
Decision tree induction algorithms are particularly well suited to use with
bagging.
This is because decision trees are very sensitive to changes in the dataset: a
small change in the dataset can result in a different feature being selected to
split the dataset at the root, or high up in the tree, and this can have a ripple
effect throughout the subtrees under this node.
Frequently, when bagging is used with decision trees, the sampling process is
extended so that each bootstrap sample only uses a randomly selected subset
of the descriptive features in the dataset. This sampling of the feature set is
known as subspace sampling.
Subspace sampling further encourages the diversity of the trees within the
ensemble and has the advantage of reducing the training time for each tree.
INTRODUCTION
Bagging
Figure 4.20 illustrates the process of creating a model ensemble
using bagging and
subspace sampling.
The combination of bagging, subspace sampling, and decision trees
is known as a random forest model.
Once the individual models have been induced, the ensemble makes
predictions by returning the majority vote or the median depending
on the type of prediction required.
For continuous target features, the median is preferred to the mean
because the mean is more heavily affected by outliers.
INTRODUCTION
Bagging

You might also like