Module 6
Module 6
5
Supervised Learning
7
Application - 1
• A credit card company receives thousands of applications for new cards.
Each application contains information about an applicant,
• age
• Marital status
• annual salary
• outstanding debts
• credit rating
• etc.
• Data: Loan application data
• Task: Predict whether a loan should be approved or not.
• Performance measure: Accuracy
An example: Data (loan application)
Approved or not
Application - 2
• An emergency room in a hospital measures 17 variables (e.g.,
blood pressure, age, etc) of newly admitted patients.
• A decision is needed: whether to put a new patient in an intensive-
care unit.
• Due to the high cost of ICU, those patients who may survive less
than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate them from
low-risk patients.
Types of Supervised Learning Algorithm
• A decision tree (also called prediction tree) uses a tree structure to specify
sequences of decisions and consequences.
• Given input X = {x1,x2,….xn}, the goal is to predict a response or output
variable ‘Y’. Each member of the set {x1,x2,….xn} is called an input
variable.
• The prediction can be achieved by constructing a decision tree with test
points and branches.
• At each test point, a decision is made to pick a specific branch and
traverse down the tree. Eventually, a final point is reached, and a
prediction can be made.
• Each test point in a decision tree involves testing a particular input
variable (or attribute), and each branch represents the decision being
made.
• Due to its flexibility and easy visualization, decision trees are commonly
deployed in data mining applications for classification purposes.
Decision Trees
• The input values of a decision tree can be categorical or continuous.
• A decision tree employs a structure of test points (called nodes) and branches,
which represent the decision being made.
• A node without further branches is called a leaf node. The leaf nodes return
class labels and, in some implementations, they return the probability scores.
• A decision tree can be converted into a set of decision rules.
• For example, income and mortgage_amount are input variables, and the
response is the output variable default with a probability score.
IF income < $50,000 AND mortgage_amount > $100K THEN default =
True WITH PROBABILITY 75%
• Decision trees can be easily represented in a visual way, and the corresponding
decision rules are quite straightforward.
• Additionally, because the result is a series of logical if-then statements, there is
no underlying assumption of a linear (or nonlinear) relationship between the
input variables and the response variable.
Decision Trees
• Decision trees have two varieties:
• Classification trees
• Regression trees
• Classification trees are usually applied to output variables that are
categorical—often binary—in nature, such as yes or no, purchase or
not purchase, and so on.
• Regression trees, are applied to output variables that are numeric or
continuous, such as the predicted price of a consumer good or the
likelihood a subscription will be purchased.
Overview of a Decision Tree
• The following diagram shows an example of using a decision tree to
predict whether customers will buy a product.
• The term branch refers to the outcome of a decision and is
visualized as a line connecting two nodes. If a decision is numerical,
the “greater than” branch is usually placed on the right, and the
“less than” branch is placed on the left.
• Depending on the nature of the variable, one of the branches may
need to include an “equal to” component.
Example of a Decision Tree – Customer Buy a
Product
In the example, the root node splits into two branches with a Gender test.
The right branch contains all those records with the variable Gender equal to Male, and the left
branch contains all those records with the variable Gender equal to Female to create the depth 1
internal nodes.
Each internal node effectively acts as the root of a sub-tree, and a test for each node is determined
independently of the other internal nodes.
The left-hand side (LHS) internal node splits on a question based on the Income variable to create
leaf nodes at depth 2, whereas the right-hand side (RHS) splits on a question on the Age variable.
The decision tree shows that females with income less than or equal to $45,000 and males 40 years
old or younger are classified as people who would purchase the product. In traversing this tree, age
does not matter for females, and income does not matter for males.
Decision Tree
• Internal nodes are the decision or test points. Each internal node refers to an
input variable or an attribute. The top internal node is called the root.
• The decision tree in the above example is a binary tree in that each internal
node has no more than two branches. The branching of a node is referred to as
a split.
• Sometimes decision trees may have more than two branches stemming from a
node. For example, if an input variable Weather is categorical and has three
choices— Sunny, Rainy, and Snowy— the corresponding node Weather in the
decision tree may have three branches labeled as Sunny, Rainy, and Snowy,
respectively.
• The depth of a node is the minimum number of steps required to reach the
node from the root. In the above example, nodes Income and Age have a depth
of one, and the four nodes on the bottom of the tree have a depth of two.
• Leaf nodes are at the end of the last branches on the tree. They represent class
labels—the outcome of all the prior decisions. The path from the root to a leaf
node contains a series of decisions made at various internal nodes.
Applications
• Decision trees are used to classify animals, (like cold- blooded or warm-blooded,
mammal or not mammal).
• Another example is a checklist of symptoms during a doctor’s evaluation of a
patient.
• The artificial intelligence engine of a video game commonly uses decision trees
to control the autonomous actions of a character in response to various
scenarios.
• Retailers can use decision trees to segment customers or predict response rates
to marketing and promotions.
• Financial institutions can use decision trees to help decide if a loan application
should be approved or denied. In the case of loan approval, computers can use
the logical if-then statements to predict whether the customer will default on
the loan.
• For customers with a clear (strong) outcome, no human interaction is required;
for observations that may not generate a clear response, a human is needed for
the decision.
Construction – General Algorithm
• The objective of a decision tree algorithm is to construct a tree T from a training set S.
• If all the records in S belong to some class C, or if S is sufficiently pure (greater than a preset threshold), then
that node is considered a leaf node and assigned the label C.
• The purity of a node is defined as its probability of the corresponding class.
• If not all the records in S belong to class C or if S is not sufficiently pure, the algorithm selects the next most
informative attribute A, and partitions S according to A‘s values.
• The algorithm constructs sub-trees T1,T2….. for the subsets of S recursively until one of the following criteria
is met:
• All the leaf nodes in the tree satisfy the minimum purity threshold.
• The tree cannot be further split with the preset minimum purity threshold.
• Any other stopping criterion is satisfied (such as the maximum depth of the tree).
• The first step in constructing a decision tree is to choose the most informative attribute. A common way to
identify the most informative attribute is to use entropy-based methods, which are used by decision tree
learning algorithms such as ID3 (or Iterative Dichotomiser 3).
• The entropy methods select the most informative attribute based on two basic measures:
• Entropy, which measures the impurity of an attribute
• Information gain, which measures the purity of an attribute
• At each split, the decision tree algorithm picks the most informative attribute out of the remaining attributes.
• The extent to which an attribute is informative is determined by measures such as entropy and information
gain.
Construction – General Algorithm
• Given a class X, and its label x ϵ X , let p(x) be the probability of X, H(X) be
the entropy of X. H(X) is defined as,
Conditional Probabilities:
Naïve Bayes Classifier
Naïve Bayes Classifier
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Linear Regression
Linear Regression
Formula used for linear regressions is, y = a + bx
Linear Regression
• For example, suppose we have the following dataset with the weight
and height of seven individuals:
Using Linear Regression For a person who weighs 170 pounds, how
tall would we expect them to be?
Linear Regression
Y = 32.7830 + 0.2001x
For a person who weighs 170 pounds
Y = 32.7830 + 0.2001 * 170
Y=32.7830+34.017
Y=66.8 inches
Overview of Clustering
• Clustering is one of the unsupervised learning algorithms for grouping
similar objects.
• In machine learning, unsupervised algorithms refers to the problem of
finding hidden structure (make inferences) within unlabeled data.
• it groups data instances that are similar to (near) each other in one cluster and
data instances that are very different (far away) from each other into different
clusters.
• Clustering techniques are unsupervised in the sense that the data
scientist does not determine, in advance, the labels to apply to the
clusters.
• The structure of the data describes the objects of interest and
determines how best to group the objects.
Overview of Clustering
• Clustering is a method used for exploratory analysis of the data.
• In clustering, there are no predictions made. Rather, clustering
methods find the similarities between objects according to the object
attributes and group the similar objects into clusters.
• Clustering techniques are utilized in marketing, economics, and various
branches of science. A popular clustering method is k-means.
Use Cases
• The data points in each cluster are as similar as possible according to a similarity
measure such as Euclidean-based distance or correlation-based distance.
• The less variation within the clusters, the more homogeneous (similar) the data
points are within the same cluster.
K-Means Clustering
• Given a collection of objects each with n measurable attributes, k-means is an
analytical technique that, for a chosen value of k, identifies k clusters of objects based
on the objects’ proximity to the center of the k groups.
• The center is determined as the arithmetic average (mean) of each cluster’s n-
dimensional vector of attributes.
• The following diagram illustrates three clusters of objects with two attributes.
• Each object in the dataset is represented by a small dot color-coded to the closest large
dot, the mean of the cluster.
K-means Algorithm – Working Principle
3. Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster. The
centroid (xc,yc) of m points in a k-means cluster is calculated as follows
where (xc,yc) is the ordered pair of the arithmetic means of the coordinates of the m points in the cluster. In this
step, a centroid is computed for each of the k clusters.
4. Repeat Steps 2 and 3 until the algorithm converges to an answer.
1. Assign each point to the closest centroid computed in Step 3.
2. Compute the centroid of newly defined clusters.
3. Repeat until the algorithm converges to the final answer
Convergence
• Convergence is reached when the computed centroids do not change or
the centroids and the assigned points oscillate back and forth from one
iteration to the next. The latter case can occur when there are one or
more points that are equal distances from the computed centroid.
• To generalize the prior algorithm to n dimensions, suppose there are M
objects, where each object is described by n attributes or property values
(p1,p2,…pn) . Then object ‘i’ is described by (pi1,pi2,…pin) for i = 1,2,…,
M. In other words, there is a matrix with M rows corresponding to the M
objects and n columns to store the attribute values.
• For a given point, pi, at (pi1,pi2,…pin) and a centroid, q, located at (q1,q2,
…qn), the distance, d, between pi and q, is expressed as
From the above chart, you will expect that there are two visible clusters/segments
and want these to be identified using K Means algorithm.
K-Means Clustering Example
K-Means Clustering Example
K-Means Clustering Example
INTRODUCTION
Model Ensembles
Rather than creating a single model, they generate a set of models and then make
predictions by aggregating the outputs of these models.
A prediction model that is composed of a set of models is called a model ensemble.
In the context of ensemble models, each model should make predictions independently
of the other models in the ensemble.
Given a large population of independent models, an ensemble can be very accurate
even if the individual models in the ensemble perform only marginally better than
random guessing.
INTRODUCTION
Model Ensembles
There are two standard approaches to creating ensembles: boosting and bagging.
When we use boosting, each new model added to an ensemble is biased to pay
more attention to instances that previous models misclassified. This is done by
incrementally adapting the dataset used to train the models.
To do this we use a weighted dataset where each instance has an associated
weight wi ≥ 0, initially set to where n is the number of instances in the dataset.
These weights are used as a distribution over which the dataset is sampled to
create a replicated training set, in which the number of times an instance is
replicated is proportional to its weight.
INTRODUCTION
Boosting
Boosting works by iteratively creating models and adding them to the ensemble.
The iteration stops when a predefined number of models have been added.
During each iteration the algorithm does the following:
1. Induces a model using the weighted dataset and calculates the total error, ∈,
in the set of predictions made by the model for the instances in the training
dataset. The ∈ value is calculated by summing the weights of the training
instances for which the predictions made by the model are incorrect.
INTRODUCTION
Boosting
2. Increases the weights for the instances misclassified by the model using
and decreases the weights for the instances correctly classified by the model
using
Once the set of models has been created, the ensemble makes predictions using a
weighted aggregate of the predictions made by the individual models.
The weights used in this aggregation are the confidence factors associated with
each model.
For categorical target features, the ensemble returns the majority target level
using a weighted vote, and for continuous target features, the ensemble returns
the weighted mean.
INTRODUCTION
Bagging
When we use bagging (or bootstrap aggregating), each model in the
ensemble is trained on a random sample of the dataset where,
importantly, each random sample is the same size as the dataset and
sampling with replacement is used. These random samples are known as
bootstrap samples, and one model is induced from each bootstrap
sample.
The reason that we sample with replacement is that this will result in
duplicates within each of the bootstrap samples, and consequently,
every bootstrap sample will be missing some of the instances from the
dataset.
As a result, each bootstrap sample will be different, and this means that
models trained on different bootstrap samples will also be different.
INTRODUCTION
Bagging
Decision tree induction algorithms are particularly well suited to use with
bagging.
This is because decision trees are very sensitive to changes in the dataset: a
small change in the dataset can result in a different feature being selected to
split the dataset at the root, or high up in the tree, and this can have a ripple
effect throughout the subtrees under this node.
Frequently, when bagging is used with decision trees, the sampling process is
extended so that each bootstrap sample only uses a randomly selected subset
of the descriptive features in the dataset. This sampling of the feature set is
known as subspace sampling.
Subspace sampling further encourages the diversity of the trees within the
ensemble and has the advantage of reducing the training time for each tree.
INTRODUCTION
Bagging
Figure 4.20 illustrates the process of creating a model ensemble
using bagging and
subspace sampling.
The combination of bagging, subspace sampling, and decision trees
is known as a random forest model.
Once the individual models have been induced, the ensemble makes
predictions by returning the majority vote or the median depending
on the type of prediction required.
For continuous target features, the median is preferred to the mean
because the mean is more heavily affected by outliers.
INTRODUCTION
Bagging