0% found this document useful (0 votes)
21 views

Module 2

The document discusses different classification algorithms including logistic regression, decision trees, naive bayes, random forest, and SVM. It covers topics such as what is classification, types of classification, evaluation metrics, and provides details on logistic regression including assumptions, examples, and what it predicts.

Uploaded by

sourish.js2021
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Module 2

The document discusses different classification algorithms including logistic regression, decision trees, naive bayes, random forest, and SVM. It covers topics such as what is classification, types of classification, evaluation metrics, and provides details on logistic regression including assumptions, examples, and what it predicts.

Uploaded by

sourish.js2021
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

BCSE352E- Essentials of

Data Analytics

1
Topics in Module-2 Classification

Classification
• Logistic Regression

• Decision Trees

• Naïve Bayes-conditional probability

• Random Forest

• SVM Classifier 2
Module-2 Introduction to Classification
What is Classification?

Segregate vast quantities of data


into discrete values,
i.e. :distinct, like 0/1,
True/False, or a pre-defined
output label class.

• The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups.

3
Module-2 Introduction to Classification
Types of Classification?
• Types of Classifiers: The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
• Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier. Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier. Example: Classifications of types of crops, Classification of
types of music.
• Types of learners: In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
• Example: K-NN algorithm, Case-based reasoning
• Eager Learners: Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction.
• Example: Decision Trees, Naïve Bayes, ANN.
4
Module-2 Introduction to Classification
Types of Classification?
• Types of classification:
• Supervised: The set of possible classes is known in advance.
• Unsupervised: Set of possible classes is not known. After classification we can try to assign a
name to that class. Unsupervised classification is called clustering.
• Types of Classification algorithms: The Classification algorithms can be further divided into the
Mainly two category:
• Linear Models
• Logistic Regression
• Support Vector Machines
• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification

5
Module-2 Introduction to Classification
Evaluation of Classification model?
• Log loss or cross entropy loss:
• It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near to 0.
• The value of log loss increases if the predicted value deviates from the actual value.
• The lower log loss represents the higher accuracy of the model.
• Confusion Matrix:
• The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
• It is also known as the error matrix.
• The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions.
• AUC-ROC curve:
• ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under
the Curve.
• It is a graph that shows the performance of the classification model at different thresholds.
• To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
• The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis. 6
Module-2 Topic-1: Logistic Regression
Logistic Regression
Logistic regression is an extension of simple linear regression, where the
dependent variable is dichotomous or binary in nature and we cannot use
simple linear regression.
Logistic regression is the statistical technique used to predict the relationship
between two or more predictors (independent variables) and a predicted
variable (the dependent variable) where the dependent variable is binary.
Logistic regression estimates the
probability of an event
occurring, based on a given
dataset of independent variables.
Since the outcome is a probability,
the dependent variable is
bounded between 0 and 1.
Logistic Regression –An Illustration
7
Module-2 Topic-1: Logistic Regression
Logistic Regression
• Logistic regression estimates the probability of a certain event occurring using
the odds ratio by calculating the logarithm of the odds.
• Uses Maximum likelihood estimation (MLE) to transform the probability of
an event occurring into its odds, a nonlinear model.

• Odds ratio is the probability of occurrence of a particular event over the


probability of non occurrence and providing an estimate of the
magnitude of the relationship between binary variables. i.e. probability of
success divided by the probability of failure
8
Module-2 Topic-1: Logistic Regression
Logistic Regression

9
Module-2 Topic-1: Logistic Regression
Example #1

Example #2

10
Module-2 Topic-1: Logistic Regression
What Logistic Regression predicts?
• Probability of Y occurring given known values for X(s).
• In Logistic Regression, the Dependent Variable is transformed into the natural log of the
odds. This is called logit (short for logistic probability unit).

• The probabilities which ranged between 0.0 and 1.0 are transformed into
odds ratios that range between 0 and infinity and approximated as a sigmoid
function applied to a linear combination of input features in the range 0 to 1.
• If the probability for group membership in the modeled category is above
some cut point (the default is 0.50), the subject is predicted to be a member
of the modeled group. Example: Default their payment.
• If the probability is below the cut point, the subject is predicted to be a
member of the other group. Example: No Default their payment.
• For any given case, logistic regression computes the probability that a case
with a particular set of values for the independent variable is a member of
the modeled category.
11
Module-2 Topic-1: Logistic Regression
Logistic Regression

12
Module-2 Topic-1: Logistic Regression
Logistic Regression

13
Module-2 Topic-1: Logistic Regression
Logistic Regression

14
Module-2 Topic-1: Logistic Regression
Assumptions with its explanation for Logistic Regression
• No outliers in the data. An outlier can be identified by analyzing the
independent variables
• No correlation (multi-collinearity) between the independent variables.
Measure how well the algorithm performs using
the weights on functions=
• Where G is the logistic function and to sigmoid
curve, We can see the values of y-axis lie
between 0 and 1 and crosses the axis at 0.5.
• The classes can be divided into positive or
negative. The output comes under the probability
of positive class if it lies between 0 and 1.
• Interpreting the output of hypothesis function as
positive if it is ≥0.5, otherwise negative.
• Loss Function:

15
Module-2 Topic-1: Logistic Regression
Types of Logistic Regression
• Binary logistic regression: In this approach,
the response or dependent variable is
dichotomous in nature—i.e. it has only two
possible outcomes (e.g. 0 or 1).
• Multinomial logistic regression: In this type
of logistic regression model, the dependent
variable has three or more possible
outcomes; however, these values have no
specified order. For example, Type-A, Type-
B, Type-C
• Ordinal logistic regression: This type of
logistic regression model is leveraged when
the response variable has three or more
possible outcome, but in this case, these
values do have a defined order. Example: An
ordinal responses include grading scales from
A to F or rating scales from 1 to 5.

16
Module-2 Topic-1: Logistic Regression
Advantages and Disadvantages of Logistic Regression

17
Module-2 Topic-1: Logistic Regression
Applications of Logistic Regression

18
Module-2 Topic-1: Logistic Regression
Logistic Regression-Solved Example#1
A dataset consist of women and men Instagram users with a sample size of
1069. Let the probability of men and women using Instagram be The
sample proportion of women who are Instagram users is given as 61.08%,
and the sample proportion for men is 43.98%. The difference is 0.170951,
and the 95% confidence interval is (0.111429, 0.2292).Establish a logistic
regression model specifies the relationship between p and x.
Odds=
Solution
Logistic regression equation for women log (+

Logistic regression equation for men log (

19
Logistic Regression-Solved Example#1 (Contd.)
Odds for women==1.5694

Odds for men==0.7851


Log of Odds for women=log (=log(1.5694)=0.4507=
Log of Odds for men=log (=log(0.7851)=-0.2419=

𝑏 0=− 0.2419
Slope = Log (odds for women)-Log(odds for men)=-(- 0.2419)=0.6926
Best fit regression equation y

20
Note: For deciding the values in the logistic regression line
Use Scattergrams that has positive correlation such that the
value can have negative coefficients and can have positive
coefficients
YY Y Y

X X X

Positive correlation Negative correlation No correlation


Best fit regression line:
y= y=𝒃𝟎 + 𝒃𝟏 𝒙 − 𝒃𝟐 𝒙 𝒃𝟑 𝒙 − … y=− 𝒃𝟎 + 𝒃𝟏 𝒙 − 𝒃𝟐 𝒙 𝒃𝟑 𝒙 +…

21
Module-2 Topic-1: Logistic Regression
Logistic Regression-Solved Example#2
A dataset consist of 5 student statistics on hours of study vs. students will
get pass/fail. Assume that we are using Logistic regression classifier. The
optimizer has the odds of passing the course as

Hours of study(X) 29 15 33 28 39
Result (Y) Pass/Fail 0 0 1 1 1
(i) Calculate the probability of pass percentage who have completed 33 hours of study.
(ii) Calculate at least how many hours the student has put on his efforts that has the
probability of score of more than 95%.
Solution
(i) probability of pass percentage(P)= Where Z=
Sub 33 hours of study in Hours, we get Z=
Probability of pass percentage(P) =
22
Module-2 Topic-1: Logistic Regression
Logistic Regression-Solved Example#2
(ii) Calculate at least how many hours the student has put on his efforts that has the
probability of score of more than 95%.
Solution
(ii) Probability of score (P=0.95)=

0.95()=1 =1 =1-0.95
=0.05 = =ln(
= 2.94
= 2.94
log ( 𝑜𝑑𝑑𝑠 )=−64 +2 ×h𝑜𝑢𝑟𝑠 𝑍 =− 64+2 × h𝑜𝑢𝑟𝑠. 2.94=− 64+2× h𝑜𝑢𝑟𝑠
2.94=− 64+2× h𝑜𝑢𝑟𝑠 66.94=2× h𝑜𝑢𝑟𝑠
𝟔𝟔 .𝟗𝟒
𝑯 𝒐𝒖𝒓𝒔= =𝟑𝟑 . 𝟒𝟕
𝟐
23
Module-2 Topic-2: Decision Tree
Decision tree
• A Decision tree is a flowchart-like tree structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label.

24
Module-2 Topic-2: Decision Tree
Decision tree
• A tree can be “learned” by splitting the
source set into subsets based on an
attribute value test.
• This process is repeated on each derived
subset in a recursive manner called
recursive partitioning.
• The recursion is completed when the
subset at a node all has the same value of
the target variable, or when splitting no
longer adds value to the predictions.
• The construction of a decision tree
classifier does not require any domain
knowledge or parameter setting, and
therefore is appropriate for exploratory
knowledge discovery.
25
Module-2 Topic-2: Decision Tree
Decision tree
• Decision trees can handle high-
dimensional data. In general decision tree
classifier has good accuracy.
• Decision tree induction is a typical
inductive approach to learn knowledge on
classification.
• Decision trees classify instances by
sorting them down the tree from the root
to some leaf node, which provides the
classification of the instance.
• An instance is classified by starting at the
root node of the tree, testing the attribute
specified by this node, then moving down
the tree branch corresponding to the value
of the attribute.
26
Module-2 Topic-2: Decision Tree
Decision tree
Strength:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for prediction
or classification.
Disadvantage:
• Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a decision
tree is computationally expensive. At each node, each candidate splitting field must be
sorted before its best split can be found. In some algorithms, combinations of fields are
used and a search must be made for optimal combining weights. Pruning algorithms can
also be expensive since many candidate sub-trees must be formed and compared.
27
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
Consider whether a dataset based on which we will determine whether to play football or
not.

There are 4 independent variables - Outlook, Temperature, Humidity, and Wind to determine
the dependent variable-whether to play football or not.
28
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
1 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)

29
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
2 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)

30
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
3 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)

31
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
4 Initial Decision tree diagram

32
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5

33
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5

34
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5

35
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5

36
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5

37
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature or Humidity has higher information gain.
6

38
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether humidity is normal or high based on higher information gain.
6

39
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6

40
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6

41
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6

42
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
6 Final Decision tree

43
Module-2 Topic-3: Naïve Bayes-conditional probability
An Example of Bayes Theorem
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20

• If a patient has stiff neck, what’s the probability he/she has meningitis?

P ( S | M ) P ( M ) 0.5  1 / 50000
P( M | S )    0.0002
P(S ) 1 / 20

44
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model
•Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps
in building the fast machine learning models that can make quick predictions.
•It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
•Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
•The formula for Bayes' theorem is given as:
• Where,
• P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
• P(B) is Marginal Probability: Probability of Evidence.
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
• P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
45
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
If the weather is sunny, then the Player should play or not?

46
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
Step-1 Frequency table for the Weather Conditions:

Step-2 Likelihood table weather condition:

47
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
Step-3 Applying Bayes Theorem

48
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model

49
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification Model
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis

50
Module-2 Topic-3: Naïve Bayes-conditional probability
Summary of Naïve Bayes Classification Model
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• Bayes’ rule can be turned into a classifier
• Maximum A Posteriori (MAP) hypothesis estimation incorporates prior
knowledge; Max Likelihood (ML) doesn’t
• Naive Bayes Classifier is a simple but effective Bayesian classifier for
vector data (i.e. data with several attributes) that assumes that attributes
are independent given the class.
• Bayesian classification is a generative approach to classification
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.

51
Random Forest Module-2 Topic-4: Random Forest

• Random forests (RF) are a


combination of tree predictors
such that each tree depends on
the values of a random vector
sampled independently and
with the same distribution for
all trees in the forest.
• The generalization error of a
forest of tree classifiers depends
on the strength of the
individual trees in the forest and
the correlation between them.
• Improvements in classification accuracy have resulted from
growing an ensemble of trees and letting them vote for
the most popular class.
• To grow these ensembles, often random vectors are
generated that govern the growth of each tree in the
ensemble. 52
Random Forest
Module-2 Topic-4: Random Forest

• Random forest is identified as a collection of


decision trees. Each tree estimates a
classification, and this is called a “vote”. Ideally,
we consider each vote from every tree and chose
the most voted classification (Majority-Voting).
• Random Forest follow the same bagging process
as the decision trees but each time a split is to be
performed, the search for the split variable is
limited to a random subset of m of the p
attributes (variables or features) aka Split-
Attribute Randomization :
• classification trees: m = √p
• regression trees: m = p/3
• Random Forests produce many unique trees.

53
Module-2 Topic-4: Random Forest
Why Random Forest?
• Random forest algorithm is suitable for both classifications and regression task.
• It gives a higher accuracy through cross validation.
• Standard decision trees often have high variance and low bias
High chance of overfitting (with ‘deep trees’, many nodes)
• With a Random Forest, the bias remains low and the variance is reduced
thus we decrease the chances of overfitting
• Random forest classifier can handle the missing values and maintain the accuracy
of a large proportion of data.
• If there are more trees, it doesn’t allow over-fitting trees in the model.
• It has the ability to work upon a large data set with higher dimensionality.

54
Bagging : Bootstrap Aggregating : Module-2 Topic-4: Random Forest
wisdom of the crowd
1. Sample records with
replacement ("bootstrap"
the training data)
Sampling is the process of selecting a
subset of items from a vast collection of
items.
Bootstrap = Sampling with replacement.
It means a data point in a drawn sample
can reappear in future drawn samples as
well.

2. Fit an overgrown tree to


each resampled data set

3. Average predictions
55
Module-2 Topic-4: Random Forest

How is a Random Forest created?


• A random forest consists of decision trees.
A decision tree consists of
• decision nodes
the top decision node is called the root node
• terminal nodes or leaf nodes

• A selection of data and features is used for each tree


For every decision tree
• a sample of the training data is used
• a sample of the features (√nfeatures up to 30 – 40%) is used

https://victorzhou.com/blog/intro-to-random-forests/
56
Module-2 Topic-4: Random Forest

For an individual Decision Tree


• Find the best split in the data.
This is the root node (decision node)

• Find in the first branch for that part of the data again the best split.
That is the first sub node (decision node)

• Continue creating decision nodes until splitting doesn’t improve the situation
The average value of the target variable is assigned to the leaf (terminal node)

• Continue until there are only leaf nodes left or until a minimum value is
reached

• Now the decision tree can be used to do a prediction based on the input
features
57
Module-2 Topic-4: Random Forest

Determine the best split


• Determine the mean of all datapoints

• Determine the mean square error of all datapoints

• Apply a split. Try at every datapoint

• Calculate the mean of all datapoints on


each side of the split

• Calculate the MSE on each side and take


a weighted average of MSEs on all sides

• Lowest (weighted averaged) MSE is best split


58
Module-2 Topic-4: Random Forest

Feature Importance
• Feature importance is calculated as
• the reduction in sum of squared errors whenever a variable is chosen to split
• weighted by the probability of reaching that node.

• The node probability


can be calculated by the number of samples that reach the node,
divided by the total number of samples.

• The higher the value the more important the feature.

• However
• the variable importance measures are not reliable in situations where potential predictor variables vary in
their scale of measurement or their number of categories
59
Module-2 Topic-4: Random Forest
Processing the ensemble of trees called

The Random Forest


• Take a set of variables
• Run them through every decision tree
• Determine a predicted target variable for each of the
trees
• Average the result of all trees

60
Module-2 Topic-4: Random Forest

How to evaluate the model?


• Split the data in a train set and test set
• Train the model using the trainset.
• Test predictions using the test set.
• Vary the train and test set contents to confirm results

• Compare the test set and the train set


• Determine the Coefficient of Determination for both sets
The proportion of the variance in the dependent variable that is predictable from the
independent variable(s)
How well are observed outcomes replicated by the model
• If R2 is different for both sets, the test or train set is probably biased

• Determine the accuracy of the predictions


by comparing predicted results of the test set with the actual results of the test set
61
Random Forest : Tuning Module-2 Topic-4: Random Forest

• Bagging introduces randomness into the rows of the data.

• Random forest introduces randomness into the rows and columns of the data

• Combined, this provides a more diverse set of trees that almost always lowers our prediction error.
62
Module-2 Topic-4: Random Forest

Random Forest - Strengths & Weaknesses

63
Module-2 Topic-4: Random Forest

Applications
• Banking Industry
• Credit Card Fraud Detection
• Customer Segmentation
• Predicting Loan Defaults on LendingClub.com
• Healthcare and Medicine
• Cardiovascular Disease Prediction
• Diabetes Prediction
• Breast Cancer Prediction
• Stock Market
• Stock Market Prediction
• Stock Market Sentiment Analysis
• Bitcoin Price Detection
• E-Commerce
• Product Recommendation
• Price Optimization
• Search Ranking

64
Module-2 Topic-4: Random Forest
Case Study
• Let us consider the example of the Boston Housing dataset. This is a well-
known dataset of information about different houses in Boston. For each
house, 13 values are known, such as the crime rate in that area,
industrialization value, average age of residents, and so on. Our task is to
train a model to predict the value of a house given these values.

65
Module-2 Topic-4: Random Forest
Case Study
• Let us consider the example of the Boston Housing dataset.

66
Case Study
• Training the dataset
• We can note that of the 13 original
features, this decision tree has used
only LSTAT (the percentage of the
population in low income groups) and
RM (average number of rooms per
dwelling) to generate a prediction.
• The four leaf nodes show us that this
single tree classifier can produce four
possible outputs: $30k, $44k, $22k and
$14k, even though we are solving a
regression problem and the true
number could be one of many
continuous values.
• This simple decision tree has a mean
absolute error of $3.6k on the training
set, and $3.8k on the test set. This
means that although it is not a
powerful model, it performs similarly
on seen and unseen data, and so it has
generalized well and has not overfit the
training data.
67
Module-2 Topic-4: Random Forest
Case Study
• Making a random forest ensemble
model

68
Module-2 Topic-4: Random Forest
Case Study
• Performance of the model

69
Case Study
• Feature Importance in RF

70
Module-2 Topic-4: Random Forest
Random forest Vs Decision tree

71
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.

72
Module-2 Topic-5: SVM Classifier
Terminologies in Support Vector Machine (SVM)
• Hyperplane: There can be multiple
lines/decision boundaries to segregate the
classes in n-dimensional space, but we need
to find out the best decision boundary that
helps to classify the data points. This best
boundary is known as the hyperplane of
SVM.
• Dimensions of the hyperplane depend on
the features present in the dataset. 2
features, then hyperplane will be a straight
line.
• Support Vectors: The data points or
vectors that are the closest to the
hyperplane and which affect the position of
the hyperplane are termed as Support
Vector 73
Module-2 Topic-5: SVM Classifier
Terminologies in Support Vector Machine (SVM)
• The distance between the Positive hyperplane
vectors and the
hyperplane is called as
margin.
• Goal of SVM is to
maximize this margin.
Negative
• The hyperplane with hyperplane
maximum margin is called
the optimal hyperplane.
This line
represents the
decision
boundary:
ax + by − c = 0
74
Module-2 Topic-5: SVM Classifier
Types of Support Vector Machine (SVM)
• Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data.

• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line.
75
Module-2 Topic-5: SVM Classifier
Types of Support Vector Machine (SVM)

76
Module-2 Topic-5: SVM Classifier
An example for SVM
Data: <xi,yi>, i=1,..,l
xi  R d
yi  {-1,+1}

Temperature

f(x) =-1
=+1
Can be expressed as w•x+b=0
Humidity
= play tennis (remember the equation for a hyperplane
= do not play tennis from algebra!)
Our aim is to find such a hyperplane
All hyperplanes in Rd are parameterize by a vector
f(x)=sign(w•x+b), that
(w) and a constant b.
correctly classify our data. 77
Module-2 Topic-5: SVM Classifier
Formulation of Margin
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1

xi•w+b  -1 when yi =-1


H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1
H
The points on the planes H1
and H2 are the Support
Vectors

d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point


The margin of a separating hyperplane is d+ + d-.
78
Module-2 Topic-5: SVM Classifier
Decision on margin for SVM

79
Module-2 Topic-5: SVM Classifier

Maximizing the margin

80
Module-2 Topic-5: SVM Classifier
Maximizing the margin

81
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Illustration

82
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
Suppose, we have positively labeled data points

And we have negatively labeled data points

1 By inspection, it should be obvious that there are three


support vectors

83
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
2 The hyperplane driving SVM is given as

84
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
4

85
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1

86
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved Example#2
Suppose, we have positively labeled data points

And we have negatively labeled data points

1 Nonlinear mapping from input space into some feature


space

Sub the labelled points in above feature space

87
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved Example#2
2 There are two support vectors

3 The hyperplane driving SVM is given as

88
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved Example#2
4 The above equation reduces to

89
Non Linear Support Vector Machine (SVM)-Solved Example#2
Module-2 Topic-5: SVM Classifier

90
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Pros and Cons
Advantages:
•Effective in high dimensional spaces.
•Still effective in cases where number of dimensions is
greater than the number of samples.
•Uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
•Versatile: different Kernel functions can be specified for
the decision function. Common kernels are provided, but it
is also possible to specify custom kernels.

Disadvantages:
•If the number of features is much greater than the number
of samples, avoid over-fitting in choosing Kernel functions
and regularization term is crucial.
•SVMs do not directly provide probability estimates, these
are calculated using an expensive five-fold cross-validation
91
Summary Module-2 Summary

• Logistic regression: Modeling the probability that the response Y belongs to a


particular category, using a logistic function, on the basis of single or multiple
variables.
• Bayes’ theorem for classification: Bayes’ classifier using conditional independence
• Decision trees and random forests: A non-parametric, ‘information-based learning’
approach which is easy to interpret.
• Hyperplane for classification: maximal marigin classifier and SVC.
• Support Vector Machines (SVMs): Extension of SVC to handle ‘non-linear
boundaries’ between classes. Uses kernels for computational efficiency. RBF kernel
exhibits ‘local behavior’.
• Random forests are an effective tool in prediction. Forests give results competitive
with boosting and adaptive bagging, yet do not progressively change the training set.
Random inputs and random features produce good results in classification- less so in
regression. For larger data sets, we can gain accuracy by combining random features
with boosting. 92

You might also like