Module 2
Module 2
Data Analytics
1
Topics in Module-2 Classification
Classification
• Logistic Regression
• Decision Trees
• Random Forest
• SVM Classifier 2
Module-2 Introduction to Classification
What is Classification?
• The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data.
• In Classification, a program learns from the given dataset or observations and then classifies new
observation into a number of classes or groups.
3
Module-2 Introduction to Classification
Types of Classification?
• Types of Classifiers: The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
• Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier. Examples: YES or NO, MALE or FEMALE, SPAM or NOT
SPAM, CAT or DOG, etc.
• Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier. Example: Classifications of types of crops, Classification of
types of music.
• Types of learners: In the classification problems, there are two types of learners:
• Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
• Example: K-NN algorithm, Case-based reasoning
• Eager Learners: Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction.
• Example: Decision Trees, Naïve Bayes, ANN.
4
Module-2 Introduction to Classification
Types of Classification?
• Types of classification:
• Supervised: The set of possible classes is known in advance.
• Unsupervised: Set of possible classes is not known. After classification we can try to assign a
name to that class. Unsupervised classification is called clustering.
• Types of Classification algorithms: The Classification algorithms can be further divided into the
Mainly two category:
• Linear Models
• Logistic Regression
• Support Vector Machines
• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification
5
Module-2 Introduction to Classification
Evaluation of Classification model?
• Log loss or cross entropy loss:
• It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near to 0.
• The value of log loss increases if the predicted value deviates from the actual value.
• The lower log loss represents the higher accuracy of the model.
• Confusion Matrix:
• The confusion matrix provides us a matrix/table as output and describes the performance of the
model.
• It is also known as the error matrix.
• The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions.
• AUC-ROC curve:
• ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under
the Curve.
• It is a graph that shows the performance of the classification model at different thresholds.
• To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
• The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis. 6
Module-2 Topic-1: Logistic Regression
Logistic Regression
Logistic regression is an extension of simple linear regression, where the
dependent variable is dichotomous or binary in nature and we cannot use
simple linear regression.
Logistic regression is the statistical technique used to predict the relationship
between two or more predictors (independent variables) and a predicted
variable (the dependent variable) where the dependent variable is binary.
Logistic regression estimates the
probability of an event
occurring, based on a given
dataset of independent variables.
Since the outcome is a probability,
the dependent variable is
bounded between 0 and 1.
Logistic Regression –An Illustration
7
Module-2 Topic-1: Logistic Regression
Logistic Regression
• Logistic regression estimates the probability of a certain event occurring using
the odds ratio by calculating the logarithm of the odds.
• Uses Maximum likelihood estimation (MLE) to transform the probability of
an event occurring into its odds, a nonlinear model.
9
Module-2 Topic-1: Logistic Regression
Example #1
Example #2
10
Module-2 Topic-1: Logistic Regression
What Logistic Regression predicts?
• Probability of Y occurring given known values for X(s).
• In Logistic Regression, the Dependent Variable is transformed into the natural log of the
odds. This is called logit (short for logistic probability unit).
• The probabilities which ranged between 0.0 and 1.0 are transformed into
odds ratios that range between 0 and infinity and approximated as a sigmoid
function applied to a linear combination of input features in the range 0 to 1.
• If the probability for group membership in the modeled category is above
some cut point (the default is 0.50), the subject is predicted to be a member
of the modeled group. Example: Default their payment.
• If the probability is below the cut point, the subject is predicted to be a
member of the other group. Example: No Default their payment.
• For any given case, logistic regression computes the probability that a case
with a particular set of values for the independent variable is a member of
the modeled category.
11
Module-2 Topic-1: Logistic Regression
Logistic Regression
12
Module-2 Topic-1: Logistic Regression
Logistic Regression
13
Module-2 Topic-1: Logistic Regression
Logistic Regression
14
Module-2 Topic-1: Logistic Regression
Assumptions with its explanation for Logistic Regression
• No outliers in the data. An outlier can be identified by analyzing the
independent variables
• No correlation (multi-collinearity) between the independent variables.
Measure how well the algorithm performs using
the weights on functions=
• Where G is the logistic function and to sigmoid
curve, We can see the values of y-axis lie
between 0 and 1 and crosses the axis at 0.5.
• The classes can be divided into positive or
negative. The output comes under the probability
of positive class if it lies between 0 and 1.
• Interpreting the output of hypothesis function as
positive if it is ≥0.5, otherwise negative.
• Loss Function:
15
Module-2 Topic-1: Logistic Regression
Types of Logistic Regression
• Binary logistic regression: In this approach,
the response or dependent variable is
dichotomous in nature—i.e. it has only two
possible outcomes (e.g. 0 or 1).
• Multinomial logistic regression: In this type
of logistic regression model, the dependent
variable has three or more possible
outcomes; however, these values have no
specified order. For example, Type-A, Type-
B, Type-C
• Ordinal logistic regression: This type of
logistic regression model is leveraged when
the response variable has three or more
possible outcome, but in this case, these
values do have a defined order. Example: An
ordinal responses include grading scales from
A to F or rating scales from 1 to 5.
16
Module-2 Topic-1: Logistic Regression
Advantages and Disadvantages of Logistic Regression
17
Module-2 Topic-1: Logistic Regression
Applications of Logistic Regression
18
Module-2 Topic-1: Logistic Regression
Logistic Regression-Solved Example#1
A dataset consist of women and men Instagram users with a sample size of
1069. Let the probability of men and women using Instagram be The
sample proportion of women who are Instagram users is given as 61.08%,
and the sample proportion for men is 43.98%. The difference is 0.170951,
and the 95% confidence interval is (0.111429, 0.2292).Establish a logistic
regression model specifies the relationship between p and x.
Odds=
Solution
Logistic regression equation for women log (+
19
Logistic Regression-Solved Example#1 (Contd.)
Odds for women==1.5694
𝑏 0=− 0.2419
Slope = Log (odds for women)-Log(odds for men)=-(- 0.2419)=0.6926
Best fit regression equation y
20
Note: For deciding the values in the logistic regression line
Use Scattergrams that has positive correlation such that the
value can have negative coefficients and can have positive
coefficients
YY Y Y
X X X
21
Module-2 Topic-1: Logistic Regression
Logistic Regression-Solved Example#2
A dataset consist of 5 student statistics on hours of study vs. students will
get pass/fail. Assume that we are using Logistic regression classifier. The
optimizer has the odds of passing the course as
Hours of study(X) 29 15 33 28 39
Result (Y) Pass/Fail 0 0 1 1 1
(i) Calculate the probability of pass percentage who have completed 33 hours of study.
(ii) Calculate at least how many hours the student has put on his efforts that has the
probability of score of more than 95%.
Solution
(i) probability of pass percentage(P)= Where Z=
Sub 33 hours of study in Hours, we get Z=
Probability of pass percentage(P) =
22
Module-2 Topic-1: Logistic Regression
Logistic Regression-Solved Example#2
(ii) Calculate at least how many hours the student has put on his efforts that has the
probability of score of more than 95%.
Solution
(ii) Probability of score (P=0.95)=
0.95()=1 =1 =1-0.95
=0.05 = =ln(
= 2.94
= 2.94
log ( 𝑜𝑑𝑑𝑠 )=−64 +2 ×h𝑜𝑢𝑟𝑠 𝑍 =− 64+2 × h𝑜𝑢𝑟𝑠. 2.94=− 64+2× h𝑜𝑢𝑟𝑠
2.94=− 64+2× h𝑜𝑢𝑟𝑠 66.94=2× h𝑜𝑢𝑟𝑠
𝟔𝟔 .𝟗𝟒
𝑯 𝒐𝒖𝒓𝒔= =𝟑𝟑 . 𝟒𝟕
𝟐
23
Module-2 Topic-2: Decision Tree
Decision tree
• A Decision tree is a flowchart-like tree structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node
(terminal node) holds a class label.
24
Module-2 Topic-2: Decision Tree
Decision tree
• A tree can be “learned” by splitting the
source set into subsets based on an
attribute value test.
• This process is repeated on each derived
subset in a recursive manner called
recursive partitioning.
• The recursion is completed when the
subset at a node all has the same value of
the target variable, or when splitting no
longer adds value to the predictions.
• The construction of a decision tree
classifier does not require any domain
knowledge or parameter setting, and
therefore is appropriate for exploratory
knowledge discovery.
25
Module-2 Topic-2: Decision Tree
Decision tree
• Decision trees can handle high-
dimensional data. In general decision tree
classifier has good accuracy.
• Decision tree induction is a typical
inductive approach to learn knowledge on
classification.
• Decision trees classify instances by
sorting them down the tree from the root
to some leaf node, which provides the
classification of the instance.
• An instance is classified by starting at the
root node of the tree, testing the attribute
specified by this node, then moving down
the tree branch corresponding to the value
of the attribute.
26
Module-2 Topic-2: Decision Tree
Decision tree
Strength:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for prediction
or classification.
Disadvantage:
• Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a decision
tree is computationally expensive. At each node, each candidate splitting field must be
sorted before its best split can be found. In some algorithms, combinations of fields are
used and a search must be made for optimal combining weights. Pruning algorithms can
also be expensive since many candidate sub-trees must be formed and compared.
27
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
Consider whether a dataset based on which we will determine whether to play football or
not.
There are 4 independent variables - Outlook, Temperature, Humidity, and Wind to determine
the dependent variable-whether to play football or not.
28
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
1 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)
29
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
2 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)
30
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
3 Calculation of Information gain(difference between parent entropy and average weighted entropy)
and Entropy (determines how a decision tree chooses to split data)
31
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
4 Initial Decision tree diagram
32
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
33
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
34
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
35
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
36
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature, Humidity or Wind has higher information gain.
5
37
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether Temperature or Humidity has higher information gain.
6
38
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether humidity is normal or high based on higher information gain.
6
39
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6
40
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6
41
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
decision tree whether wind is strong or not based on higher information gain.
6
42
Module-2 Topic-2: Decision Tree
Decision tree Solved Example
6 Final Decision tree
43
Module-2 Topic-3: Naïve Bayes-conditional probability
An Example of Bayes Theorem
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis?
P ( S | M ) P ( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P(S ) 1 / 20
44
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model
•Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps
in building the fast machine learning models that can make quick predictions.
•It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
•Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
•The formula for Bayes' theorem is given as:
• Where,
• P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
• P(B) is Marginal Probability: Probability of Evidence.
• P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
• P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
45
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
If the weather is sunny, then the Player should play or not?
46
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
Step-1 Frequency table for the Weather Conditions:
47
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model Solved Example#1
Step-3 Applying Bayes Theorem
48
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification model
49
Module-2 Topic-3: Naïve Bayes-conditional probability
Naïve Bayes Classification Model
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis
50
Module-2 Topic-3: Naïve Bayes-conditional probability
Summary of Naïve Bayes Classification Model
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• Bayes’ rule can be turned into a classifier
• Maximum A Posteriori (MAP) hypothesis estimation incorporates prior
knowledge; Max Likelihood (ML) doesn’t
• Naive Bayes Classifier is a simple but effective Bayesian classifier for
vector data (i.e. data with several attributes) that assumes that attributes
are independent given the class.
• Bayesian classification is a generative approach to classification
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
51
Random Forest Module-2 Topic-4: Random Forest
53
Module-2 Topic-4: Random Forest
Why Random Forest?
• Random forest algorithm is suitable for both classifications and regression task.
• It gives a higher accuracy through cross validation.
• Standard decision trees often have high variance and low bias
High chance of overfitting (with ‘deep trees’, many nodes)
• With a Random Forest, the bias remains low and the variance is reduced
thus we decrease the chances of overfitting
• Random forest classifier can handle the missing values and maintain the accuracy
of a large proportion of data.
• If there are more trees, it doesn’t allow over-fitting trees in the model.
• It has the ability to work upon a large data set with higher dimensionality.
54
Bagging : Bootstrap Aggregating : Module-2 Topic-4: Random Forest
wisdom of the crowd
1. Sample records with
replacement ("bootstrap"
the training data)
Sampling is the process of selecting a
subset of items from a vast collection of
items.
Bootstrap = Sampling with replacement.
It means a data point in a drawn sample
can reappear in future drawn samples as
well.
3. Average predictions
55
Module-2 Topic-4: Random Forest
https://victorzhou.com/blog/intro-to-random-forests/
56
Module-2 Topic-4: Random Forest
• Find in the first branch for that part of the data again the best split.
That is the first sub node (decision node)
• Continue creating decision nodes until splitting doesn’t improve the situation
The average value of the target variable is assigned to the leaf (terminal node)
• Continue until there are only leaf nodes left or until a minimum value is
reached
• Now the decision tree can be used to do a prediction based on the input
features
57
Module-2 Topic-4: Random Forest
Feature Importance
• Feature importance is calculated as
• the reduction in sum of squared errors whenever a variable is chosen to split
• weighted by the probability of reaching that node.
• However
• the variable importance measures are not reliable in situations where potential predictor variables vary in
their scale of measurement or their number of categories
59
Module-2 Topic-4: Random Forest
Processing the ensemble of trees called
60
Module-2 Topic-4: Random Forest
• Random forest introduces randomness into the rows and columns of the data
• Combined, this provides a more diverse set of trees that almost always lowers our prediction error.
62
Module-2 Topic-4: Random Forest
63
Module-2 Topic-4: Random Forest
Applications
• Banking Industry
• Credit Card Fraud Detection
• Customer Segmentation
• Predicting Loan Defaults on LendingClub.com
• Healthcare and Medicine
• Cardiovascular Disease Prediction
• Diabetes Prediction
• Breast Cancer Prediction
• Stock Market
• Stock Market Prediction
• Stock Market Sentiment Analysis
• Bitcoin Price Detection
• E-Commerce
• Product Recommendation
• Price Optimization
• Search Ranking
64
Module-2 Topic-4: Random Forest
Case Study
• Let us consider the example of the Boston Housing dataset. This is a well-
known dataset of information about different houses in Boston. For each
house, 13 values are known, such as the crime rate in that area,
industrialization value, average age of residents, and so on. Our task is to
train a model to predict the value of a house given these values.
65
Module-2 Topic-4: Random Forest
Case Study
• Let us consider the example of the Boston Housing dataset.
66
Case Study
• Training the dataset
• We can note that of the 13 original
features, this decision tree has used
only LSTAT (the percentage of the
population in low income groups) and
RM (average number of rooms per
dwelling) to generate a prediction.
• The four leaf nodes show us that this
single tree classifier can produce four
possible outputs: $30k, $44k, $22k and
$14k, even though we are solving a
regression problem and the true
number could be one of many
continuous values.
• This simple decision tree has a mean
absolute error of $3.6k on the training
set, and $3.8k on the test set. This
means that although it is not a
powerful model, it performs similarly
on seen and unseen data, and so it has
generalized well and has not overfit the
training data.
67
Module-2 Topic-4: Random Forest
Case Study
• Making a random forest ensemble
model
68
Module-2 Topic-4: Random Forest
Case Study
• Performance of the model
69
Case Study
• Feature Importance in RF
70
Module-2 Topic-4: Random Forest
Random forest Vs Decision tree
71
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
72
Module-2 Topic-5: SVM Classifier
Terminologies in Support Vector Machine (SVM)
• Hyperplane: There can be multiple
lines/decision boundaries to segregate the
classes in n-dimensional space, but we need
to find out the best decision boundary that
helps to classify the data points. This best
boundary is known as the hyperplane of
SVM.
• Dimensions of the hyperplane depend on
the features present in the dataset. 2
features, then hyperplane will be a straight
line.
• Support Vectors: The data points or
vectors that are the closest to the
hyperplane and which affect the position of
the hyperplane are termed as Support
Vector 73
Module-2 Topic-5: SVM Classifier
Terminologies in Support Vector Machine (SVM)
• The distance between the Positive hyperplane
vectors and the
hyperplane is called as
margin.
• Goal of SVM is to
maximize this margin.
Negative
• The hyperplane with hyperplane
maximum margin is called
the optimal hyperplane.
This line
represents the
decision
boundary:
ax + by − c = 0
74
Module-2 Topic-5: SVM Classifier
Types of Support Vector Machine (SVM)
• Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line.
75
Module-2 Topic-5: SVM Classifier
Types of Support Vector Machine (SVM)
76
Module-2 Topic-5: SVM Classifier
An example for SVM
Data: <xi,yi>, i=1,..,l
xi R d
yi {-1,+1}
Temperature
f(x) =-1
=+1
Can be expressed as w•x+b=0
Humidity
= play tennis (remember the equation for a hyperplane
= do not play tennis from algebra!)
Our aim is to find such a hyperplane
All hyperplanes in Rd are parameterize by a vector
f(x)=sign(w•x+b), that
(w) and a constant b.
correctly classify our data. 77
Module-2 Topic-5: SVM Classifier
Formulation of Margin
Define the hyperplane H such that:
xi•w+b +1 when yi =+1 H1
79
Module-2 Topic-5: SVM Classifier
80
Module-2 Topic-5: SVM Classifier
Maximizing the margin
81
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Illustration
82
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
Suppose, we have positively labeled data points
83
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
2 The hyperplane driving SVM is given as
84
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
4
85
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Solved Example#1
86
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved Example#2
Suppose, we have positively labeled data points
87
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved Example#2
2 There are two support vectors
88
Module-2 Topic-5: SVM Classifier
Non Linear Support Vector Machine (SVM)-Solved Example#2
4 The above equation reduces to
89
Non Linear Support Vector Machine (SVM)-Solved Example#2
Module-2 Topic-5: SVM Classifier
90
Module-2 Topic-5: SVM Classifier
Support Vector Machine (SVM)-Pros and Cons
Advantages:
•Effective in high dimensional spaces.
•Still effective in cases where number of dimensions is
greater than the number of samples.
•Uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
•Versatile: different Kernel functions can be specified for
the decision function. Common kernels are provided, but it
is also possible to specify custom kernels.
Disadvantages:
•If the number of features is much greater than the number
of samples, avoid over-fitting in choosing Kernel functions
and regularization term is crucial.
•SVMs do not directly provide probability estimates, these
are calculated using an expensive five-fold cross-validation
91
Summary Module-2 Summary