Unit-2: Logistic Regression
Unit-2: Logistic Regression
Logistic regression
• Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning
technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent
variable. Therefore the outcome must be a categorical or discrete
value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for
solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
Type of Logistic Regression:
• On the basis of the categories, Logistic Regression
can be classified into three types:
• Binomial: In binomial Logistic regression, there can
be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression,
there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or
"sheep"
• Ordinal: In ordinal Logistic regression, there can be 3
or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
Perceptron
• A perceptron is the simplest model of Artificial
Neural Network. It consists of a single artificial
neuron with Heaviside Step function as the
activation function.
The perceptron is a linear binary classifier. The training phase of perceptron performs multiple
iterations on the training data points.
A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers
decide whether an input, usually represented by a series of vectors, belongs to a specific class
Perceptron Learning Algorithm
• where,
• f is the probability of the input to belong to class i
i
Exponential kernel
The exponential kernel is closely related to the Gaussian kernel, with only the square of the
norm left out. It is also a radial basis function kernel.
Model selection and feature selection.
Model selection
• Given a set of models, choose the model that is expected to give the best results.
• Choosing among different learning algorithms e.g. choosing kNN over other
• Classification algorithms.
• Choosing parameters in same learning model e.g. choosing value of k in kNN.
Feature Selection- Selecting a useful subset from all the features.
Why Feature Selection?
• Some algorithms scale (computationally) poorly with increased dimension
• Irrelevant features can confuse some algorithms
• Redundant features adversely affect regularization
• Removal of features can increase (relative) margin (and generalization)
• Reduces data set and resulting model size
• Note: Feature Selection is different from Feature Extraction. The latter transforms
original
• features to get a small set of new features
How?
• Remove a binary feature if nearly all of it values are same.
• Use some criteria to rank features and keep top ranked features.
• Wrapper Methods: requires repeated runs of the learning algorithm with
different
• Set of features.
• Combining classifiers: Bagging, boosting (The Ada boost
algorithm),
• Ensemble Models
Bagging
• Its objective is to create several subsets of data from training
sample chosen randomly with replacement. Each collection of
subset data is used to train their decision trees. We get an
ensemble of different models. Average of all the predictions
from different trees are used which is more robust than a single
decision tree classifier
Steps:
• 1. Suppose there are observations and features in training data set,
sample from training data set is taken randomly with replacement
• 2. A subset of features are selected randomly and whichever
feature gives the best split is used to split the node iteratively
• 3. The tree is grown to the largest
• 4. Above steps are repeated times and prediction is given based on
the aggregation of predictions from number of trees.
Advantages:
• Reduces over-fitting of the model
• Handles higher dimensionality data very well
• Maintains accuracy for missing data
Disadvantages:
• Since final prediction is based on the mean predictions from subset
trees, it won’t give precise values for the classification and
regression model.
Boosting
• It is used to create a collection of predictors. Learners
are learned sequentially with early learners fitting
simple models to the data and then analysing data for
errors. Consecutive trees are fit and at every step, the
goal is to improve the accuracy from the prior tree.
When an input is misclassified by a hypothesis, its
weight is increased so that next hypothesis is more
likely to classify it correctly. Process converts weak
learners into better performing model.
Steps:
• 1. Draw a random subset of training samples without replacement
from the training set to train a weak learner
• 2. Draw second random training subset without replacement from
the training set and add percent of the samples that were
previously falsely classified/misclassified to train a weak learner
• 3. Find the training samples d3 in the training set D on which and
disagree to train a third weak learner
• 4. Combine all the weak learners via majority voting.
Advantages
• Supports different loss function
• Works well with interactions.
Disadvantages
• Prone to over-fitting
• Requires careful tuning of different hyper-parameters
Adaboost
• Weak models are added sequentially, trained using the
weighted training data.
• The training weights are updated giving more weight to
incorrectly predicted instances, and less weight to correctly
predicted instances.
• The process continues until a pre-set number of weak
learners have been created (a user parameter) or no further
improvement can be made on the training dataset.
• Once completed, you are left with a pool of weak learners
each with a stage value.
• A stage value is calculated for the trained model which
provides a weighting for any predictions that the model
makes.
• Predictions are made by calculating the weighted average of
the weak classifiers.
– Evaluating and debugging learning algorithms,
Classification errors.
Evaluating your machine learning algorithm is an essential part of
any project. Your model may give you satisfying results when
evaluated using a metric say accuracy_score but may give poor
results when evaluated against other metrics such
as logarithmic_loss or any other such metric. Most of the times we
use classification accuracy to measure the performance of our
model, however it is not enough to truly judge our model. In this
post, we will cover different types of evaluation metrics available.
Classification Accuracy
• Classification Accuracy is what we usually mean, when we use
the term accuracy. It is the ratio of number of correct predictions
to the total number of input samples.
Logarithmic Loss
• Logarithmic Loss or Log Loss, works by penalising
the false classifications. It works well for multi-class
classification. When working with Log Loss, the
classifier must assign probability to each class for all
the samples. Suppose, there are N samples
belonging to M classes, then the Log Loss is
calculated as below
where,
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range *0, ∞). Log Loss nearer to 0 indicates
higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy
Naive Bayes,
• It is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems. Mainly
used in text classification that includes a high-dimensional
training dataset. It is simple and most effective Classification
algorithms. Probabilistic classifier, which means it predicts on the
basis of the probability of an object. Examples are spam
filtration, sentimental analysis, and classifying articles. It
assumes that the occurrence of a certain feature is independent of
the occurrence of other features uses Bayes theorem
p(Ck∣x)=p(Ck) p(x∣Ck)p(x)p(Ck∣x)=p(Ck) p(x∣Ck)p(x)
Naive Bayes classifier is based on Bayes theorem which says that
P(H|E) = P(E|H) * P(H) / P(E)
where H is some hypothesis based on some evidence E e.g.
evidence=fever, hypothesis=dengue.
• P(E), P(H), P(E|H) are priori-probabilities which are used to
calculate conditional probability P(H|E).
In Naive Bayes, we have to predict the class (C) of an example(X), so
the equation can be re-written as
• P(C|X) = P(X|C) * P(C) / P(X)
we have to build a classifier using the above training set i.e. we have to calculate
priori probabilities P(C), P(X|C) and P(X). As we have only two classes in out training
dataset, therefore P(C) is P(yes) and P(no). (sunny,cool,high,true),
Case II : No
P(no|sunny,cool,high,true) = P(no) * P(sunny|no) * P(cool|no) * P(high|no) * P(true|no) /
P(sunny) *
P(cool) * P(high) * P(true) = 5/14 * 3/5 * 1/5 * 4/5 * 3/5 / ΠP(X)
Result:
As P(X) is same in both equations, we can ignore it giving
P(yes|sunny,cool,high,true) = 0.00529
P(no|sunny,cool,high,true) = 0.02057
As P(no|sunny,cool,high,true) > P(yes|sunny,cool,high,true), therefore we assign label "no" to it.