0% found this document useful (0 votes)
56 views30 pages

Unit-2: Logistic Regression

Logistic regression is a machine learning classification algorithm that predicts categorical dependent variables using independent variables. It is similar to linear regression but is used for classification problems rather than regression problems. The logistic function is used to map predicted values to probabilities between 0 and 1. Logistic regression makes assumptions that the dependent variable is categorical and independent variables do not have multicollinearity. It can be binomial, multinomial, or ordinal based on the number of categories of the dependent variable.

Uploaded by

jas deep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views30 pages

Unit-2: Logistic Regression

Logistic regression is a machine learning classification algorithm that predicts categorical dependent variables using independent variables. It is similar to linear regression but is used for classification problems rather than regression problems. The logistic function is used to map predicted values to probabilities between 0 and 1. Logistic regression makes assumptions that the dependent variable is categorical and independent variables do not have multicollinearity. It can be binomial, multinomial, or ordinal based on the number of categories of the dependent variable.

Uploaded by

jas deep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit-2

Logistic regression
• Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning
technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent
variable. Therefore the outcome must be a categorical or discrete
value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression
except that how they are used. Linear Regression is used for
solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
Type of Logistic Regression:
• On the basis of the categories, Logistic Regression
can be classified into three types:
• Binomial: In binomial Logistic regression, there can
be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression,
there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or
"sheep"
• Ordinal: In ordinal Logistic regression, there can be 3
or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
Perceptron
• A perceptron is the simplest model of Artificial
Neural Network. It consists of a single artificial
neuron with Heaviside Step function as the
activation function.

The perceptron is a linear binary classifier. The training phase of perceptron performs multiple
iterations on the training data points.
A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers
decide whether an input, usually represented by a series of vectors, belongs to a specific class
Perceptron Learning Algorithm

• Set all the weights to zero


• Until all the instances in the training data are
classified correctly
• For each instance I in the training data
• If I is classified incorrectly by the perceptron
• If I belongs to first class add it to the weight
vector
• else subtract it from the weight vector
Exponential family
Generative learning algorithms,
Generative Adversarial Networks (GANs) are a powerful class of neural networks
that are used for unsupervised learning. It was developed and introduced by Ian J.
Goodfellow in 2014. GANs are basically made up of a system of two competing
neural network models which compete with each other and are able to analyze,
capture and copy the variations within a dataset. Generative approaches try to build a
model of the positives and a model of the negatives. You can think of a model as a
“blueprint” for a class. A decision boundary is formed where one model becomes
more likely. As these create models of each class they can be used for generation.
To create these models, a generative learning algorithm learns the joint probability
distribution P(x, y).
The joint probability can be written as:
P(x, y) = P(x | y) . P(y) ….(i)
Also, using Bayes’ Rule we can write:
P(y | x) = P(x | y) . P(y) / P(x) ….(ii)
Since, to predict a class label y, we are only interested in the arg max , the
denominator can be removed from (ii).
Hence to predict the label y from the training example x, generative models evaluate:
f(x) = argmax_y P(y | x) = argmax_y P(x | y) . P(y)
The most important part in the above is P(x | y). This is what allows the model to be
generative! P(x | y) means – what x (features) are there given class y.
Different types of GANs:
GANs are now a very active topic of research and there have been many different types of GAN
implementation. Some of the important ones that are actively being used currently are described
below:
• Vanilla GAN: This is the simplest type GAN. Here, the Generator and the Discriminator are
simple multi-layer perceptrons. In vanilla GAN, the algorithm is really simple, it tries to
optimize the mathematical equation using stochastic gradient descent.
• Conditional GAN (CGAN): CGAN can be described as a deep learning method in which
some conditional parameters are put into place. In CGAN, an additional parameter „y‟ is
added to the Generator for generating the corresponding data. Labels are also put into the
input to the Discriminator in order for the Discriminator to help distinguish the real data from
the fake generated data.
• Deep Convolutional GAN (DCGAN): DCGAN is one of the most popular also the most
successful implementation of GAN. It is composed of ConvNets in place of multi-layer
perceptrons. The ConvNets are implemented without max pooling, which is in fact replaced
by convolutional stride. Also, the layers are not fully connected.
• Laplacian Pyramid GAN (LAPGAN): The Laplacian pyramid is a linear invertible image
representation consisting of a set of band-pass images, spaced an octave apart, plus a low-
frequency residual. This approach uses multiple numbers of Generator and Discriminator
networks and different levels of the Laplacian Pyramid. This approach is mainly used because
it produces very high-quality images. The image is down-sampled at first at each layer of the
pyramid and then it is again up-scaled at each layer in a backward pass where the image
acquires some noise from the Conditional GAN at these layers until it reaches its original
size.
• Super Resolution GAN (SRGAN): SRGAN as the name suggests is a way of designing a
GAN in which a deep neural network is used along with an adversarial network in order to
produce higher resolution images. This type of GAN is particularly useful in optimally up-
scaling native low-resolution images to enhance its details minimizing errors while doing so.
Gaussian/Linear discriminant analysis
• LDA is a generative learner as it makes assumption about the data distribution.
• LDA makes some simplifying assumptions about your data:
• That your data is Gaussian, that each variable is shaped like a bell curve when
plotted.
• That each attribute has the same variance that values of each variable vary
around the mean by the same amount on average. With these assumptions, the
LDA model estimates the mean and variance from your data for each class.
• The discriminant function used by LDA is:
• f = μ C (x ) – 0.5* μ C (μ ) + ln(p )
i i -1 k T i -1 i T i

• where,
• f is the probability of the input to belong to class i
i

• μ is the mean of features for class i


i

• C is the inverse of pooled covariance matrix


-1

• x is the object which is to be classified


k

• We assign the object k with features x to group i that has maximum f


k i

• models the decision boundary between the classes


• learns the conditional probability distribution p(y|x)p(y|x)
they:
– assume some functional form for p(y|x)p(y|x)
– estimate parameters of p(y|x)p(y|x) directly from training data
• examples : logistic regression, scalar vector machine, traditional neural
networks
Steps:
• Train the data and obtain a discriminate fn, tells
which class a data point has higher prob of
belonging to
• compute μμ and σσ for each class, then calculate
prob that data belongs to it, class with highest prob
chosen
Support vector machines: Optimal hyper plane,
• Support Vector Machine or SVM is one of the most
popular Supervised Learning algorithms, which is
used for Classification as well as Regression
problems. However, primarily, it is used for
Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best
line or decision boundary that can segregate n-
dimensional space into classes so that we can
easily put the new data point in the correct
category in the future. This best decision boundary
is called a hyperplane.
• SVM chooses the extreme points/vectors that help
in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm
is termed as Support Vector Machine
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly
separable data, which means if a dataset can be
classified into two classes by using a single straight
line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM
classifier.
• Non-linear SVM: Non-Linear SVM is used for non-
linearly separated data, which means if a dataset
cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

• by adding a new dimension, z=x2+y2z=x2+y2


• Hyperplane: There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to find out
the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which
means the maximum distance between the data points.
• A hyperplane in an n-dimensional Euclidean space is a flat, n-1
dimensional subset of that space that divides the space into two
disconnected parts. There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space. SVM algorithm finds the
closest point of the lines from both the classes, these points are called
support vectors. The distance between the vectors and the hyperplane is
called as margin, the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.
Kernels.
• SVM algorithms use a set of mathematical functions that are defined as the kernel
• function of kernel is to take data as input and transform it into the required form
• different SVM algorithms use different types of kernel functions. These functions
can be different types. For example linear, nonlinear, polynomial, radial basis
function (RBF), and sigmoid
• most used type of kernel function is RBF. Because it has localized and finite
response along the entire x-axis
• kernel functions return the inner product between two points in a suitable feature
space, thus defining a notion of similarity, with little computational cost even in
very high-dimensional spaces
Radial Basis Function (RBF) kernel
• A radial basis function is a real-valued function whose value depends only
on the distance from the origin. Any function that satisfies the property
ϕ(x)=ϕ(||x||) is a radial function.
• There are various types of RBF: Gaussian, Multi-quadratic, Inverse quadratic,
etc.
Gaussian Kernel
• The Gaussian kernel is an example of RBF kernel. The adjustable parameter
sigma plays a major role in the performance of the kernel, and should be
carefully tuned to the problem at hand. If over-estimated the exponential
will behave almost linearly and the higher dimensional projection will start
to lose its non-linear power. On the other hand, if under-estimated, the
function will lack regularization and the decision boundary will be highly
sensitive to noise in training data.

Exponential kernel
The exponential kernel is closely related to the Gaussian kernel, with only the square of the
norm left out. It is also a radial basis function kernel.
Model selection and feature selection.
Model selection
• Given a set of models, choose the model that is expected to give the best results.
• Choosing among different learning algorithms e.g. choosing kNN over other
• Classification algorithms.
• Choosing parameters in same learning model e.g. choosing value of k in kNN.
Feature Selection- Selecting a useful subset from all the features.
Why Feature Selection?
• Some algorithms scale (computationally) poorly with increased dimension
• Irrelevant features can confuse some algorithms
• Redundant features adversely affect regularization
• Removal of features can increase (relative) margin (and generalization)
• Reduces data set and resulting model size
• Note: Feature Selection is different from Feature Extraction. The latter transforms
original
• features to get a small set of new features
How?
• Remove a binary feature if nearly all of it values are same.
• Use some criteria to rank features and keep top ranked features.
• Wrapper Methods: requires repeated runs of the learning algorithm with
different
• Set of features.
• Combining classifiers: Bagging, boosting (The Ada boost
algorithm),
• Ensemble Models
Bagging
• Its objective is to create several subsets of data from training
sample chosen randomly with replacement. Each collection of
subset data is used to train their decision trees. We get an
ensemble of different models. Average of all the predictions
from different trees are used which is more robust than a single
decision tree classifier
Steps:
• 1. Suppose there are observations and features in training data set,
sample from training data set is taken randomly with replacement
• 2. A subset of features are selected randomly and whichever
feature gives the best split is used to split the node iteratively
• 3. The tree is grown to the largest
• 4. Above steps are repeated times and prediction is given based on
the aggregation of predictions from number of trees.

Advantages:
• Reduces over-fitting of the model
• Handles higher dimensionality data very well
• Maintains accuracy for missing data
Disadvantages:
• Since final prediction is based on the mean predictions from subset
trees, it won’t give precise values for the classification and
regression model.
Boosting
• It is used to create a collection of predictors. Learners
are learned sequentially with early learners fitting
simple models to the data and then analysing data for
errors. Consecutive trees are fit and at every step, the
goal is to improve the accuracy from the prior tree.
When an input is misclassified by a hypothesis, its
weight is increased so that next hypothesis is more
likely to classify it correctly. Process converts weak
learners into better performing model.
Steps:
• 1. Draw a random subset of training samples without replacement
from the training set to train a weak learner
• 2. Draw second random training subset without replacement from
the training set and add percent of the samples that were
previously falsely classified/misclassified to train a weak learner
• 3. Find the training samples d3 in the training set D on which and
disagree to train a third weak learner
• 4. Combine all the weak learners via majority voting.

Advantages
• Supports different loss function
• Works well with interactions.

Disadvantages
• Prone to over-fitting
• Requires careful tuning of different hyper-parameters
Adaboost
• Weak models are added sequentially, trained using the
weighted training data.
• The training weights are updated giving more weight to
incorrectly predicted instances, and less weight to correctly
predicted instances.
• The process continues until a pre-set number of weak
learners have been created (a user parameter) or no further
improvement can be made on the training dataset.
• Once completed, you are left with a pool of weak learners
each with a stage value.
• A stage value is calculated for the trained model which
provides a weighting for any predictions that the model
makes.
• Predictions are made by calculating the weighted average of
the weak classifiers.
– Evaluating and debugging learning algorithms,
Classification errors.
Evaluating your machine learning algorithm is an essential part of
any project. Your model may give you satisfying results when
evaluated using a metric say accuracy_score but may give poor
results when evaluated against other metrics such
as logarithmic_loss or any other such metric. Most of the times we
use classification accuracy to measure the performance of our
model, however it is not enough to truly judge our model. In this
post, we will cover different types of evaluation metrics available.
Classification Accuracy
• Classification Accuracy is what we usually mean, when we use
the term accuracy. It is the ratio of number of correct predictions
to the total number of input samples.
Logarithmic Loss
• Logarithmic Loss or Log Loss, works by penalising
the false classifications. It works well for multi-class
classification. When working with Log Loss, the
classifier must assign probability to each class for all
the samples. Suppose, there are N samples
belonging to M classes, then the Log Loss is
calculated as below

where,
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range *0, ∞). Log Loss nearer to 0 indicates
higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy
Naive Bayes,
• It is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems. Mainly
used in text classification that includes a high-dimensional
training dataset. It is simple and most effective Classification
algorithms. Probabilistic classifier, which means it predicts on the
basis of the probability of an object. Examples are spam
filtration, sentimental analysis, and classifying articles. It
assumes that the occurrence of a certain feature is independent of
the occurrence of other features uses Bayes theorem
p(Ck∣x)=p(Ck) p(x∣Ck)p(x)p(Ck∣x)=p(Ck) p(x∣Ck)p(x)
Naive Bayes classifier is based on Bayes theorem which says that
P(H|E) = P(E|H) * P(H) / P(E)
where H is some hypothesis based on some evidence E e.g.
evidence=fever, hypothesis=dengue.
• P(E), P(H), P(E|H) are priori-probabilities which are used to
calculate conditional probability P(H|E).
In Naive Bayes, we have to predict the class (C) of an example(X), so
the equation can be re-written as
• P(C|X) = P(X|C) * P(C) / P(X)
we have to build a classifier using the above training set i.e. we have to calculate
priori probabilities P(C), P(X|C) and P(X). As we have only two classes in out training
dataset, therefore P(C) is P(yes) and P(no). (sunny,cool,high,true),

P(C) = number of examples belonging to


class C / total examples
P(yes) = 9/14
P(no) = 5/14
P(X) = number of examples having X /
total examples
P(sunny) = 5/14
P(overcast) = 4/14
P(rainy) = 5/14
P(hot) = 4/14
P(mild) = 6/14
P(cool) = 4/14
P(high) = 7/14
P(normal) = 7/14
P(false) = 8/14
P(true) = 6/14
P(X|C) = number of times X is associated
with C / number of examples belonging to
class C
P(sunny|yes) = 2/9, P(sunny|no) = 3/5
P(overcast|yes) = 4/9, P(overcast|no) =
0/5
P(rainy|yes) = 3/9, P(rainy|no) = 2/5
P(hot|yes) = 2/9, P(hot|no) = 2/5
P(mild|yes) = 4/9, P(mild|no) = 2/5
P(cool|yes) = 3/9, P(cool|no) = 1/5
P(high|yes) = 3/9, P(high|no) = 4/5
P(normal|yes) = 6/9, P(normal|no) = 1/5
P(false|yes) = 6/9, P(false|no) = 2/5
P(true|yes) = 3/9, P(true|no) = 3/5
We have obtained all the 3 priori
probabilities from the training dataset.
Now, we want to classify a new
unclassified example.
Let the example be {sunny,cool,high,true}
and we have to predict it's class. The class
can be
predicted using the formula
P(C|X) = ,* P(C)*ΠP(X|C)+ - / ΠP(X)
Case I: Yes
P(yes|sunny,cool,high,true) = P(yes) *
P(sunny|yes) * P(cool|yes) * P(high|yes) *
P(true|yes) /
P(sunny) * P(cool) * P(high) * P(true) = 9/14
* 2/9 *3/9 * 3/9 * 3/9 / ΠP(X)

Case II : No
P(no|sunny,cool,high,true) = P(no) * P(sunny|no) * P(cool|no) * P(high|no) * P(true|no) /
P(sunny) *
P(cool) * P(high) * P(true) = 5/14 * 3/5 * 1/5 * 4/5 * 3/5 / ΠP(X)
Result:
As P(X) is same in both equations, we can ignore it giving
P(yes|sunny,cool,high,true) = 0.00529
P(no|sunny,cool,high,true) = 0.02057
As P(no|sunny,cool,high,true) > P(yes|sunny,cool,high,true), therefore we assign label "no" to it.

You might also like