0% found this document useful (0 votes)
23 views88 pages

Chapter 2 - Logistic Regression

Logistic Regression

Uploaded by

asjeevannavar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views88 pages

Chapter 2 - Logistic Regression

Logistic Regression

Uploaded by

asjeevannavar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 88

Supervised Learning:

Logistic Regression
Support Vector Machine
Decision Tree
Logistic regression: Introduction
• In classification, we seek to identify the categorical class Ck associate with
a given input vector x.
Vehicle features / budget: Buy / Not ? 0: “Negative Class”
Online Transactions: Fraudulent (Yes / No)? 1: “Positive Class”

• In order to predict correct value of Y for a given value of X.


1. Data (samples, combination of X and Y)
2. Model (function to represent relationship X & Y)
3. Cost function (how well our model approximates training samples)
4. Optimization (find parameters of model to minimize cost function)

15/12/2024 School of Computer Science and Engineering 2


Logistic regression- data
• In univariate logistic regression the number of independent variables is
one and there is a linear relationship between the independent(x) and
dependent(y) variable.
Marks scored in Admitted / Not admitted
entrance examination to University
20 Not Admitted
60 Admitted
36 Admitted
32 Not Admitted
30 Not Admitted
80 Admitted
38 Admitted

15/12/2024 School of Computer Science and Engineering 3


Logistic regression: Hypothesis
• Hypothesis used in Linear Regression predicts the continuous values and Logistic
regression hypothesis should predict discrete values.
(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

Threshold classifier output at 0.5:


If , predict “y = 1”
If , predict “y = 0”

15/12/2024 School of Computer Science and Engineering 4


Logistic regression – decision
boundary
Decision Boundary
x2
3
2

1 2 3 x1

Predict “ “ if

15/12/2024 School of Computer Science and Engineering 5


Logistic regression – decision
boundary
Non-linear decision boundaries

x2

-1 1 x1

Predict “ “ if
-1

15/12/2024 School of Computer Science and Engineering 6


Logistic regression - hypothesis
• The sigmoid function is also called a squashing function as its domain is
the set of all real numbers, and its range is (0, 1).
Need

• For given input, hypothesis always predicts value which is between 0 & 1.
if hθ(x) < 0.5 then consider hθ(x) = 0
else if hθ(x) >= 0.5 then consider hθ(x) = 1
15/12/2024 School of Computer Science and Engineering 7
Logistic regression – hypothesis

Training set:

m examples

15/12/2024 School of Computer Science and Engineering 8


Logistic regression - hypothesis
Interpretation of Hypothesis Output
= estimated probability that y = 1 on input x

Example: If

Tell patient that 70% chance of tumor being malignant


“probability that y = 1, given x, parameterized by ”

15/12/2024 School of Computer Science and Engineering 9


Logistic regression – cost
function
If Y = 1 then, If Y = 0 then,
-log(z) will be zero at hθ(x) =1 -log(1-z) will be zero at hθ(x) = 0

-log(z) –(log(1-z)

15/12/2024 School of Computer Science and Engineering 10


Logistic regression – cost
function
Logistic regression cost function

To fit parameters : To make a prediction given new :


Output

15/12/2024 School of Computer Science and Engineering 11


Logistic regression -
optimization
Gradient Descent

Want :
Repeat

(simultaneously update all )

Repeat

(simultaneously update all )

15/12/2024 School of Computer Science and Engineering 12


Logistic regression – multiclass

Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

15/12/2024 School of Computer Science and Engineering 13


Logistic regression – multiclass

Binary classification: Multi-class classification:

x2 x2

x1 x1

15/12/2024 School of Computer Science and Engineering 14


Logistic regression – multiclass
x2
One-vs-all (one-vs-rest):

x1
x2 x2

x1 x1
x2
Class 1:
Class 2:
Class 3:
x1

15/12/2024 School of Computer Science and Engineering 15


Logistic regression – multiclass
One-vs-all

Train a logistic regression classifier for each


class to predict the probability that .

On a new input , to make a prediction, pick the


class that maximizes

15/12/2024 School of Computer Science and Engineering 16


15/12/2024 School of Computer Science and Engineering 17
Supervised
Learning:
Regularization
Regularization
• Bias explains how much the model has over-fitted on train data.
• Variance explains the difference in the predictions made on train data and
test data.
• In an ideal situation, we need to reduce both bias and variance, which is
where regularization comes in.

15/12/2024 School of Computer Science and Engineering 19


Bias and Variance
Regularization – bias-variance
tradeoff

15/12/2024 School of Computer Science and Engineering 21


Regularization

15/12/2024 School of Computer Science and Engineering 31


Regularization

15/12/2024 School of Computer Science and Engineering 32


Regularization

15/12/2024 School of Computer Science and Engineering 33


Regularization

15/12/2024 School of Computer Science and Engineering 34


Regularization

15/12/2024 School of Computer Science and Engineering 35


Regularization

15/12/2024 School of Computer Science and Engineering 36


Regularization

15/12/2024 School of Computer Science and Engineering 37


Regularization

15/12/2024 School of Computer Science and Engineering 38


Regularization
• Lasso Regression (Least Absolute Shrinkage and
Selection Operator) adds “absolute value of magnitude” of
coefficient as penalty term to the loss function.

• If lambda is zero then we will get back Ordinary Least


Squared whereas very large value will make coefficients
zero hence it will under-fit.

• The key difference between these techniques is that Lasso


shrinks the less important feature’s coefficient to zero
thus, removing some feature altogether. So, this works
well for feature selection
15/12/2024 in Science
School of Computer case we have a huge number
and Engineering 39
Regularization
A regression model that uses L1 regularization technique is
called Lasso Regression and model which uses L2 is
called Ridge Regression.
• Ridge regression adds “squared magnitude” of coefficient
as penalty term to the loss function. Here
the highlighted part represents L2 regularization element.

• If lambda is zero then you can imagine we get back


Ordinary Least Squared.
• If lambda is very large then it will add too much weight and
it will lead to under-fitting. Having said that it’s important
how lambda is chosen. This technique works very well to
avoid over-fitting issue.
15/12/2024 School of Computer Science and Engineering 40
• we can see from the formula of L1 and L2 regularization,
L1 regularization adds the penalty term in cost function
by adding the absolute value of weight(Wj) parameters,
while L2 regularization adds the squared value of
weights(Wj) in the cost function.

• L1 regularization helps in feature selection by eliminating


the features that are not important. This is helpful when
the number of feature points are large in number.

• L1 regularization it will estimate around the median of


the data.
• L2 regularization, while calculating the loss function in
the
15/12/2024
gradient calculation step, the loss function tries
School of Computer Science and Engineering
to41
minimize the loss by subtracting it from the average of
Regularization
• L1 Regularization aka Lasso Regularization:
• This add regularization terms in the model which are function of absolute value of
the coefficients of parameters.
• The coefficient of the parameters can be driven to zero as well during the
regularization process. Hence this technique can be used for feature selection and
generating more parsimonious model.
• L2 Regularization aka Ridge Regularization:
• This add regularization terms in the model which are function of square of
coefficients of parameters. Coefficient of parameters can approach to zero but
never become zero and hence.
• Combination of the above two such as Elastic Nets:
• This add regularization terms in the model which are combination of both L1 and
L2 regularization.
15/12/2024 School of Computer Science and Engineering 42
Support Vector
Machines
SVM - Introduction

-Support Vector Machine (SVM) is a supervised learning


algorithm developed by Vladimir Vapnik and it was first heard in
1992, introduced by Vapnik, Boser and Guyon in COLT-92.
-Support Vector Machine (SVM) is a relatively
simple Supervised Machine Learning Algorithm used for
classification and/or regression.
It is more preferred for classification but is sometimes very
useful for regression as well.
• Basically, SVM finds a hyper-plane that creates a boundary
between the types of data.
• The two results of each classifier will be :
• The data point belongs to that class OR
• The data point does not belong to that class.
SVM - Introduction

• The aim of a support vector machine algorithm is to find the best


possible line, or decision boundary, that separates the data points of
different data classes.
• This boundary is called a hyperplane when working in high-dimensional
feature spaces.
• The idea is to maximize the margin, which is the distance between the
hyperplane and the closest data points of each category, thus making it
easy to distinguish data classes.
• During the training phase, SVMs use a mathematical formulation to find
the optimal hyperplane in a higher-dimensional space, often called
the kernel space.
SVM - Introduction

-We are given a set of n points (vectors) x1 , x:2 ,.......xn


such that
xi is a vector of length m and each belong to one of two
classes we label them by “+1” and “-1”.
-So our training set is:( x1 , y1 ), ( x2 , y2 ),....( xn , yn ) So the decision
function will be
i xi  R m , yi  {1,  1} f ( x) sign( w x  b)

- We want to find a separating hyperplane w.x +b =0


that separates these points into the two classes. “The positives” (class
“+1”) and “The negatives” (class “-1”). (Assuming that they are linearly
separable)
SVM – Seperating Hyperplane

x2
yi 1
yi  1 f ( x) sign( w x  b)

A separating
hypreplane
w x  b 0
x1

But There are many possibilities


for such hyperplanes !!
SVM – Seperating Hyperplane

yi 1
yi  1 Which one should we choose!

Yes, There are many possible separating hyperplanes


It could be this one or this or this or maybe….!
SVM – Choosing a separating hyperplane

-Suppose we choose the hypreplane (seen below) that is


close to some sample xi .
- Now suppose we have a new point x ' that should be in class
“-1” and is close to xi . Using our classification function f ( x )
this point is misclassified!
f ( x) sign( w x  b)
Poor generalization! x'
xi
(Poor performance on
unseen data)
SVM – Choosing a separating hyperplane

-Hyperplane should be as far as possible from any sample point.


-This way a new data that is close to the old samples will be classified
correctly.

x'
Good generalization! xi
SVM – Choosing a separating hyperplane
The SVM approach: Linear separable
case
-The SVM idea is to maximize the distance between The hyperplane
and the closest sample point.
In the optimal hyper- plane:

The distance to The distance to


the closest = the closest
negative point positive point.
SVM – Choosing a separating hyperplane
The SVM approach: Linear separable
case
SVM’s goal is to maximize the Margin which is twice the distance “d”
between the separating hyperplane and the closest sample.
gin
ar
M

Why it is the best?


-Robust to outliners as
d

xi we saw and thus strong generalization


ability.
d

-It proved itself to have better


performance on test data in both
practice and in theory.
SVM – Choosing a separating hyperplane
The SVM approach: Linear separable
case
Support vectors are the samples closest to the separating hyperplane.

Oh! So this is where


the name came from!

gin
ar
These are

M
Support

d
Vectors xi
We will see latter that the

d
Optimal hyperplane is
completely defined by
the support vectors.
SVM – Margin Decision Boundary

• The decision boundary should be as far away from


the data of both classes as possible
• We should maximize the margin, m
• Distance between the origin and the line wtx=-b is b/||
w||

Class 2

m
Class 1
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
SVM - Example
Decision Tree and
Naïve Bayesian
Decision Tree
Issues Regarding Classification and
Prediction

Preparing the Data for Classification and Prediction


• Data cleaning: This refers to the preprocessing of data in order
to remove or reduce noise and the treatment of missing values.
• Relevance analysis: Many of the attributes in the data may be
redundant. Correlation analysis can be used to identify whether
any two given attributes are statistically related.
• Data transformation and reduction: The data may be
transformed by normalization, particularly when neural
networks or methods involving distance measurements are used
in the learning step. Normalization involves scaling all values for
a given attribute so that they fall within a small specified range,
12/15/2024 Classification and Prediction 66
Comparing Classification and Prediction
Methods

• Accuracy: The accuracy of a classifier refers to the ability of a


given classifier to correctly predict the class label of new or
previously unseen data (i.e., tuples without class label
information).
• Speed: This refers to the computational costs involved in
generating and using the given classifier or predictor
• Robustness: This is the ability of the classifier or predictor to
make correct predictions given noisy data or data with missing
values.
• Scalability: This refers to the ability to construct the classifier

12/15/2024
or predictor efficiently given large amounts of data.
Classification and Prediction 67
Classification by decision tree Induction

• Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected
attributes
• Tree pruning
• Identify and remove branches that reflect noise or
outliers
• Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the
decision tree
Training Dataset

12/15/2024 Classification and Prediction 69


Algorithm

• Basic algorithm (a greedy algorithm)


• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
Attribute Selection Measure

• Information gain (ID3/C4.5)


• All attributes are assumed to be categorical

• Can be modified for continuous-valued attributes


• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and
n elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as

p p n n
I ( p, n)  log 2  log 2
pn pn pn pn
Information Gain in Decision Tree
Induction
• Assume that using attribute A a set S will be partitioned into
sets {S1, S2 , …, Sv}

• If Si contains pi examples of P and ni examples of N, the


entropy, or the expected information needed to classify
objects in all subtrees Si is 
pi  ni
E ( A)  I ( pi , ni )
i 1 pn

• The encoding information that would be gained by branching


on A Gain( A) I ( p, n)  E ( A)
Gain (A)

• In other words, Gain(A) tells us how much would be gained by


branching on A.

• It is the expected reduction in the information requirement


caused by knowing the value of A.

• The attribute A with the highest information gain, Gain(A), is


chosen as the splitting attribute at nodeN.

• This is equivalent to saying that we want to partition on the


attribute A that would do the “best classification,”

Gain( A) I ( p, n)  E ( A)

12/15/2024 Classification and Prediction 73


Information Gain (A)

But how can we compute the information


gain of an attribute that is continuous
valued, unlike in the example?

 Suppose, instead, that we have an attribute A that is continuous-


valued, rather than discrete-valued.
 (For example, suppose that instead of the discretized version of age
from the example, we have the raw values for this attribute.)
 For such a scenario, we must determine the “best” split-point for
A, where the split-point is a threshold on A.

12/15/2024 Classification and Prediction 74


Information Gain in Decision Tree
Induction
• The algorithm is called with three parameters: D, attribute list, and
Attribute selection method.

• We refer to D as a data partition. Initially, it is the complete set of


training tuples and their associated class labels.

• The parameter attribute list is a list of attributes describing the


tuples.

• Attribute selection method specifies a heuristic procedure for


selecting the attribute that “best” discriminates the given tuples
according to class.

12/15/2024 Classification and Prediction 75


Attribute Selection Measures

• Is a heuristic for selecting the splitting criterion that “best”


separates a given data partition, D, of class-labeled training tuples
into individual classes.

• If we were to split D into smaller partitions according to the


outcomes of the splitting criterion, ideally each partition would be
pure.

• Attribute selection measures are also known as splitting rules


because they determine how the tuples at a given node are to be
split.

12/15/2024 Classification and Prediction 76


Implementation

The class label attribute, buys computer, has two distinct values
(namely, yes, no)

Therefore, there are two distinct classes (that is, m = 2).


Let class C1 correspond to yes.
Let class C2 correspond to no.

There are
Nine tuples of class yes
Five tuples of class no.

A (root) node N is created for the tuples in D.

To find the splitting criterion for these tuples, we must compute


the information gain of each attribute.
Implementation

Information gain will be calculated using the following


formula

Gain(A) = Info(D) - InfoA(D)

Info(D) = - (p/p+n) log2(p/p+n) -


(n/p+n)log2(n/p+n)

v
infoA(D)= ∑ ( pi + ni/p+n )Info(pi,ni)
i=1
Implementation

Info(D) = -9/14 log2(9/14) –(5/14)log2(5/14)


= 0.409 + 0.530
= 0.934
Now,
-> gain for age,

I(p1,I1) = -(2/5) log2(2/5) – (3/5) log2(3/5) = 0.971


I(p2,I2) = -(4/4) log2(4/4) – (0/4) log2(0/4) = 0
I(p3,I3) = -(3/5) log2(3/5) – (2/5) log2(2/5) = 0.971

E(A) = (( 2+3 ) / (9+5 ) * 0.971 ) + (( 4+0 ) / (9+5 ) * 0 ) + (( 3+2 ) / (9+5 ) *


0.971 )
= 0.694
Gain(Age) = Info(D) -- E(A) = 0.934 – 0.694 = 0.240
Implementation

Now,
-> gain of income

I(p1,I1) = -(2/4) log2(2/4) – (2/4) log2(2/4) =1


I(p2,I2) = -(4/6) log2(4/6) – (2/6) log2(2/6) = 0.918
I(p3,I3) = -(3/4) log2(3/4) – (1/4) log2(1/4) = 0.8112

E(A) = (( 2+2 ) / (9+5 ) * 1 ) + (( 4+2) / (9+5 ) * 0.918 ) + (( 3+1 ) / (9+5 ) *


0.8112 )
= 0.910
Gain(Income) = Info(D) -- E(A) = 0.934 – 0.910 = 0.024
Implementation

Now,
-> gain of student

I(p1,I1) = -(6/7) log2(6/7) – (1/7) log2(1/7) = 0.591


I(p2,I2) = -(3/7) log2(3/7) – (4/7) log2(4/7) = 0.984

E(A) = (( 6+1 ) / (9+5 ) * 0.591 ) + (( 3+4) / (9+5 ) * 0.984 )


= 0.7875
Gain(Student) = Info(D) -- E(A) = 0.934 – 0.7875 = 0.147
Implementation

Now,
-> gain of credit limit

I(p1,I1) = -(6/8) log2(6/8) – (2/8) log2(2/8) =0.8112


I(p2,I2) = -(3/6) log2(3/6) – (3/6) log2(3/6) = 1

E(A) = (( 6+2 ) / (9+5 ) * 0.915 ) + (( 3+3) / (9+5 ) *1)


= 0.891
Gain(Credit_Limit) = Info(D) - E(A) = 0.934 – 0.891 = 0.042
Implementation

Since Gain(age) is maximum of all, So we choose Age as the


classifying attribute.

AGE

Age = youth Age = senior

Income Student Credit class Income Student Credit Class

High No Fair No Med No Fair Yes


High No Exclnt No Low Yes Fair Yes
Med No Fair No Low Yes Ecxlnt No
Low Yes Fair Yes Med Yes Fair Yes
Med Yes Exclnt Yes Med No Exclnt No

Age = mid_age
Class : C1
Implementation

AGE

Age = youth Age = senior

Student Credit_limit

Credit_limit Credit_limit
Student = Yes Student = No = fair = excellent

Class : C1 Class : C2 Class : C1 Class : C2

Age = mid_age
Class : C1
Bayesian Classifiers :

• “What are Bayesian classifiers?” Bayesian classifiers are statistical


classifiers.
• They can predict class membership probabilities, such as the
probability that a given tuple belongs to a particular class.
• Let X be a data tuple. In Bayesian terms, X is considered
“evidence.” As usual, it is described by measurements made on a
set of n attributes.
• Let H be some hypothesis, such as that the data tuple X belongs
to a specified class C.
• For classification problems, we want to determine P(H/X), the
probability that the hypothesis H holds given the
“evidence” or observed data tuple X.
12/15/2024 Classification and Prediction 85
• In other words, we are looking for the probability that tuple X
Bayesian Classifiers :

Naive Bayesian classifiers assume that the


effect of an attribute value on a given class is
independent of the values of the other attributes.
This assumption is called class conditional
independence.

Dependencies can exist between variables.


Bayesian belief networks specify joint
conditional probability distributions.
They allow class conditional independencies to be
defined between subsets of variables.
They provide a graphical model of causal
relationships, on which learning can be
12/15/2024 Classification and Prediction 86
Naïve Bayesian Example :

December 15, 2024 Data Mining: Concepts and Techniques 87


Naïve Bayesian Example :

December 15, 2024 Data Mining: Concepts and Techniques 88

You might also like