0% found this document useful (0 votes)
5 views52 pages

3-LG Eval

The document outlines the key concepts of logistic regression in machine learning, including model definition, loss functions, and parameter optimization. It explains how logistic regression can be used for classification problems, providing examples such as house price prediction and credit card application acceptance. Additionally, it discusses the importance of regularization and gradient descent for optimizing the model's performance.

Uploaded by

vinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views52 pages

3-LG Eval

The document outlines the key concepts of logistic regression in machine learning, including model definition, loss functions, and parameter optimization. It explains how logistic regression can be used for classification problems, providing examples such as house price prediction and credit card application acceptance. Additionally, it discusses the importance of regularization and gradient descent for optimizing the model's performance.

Uploaded by

vinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

EECS 836: Machine Learning

Zijun Yao
Assistant Professor, EECS Department
The University of Kansas

1
Anonymous Feedback (active till Feb 3)

2
Agenda

• Logistic Regression model


• Model definition

• Loss function

• Optimizing parameters

• Model Evaluation
• Metrics

• Methods

3
House price prediction - regression

Size of House
# of Bedrooms f Price of House
……. a value (scalar)

Feature x Target y

4
Linear regression recap
• The problem of predicting continuous values is called regression problem
• Given
• Data
• Corresponding labels

• Find a continuous function that models the continuous points


Model definition

𝑤0 is bias 𝑏, 𝑥0 is 1 for all data


Loss function
𝑛

𝐿 𝐰, 𝑏 = ෍ (𝑦 𝑖 − 𝑦ො (𝑖) )2
𝑖=1
5
House price prediction - classification

Size of House
# of Bedrooms f Trend of Price
……. a class (goes up or goes down)

Trend of Price
Down
Up
Down
Down
Up
Up
Down
Up

Feature x Target y

6
Linear classifiers - 3 steps
• Model definition:
z 𝒙 >0 Output = class 1
𝒙
𝑒𝑙𝑠𝑒 Output = class 0

• Loss function: how good is a classifier?

𝐿 𝑓 = ෍ 𝛿 𝑓 𝑥 (𝑖) ≠ 𝑦 (𝑖) The number of times f(x) get


incorrect results on training data.
𝑖

• Find the best classifier parameters: optimization algorithm

7
Step 1: Logistic regression definition
• Weight all features using linear regression
𝑧 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑑 𝑥𝑑 (𝑤0 is bias 𝑏, 𝑥0 is 1)

• Pass real value 𝑧 to decision function for confidence of classification

Sigmoid function
1
𝑦ො = 𝜎(𝑧)

𝑦ො = 𝜎 𝑧 =
1 + 𝑒 −𝑧 1, 𝑧 ≥ 0 Positive class
𝑦ො = ቊ
0, 𝑧 < 0 Negative class

8
Probability
• Many problems require a probability estimate as output 𝑦ො
• Credit card application example
• Probability of accepting application 𝑝(𝑎𝑐𝑐𝑒𝑝𝑡|𝑎𝑔𝑒, 𝑖𝑛𝑐𝑜𝑚𝑒)
• 𝑧 = 𝑎𝑔𝑒 + 1.25 × 𝑖𝑛𝑐𝑜𝑚𝑒 − 80

Age
1
• Probability 𝑦ො = 𝜎 𝑧 =
1+𝑒 −𝑧

(55, 47)
age Income (k) z 𝑦=𝜎 𝑧 class 45
47 55 35.75 0.9999 1 (30, 42)
42 30 -0.5 0.3775 0

9
50K Income
Interpretation of logistic regression
• 𝑓(𝒙) estimates the probability of class
• Example: cancer diagnosis from tumor size
𝑥0 1
𝒙= 𝑥 =
1 Tumor size
𝑓 𝒙 = 0.7
The probability of this patient having malignant tumor is 70%

• Properties – probabilities of all classes sum up to 1

10
Interpretation of logistic regression
• Learns the odds of positive class

• Take the log on odds 𝑦=


1
1 + 𝑒 −𝑧

• Logistic regression model assumes that the log odds is a linear


function of 𝒙

11
Decision boundary

For class 0, 𝑤 𝑡 𝑥 should be For class 1, 𝑤 𝑡 𝑥 should be


large negative values large positive values

• Set a threshold that


• Predict 𝑦 = 1 if 𝑓(𝑥) ≥ 0.5
𝑤
• Predict 𝑦 = 0 if 𝑓(𝑥) < 0.5

12
Logistic regression as an artificial neuron:
Connection to neural networks

x1 w1


wi z  (z )
xi + 𝑝𝑤,𝑏 𝐶1 |𝑥

wI Sigmoid Function  (z )
𝑥𝑑 𝑤0 (or 𝑏)

 (z ) =
1
1 + e−z z
13
Step 2: Goodness of a function
Learning logistic regression model
• How are the parameters of the model (the weights w) learned?
• We want to learn parameters w that make 𝑦ො for each training data as close
as possible to the true class 𝑦

𝐷 = { 𝒙(1) , 𝑦 (1) , 𝒙(2) , 𝑦 (2) , … , 𝒙(𝑛) , 𝑦 (𝑛) }

𝑦ො = 𝜎 𝒘𝑇 𝒙

• Loss function: how close the classifier output (𝑦ො = 𝜎 𝒘𝑇 𝒙 ) is to the correct
output (𝑦, which is 0 or 1)
ො = How much predicted class 𝑦ො differs from the true 𝑦
ℒ(𝑦, 𝑦)
14
Loss function in probability

• Maximum likelihood estimate give training data:


𝐷= 𝒙 1 ,𝑦 1 , 𝒙 2 ,𝑦 2 ,…, 𝒙 𝑛 ,𝑦 𝑛

• Likelihood of a single data sample (given parameters)


• How likely the features are to produce an observed sample?
1
−(𝒘𝑇 𝒙) , if 𝑦 = 1
𝑝(𝑦|𝒙) = 1 + 𝑒
1
1− −(𝒘 𝑇 𝒙) , if 𝑦 = 0
1+𝑒

• Likelihood of the entire data is to multiply all sample ς𝑛𝑖=1 𝑝(𝑦 𝑖 |𝒙 𝑖 )


15
Loss function in probability

• Maximum likelihood estimate give training data:


𝐷= 𝒙 1 ,𝑦 1 , 𝒙 2 ,𝑦 2 ,…, 𝒙 𝑛 ,𝑦 𝑛

• Likelihood of a single data sample (given parameters)


• How likely the features are to produce an observed sample?
1
−(𝒘 𝑇 𝒙) , if 𝑦 = 1
𝑝(𝑦|𝒙) = 1 + 𝑒
1
(𝒘𝑇 𝒙) , if 𝑦 = 0
1+𝑒

• Likelihood of the entire data is to multiply all sample ς𝑛𝑖=1 𝑝(𝑦 𝑖 |𝒙 𝑖 )


16
Deriving the loss function

• Likelihood of the data 𝐷 = { 𝒙(1) , 𝑦 (1) , 𝒙(2) , 𝑦 (2) , … , 𝒙(𝑛) , 𝑦 (𝑛) } given the
parameters 𝒘 is ς𝑛𝑖=1 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝒘)

Probability if class 1 Probability if class 0 1


𝑇 𝒙) , if 𝑦 = 1
𝑝(𝑦|𝒙) = 1 + 𝑒 −(𝒘
1
𝑇 𝒙) , if 𝑦 = 0
1 + 𝑒 (𝒘

17
Deriving the loss function

• Likelihood of the data 𝐷 = { 𝒙(1) , 𝑦 (1) , 𝒙(2) , 𝑦 (2) , … , 𝒙(𝑛) , 𝑦 (𝑛) } given the
parameters 𝒘 is ς𝑛𝑖=1 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝒘)

Probability if class 1 Probability if class 0 1


𝑇 𝒙) , if 𝑦 = 1
𝑝(𝑦|𝒙) = 1 + 𝑒 −(𝒘
1
• Take log of both sides 1 + 𝑒 (𝒘
𝑇 𝒙) , if 𝑦 = 0

Maximize the likelihood of the data, we get the best logistic regression model
18
Deriving the loss function

• Loss function (by adding a negative sign to likelihood)

Substitute probability 𝑝 with


discission function 𝜎 of 𝑤 𝑇 𝑥

Substitute discission function 𝜎 with sigmoid


𝑇
function 1/(1 + 𝑒 𝑤 𝑥 )

Finally, we show how loss function is


determined by parameters 𝑤
19
Deriving the loss function

• Loss function (by adding a negative sign to likelihood)

That’s it, we got cross-entropy loss


for classification!
20
Interpreting the loss function
• Cross-entropy loss for classification

gives the prediction

• Loss of a single data instance 𝑖

Plots of logarithm functions log 𝑒 (∙)

21
Interpreting the Loss ..
• Loss of a single data instance

When 𝑦 = 1
• loss = 0 if prediction is correct
• Or 𝑓(𝒙) ⟶ 0, loss ⟶ ∞
loss
Intuition is that larger mistakes should get
larger penalties
• e.g., predict 𝑓(𝒙) = 0, but 𝑦 = 1
𝑓(𝒙)

22
Interpreting the Loss ..
• Cost of a single data instance

When 𝑦 = 0
• loss = 0 if prediction is correct
• Or (1 − 𝑓(𝒙)) ⟶ 0, loss ⟶ ∞
loss
Larger mistakes get larger penalties as well
• e.g., predict 𝑓 𝒙 = 1, but 𝑦 = 0
𝑓(𝒙)

23
Regularized logistic regression
• Add a regularization term to constrain model complexity
• Prevent overfitting as we do for linear regression
• L2 norm (|| ∙ ||2 ) - the square root of the sum of the squared vector values
(Euclidean distance).
• Measure the size of parameter vector 𝒘

• Overall loss function for optimization

24
Step 3: Gradient descent for optimization
• To find the optimal weights: minimize the loss function we’ve
defined for the model
𝑛
1
𝐰 = argmin ෍ 𝐿 (𝛔(𝐰 ⊺ 𝐱 (𝑖) ) , 𝑦 (𝑖) )

𝑤 𝑛
𝑖=1

• Learn by Gradient Descent method


• Choose a starting point [𝒘0 ]
• Repeat until convergence
• Compute gradient
• Update parameters

update for 𝑗 = 0 … 𝑑

25
Logistic Regression Linear Regression

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓𝑤,𝑏 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏


𝑖 𝑖
Output: between 0 and 1 Output: real value

Step 2:

Step 3:

26
Logistic Regression Linear Regression

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓𝑤,𝑏 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏


𝑖 𝑖
Output: between 0 and 1 Output: real value

Training data: 𝑥, 𝑦 Training data: 𝑥, 𝑦


Step 2: 𝑦: 1 for class 1, 0 for class 0 𝑦: a real number
2
𝐿 𝑓 = ෍ 𝐿 𝑓 𝑥 (𝑛) , 𝑦 (𝑛) 𝐿 𝑓 =෍ 𝑓 𝑥 (𝑛) −𝑦 (𝑛)

𝑛 𝑛 SSE loss

Cross entropy loss:


𝐿 𝑓 𝑥 (𝑖) , 𝑦 (𝑖) = − 𝑦 (𝑖) log𝑓 𝑥 (𝑖) + 1 − 𝑦 (𝑖) log 1 − 𝑓 𝑥 (𝑖)

27
Logistic Regression Linear Regression

Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓𝑤,𝑏 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 + 𝑏


𝑖 𝑖
Output: between 0 and 1 Output: real value

Training data: 𝑥, 𝑦 Training data: 𝑥, 𝑦


Step 2: 𝑦: 1 for class 1, 0 for class 0 𝑦: a real number
2
𝐿 𝑓 = ෍ 𝐿 𝑓 𝑥 (𝑛) , 𝑦 (𝑛) 𝐿 𝑓 =෍ 𝑓 𝑥 (𝑛) −𝑦 (𝑛)

𝑛 𝑛

(𝑛)
Logistic regression: 𝑤𝑖 ← 𝑤𝑖 − 𝛼 ෍ 𝑓𝑤,𝑏 𝑥 (𝑛) − 𝑦 (𝑛) 𝑥𝑖
𝑛
Step 3:
Linear regression: 𝑤𝑖 ← 𝑤𝑖 − 𝛼 ෍ 𝑓𝑤,𝑏 𝑥 (𝑛) − 𝑦 (𝑛) 𝑥𝑖(𝑛)
𝑛 28
Demo
class sklearn.linear_model.SGDClassifier(loss=‘log', *, penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=Tr
ue, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learnin
g_rate='optimal', eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, cl
ass_weight=None, warm_start=False, average=False)
class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True,
intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto',
verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
1. Prepare training and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

2. Specific the model and train the model


clf = DecisionTreeClassifier(fit_intercept=True)
clf.fit(X_train, y_train)

3. Make prediction and evaluate the model


y_predict = clf.predict(X_test)
accuracy_score(y_test, y_predict)
29
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
Iris sample data set
• Iris Plant data set.
• Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
• From the statistician Douglas Fisher
• Three flower types (classes):
• Setosa
• Versicolour
• Virginica
• Four (non-class) attributes
• Sepal width and length Setosa Versicolour Virginica
• Petal width and length
Data example
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
30

Agenda

• Logistic Regression model


• Model definition

• Loss function

• Optimizing parameters

• Model Evaluation
• Metrics

• Methods

31
Model evaluation
• Metrics for Performance Evaluation
• How to evaluate the predictive capability of a model?

• Methods for Evaluation Process


• How to obtain reliable estimates?

32
Metrics for regression task
• How close is your prediction against the target value?

Mean square error (MSE) Mean absolute error (MAE)

Gives larger penalization to big Treats all error the same


prediction error by square it

o Can NOT interpret how the model performance from one single result
o But can be used to compare against other models

33
Metrics for regression task
• How close your prediction is against the target value

R-squared (𝑅2 ), also called coefficient of determination


Prediction

Mean
A goodness-of-fit measure shows percentage of the target variation that a linear
model explains

o More informative than MSE and MAE by showing percentage rather


than absolute value (with arbitrary range)
34
Metrics for classification task
• Confusion Matrix (binary classification):

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b a: TP (true positive)


ACTUAL
b: FN (false negative)
CLASS Class=No c d c: FP (false positive)
d: TN (true negative)

35
Metrics for performance evaluation
PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)

• Most widely-used metric:


a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
36
Limitation of accuracy
• Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10

• If model predicts everything to be class 0, accuracy is


9990/10000 = 99.9 %
• Accuracy is misleading because model does not detect any class 1
example

37
Cost matrix

PREDICTED CLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)


ACTUAL
CLASS Class=No C(Yes|No) C(No|No)

C(i|j): Cost of classifying class j example as class i

38
CONFUSION PREDICTED CLASS
MATRIX
Class=Yes Class=No

Weighted ACTUAL
Class=Yes a
(TP)
b
(FN)
accuracy CLASS
Class=No c
(FP)
d
(TN)

COST PREDICTED CLASS


MATRIX
C(i|j) Class=Yes Class=No

ACTUAL Class=Yes 𝑤1 𝑤2
CLASS C(Yes|Yes) C(No|Yes)
Class=No 𝑤3 𝑤4
C(Yes|No) C(No|No)

𝑤1 𝑎+𝑤4 𝑑
Weighted Accuracy =
𝑤1 𝑎+𝑤2 𝑏+𝑤3 𝑐+𝑤4 𝑑 39
Computing weighted accuracy
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL
+ 1 100
CLASS
- 1 1

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Weighted Accuracy = 8.9% Weighted Accuracy= 9%
40
Precision-Recall
Count PREDICTED CLASS
Class=Yes Class=No
a TP
Precision (p) = = ACTUAL
Class=Yes a b
a + c TP + FP CLASS
Class=No c d
a TP
Recall (r) = =
a + b TP + FN
1 2rp 2a 2TP
F - measure (F) = = = =
 1 / r + 1 / p  r + p 2a + b + c 2TP + FP + FN
 
 2 

Assumption: The class YES is the one we care about.

Precision is biased towards C(Yes|Yes) & C(Yes|No)


Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
41
ROC (Receiver Operating Characteristic)
• ROC curve plots the trade-off between TPR and FPR of a classifier
• Changing threshold of the model to classify data (e.g., 0.5 for sigmoid function)

Look at the positive predictions of the classifier and compute:


TP
TPR =
TP + FN
Prediction
What fraction of positive instances Yes No
are predicted correctly ?
Yes a b
Actual (TP) (FN)
FP
FPR = No c d
FP + TN
(FP) (TN)
What fraction of negative instances
were predicted incorrectly?
42
ROC curve
- Data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive At threshold t:
TPR=0.5, FPR=0.12

model TP
TPR =
TP + FN

True Positive (TPR)


FP
FPR =
random guess FP + TN

Changing t False Positive (FPR)

(TPR=0, FPR=0): Model predicts every instance to be a negative class


(TPR=1, FPR=1): Model predicts every instance to be a positive class
(TPR=1, FPR=0): The perfect model with zero misclassifications 43
Model evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?

• Methods for Evaluation Process


• How to obtain reliable estimates?

44
Methods for evaluation process
• How to obtain a reliable estimate of performance?

• Performance of a model may depend on other factors besides


the learning algorithm:
• Class distribution
• Cost of misclassification
• Size of training and test sets

45
Methods of estimation
• Holdout
• Reserve two disjoint sets: training/testing (80%/20%)
• One sample may be biased -- Repeated holdout by random
subsampling
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Guarantees that each record is used the same number of times for
training and testing
• Bootstrap
• Sampling with replacement of size N
• Repeat b times
46
Holdout evaluation

“Learn”
Training (features Model
+ labels)
Examples (predicted
Estimate (features) labels)

Holdout Eval
(true labels)

• Train data on examples training set (not held out)


• Evaluate model on holdout set (aka test set)
• A small holdout set from training set is validation set, for
monitoring overfitting 47
Cross-validation
• Why - data in test set is too different from data in training set

Fold 1
𝐾−1
Fold 2 Training
Examples


Fold 𝐾 Test

Repeat for all


𝐾 − 1 vs 1 test-train splits

• Gives better estimate of generalization performance


48
Cross-validation
Common choices of 𝐾:
• 10-fold cross-validation: very common
Data

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10


D1 D2 D3 D4 D5 D6 D7 D8 D9 D10

• (𝑀 − 1)-fold cross-validation (where M is #examples):


• Aka “leave one out” cross-validation
49
Variations on cross-validation
• Repeated cross-validation
• Perform cross-validation a number of times
• Gives an estimate of the variance of the generalization error
• Stratified cross-validation
• Guarantee the same percentage of class labels in training and test
• Important when classes are imbalanced and the sample is small

50
Dealing with class imbalance
• If the class we are interested in is very rare, then the classifier
will ignore it.
• The class imbalance problem
• Solution
• We can modify the optimization criterion by using a cost sensitive
metric
• We can balance the class distribution
• Sample from the larger class so that the size of the two classes is the same
• Replicate the data of the class of interest so that the classes are balanced

51
Learning curve
Learning curve shows how accuracy
changes with varying sample size

Requires a sampling schedule for


creating learning curve

Effect of small sample size:


- Bias in the estimate
- Poor model
- Variance of estimate
- Poor training data

52

You might also like