3-LG Eval
3-LG Eval
Zijun Yao
Assistant Professor, EECS Department
The University of Kansas
1
Anonymous Feedback (active till Feb 3)
2
Agenda
• Loss function
• Optimizing parameters
• Model Evaluation
• Metrics
• Methods
3
House price prediction - regression
Size of House
# of Bedrooms f Price of House
……. a value (scalar)
Feature x Target y
4
Linear regression recap
• The problem of predicting continuous values is called regression problem
• Given
• Data
• Corresponding labels
𝐿 𝐰, 𝑏 = (𝑦 𝑖 − 𝑦ො (𝑖) )2
𝑖=1
5
House price prediction - classification
Size of House
# of Bedrooms f Trend of Price
……. a class (goes up or goes down)
Trend of Price
Down
Up
Down
Down
Up
Up
Down
Up
Feature x Target y
6
Linear classifiers - 3 steps
• Model definition:
z 𝒙 >0 Output = class 1
𝒙
𝑒𝑙𝑠𝑒 Output = class 0
7
Step 1: Logistic regression definition
• Weight all features using linear regression
𝑧 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑑 𝑥𝑑 (𝑤0 is bias 𝑏, 𝑥0 is 1)
Sigmoid function
1
𝑦ො = 𝜎(𝑧)
𝑦ො = 𝜎 𝑧 =
1 + 𝑒 −𝑧 1, 𝑧 ≥ 0 Positive class
𝑦ො = ቊ
0, 𝑧 < 0 Negative class
8
Probability
• Many problems require a probability estimate as output 𝑦ො
• Credit card application example
• Probability of accepting application 𝑝(𝑎𝑐𝑐𝑒𝑝𝑡|𝑎𝑔𝑒, 𝑖𝑛𝑐𝑜𝑚𝑒)
• 𝑧 = 𝑎𝑔𝑒 + 1.25 × 𝑖𝑛𝑐𝑜𝑚𝑒 − 80
Age
1
• Probability 𝑦ො = 𝜎 𝑧 =
1+𝑒 −𝑧
(55, 47)
age Income (k) z 𝑦=𝜎 𝑧 class 45
47 55 35.75 0.9999 1 (30, 42)
42 30 -0.5 0.3775 0
9
50K Income
Interpretation of logistic regression
• 𝑓(𝒙) estimates the probability of class
• Example: cancer diagnosis from tumor size
𝑥0 1
𝒙= 𝑥 =
1 Tumor size
𝑓 𝒙 = 0.7
The probability of this patient having malignant tumor is 70%
10
Interpretation of logistic regression
• Learns the odds of positive class
11
Decision boundary
12
Logistic regression as an artificial neuron:
Connection to neural networks
x1 w1
…
…
wi z (z )
xi + 𝑝𝑤,𝑏 𝐶1 |𝑥
…
…
wI Sigmoid Function (z )
𝑥𝑑 𝑤0 (or 𝑏)
(z ) =
1
1 + e−z z
13
Step 2: Goodness of a function
Learning logistic regression model
• How are the parameters of the model (the weights w) learned?
• We want to learn parameters w that make 𝑦ො for each training data as close
as possible to the true class 𝑦
𝑦ො = 𝜎 𝒘𝑇 𝒙
• Loss function: how close the classifier output (𝑦ො = 𝜎 𝒘𝑇 𝒙 ) is to the correct
output (𝑦, which is 0 or 1)
ො = How much predicted class 𝑦ො differs from the true 𝑦
ℒ(𝑦, 𝑦)
14
Loss function in probability
• Likelihood of the data 𝐷 = { 𝒙(1) , 𝑦 (1) , 𝒙(2) , 𝑦 (2) , … , 𝒙(𝑛) , 𝑦 (𝑛) } given the
parameters 𝒘 is ς𝑛𝑖=1 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝒘)
17
Deriving the loss function
• Likelihood of the data 𝐷 = { 𝒙(1) , 𝑦 (1) , 𝒙(2) , 𝑦 (2) , … , 𝒙(𝑛) , 𝑦 (𝑛) } given the
parameters 𝒘 is ς𝑛𝑖=1 𝑝(𝑦 𝑖 |𝒙 𝑖 ; 𝒘)
Maximize the likelihood of the data, we get the best logistic regression model
18
Deriving the loss function
21
Interpreting the Loss ..
• Loss of a single data instance
When 𝑦 = 1
• loss = 0 if prediction is correct
• Or 𝑓(𝒙) ⟶ 0, loss ⟶ ∞
loss
Intuition is that larger mistakes should get
larger penalties
• e.g., predict 𝑓(𝒙) = 0, but 𝑦 = 1
𝑓(𝒙)
22
Interpreting the Loss ..
• Cost of a single data instance
When 𝑦 = 0
• loss = 0 if prediction is correct
• Or (1 − 𝑓(𝒙)) ⟶ 0, loss ⟶ ∞
loss
Larger mistakes get larger penalties as well
• e.g., predict 𝑓 𝒙 = 1, but 𝑦 = 0
𝑓(𝒙)
23
Regularized logistic regression
• Add a regularization term to constrain model complexity
• Prevent overfitting as we do for linear regression
• L2 norm (|| ∙ ||2 ) - the square root of the sum of the squared vector values
(Euclidean distance).
• Measure the size of parameter vector 𝒘
24
Step 3: Gradient descent for optimization
• To find the optimal weights: minimize the loss function we’ve
defined for the model
𝑛
1
𝐰 = argmin 𝐿 (𝛔(𝐰 ⊺ 𝐱 (𝑖) ) , 𝑦 (𝑖) )
∗
𝑤 𝑛
𝑖=1
update for 𝑗 = 0 … 𝑑
25
Logistic Regression Linear Regression
Step 2:
Step 3:
26
Logistic Regression Linear Regression
𝑛 𝑛 SSE loss
27
Logistic Regression Linear Regression
𝑛 𝑛
(𝑛)
Logistic regression: 𝑤𝑖 ← 𝑤𝑖 − 𝛼 𝑓𝑤,𝑏 𝑥 (𝑛) − 𝑦 (𝑛) 𝑥𝑖
𝑛
Step 3:
Linear regression: 𝑤𝑖 ← 𝑤𝑖 − 𝛼 𝑓𝑤,𝑏 𝑥 (𝑛) − 𝑦 (𝑛) 𝑥𝑖(𝑛)
𝑛 28
Demo
class sklearn.linear_model.SGDClassifier(loss=‘log', *, penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=Tr
ue, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learnin
g_rate='optimal', eta0=0.0, power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, cl
ass_weight=None, warm_start=False, average=False)
class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True,
intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto',
verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
1. Prepare training and test data
• Loss function
• Optimizing parameters
• Model Evaluation
• Metrics
• Methods
31
Model evaluation
• Metrics for Performance Evaluation
• How to evaluate the predictive capability of a model?
32
Metrics for regression task
• How close is your prediction against the target value?
o Can NOT interpret how the model performance from one single result
o But can be used to compare against other models
33
Metrics for regression task
• How close your prediction is against the target value
Mean
A goodness-of-fit measure shows percentage of the target variation that a linear
model explains
PREDICTED CLASS
Class=Yes Class=No
35
Metrics for performance evaluation
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
37
Cost matrix
PREDICTED CLASS
38
CONFUSION PREDICTED CLASS
MATRIX
Class=Yes Class=No
Weighted ACTUAL
Class=Yes a
(TP)
b
(FN)
accuracy CLASS
Class=No c
(FP)
d
(TN)
ACTUAL Class=Yes 𝑤1 𝑤2
CLASS C(Yes|Yes) C(No|Yes)
Class=No 𝑤3 𝑤4
C(Yes|No) C(No|No)
𝑤1 𝑎+𝑤4 𝑑
Weighted Accuracy =
𝑤1 𝑎+𝑤2 𝑏+𝑤3 𝑐+𝑤4 𝑑 39
Computing weighted accuracy
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL
+ 1 100
CLASS
- 1 1
model TP
TPR =
TP + FN
44
Methods for evaluation process
• How to obtain a reliable estimate of performance?
45
Methods of estimation
• Holdout
• Reserve two disjoint sets: training/testing (80%/20%)
• One sample may be biased -- Repeated holdout by random
subsampling
• Cross validation
• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n
• Guarantees that each record is used the same number of times for
training and testing
• Bootstrap
• Sampling with replacement of size N
• Repeat b times
46
Holdout evaluation
“Learn”
Training (features Model
+ labels)
Examples (predicted
Estimate (features) labels)
Holdout Eval
(true labels)
Fold 1
𝐾−1
Fold 2 Training
Examples
…
Fold 𝐾 Test
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
…
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
50
Dealing with class imbalance
• If the class we are interested in is very rare, then the classifier
will ignore it.
• The class imbalance problem
• Solution
• We can modify the optimization criterion by using a cost sensitive
metric
• We can balance the class distribution
• Sample from the larger class so that the size of the two classes is the same
• Replicate the data of the class of interest so that the classes are balanced
51
Learning curve
Learning curve shows how accuracy
changes with varying sample size
52