0% found this document useful (0 votes)

67 views

Lecture 7 Loss Function and Regularization

This document provides an overview of various loss functions and regularization techniques used in machine learning. It discusses mean squared error and L2 regularization, and how L1 regularization can lead to sparsity. It also covers other losses inspired by L1 and L2 such as elastic net, SCAD, and hinge loss. Logistic regression and cross entropy are linked, and other losses for tasks like ranking and metric learning are mentioned. Dropout is introduced as a technique to prevent neural networks from overfitting by dropping units to break co-adaptation.

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views

Lecture 7 Loss Function and Regularization

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Advanced Machine Learning

Loss Function and Regularization

Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture

• Write expressions for common loss functions

• Match loss functions to qualitative objectives

• List advantages and disadvantages of loss

functions
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Assumptions behind MSE loss
• MSE is related to RMSE

• RMSE is the standard deviation of the error

• The mean of the error will be zero for a convex

problem
Regularization in regression
• Why regularize?
– Reduce variance, at the cost of bias
– Increase test (validation) accuracy
– Get interpretable models

• How to regularize?
– Shrink coefficients
– Reduce features
Regularization is constraining a model
• How to regularize?
– Reduce the number of parameters
• Share weights in structure
– Constrain parameters to be small
– Encourage sparsity of output in loss
• Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
– Penalty on sums of squares of individual weights
𝑁 𝑛 𝑛
1 2
𝜆
𝐽= 𝑦𝑖 − 𝑓 𝑥𝑖 + 𝑤𝑗 2 ; 𝑓 𝑥𝑖 = 𝑤𝑗 𝑥𝑖 𝑗 ;
𝑁 2
𝑖=1 𝑗=1 𝑗=0
Coefficient shrinkage using ridge

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
L2-regularization visualized
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Subset selection
• Set the coefficients with lowest absolute value
to zero
Level sets of Lq norm of coefficients

Which one is ridge? Subset selection? Lasso?

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Other forms of regularization
• L1-regularization
(sparsity
inducing norm)
– Penalty on sums
of absolute
values of
weights
Lasso coeff paths with decreasing λ

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Compare to coeff shrinkage path of
ridge

Source: Sci-kit learn tutorial

Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Smoothly Clipped Absolute Deviation
(SCAD) Penalty

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No
alteration of large coefficients by SCAD
and Hard

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
– Microarrays: p > 10,000 and n < 100.
– For those genes sharing the same biological “pathway”, the
correlations among them can be high.
• LASSO limitations
– If p > n, the lasso selects at most n variables. The number
of
– Grouped variables: the lasso fails to do grouped selection.
It tends to select one variable from a group and ignore the
others.

Source: Elastic net, by Zou and Hastie

Elastic net: Use both L2 and L2
penalties

Source: Elastic net, by Zou and Hastie

Geometry of elastic net

Source: Elastic net, by Zou and Hastie

Elastic net selects correlated variables
as “group”

Source: Elastic net, by Zou and Hastie

Elastic net selects correlated variables as
“group” and stabilizes the coefficient paths

Source: Elastic net, by Zou and Hastie

Why L2 penalty keeps coefficients of
groups together?
• Try to think of an example with correlated
variables
This analysis
can be
generalized
to linear
SVM

Source: Elastic SCAD SVM, by Becker, Toedt, Lichter and Benner, in BMC Bioinformatics2011
A family of loss functions

Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
A family of loss functions

Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

What is hinge loss?
• Surrogate loss
function to 0-1 loss

• There are other

surrogate losses
possible
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Losses for ranking and metric learning
• Margin loss

• Cosine similarity

• Ranking
– Point-wise
– Pair-wise
• φ(z) = (1-z)+, e-z, log(1-e-z)
– List-wise
Source: “Ranking Measures and Loss Functions in Learning to Rank” Chen et al, NIPS 2009
Dropout: Drop a unit out to prevent
co-adaptation

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Why dropout?
• Make other features unreliable to break co-
adaptation
• Equivalent to adding noise
• Train several (dropped out) architectures in one
architecture (O(2n))
• Average architectures at run time
– Is this a good method for averaging?
– How about Bayesian averaging?
– Practically, this work well too

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Model averaging

• Average output should be the same

• Alternatively,
– w/p at training time
– w at testing time
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Difference between non-DO and DO
features

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Indeed, DO leads to sparse activation

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
There is a sweet spot with DO, even if
you increase the number of neurons

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.

Electronic Communication Systems - R. Blake
No ratings yet
Electronic Communication Systems - R. Blake
123 pages
AML L2 Logistic regression
No ratings yet
AML L2 Logistic regression
37 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
07_regularization
No ratings yet
07_regularization
51 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Unit -4-NNDL- Notes
No ratings yet
Unit -4-NNDL- Notes
14 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Chapter 2 - Logistic Regression
No ratings yet
Chapter 2 - Logistic Regression
88 pages
01_lecturenote_SRM
No ratings yet
01_lecturenote_SRM
9 pages
05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Most Influential Data Science Research Papers
No ratings yet
Most Influential Data Science Research Papers
628 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
Lec4 PDF
No ratings yet
Lec4 PDF
7 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Regression
No ratings yet
Regression
39 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
A General and Adaptive Robust Loss Function: Jonathan T. Barron Google Research
No ratings yet
A General and Adaptive Robust Loss Function: Jonathan T. Barron Google Research
19 pages
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
No ratings yet
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
4 pages
DL_Assi02
No ratings yet
DL_Assi02
9 pages
A General and Adaptive Robust Loss Function
No ratings yet
A General and Adaptive Robust Loss Function
9 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Learning-Demo
No ratings yet
Learning-Demo
7 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
MSCV MLDL Remedial
No ratings yet
MSCV MLDL Remedial
95 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Dl 02 Basics
No ratings yet
Dl 02 Basics
95 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Lecture6 Regularization
No ratings yet
Lecture6 Regularization
56 pages
Regularization
No ratings yet
Regularization
46 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
Regularization_(mathematics)
No ratings yet
Regularization_(mathematics)
11 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
wainwrightslides2
No ratings yet
wainwrightslides2
77 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
3 - Loss Functions
No ratings yet
3 - Loss Functions
14 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Lecture 9: October 2: 9.1.1 Stochastic Block Model
No ratings yet
Lecture 9: October 2: 9.1.1 Stochastic Block Model
6 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
Advanced Regression Pres
No ratings yet
Advanced Regression Pres
42 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Unit 2
No ratings yet
Unit 2
92 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
Class 02
No ratings yet
Class 02
42 pages
4-Loss Function
No ratings yet
4-Loss Function
8 pages
Regularization in Deep Learning (1)
No ratings yet
Regularization in Deep Learning (1)
49 pages
Introduction to Logarithms and Exponentials
From Everand
Introduction to Logarithms and Exponentials
Simone Malacrida
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Session 5 - Consumption
No ratings yet
Session 5 - Consumption
17 pages
Session 15 - Exchange Rates and Capital Flows
No ratings yet
Session 15 - Exchange Rates and Capital Flows
11 pages
Session 2 - Macroeconomics - What Is It About
No ratings yet
Session 2 - Macroeconomics - What Is It About
14 pages
Session 1 - Macroeconomics - What Is It About
No ratings yet
Session 1 - Macroeconomics - What Is It About
7 pages
Session 14 - The Money Supply
No ratings yet
Session 14 - The Money Supply
13 pages
Session 3 - Macroeconomics - What Is It About
No ratings yet
Session 3 - Macroeconomics - What Is It About
16 pages
Session 12 - The Labour Market
No ratings yet
Session 12 - The Labour Market
12 pages
Session 16 - The Mundell Fleming Model
No ratings yet
Session 16 - The Mundell Fleming Model
14 pages
Session 6 - Investment
No ratings yet
Session 6 - Investment
13 pages
Session 7 - The Demand For Money
No ratings yet
Session 7 - The Demand For Money
12 pages
Session 10 - The IS LM Model
No ratings yet
Session 10 - The IS LM Model
14 pages
Session 15 - The Open Economy Mundell Fleming Model
No ratings yet
Session 15 - The Open Economy Mundell Fleming Model
8 pages
Session 13 - The Money Supply
No ratings yet
Session 13 - The Money Supply
6 pages
Session 9 - The IS LM Model
No ratings yet
Session 9 - The IS LM Model
8 pages
Session 8 - The Demand For Money
No ratings yet
Session 8 - The Demand For Money
10 pages
Session 11 - The Labour Market
No ratings yet
Session 11 - The Labour Market
11 pages
Session 7 - Investment
No ratings yet
Session 7 - Investment
6 pages
Session 14 - Exchange Rates and Capital Flows
No ratings yet
Session 14 - Exchange Rates and Capital Flows
6 pages
Session 18 - Economic Growth
No ratings yet
Session 18 - Economic Growth
9 pages
Session 8 - The IS LM Model
No ratings yet
Session 8 - The IS LM Model
10 pages
Session 13 - The Labour Market
No ratings yet
Session 13 - The Labour Market
8 pages
Session 17 - Economic Growth
No ratings yet
Session 17 - Economic Growth
10 pages
GNR602-Lec12-13 Image Compression
No ratings yet
GNR602-Lec12-13 Image Compression
85 pages
FRA - Lone Pine Cafe (B)
No ratings yet
FRA - Lone Pine Cafe (B)
2 pages
Cash Flow-Introduction-Example
No ratings yet
Cash Flow-Introduction-Example
2 pages
GNR602-Lec14-15 Harris-HoG-SIFT
No ratings yet
GNR602-Lec14-15 Harris-HoG-SIFT
86 pages
ML 2
No ratings yet
ML 2
28 pages
Wall Climbing Robot Report
100% (2)
Wall Climbing Robot Report
10 pages
2-Coagulation and Flocculation
No ratings yet
2-Coagulation and Flocculation
50 pages
2 PB
No ratings yet
2 PB
9 pages
Core Java
100% (4)
Core Java
214 pages
Single Phase Liquid Flow - Water Hammer and Surge Pressure Design Guide
No ratings yet
Single Phase Liquid Flow - Water Hammer and Surge Pressure Design Guide
11 pages
Master JEE Mains Math 2020 Program: 1 5x Lim 1 3x + +
No ratings yet
Master JEE Mains Math 2020 Program: 1 5x Lim 1 3x + +
2 pages
UX305FA
No ratings yet
UX305FA
7 pages
Assignment No. 1 PDF
No ratings yet
Assignment No. 1 PDF
2 pages
Solving Inequalities: Word Problem Practice
No ratings yet
Solving Inequalities: Word Problem Practice
4 pages
ELT_110_Unit 1
No ratings yet
ELT_110_Unit 1
20 pages
DDP Solutions JEE ADVANCED Definite & Indefinite Integration
No ratings yet
DDP Solutions JEE ADVANCED Definite & Indefinite Integration
8 pages
Piping Class - AP03
No ratings yet
Piping Class - AP03
2 pages
Ficha Tecnica Maquina Soldar Miller Big Blue 700 Duo Pro Co
No ratings yet
Ficha Tecnica Maquina Soldar Miller Big Blue 700 Duo Pro Co
8 pages
Appendix D
No ratings yet
Appendix D
11 pages
Periodic Table Definition, Elements, Groups, Charges, Trends, & Facts Britannica
No ratings yet
Periodic Table Definition, Elements, Groups, Charges, Trends, & Facts Britannica
1 page
Fn397842 Spare Parts Catalog en
No ratings yet
Fn397842 Spare Parts Catalog en
403 pages
Operator Types in Java
No ratings yet
Operator Types in Java
11 pages
UMRR Brochure
No ratings yet
UMRR Brochure
2 pages
Espec G10 Cat Ing
No ratings yet
Espec G10 Cat Ing
8 pages
M1 QB
No ratings yet
M1 QB
9 pages
MRI Assignment TRW
No ratings yet
MRI Assignment TRW
21 pages
Neural Networks Neural Networks
No ratings yet
Neural Networks Neural Networks
30 pages
List of SAP Tables: Area Description
100% (1)
List of SAP Tables: Area Description
17 pages
(Ebook) Production Optimization Using Nodal Analysis (2nd Edition) by H. Dale Beggs ISBN 9780930972141, 0930972147 - The full ebook version is just one click away
100% (1)
(Ebook) Production Optimization Using Nodal Analysis (2nd Edition) by H. Dale Beggs ISBN 9780930972141, 0930972147 - The full ebook version is just one click away
58 pages
Assignment On Operation Process of Jaypee Cement Plant: Sumbitted By: Anmol Garg (A1802014082) Mba-Ib
No ratings yet
Assignment On Operation Process of Jaypee Cement Plant: Sumbitted By: Anmol Garg (A1802014082) Mba-Ib
13 pages
SAE J343-2001 Test and Test Procedures For SAE 100R Series Hydraulic Hose and Hose Assemblies
No ratings yet
SAE J343-2001 Test and Test Procedures For SAE 100R Series Hydraulic Hose and Hose Assemblies
9 pages
Tablet Compacting and Pressing Tools - Punch and Die
100% (1)
Tablet Compacting and Pressing Tools - Punch and Die
26 pages
Link Budget: Produced Using Satmaster Pro
No ratings yet
Link Budget: Produced Using Satmaster Pro
3 pages
Astm Volumetrcos
No ratings yet
Astm Volumetrcos
8 pages