0% found this document useful (0 votes)
8 views69 pages

Machine Learning Techniques - Week 11: December 20, 2024

The document discusses various binary classification algorithms and their associated loss functions, including linear models, SVMs, and logistic regression. It explores the NP-hardness of classification, the implications of different loss functions, and provides insights into the perceptron algorithm and boosting techniques. The content is structured into sections that cover theoretical foundations, practical applications, and performance evaluations of these machine learning techniques.

Uploaded by

ramaseshan.nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views69 pages

Machine Learning Techniques - Week 11: December 20, 2024

The document discusses various binary classification algorithms and their associated loss functions, including linear models, SVMs, and logistic regression. It explores the NP-hardness of classification, the implications of different loss functions, and provides insights into the perceptron algorithm and boosting techniques. The content is structured into sections that cover theoretical foundations, practical applications, and performance evaluations of these machine learning techniques.

Uploaded by

ramaseshan.nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Machine Learning Techniques - Week 11

December 20, 2024


Contents
1. Binary Classification Algorithms through Loss Functions 1
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3. The Classification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4. Linear Models for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4.1. Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5. The NP-Hardness of Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6. The Loss Function View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.1. Squared Loss (Regression for Classification) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.2. Comparison of Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.3. Why Squared Loss is Suboptimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7. Other Surrogate Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. SVM and Logistic Loss 7


2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Revisiting Zero-One Loss and NP-Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3. Support Vector Machines and Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1. Formulation of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2. Hinge Loss Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3. Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4. Logistic Regression and Logistic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1. Maximum Likelihood Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2. Logistic Loss Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5. Comparing Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1. Visual Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

iii
Contents

2.5.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3. Perceptron and Boosting Loss 13


3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2. Perceptron Algorithm: A Gradient Descent Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1. Connection to Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2. Subgradient Descent Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3. Stochastic Gradient Descent View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3. Boosting and Exponential Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1. Coordinate Descent Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4. Convex Surrogates for Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.1. Examples of Convex Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2. Why Convexity Matters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5. Beyond Convex Surrogates: Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.1. Neural Networks and Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.2. Implications for Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4. Neural Networks 17
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2. Revisiting the Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3. Neural Networks: Extending Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.1. Structure of a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2. Mathematical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.3. Introducing Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.4. Role of Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4. Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5. Challenges and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5. Backpropagation 21
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv
Contents

5.2. Defining the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


5.2.1. Loss Function for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2. Loss Function for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3. Optimizing Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.1. Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.2. Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4. Architectures and Parameter Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5. Challenges in Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.1. Non-Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.2. Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.3. Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.6. Applications to Structured and Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6. Concluding Remarks: Foundations and Frontiers in Machine Learning 25


6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2. Summary of Topics Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2.1. Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2.2. Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3. Topics Beyond the Course Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3.1. Semi-Supervised and Self-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3.2. Sequential Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3.3. Deployable AI and Real-World Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4. Advanced Topics in Modern Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Appendices 29

A. Proof of Convergence of Perceptron Algorithm with Boundary Conditions 29


A.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.1.2. Key Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.1.3. Proof Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.1.4. Example Dataset with Boundary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v
Contents

B. Problem connecting γ, mistakes (M ) and R 33


B.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
B.1.1. Key Equations and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
B.1.2. Analysis of Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
B.1.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
B.1.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

C. The Dual Problem in Optimization 35


C.1. Introduction to the Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.1.1. Primal Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.1.2. Lagrangian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
C.1.3. Dual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.1.4. Dual Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
C.1.5. Weak and Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.1.6. Advantages of the Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.1.7. Example: Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
C.1.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

D. Lagrangian Multipliers and Their Role in Support Vector Machines 39


D.1. Introduction to Lagrangian Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
D.1.1. KKT Conditions and Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
D.1.2. Lagrangian Multipliers in the SVM Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
D.1.3. Optimization with Lagrangian Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
D.1.4. Role of Lagrangian Multipliers in SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
D.1.5. Complementary Slackness in SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
D.1.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

E. Hinge Loss vs. Logistic Loss 43


E.1. Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
E.2. Logistic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
E.3. Comparison and Contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
E.3.1. Visual Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
E.3.2. Role of C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vi
Contents

F. Decision Tree 47
F.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
F.2. Why Decision Trees? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
F.2.1. Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
F.2.2. Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
F.3. Decision Stumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
F.3.1. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
F.3.2. Choosing the Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
F.4. Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
F.5. Building a Bigger Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
F.6. Real-Valued Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
F.7. Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
F.8. Managing Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
F.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

G. Loss Functions 57

vii
1. Binary Classification Algorithms through Loss Functions
1.1. Introduction
Binary classification is a fundamental problem in supervised learning, where the objective is to predict one of two possible labels (+1 or −1) for
a given input. Over the years, numerous algorithms have been developed for binary classification, including logistic regression, support vector
machines (SVMs), decision trees, boosting, and perceptrons. This chapter explores why there are so many algorithms for binary classification and
provides a unified framework for understanding them through the lens of loss functions.

1.2. Motivation
Unlike regression, where a single method (linear regression) can often suffice with modifications like regularization, binary classification involves a
variety of approaches. This variety arises because solving the classification problem directly, as formulated, is computationally hard (NP-hard). To
address this, different algorithms employ surrogate loss functions that approximate the original problem in computationally efficient ways.

1.3. The Classification Problem


Given a dataset {(x1 , y1 ), . . . , (xn , yn )}, where xi ∈ Rd represents input features and yi ∈ {+1, −1} represents labels, the goal is to learn a
hypothesis h such that:

h : Rd → {+1, −1}.
We evaluate the performance of h using a zero-one loss function:

L(h, (x, y)) = I(h(x) 6= y),


where I is the indicator function:

1
1. Binary Classification Algorithms through Loss Functions

(
1 if condition is true,
I(condition) =
0 otherwise.
The total loss over the dataset is:
n
X
I(h(xi ) 6= yi ).
i=1

Directly minimizing this loss is NP-hard because the indicator function is non-convex and discontinuous. Thus, practical algorithms rely on
surrogate loss functions.

0-1 Loss
The 0-1 loss function, also known as zero-one loss or misclassification loss, assigns a loss of 0 if a prediction matches the actual outcome
and a loss of 1 if it does not. It’s used in classification tasks to measure the accuracy of predictions:
(
0, ŷ = y
L(h(x, y)) =
1, ŷ 6= y

This loss function does not account for the degree of error, only whether an error occurred.

1.4. Linear Models for Classification


A common approach in binary classification is to restrict h to a class of linear functions. Specifically, we define:

h(x) = sign(w> x + b),


where w ∈ Rd is a weight vector and b ∈ R is a bias term. The decision boundary for such models is a hyperplane defined by w> x + b = 0.

1.4.1. Performance Evaluation


The zero-one loss for linear models can be reformulated as:

2
1.5. The NP-Hardness of Classification

L(h, (x, y)) = I(y(w> x + b) ≤ 0).


Here, the term y(w> x + b) captures the signed margin, which indicates whether the prediction agrees with the true label (y).

1.5. The NP-Hardness of Classification


Minimizing the zero-one loss over a dataset is challenging due to its non-convexity. Even for linear models, the optimization problem:
n
X
min I(yi (w> xi + b) ≤ 0)
w,b
i=1

is NP-hard. Algorithms overcome this by replacing the zero-one loss with surrogate loss functions that are easier to minimize.

1.6. The Loss Function View


Different binary classification algorithms can be understood by examining the surrogate loss functions they minimize. Consider a general formu-
lation:

Loss for a single point: L(G, (x, y)) = f (G(x) · y),


where G(x) represents a score produced by the model (e.g., G(x) = w> x+b) and f is the surrogate loss function. The zero-one loss corresponds
to f (z) = I(z ≤ 0).

1.6.1. Squared Loss (Regression for Classification)


One naive approach is to use the squared loss, commonly employed in regression:

L(G, (x, y)) = (G(x) − y)2 .


For y ∈ {+1, −1}, this can be rewritten as:

L(G, (x, y)) = (G(x) · y − 1)2 .

3
1. Binary Classification Algorithms through Loss Functions

While this loss is convex and easy to optimize, it has significant drawbacks for classification. Even if G(x) predicts the correct label, the squared
loss penalizes large values of G(x) · y, which can lead to poor performance in the presence of outliers.

1.6.2. Comparison of Loss Functions


To visualize the differences between the zero-one loss and the squared loss, consider their behaviors as functions of G(x) · y:
- The zero-one loss is 1 for G(x) · y ≤ 0 and 0 otherwise. - The squared loss grows quadratically as G(x) · y deviates from 1.
Figure 1.1 illustrates these differences.

Figure 1.1.: Comparison of zero-one loss and squared loss as functions of G(x) · y.

1.6.3. Why Squared Loss is Suboptimal


The squared loss does not align well with the zero-one loss for classification tasks. It penalizes large values of G(x) · y, even when the prediction
is correct. This makes it sensitive to outliers and can result in poor decision boundaries.

1.7. Other Surrogate Loss Functions


To address the limitations of the squared loss, other surrogate loss functions have been proposed. These include:

• Hinge Loss (SVM): f (z) = max(0, 1 − z)

• Logistic Loss: f (z) = log(1 + e−z )

• Exponential Loss (Boosting): f (z) = e−z

Each of these loss functions is designed to approximate the zero-one loss more closely while remaining convex or amenable to optimization.

4
1.8. Conclusion

1.8. Conclusion
This chapter introduced the classification problem and explained why it necessitates a variety of algorithms. By adopting a loss function perspective,
we saw how different algorithms approximate the zero-one loss using surrogate loss functions. While the squared loss is computationally efficient,
its poor alignment with the classification objective highlights the importance of selecting appropriate surrogate losses. In the next chapter, we will
delve deeper into specific algorithms like SVMs, logistic regression, and boosting, and analyze their loss functions in detail.

5
2. SVM and Logistic Loss
2.1. Introduction
Binary classification is a cornerstone of machine learning, where the task is to classify inputs into one of two possible categories. This chapter
delves into how various algorithms approach binary classification through the lens of loss functions. It discusses how the inherent complexity of
minimizing the zero-one loss leads to the development of surrogate loss functions, enabling practical algorithms like Support Vector Machines
(SVMs) and logistic regression.

2.2. Revisiting Zero-One Loss and NP-Hardness


Given a dataset {(x1 , y1 ), . . . , (xn , yn )}, where xi ∈ Rd represents the input features and yi ∈ {+1, −1} represents the labels, the zero-one loss
for a hypothesis h is defined as:

L(h, (x, y)) = I(h(x) 6= y),


where I(condition) is the indicator function:
(
1 if condition is true,
I(condition) =
0 otherwise.
The total loss over the dataset is given by:
n
X
I(h(xi ) 6= yi ).
i=1

This formulation is non-convex and discontinuous, making it computationally hard to minimize (NP-hard). Consequently, different algorithms
replace the zero-one loss with surrogate loss functions that are easier to optimize.

7
2. SVM and Logistic Loss

2.3. Support Vector Machines and Hinge Loss


2.3.1. Formulation of SVMs
The soft-margin SVM aims to find a hyperplane that maximizes the margin while allowing for some misclassification. The optimization problem is:
n
1 X
min kwk2 + C ξi ,
w,ξ 2
i=1

subject to:
yi (w> xi + b) ≥ 1 − ξi , ξi ≥ 0 ∀i.

Here: - w is the weight vector, - b is the bias term, - ξi are slack variables allowing misclassification, - C controls the trade-off between maximizing
the margin and minimizing the slack.

Hinge Loss
The hinge loss function, often used in support vector machines, is defined as:

L(y, f (x)) = max(0, 1 − y · f (x))

where:

• y ∈ {−1, 1} is the true label,

• f (x) is the predicted score or output from the model.

2.3.2. Hinge Loss Interpretation


Rewriting the constraints reveals the hinge loss perspective. The slack variable ξi can be expressed as:

ξi = max(0, 1 − yi (w> xi + b)).

Thus, the optimization problem becomes:

8
2.4. Logistic Regression and Logistic Loss

n
1 X
min kwk2 + C max(0, 1 − yi (w> xi )).
w 2
i=1

Here, the term max(0, 1 − yi (w> xi )) is the hinge loss for a single data point. It penalizes points that lie within the margin or on the wrong side
of the hyperplane.

2.3.3. Geometric Interpretation


The hinge loss can be plotted as a function of z = y · (w> x). It is defined as:

Hinge Loss: `(z) = max(0, 1 − z).

• For z ≥ 1, the loss is zero, indicating correct classification with sufficient margin.

• For z < 1, the loss increases linearly, penalizing points based on their proximity to the decision boundary.

The hinge loss is convex, making it computationally feasible to minimize, unlike the zero-one loss.

2.4. Logistic Regression and Logistic Loss


2.4.1. Maximum Likelihood Formulation
Logistic regression models the probability of y given x using a sigmoid function:
1
P (y = 1 | x) = σ(w> x), σ(z) = .
1 + e−z
The log-likelihood for a dataset is:
n
X
log L(w) = yi log σ(w> xi ) + (1 − yi ) log(1 − σ(w> xi )) .
 
i=1

Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood:

9
2. SVM and Logistic Loss

n
X
min −yi log σ(w> xi ) − (1 − yi ) log(1 − σ(w> xi )) .
 
w
i=1

2.4.2. Logistic Loss Interpretation


By re-expressing the labels yi ∈ {+1, −1}, the logistic loss for a single data point becomes:

`(z) = log(1 + e−z ),


where z = y · (w> x).
The logistic loss is convex and differentiable, making it amenable to optimization. Unlike the hinge loss, it penalizes all points, even those
correctly classified, but the penalty decreases exponentially as z increases.

2.5. Comparing Loss Functions


The following loss functions are commonly used in binary classification:
SVM :Hinge loss
• Zero-One Loss: I(z ≤ 0).
• Hinge Loss: max(0, 1 − z).
Logistic Regression :Log loss (-ve log conditional likelihood) • Logistic Loss: log(1 + e−z ).

2.5.1. Visual Comparison


Log loss Hinge loss Figure 1.1 compares these loss functions as a function of z = y · (w> x).

2.5.2. Discussion
0-1 loss • The zero-one loss is ideal but computationally intractable.
• The hinge loss is a convex approximation and enforces a margin,
making it suitable for SVMs.

-1 0 1
10 Figure 2.1.: Hinge Loss Vs Logistic Loss
2.6. Conclusion

• The logistic loss is smooth and differentiable, aligning well with


probabilistic interpretations.

2.6. Conclusion
This chapter illustrated how different algorithms for binary classification address the NP-hardness of the zero-one loss using surrogate loss func-
tions. SVMs rely on hinge loss, while logistic regression uses logistic loss. Both approaches provide computationally efficient and effective solutions
to binary classification problems. The choice of loss function significantly influences the algorithm’s behavior and performance, emphasizing the
importance of selecting the right surrogate for the problem at hand.

11
3. Perceptron and Boosting Loss
3.1. Introduction
Binary classification algorithms, ranging from perceptrons to support vector machines (SVMs) and boosting, rely on minimizing loss functions to
improve classification performance. This chapter examines the perceptron algorithm, boosting, and the broader implications of using convex and
non-convex surrogate loss functions. We explore their mathematical foundations, interpretations, and connections with modern advancements
such as neural networks.

3.2. Perceptron Algorithm: A Gradient Descent Perspective


The perceptron algorithm operates on a simple update rule when a misclassification occurs. At iteration t, given a data point (xt , yt ), the perceptron
updates the weight vector wt as:

wt+1 = wt + xt yt .

3.2.1. Connection to Gradient Descent


The perceptron update rule can be interpreted as a gradient descent step. Consider the hinge loss, a convex surrogate used in SVMs:

Lhinge (w; x, y) = max(0, 1 − y · w> x).


The hinge loss has the following piecewise gradient with respect to w:
(
−xy, if y · w> x < 1,
∇Lhinge (w; x, y) =
0, if y · w> x ≥ 1.

13
3. Perceptron and Boosting Loss

The perceptron mimics gradient descent on a modified hinge loss where the margin condition is simplified. For the perceptron, the loss is
implicitly defined as:

Lperceptron (w; x, y) = max(0, −y · w> x).

3.2.2. Subgradient Descent Interpretation


When y · w> x = 0, the hinge loss is non-differentiable. In such cases, subgradient descent is used. The subgradient of the hinge loss at this point
lies in the interval [−xy, 0]. The perceptron selects the subgradient −xy, making the update rule:

wt+1 = wt − ηt · (−xy),
where ηt = 1 is a fixed step size in the perceptron.

3.2.3. Stochastic Gradient Descent View


The perceptron processes one data point at a time, akin to stochastic gradient descent (SGD). For each misclassified point, it performs a gradient
descent step with a constant step size. This interpretation places the perceptron within the broader framework of loss minimization algorithms.

3.3. Boosting and Exponential Loss


Boosting combines weak classifiers to create a strong ensemble classifier. It optimizes a different loss function, known as the exponential loss:

Lexp (h; x, y) = e−y·h(x) .


Here: - h(x) is the ensemble classifier’s output, - y ∈ {+1, −1} is the label.
The exponential loss penalizes misclassified points more heavily, encouraging the algorithm to focus on these points in subsequent iterations.

3.3.1. Coordinate Descent Perspective


Boosting algorithms, such as AdaBoost, can be viewed as performing coordinate descent on the exponential loss. At each iteration, a new weak
classifier is added to the ensemble, aiming to minimize the exponential loss. While we did not prove the correctness of boosting in this discussion,
it aligns well with minimizing this loss function.

14
3.4. Convex Surrogates for Loss Functions

3.4. Convex Surrogates for Loss Functions


The success of algorithms like SVMs, logistic regression, and boosting arises from their use of convex surrogate loss functions. Convexity ensures
that these functions are easier to optimize than the zero-one loss.

3.4.1. Examples of Convex Loss Functions


• Hinge Loss (SVM): `(z) = max(0, 1 − z).

• Logistic Loss: `(z) = log(1 + e−z ).

• Exponential Loss: `(z) = e−z .

These loss functions serve as surrogates for the zero-one loss, providing smooth and differentiable approximations.

3.4.2. Why Convexity Matters


Convex functions have a single global minimum, making them computationally tractable. Gradient-based optimization methods, such as SGD, are
effective for minimizing convex functions.

3.5. Beyond Convex Surrogates: Neural Networks


While convex surrogates dominate traditional algorithms, modern machine learning often employs non-convex loss functions, such as those in
neural networks. Neural networks optimize highly non-convex functions, yet achieve remarkable success in practice.

3.5.1. Neural Networks and Perceptron


Neural networks extend the perceptron concept by introducing multiple layers and non-linear activation functions. Despite their non-convex
nature, advancements in optimization techniques allow neural networks to perform well.

15
3. Perceptron and Boosting Loss

3.5.2. Implications for Binary Classification


Neural networks do not rely on convexity. Instead, they achieve good solutions by exploring the parameter space and finding local minima that
generalize well to unseen data.

3.6. Conclusion
This chapter explored various algorithms for binary classification through the lens of loss functions. The perceptron was reinterpreted as performing
subgradient descent on a modified hinge loss. Boosting was connected to the exponential loss, highlighting its focus on misclassified points. We
concluded by emphasizing the importance of convex surrogates in traditional algorithms and introduced the broader possibilities of non-convex
optimization in neural networks. This discussion provides a foundation for understanding the diverse approaches to binary classification and the
trade-offs involved.

16
4. Neural Networks

4.1. Introduction

In previous discussions, we explored various algorithms for supervised learning, particularly binary classification, through the lens of loss function
minimization combined with regularization. This framework allowed us to understand algorithms such as support vector machines (SVMs), logistic
regression, and perceptron. However, these methods rely heavily on convex surrogate loss functions, which ensure computational tractability.
This chapter introduces neural networks, a family of algorithms inspired by perceptrons, that do not necessarily rely on convex loss functions.
Neural networks are a foundational component of modern machine learning, particularly deep learning, and provide a powerful approach to
modeling complex relationships between input and output.

4.2. Revisiting the Perceptron

The perceptron algorithm predicts outputs by learning a weight vector w ∈ Rd and applying the rule:

ŷ = sign(w> x),

where x ∈ Rd is the input vector. This can be visualized as follows: - The input x = [x1 , x2 , . . . , xd ] is represented as a set of nodes. - Each node
connects to an output node with weights w = [w1 , w2 , . . . , wd ]. - The weighted sum w> x determines the output after applying the sign function.
While perceptrons are effective for linearly separable data, they cannot model non-linear relationships. Neural networks generalize this concept
by introducing non-linearity through hidden layers and activation functions.

17
4. Neural Networks

4.3. Neural Networks: Extending Perceptrons


4.3.1. Structure of a Neural Network
A simple neural network consists of: 1. An input layer representing the features x = [x1 , x2 , . . . , xd ]. 2. One or more hidden layers, where each
node (neuron) computes a weighted sum followed by a non-linear activation. 3. An output layer that combines outputs from the last hidden layer
to produce the final prediction.
For simplicity, consider a neural network with one hidden layer containing k neurons. Each neuron in the hidden layer computes:

zi = wi> x and hi = a(zi ),

where: - wi ∈ Rd are the weights for the i-th neuron, - a(·) is the activation function, introducing non-linearity.
The output layer computes:

>
ŷ = wout h,

where: - h = [h1 , h2 , . . . , hk ]> is the vector of activations from the hidden layer, - wout ∈ Rk are the weights of the output layer.

4.3.2. Mathematical Representation


Given an input x ∈ Rd , the prediction ŷ for a neural network can be written as:

k
X
ŷ = wout,i · a(wi> x).
i=1

Here, the network parameters are: 1. Hidden layer weights: w1 , w2 , . . . , wk ∈ Rd , 2. Output layer weights: wout ∈ Rk .
The task of training a neural network involves learning these weights to minimize a loss function over the dataset.

4.3.3. Introducing Non-Linearity


Linear models like perceptrons assume that the relationship between input and output can be captured by a linear function. Neural networks
overcome this limitation by introducing non-linear activation functions in the hidden layers. Popular activation functions include:

18
4.4. Training Neural Networks

1. Sigmoid Function
1
a(z) = .
1 + e−z
The sigmoid function maps z ∈ R to the range (0, 1), and is differentiable, which facilitates optimization.

2. Rectified Linear Unit (ReLU)


a(z) = max(0, z).
ReLU introduces non-linearity by outputting z if z > 0 and 0 otherwise. It is computationally efficient and widely used in deep learning.

Comparison of Activation Functions Figure 4.1 illustrates the sigmoid and ReLU functions.

Figure 4.1.: Comparison of Sigmoid and ReLU activation functions.

4.3.4. Role of Non-Linearity


Without non-linear activation functions, the neural network would reduce to a linear model, as linear combinations of linear functions remain linear.
Non-linearity enables neural networks to approximate complex mappings between input and output.

4.4. Training Neural Networks


Training a neural network involves learning the weights w1 , w2 , . . . , wk and wout to minimize a loss function. The loss function measures the
discrepancy between the predicted outputs ŷ and the true labels y. Common loss functions include:

• Mean Squared Error (MSE): Used for regression tasks.


n
1X
L= (ŷi − yi )2 .
n i=1

19
4. Neural Networks

• Cross-Entropy Loss: Used for classification tasks.


n
1X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] .
n i=1

The optimization is typically performed using gradient-based methods, such as stochastic gradient descent (SGD), where gradients are computed
via backpropagation.

4.5. Challenges and Implications


Neural networks face several challenges: 1. Non-Convexity: The loss function is non-convex due to the non-linear activations, leading to multiple
local minima. 2. Overfitting: Neural networks can overfit the training data, requiring regularization techniques such as dropout or weight decay.
3. Computational Complexity: Training deep networks requires significant computational resources.
Despite these challenges, neural networks excel in tasks involving large datasets and complex relationships, such as image recognition and natural
language processing.

4.6. Conclusion
Neural networks extend the perceptron by introducing hidden layers and non-linear activation functions, enabling them to model complex, non-
linear relationships. While they deviate from convex optimization principles, their flexibility and scalability make them indispensable in modern
machine learning. This chapter provides an introduction to neural networks, serving as a foundation for further exploration in deep learning.

20
5. Backpropagation
5.1. Introduction
Neural networks represent a significant extension of classical machine learning algorithms, enabling the modeling of complex, non-linear relation-
ships in data. Training a neural network involves defining an appropriate loss function, optimizing the network’s parameters, and handling challenges
posed by non-convexity and high-dimensional parameter spaces. This chapter elaborates on the training process, introduces backpropagation,
and discusses its application in regression and classification problems.

5.2. Defining the Loss Function


To train a neural network, we first define a loss function that quantifies the error between the predicted outputs and the true labels. The choice
of loss function depends on the task at hand—regression or classification.

5.2.1. Loss Function for Regression


For regression problems, the loss function is typically the Mean Squared Error (MSE). Let the neural network be parameterized by Θ, which includes
all weights from the input to the output layer. For an input xi , the network’s output is denoted as NN(xi ; Θ). The loss for a single data point is:

Li (Θ) = (NN(xi ; Θ) − yi )2 ,
where yi is the true label. The total loss over the dataset is:
n
1X
L(Θ) = (NN(xi ; Θ) − yi )2 .
n i=1

This loss function captures the squared deviation between the predicted and actual values.

21
5. Backpropagation

5.2.2. Loss Function for Classification


For binary classification problems, the output of the neural network is interpreted as the probability of y = 1 given x. This is typically achieved
using a sigmoid activation function in the output layer:

>

P (y = 1 | x; Θ) = σ wout h ,
where σ(z) = 1+e1−z is the sigmoid function, and h is the vector of activations from the last hidden layer. The cross-entropy loss is used to
compare the predicted probabilities with the true labels:
n
1X
L(Θ) = − [yi log ŷi + (1 − yi ) log(1 − ŷi )] ,
n i=1

where ŷi = P (y = 1 | xi ; Θ).

5.3. Optimizing Neural Networks


The parameters Θ are optimized to minimize the chosen loss function. Optimization is performed using gradient-based methods such as Stochastic
Gradient Descent (SGD).

5.3.1. Gradient Descent


Gradient descent updates the parameters Θ iteratively:

Θ ← Θ − η∇Θ L(Θ),
where η > 0 is the learning rate, and ∇Θ L(Θ) is the gradient of the loss with respect to the parameters.

5.3.2. Backpropagation
Backpropagation is the algorithm used to compute gradients efficiently in neural networks. It exploits the chain rule of differentiation to propagate
errors backward through the network. For a parameter wij in layer l, the gradient is computed as:

22
5.4. Architectures and Parameter Complexity

∂L (l) (l−1)
= δj hi ,
∂wij
(l−1) (l)
where: - hi is the activation of the i-th neuron in layer l − 1, - δj is the error term for the j-th neuron in layer l, computed recursively.
The backpropagation algorithm involves the following steps: 1. Perform a forward pass to compute the network’s predictions. 2. Compute the
loss and the gradient of the output layer. 3. Propagate the error backward through the network using the chain rule. 4. Update the parameters
using gradient descent.

5.4. Architectures and Parameter Complexity


The number of parameters in a neural network increases with the number of layers and neurons per layer. For a network with L hidden layers, the
parameters are: - Θ = {W (1) , W (2) , . . . , W (L) , Wout }, where W (l) are the weights connecting layers l and l + 1.
The parameter space grows exponentially with the network’s depth and width, making optimization challenging.

5.5. Challenges in Neural Network Training


5.5.1. Non-Convexity
Neural network loss functions are highly non-convex due to the non-linear activation functions. As a result, gradient descent typically converges to
a local minimum or saddle point rather than the global minimum. Despite this, neural networks perform remarkably well in practice, often finding
solutions that generalize effectively.

5.5.2. Overfitting
With large parameter spaces, neural networks are prone to overfitting, particularly on small datasets. Regularization techniques such as dropout,
weight decay, and data augmentation are commonly used to mitigate this issue.

5.5.3. Computational Complexity


Training deep networks requires significant computational resources, including GPUs and distributed computing frameworks. Efficient implemen-
tations of backpropagation and gradient descent are essential for scalability.

23
5. Backpropagation

5.6. Applications to Structured and Unstructured Data


Neural networks excel in tasks involving unstructured data, such as: - Images: Convolutional Neural Networks (CNNs) extract hierarchical fea-
tures from images. - Time-Series Data: Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks handle sequential
dependencies. - Text: Transformers and attention mechanisms model long-range dependencies in textual data.
For structured data, classical algorithms like SVMs and gradient boosting often remain competitive.

5.7. Conclusion
Neural networks represent a powerful approach to modeling complex relationships, particularly for unstructured data. While their training involves
challenges such as non-convexity and overfitting, techniques like backpropagation and regularization ensure effective optimization. The versatility
of neural networks makes them indispensable in modern machine learning, particularly in fields like computer vision, natural language processing,
and speech recognition.

24
6. Concluding Remarks: Foundations and Frontiers in Machine
Learning
6.1. Introduction
This chapter concludes the course by summarizing the foundational topics we have covered and outlining advanced concepts and areas for further
exploration. The journey through unsupervised learning, supervised learning, and an introduction to neural networks has provided a robust base
for understanding machine learning algorithms and their applications. Additionally, we touch on topics beyond the scope of this course, such as
semi-supervised learning, fairness, explainability, and privacy, offering a glimpse into the future challenges and opportunities in machine learning.

6.2. Summary of Topics Covered


6.2.1. Unsupervised Learning
Unsupervised learning involves extracting patterns or representations from data without explicit labels. The key areas covered include:

• Representation Learning:
– Principal Component Analysis (PCA): A linear technique to reduce the dimensionality of data while preserving variance.
– Kernel PCA: An extension of PCA to capture non-linear relationships using kernel functions.

• Clustering:
– K-Means Algorithm: A partitioning-based clustering technique minimizing intra-cluster variance.
– Spectral Clustering: An extension of k-means utilizing graph-based representations and eigenvalue decompositions.

• Density Estimation:

25
6. Concluding Remarks: Foundations and Frontiers in Machine Learning

– Maximum Likelihood and Bayesian Methods: Approaches to estimate data distributions.


– Mixture Models: Using Gaussian Mixture Models (GMMs) and the Expectation-Maximization (EM) algorithm to handle multi-modal
data distributions.

6.2.2. Supervised Learning


Supervised learning aims to model the relationship between inputs and labeled outputs. Key algorithms studied include:

• Regression:
– Ordinary Least Squares (OLS): Minimizing the squared error between predictions and labels.
– Ridge Regression and Lasso: Regularized versions of OLS to prevent overfitting and encourage sparsity.

• Classification:
– K-Nearest Neighbors (KNN): A non-parametric method based on proximity.
– Decision Trees: A hierarchical, rule-based classifier.
– Logistic Regression: A probabilistic classifier using the logistic function.
– Support Vector Machines (SVMs): Maximizing the margin between classes with extensions for non-linear decision boundaries via
kernels.
– Bagging and Boosting: Ensemble methods to improve stability and performance.

• Neural Networks:
– Introduced as an extension of perceptrons with non-linear activations.
– Discussed the basics of backpropagation and its role in optimizing neural networks.

6.3. Topics Beyond the Course Scope


While this course laid a strong foundation, several important topics were beyond its scope:

26
6.3. Topics Beyond the Course Scope

6.3.1. Semi-Supervised and Self-Supervised Learning


• Semi-Supervised Learning: Combines labeled and unlabeled data to improve model performance, leveraging both supervised and unsuper-
vised techniques.

• Self-Supervised Learning: Particularly useful in deep learning, it involves generating pseudo-labels from data itself, enabling unsupervised
pretraining.

6.3.2. Sequential Decision Making


Sequential decision making involves learning from data presented over time. Unlike the static datasets used in this course, here decisions must be
made in real time with feedback. This includes:

• Reinforcement Learning

• Multi-Armed Bandit Problems

6.3.3. Deployable AI and Real-World Considerations


Deploying machine learning systems in practice raises unique challenges:

• Fairness: Ensuring the algorithm does not reinforce societal biases present in training data.

• Explainability: Enabling algorithms to provide human-interpretable explanations for their decisions, crucial for domains like healthcare and
autonomous vehicles.

• Privacy: Developing algorithms that respect data privacy, using techniques like differential privacy.

• Edge AI: Designing lightweight models for deployment on resource-constrained devices, such as mobile phones.

• Continual Learning: Adapting models dynamically as new data becomes available, addressing changing data distributions.

27
6. Concluding Remarks: Foundations and Frontiers in Machine Learning

6.4. Advanced Topics in Modern Machine Learning


The state-of-the-art in machine learning focuses on:

• Deep Learning: Advanced neural network architectures for tasks involving unstructured data (e.g., images, text, and audio).

• Transfer Learning: Leveraging pre-trained models to accelerate learning in new tasks.

• Attention Mechanisms and Transformers: Revolutionizing natural language processing and sequence modeling.

• Distributed Learning: Scaling learning algorithms across large datasets and computational infrastructures.

6.5. Conclusion
This course has equipped you with the foundational concepts of machine learning, from regression and classification to clustering and neural
networks. As you move forward, you are encouraged to explore advanced courses and research in areas such as deep learning, reinforcement
learning, and Deployable AI. The topics discussed here form the bedrock for understanding modern AI systems and tackling real-world challenges.
Thank you for your participation, and we wish you success in your journey through machine learning!

28
A. Proof of Convergence of Perceptron Algorithm with
Boundary Conditions
A.1. Introduction
The Perceptron algorithm guarantees convergence if the data is linearly separable. This proof includes boundary conditions using γ, R, and mistake
count.

A.1.1. Definitions
• Margin (γ): The minimum distance between the decision boundary defined by w∗ and any data point:
yi (w∗ · xi )
γ = min .
i kw∗ k
• Maximum Norm (R): The maximum Euclidean norm of any input vector:
R = max kxi k.
i

• Number of Mistakes (M ): The total number of weight updates performed by the algorithm.

A.1.2. Key Boundary Conditions


1. The dot product of the weight vector with the optimal weight vector grows linearly with the number of updates:
w(t+1) · w∗ ≥ M γ.

2. The norm of the weight vector grows quadratically with the maximum norm of the inputs:
kw(t+1) k2 ≤ M R2 .

29
A. Proof of Convergence of Perceptron Algorithm with Boundary Conditions

A.1.3. Proof Steps


Step 1: Progress in the Direction of w∗
Each mistake leads to a weight update:
w(t+1) = w(t) + yi xi .
The dot product with w∗ after the update is:
w(t+1) · w∗ = (w(t) · w∗ ) + yi (xi · w∗ ).
Since yi (xi · w∗ ) ≥ γ, the dot product grows as:
w(t+1) · w∗ ≥ w(t) · w∗ + γ.
After M updates:
w · w∗ ≥ M γ.

Step 2: Bounding the Norm of w


The norm of w after an update is:
kw(t+1) k2 = kw(t) k2 + kyi xi k2 + 2(w(t) · yi xi ).
Since kyi xi k = kxi k ≤ R, the norm satisfies:
kw(t+1) k2 ≤ kw(t) k2 + R2 .
After M updates:
kwk2 ≤ M R2 .

Step 3: Combining the Results


Using the Cauchy-Schwarz inequality:
w · w∗ ≤ kwkkw∗ k.
From Steps 1 and 2: √
Mγ ≤ M Rkw∗ k.

Dividing both sides by M: √ γ
M≥ .
Rkw∗ k

30
A.1. Introduction

Squaring both sides:


R2 kw∗ k2
M≤ .
γ2
This upper bound ensures that the number of updates M is finite.

A.1.4. Example Dataset with Boundary Analysis


Consider the dataset:
Input: x1 = [2, 1], x2 = [1, −1], x3 = [−1, −2], x4 = [−2, 1].
Labels: y1 = +1, y2 = −1, y3 = −1, y4 = +1.
Margin and Maximum Norm:
p √
• Compute R: kxi k = x21 + x22 , so R = 5 (maximum of all input norms).

• Assume γ = 1 for simplicity.

Iterative Updates:

1. Initialize w = [0, 0], b = 0.

2. Update weights for each misclassified point:


a) For x1 : w = [2, 1], b = 1.
b) For x2 : w = [1, 2], b = 0.
c) For x3 : w = [2, 4], b = 1.
d) For x4 : w = [0, 5], b = 2.

Verification of Bound:

• Number of updates M = 4.
R2
• Verify the bound: M ≤ γ2
= 5
1
= 5.

31
B. Problem connecting γ, mistakes (M ) and R
B.1. Problem Statement
The Perceptron algorithm is applied to a dataset where:

• Maximum length of data points (R) is 4.

• Margin (Γ) of the optimal separator is 1.

• The algorithm has made 10 mistakes at some point during execution.

• The task is to determine which of the following squared lengths of the weight vector (kwk2 ) can be valid in the 11th iteration:
1. (a) 90
2. (b) 150
3. (c) 190

B.1.1. Key Equations and Concepts


1. The norm of the weight vector is bounded as:
γ 2 · M 2 ≤ kwk2 ≤ M · R2 .
2. Substituting the values:

• Margin (γ): 1,

• Maximum length of data points (R): 4,

• Mistake count (M ): 11 (at the 11th iteration).

33
B. Problem connecting γ, mistakes (M ) and R

The bounds for kwk2 become:


γ 2 · M 2 = 12 · 112 = 121,
kwk2 ≤ M · R2 = 11 · 42 = 176.
Therefore:
121 ≤ kwk2 ≤ 176.

B.1.2. Analysis of Options


1. Option (a): 90 Since 90 < 121, this does not satisfy the lower bound for kwk2 .

2. Option (b): 150 Since 121 ≤ 150 ≤ 176, this is a valid value for kwk2 .

3. Option (c): 190 Since 190 > 176, this exceeds the upper bound for kwk2 and is invalid.

B.1.3. Conclusion
The valid squared length of the weight vector in the 11th iteration is:

(b) 150.

Options 90 and 190 are invalid because they fall outside the bounds 121 ≤ kwk2 ≤ 176.

B.1.4. Conclusion
R2
The Perceptron algorithm converges after a finite number of updates. The mistake count is bounded by γ2
, demonstrating the efficiency of the
algorithm.

34
C. The Dual Problem in Optimization
C.1. Introduction to the Dual Problem
In optimization, many problems involve minimizing or maximizing a function subject to constraints. The dual problem is an alternative represen-
tation of the original optimization problem, known as the primal problem. Solving the dual problem provides valuable insights and sometimes
computational advantages.
The dual problem arises naturally when applying Lagrangian methods to incorporate constraints into the objective function. It involves expressing
the primal problem in terms of dual variables (Lagrange multipliers), leading to a new optimization problem.

C.1.1. Primal Problem Formulation


Consider the general form of a constrained optimization problem (the primal problem):

min f (w) subject to gi (w) ≤ 0, i = 1, . . . , m, (C.1)


w

where:

• f (w) is the objective function to be minimized.

• gi (w) are constraint functions.

C.1.2. Lagrangian Function


To incorporate the constraints into the objective, we define the Lagrangian function:
m
X
L(w, α) = f (w) + αi gi (w), (C.2)
i=1

35
C. The Dual Problem in Optimization

where:

• α = [α1 , . . . , αm ]> are non-negative Lagrange multipliers, i.e., αi ≥ 0.

• f (w) is the original objective function.

• gi (w) are the constraint functions.

The Lagrangian combines the objective function and the constraints into a single function.

C.1.3. Dual Function


The dual function is derived by minimizing the Lagrangian with respect to the primal variables w:

q(α) = inf L(w, α). (C.3)


w

Key properties of the dual function:

• For any α ≥ 0, q(α) provides a lower bound on the optimal value of the primal problem.

• The dual function is concave in α, even if the primal problem is not convex.

C.1.4. Dual Problem Formulation


The dual problem is to maximize the dual function q(α) over all non-negative values of α:

max q(α). (C.4)


α≥0

The constraints αi ≥ 0 ensure that the dual variables remain non-negative.

36
C.1. Introduction to the Dual Problem

C.1.5. Weak and Strong Duality


Weak Duality: The value of the dual function at any α ≥ 0 is a lower bound for the primal objective:

q(α) ≤ f (w∗ ), (C.5)


where w∗ is the optimal solution to the primal problem.
Strong Duality: If the primal problem is convex and satisfies certain regularity conditions (such as Slater’s condition), then the optimal values
of the primal and dual problems are equal:

q(α∗ ) = f (w∗ ). (C.6)


Here, α∗ and w∗ are the optimal solutions of the dual and primal problems, respectively.

Slater’s Condition
For the primal problem to satisfy Slater’s condition:

1. Convexity: The objective function f (x) must be convex, and the inequality constraint functions gi (x) must also be convex.

2. Feasibility: There must exist a point x0 in the feasible region (the domain of the problem) such that:

gi (x0 ) < 0, ∀i,

(strict inequality for all inequality constraints).

3. Equality Constraints: If equality constraints hj (x) = 0 exist, they must be affine (linear).

In simpler terms, Slater’s condition requires the existence of a strictly feasible point for the inequality constraints, i.e., a point that satisfies
all constraints but does so with strict inequality for the inequalities.

C.1.6. Advantages of the Dual Problem


The dual problem provides several advantages:

1. Simplified Constraints: The dual problem often has simpler constraints compared to the primal problem.

37
C. The Dual Problem in Optimization

2. Reduced Dimensionality: If the primal problem involves many variables but few constraints, the dual problem involves fewer variables.

3. Insights into Optimal Solutions: The dual variables α often provide meaningful interpretations, such as the sensitivity of the objective to
constraint violations.

4. Kernelization: For certain problems, the dual formulation depends on dot products, enabling the use of kernel functions to handle non-linear
cases.

C.1.7. Example: Quadratic Programming


Consider a quadratic programming problem:
1
min kwk2 subject to yi (w> xi + b) ≥ 1, ∀i. (C.7)
w 2

The Lagrangian is:


n
1 X
2
αi 1 − yi (w> xi + b) . (C.8)

L(w, b, α) = kwk +
2 i=1

Minimizing L with respect to w and b, and maximizing with respect to α, yields the dual problem:
n n n
X 1 XX
max αi − αi αj yi yj (xi · xj ). (C.9)
α≥0
i=1
2 i=1 j=1

This formulation depends only on dot products, enabling the use of kernel functions in non-linear cases.

C.1.8. Conclusion
The dual problem provides a powerful framework for solving constrained optimization problems, often simplifying computation and offering deeper
insights. By leveraging duality, we can address complex problems like support vector machines and kernelized learning in a principled and efficient
manner.

38
D. Lagrangian Multipliers and Their Role in Support Vector
Machines
D.1. Introduction to Lagrangian Multipliers
In constrained optimization problems, the goal is to optimize an objective function f (w) subject to constraints gi (w) ≤ 0. Lagrangian multipliers
provide a systematic method to incorporate these constraints into the optimization process by defining a single function called the Lagrangian
function.
The general form of a constrained optimization problem is:

min f (w) (D.1)


w
subject to gi (w) ≤ 0, i = 1, 2, . . . , m. (D.2)

The Lagrangian function L(w, α) is constructed as:


m
X
L(w, α) = f (w) + αi gi (w), (D.3)
i=1

where:

• αi ≥ 0 are the Lagrange multipliers.

• gi (w) are the constraint functions.

The Lagrangian function combines the objective and constraints into a single expression. By solving for the optimal values of w and α, we ensure
that the constraints are satisfied at the optimal solution.

39
D. Lagrangian Multipliers and Their Role in Support Vector Machines

D.1.1. KKT Conditions and Optimality


For convex problems, the optimal solution satisfies the Karush-Kuhn-Tucker (KKT) conditions:

1. Primal feasibility: gi (w) ≤ 0.

2. Dual feasibility: αi ≥ 0.

3. Complementary slackness: αi gi (w) = 0 ∀i.

4. Stationarity: ∇w L(w, α) = 0.

D.1.2. Lagrangian Multipliers in the SVM Context


The support vector machine (SVM) formulation is a quadratic programming problem with constraints. For the hard-margin SVM, the primal problem
is:

1
min kwk2 (D.4)
w,b 2
subject to yi (w> xi + b) ≥ 1, ∀i. (D.5)

Here:

• 1
2
kwk2 is the objective function, minimizing the norm of w to maximize the margin.

• yi (w> xi + b) ≥ 1 ensures that all points are correctly classified with a margin of at least 1.

The constraints can be incorporated using Lagrangian multipliers αi ≥ 0, leading to the Lagrangian:

n
1 X
2
αi yi (w> xi + b) − 1 . (D.6)
 
L(w, b, α) = kwk −
2 i=1

40
D.1. Introduction to Lagrangian Multipliers

D.1.3. Optimization with Lagrangian Multipliers


To find the optimal solution, we minimize L with respect to the primal variables w and b, and maximize with respect to the dual variables α. This
leads to the dual problem:
n n n
X 1 XX
max αi − αi αj yi yj (xi · xj ), (D.7)
α≥0
i=1
2 i=1 j=1
subject to:
n
X
αi yi = 0. (D.8)
i=1

D.1.4. Role of Lagrangian Multipliers in SVM


The Lagrangian multipliers αi have a direct interpretation in the SVM:
• If αi > 0, the point xi is a support vector. These points lie on the margin or are misclassified in the case of soft-margin SVMs.
• If αi = 0, the point xi does not influence the decision boundary and lies outside the margin.
• The support vectors determine the optimal decision boundary, and w can be expressed as:
X
w= αi y i x i . (D.9)
i∈SV

D.1.5. Complementary Slackness in SVM


Complementary slackness implies:

αi yi (w> xi + b) − 1 = 0, (D.10)
 
∀i.
This condition indicates:
• If αi > 0, the point xi lies exactly on the margin, yi (w> xi + b) = 1.
• If αi = 0, the point is either correctly classified and outside the margin or irrelevant to the decision boundary.

41
D. Lagrangian Multipliers and Their Role in Support Vector Machines

D.1.6. Conclusion
Lagrangian multipliers play a crucial role in SVMs by incorporating margin constraints into the objective function. The dual formulation derived
using Lagrangian multipliers enables efficient optimization and kernelization, while complementary slackness ensures that only support vectors
contribute to the decision boundary. This elegant framework underpins the success of SVMs in both linear and non-linear classification tasks.

42
E. Hinge Loss vs. Logistic Loss
E.1. Hinge Loss
Hinge loss is primarily used in Support Vector Machines (SVMs), especially for classification tasks where the goal is to maximize the margin between
classes. It penalizes misclassified points and those inside the margin.
Equation:
L(y, f (x)) = max(0, 1 − y · f (x))
where:
• y is the true label (−1 or +1)

• f (x) is the output of the classifier, typically w · x + b


Explanation: - If y · f (x) ≥ 1, the point is correctly classified and outside the margin, so the loss is 0. - If y · f (x) < 1, the point is either
misclassified or within the margin, incurring a loss.

E.2. Logistic Loss


Also known as binary cross-entropy loss, logistic loss is commonly used in logistic regression for binary classification. It models probabilities and is
suited for scenarios where the decision boundary isn’t as clear-cut as in SVM.
Equation:
L(y, p) = − (y log(p) + (1 − y) log(1 − p))
where:
• y is the true label (0 or 1)

• p is the predicted probability of the positive class, often given by σ(f (x)) = 1
1+e−f (x)

43
E. Hinge Loss vs. Logistic Loss

Explanation: - The loss increases as the predicted probability p deviates from the true label y. - It’s continuous and differentiable, making it
suitable for gradient-based optimization methods.

E.3. Comparison and Contrast


Similarities:
• Both are used for classification problems.

• Both aim to penalize incorrect predictions, though in different ways.


Differences:
• Nature of Loss:
– Hinge loss focuses on the margin and classification correctness, being zero for correct classifications outside the margin.
– Logistic loss deals with probabilities, penalizing based on how far the predicted probability is from the true class.

• Output:
– Hinge loss works with a signed distance from the decision boundary.
– Logistic loss deals with probability outputs.

• Shape:
– Hinge loss has a piecewise linear form, with a flat section at zero for correct classifications.
– Logistic loss is smooth and continuous, with no flat regions.

• Use Cases:
– Hinge loss is tailored for SVM, emphasizing a large margin.
– Logistic loss is preferred in logistic regression, focusing on probabilistic interpretations.

• Gradient Properties:
– Hinge loss has a constant gradient where it’s non-zero, which can be less sensitive to outliers.
– Logistic loss provides a gradient that smoothly decreases as the prediction moves towards the correct classification.

44
E.3. Comparison and Contrast

E.3.1. Visual Comparison


In conclusion, while both losses aim to separate classes, hinge loss em-
SVM :Hinge loss phasizes the margin and exactness of classification, whereas logistic loss
focuses on modeling the probability of class membership, offering a
softer, more probabilistic approach to classification.

Logistic Regression :Log loss (-ve log conditional likelihood) • Hinge loss as having a ”hinge” at y · f (x) = 1 where the loss starts
to increase linearly.

• Logistic loss as a curve where the loss increases as the probability


diverges from the true label, with no sharp points or flat areas.
Log loss Hinge loss
E.3.2. Role of C

0-1 loss

-1 0 1
Figure E.1.: Hinge Loss Vs Logistic Loss

45
E. Hinge Loss vs. Logistic Loss

Figure E.2.: Role of (C)

46
F. Decision Tree
F.1. Introduction
A lot of nature books include diagrams that help identify animals. These diagrams often resemble decision trees, or dichotomous classifiers. To
classify a new example (e.g., a new animal), we start at the root node of the tree. At each node, there is a question about a specific feature of the
example. The outgoing edges are labeled with all possible answers[1].
We choose the correct answer for our example and follow the corresponding edge to a new node (one of the current node’s children). That new
node contains another question, and the process repeats. Eventually, we reach a leaf, which provides the predicted class for the example.
If we do not reach a given node in our path through the tree, we never ask that node’s question. For example, we never ask if an iguana has
wings. Often, an example will reach only a small fraction of the nodes in the tree. In a balanced tree of height h, a path from root to leaf will reach
approximately 2h − 1 nodes, out of a total of 2h − 1.

F.2. Why Decision Trees?


Decision trees have both advantages and disadvantages compared to other learning methods.

F.2.1. Advantages
• Interpretability: Decision trees provide clear explanations for every prediction made as they rely on a simple logical function of the input
features.

• Minimal Data Preprocessing: They require minimal data preprocessing (e.g., no need to scale or normalize features).

• Flexibility: They can handle both numerical and categorical data.

47
F. Decision Tree

Figure F.1.: Decision Tree

48
F.3. Decision Stumps

Figure F.2.: Decision Stump

F.2.2. Disadvantages
• Accuracy: Decision trees often do not achieve the same prediction performance as less interpretable models.

• Overfitting: Large trees can overfit the training data, especially when the dataset is small or noisy.

• Complexity: If input features are not interpretable (e.g., individual pixels in an image), the tree itself loses interpretability.

Accuracy can be improved by using ensembles of decision trees, such as Random Forests or Gradient Boosted Trees. However, this reduces
interpretability when multiple trees are combined.

F.3. Decision Stumps


A decision stump is a decision tree with just one node. It is one of the simplest classifiers, as the entire decision is based on a single feature.

49
F. Decision Tree

F.3.1. Example
Consider predicting whether an animal is a parrot based on two features: whether it flies and whether it likes crackers. Here is some training data:

flies crackers parrot


T T T
F F F
T F T
F T F
F F T

F.3.2. Choosing the Split


We choose a feature to split on (e.g., flies). If we split based on flies, the data is divided as follows:

flies = T: {T, T }, flies = F: {F, F, T }.


The predictions for each bucket will maximize the likelihood of the training data, resulting in the following stump:

If flies = T, predict parrot = T ; If flies = F, predict parrot = F.


Alternatively, we could split on crackers, resulting in a different stump.

F.4. Information Gain


Information gain measures how much uncertainty (entropy) about the target variable S is reduced after splitting the data based on a particular
feature. In simpler terms, it tells us how much “useful information” a feature provides about the target when used to split the dataset.
To evaluate which decision stump is better, we use information gain, based on the reduction in entropy after the split. Entropy is calculated as:
X
H(S) = − pi log2 (pi ),
i

where pi is the proportion of examples in class i.

50
F.4. Information Gain

For a dataset split based on feature A, the information gain is:


X |Sv |
IG(S, A) = H(S) − H(Sv ).
v∈A
|S|

Information Gain for P (parrot|flies = T )


Step 1: Total Entropy (H(S))
The formula for entropy is: X
H(S) = − pi log2 (pi ),
i
where pi is the proportion of instances belonging to class i.
From the dataset:
flies crackers parrot
T T T
F F F
T F T
F T F
F F T
• Total examples: 5.
• parrot = T : 3 instances.
• parrot = F : 2 instances.
• The probabilities are: P (parrot = T ) = 35 , P (parrot = F ) = 25 .
Substituting into the entropy formula:  
3 3 2 2
H(S) = − log2 + log2 .
5 5 5 5
Using log2 3
5
≈ −0.736 and log2 2
5
≈ −1.322, we compute:
 
3 2
H(S) = − · −0.736 + · −1.322 ≈ 0.970.
5 5

51
F. Decision Tree

Step 2: Conditional Entropy (H(S|flies))


We split the dataset based on the value of flies:

• For flies = T : 2 examples, both parrot = T .

• For flies = F : 3 examples (parrot = T : 1, parrot = F : 2).

Subset 1 (flies = T ): The probabilities are:

P (parrot = T |flies = T ) = 1, P (parrot = F |flies = T ) = 0.

Entropy for this subset:


H(S|flies = T ) = −(1 · log2 1 + 0 · log2 0) = 0.

Subset 2 (flies = F ): The probabilities are:


1 2
P (parrot = T |flies = F ) = , P (parrot = F |flies = F ) = .
3 3
Entropy for this subset:  
1 1 2 2
H(S|flies = F ) = − log2 + log2 .
3 3 3 3
Using log2 1
3
≈ −1.585 and log2 2
3
≈ −0.585:
 
1 2
H(S|flies = F ) = − · −1.585 + · −0.585 ≈ 0.918.
3 3

Weighted Average: The weights are proportional to the sizes of the subsets: - flies = T : 2 examples out of 5. - flies = F : 3 examples out of 5.
The conditional entropy is:
2 3
H(S|flies) = · H(S|flies = T ) + · H(S|flies = F ).
5 5
Substituting:
2 3
H(S|flies) = · 0 + · 0.918 ≈ 0.551.
5 5

52
F.4. Information Gain

Step 3: Information Gain


The information gain is the reduction in entropy:
IG(S, flies) = H(S) − H(S|flies).

Substituting the values:


IG(S, flies) = 0.970 − 0.551 = 0.419 bits.

Final Answer:
The information gain for P (parrot|flies = T ) is approximately:
0.419 bits .

Information Gain for P (S|crackers)


Step 1: Total Entropy (H(S))
The total entropy of the dataset is the same as before:
 
3 3 2 2
H(S) = − log2 + log2 .
5 5 5 5

Using log2 3
5
≈ −0.736 and log2 2
5
≈ −1.322, we compute:
 
3 2
H(S) = − · −0.736 + · −1.322 ≈ 0.970.
5 5

Step 2: Conditional Entropy (H(S|crackers))


We split the dataset based on the value of crackers: - For crackers = T : 2 examples (parrot = T : 1, parrot = F : 1). - For crackers = F : 3 examples
(parrot = T : 2, parrot = F : 1).

53
F. Decision Tree

Subset 1 (crackers = T ): The probabilities are:

1 1
P (parrot = T |crackers = T ) = , P (parrot = F |crackers = T ) = .
2 2
Entropy for this subset:  
1 1 1 1
H(S|crackers = T ) = − log2 + log2 .
2 2 2 2
Since log2 1
2
= −1, we get:
 
1 1
H(S|crackers = T ) = − · −1 + · −1 = 1.
2 2

Subset 2 (crackers = F ): The probabilities are:

2 1
P (parrot = T |crackers = F ) = , P (parrot = F |crackers = F ) = .
3 3
Entropy for this subset:  
2 2 1 1
H(S|crackers = F ) = − log2 + log2 .
3 3 3 3
Using log2 2
3
≈ −0.585 and log2 1
3
≈ −1.585, we compute:
 
2 1
H(S|crackers = F ) = − · −0.585 + · −1.585 ≈ 0.918.
3 3

Weighted Average: The weights are proportional to the sizes of the subsets: - crackers = T : 2 examples out of 5. - crackers = F : 3 examples
out of 5.
The conditional entropy is:
2 3
H(S|crackers) = · H(S|crackers = T ) + · H(S|crackers = F ).
5 5
Substituting:
2 3
H(S|crackers) = · 1 + · 0.918 ≈ 0.967.
5 5

54
F.5. Building a Bigger Tree

Step 3: Information Gain


The information gain is:
IG(S, crackers) = H(S) − H(S|crackers).
Substitute the values:
IG(S, crackers) = 0.970 − 0.967 = 0.003 bits.

Final Answer:
The information gain for P (S|crackers) is approximately:
0.003 bits .

F.5. Building a Bigger Tree


When a decision stump is insufficient, we can extend it by splitting further on the remaining data at each node. For example:
• Split first on flies.
• For examples where flies = F, split again on crackers.
The new tree can be evaluated by calculating the information gain at each step.

F.6. Real-Valued Features


For real-valued features, we choose a threshold θ and split the dataset into:
feature ≤ θ and feature > θ.
Only thresholds between adjacent values in the sorted dataset need to be considered, reducing the computational complexity.

F.7. Categorical Features


For categorical features with many values, sorting the feature values by P (class | feature = value) allows us to treat the feature as real-valued.

55
F. Decision Tree

F.8. Managing Complexity


To prevent overfitting, decision trees can use two strategies:

• Early Stopping: Stop splitting when further splits do not improve validation accuracy.

• Pruning: Grow the tree fully, then simplify it by merging nodes that do not improve accuracy on a holdout set.

F.9. Conclusion
Decision trees are a versatile and interpretable model. However, they can overfit or underperform compared to more complex models. Techniques
like pruning, early stopping, and ensemble methods improve their performance.

56
G. Loss Functions in Machine Learning
G.1. Introduction
Loss functions play a critical role in machine learning by quantifying the difference between predicted and actual values. They guide optimiza-
tion algorithms in adjusting model parameters to minimize errors, thereby improving performance. Loss functions can be broadly classified into
regression, classification, and hybrid categories. This essay discusses some of the most commonly used loss functions in detail.

G.2. Loss Functions for Regression


G.2.1. Mean Squared Error (MSE)
The Mean Squared Error (MSE) measures the average squared difference between predicted and actual values:
n
1X
LMSE (y, ŷ) = (yi − ŷi )2
n i=1

MSE penalizes large errors more significantly, making it sensitive to outliers. It is widely used in regression tasks.

G.2.2. Mean Absolute Error (MAE)


The Mean Absolute Error (MAE) calculates the average absolute difference:
n
1X
LMAE (y, ŷ) = |yi − ŷi |
n i=1

Unlike MSE, MAE is robust to outliers but may lead to slower convergence during optimization.

57
G. Loss Functions in Machine Learning

G.2.3. Huber Loss


Huber Loss combines the strengths of MSE and MAE, making it robust to outliers while retaining differentiability:
(
1
(y − ŷ)2 if |y − ŷ| ≤ δ,
LHuber (y, ŷ) = 2
δ|y − ŷ| − 2 δ if |y − ŷ| > δ.
1 2

G.3. Loss Functions for Classification


G.3.1. Cross-Entropy Loss
The Cross-Entropy Loss is commonly used in classification tasks involving probabilistic outputs:
n K
1 XX
LCE (y, ŷ) = − yi,k log(ŷi,k )
n i=1 k=1

Here, K is the number of classes, yi,k is the true label, and ŷi,k is the predicted probability for class k.

G.3.2. Hinge Loss


Hinge Loss is often used with Support Vector Machines (SVMs):

LHinge (y, f (x)) = max(0, 1 − y · f (x))

It emphasizes correct classification by maintaining a margin between classes.

G.3.3. Logistic Loss


The Logistic Loss, used in logistic regression, is defined as:
n
1X
LLogistic (y, ŷ) = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
n i=1

This loss function is foundational in binary classification tasks.

58
G.4. Loss Functions for Hybrid Tasks

G.4. Loss Functions for Hybrid Tasks


G.4.1. Exponential Loss
Exponential Loss is widely used in boosting algorithms such as AdaBoost:

LExponential (y, f (x)) = e−y·f (x)

This loss function focuses on reducing the impact of misclassified samples during training.

G.5. Conclusion
Loss functions are integral to machine learning, shaping how models learn from data. Choosing an appropriate loss function depends on the nature
of the problem, the type of data, and the desired trade-offs between accuracy, robustness, and interpretability. As machine learning continues to
evolve, developing new and optimized loss functions remains an active area of research.

59
Bibliography
[1] Decision Tree. URL: https://www.cs.cmu.edu/~aarti/Class/10701_Spring21/Lecs/decision-trees.pdf.

61

You might also like