Machine Learning Techniques - Week 11: December 20, 2024
Machine Learning Techniques - Week 11: December 20, 2024
iii
Contents
2.5.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4. Neural Networks 17
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2. Revisiting the Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3. Neural Networks: Extending Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.1. Structure of a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2. Mathematical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.3. Introducing Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.4. Role of Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4. Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5. Challenges and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5. Backpropagation 21
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
iv
Contents
Appendices 29
v
Contents
vi
Contents
F. Decision Tree 47
F.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
F.2. Why Decision Trees? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
F.2.1. Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
F.2.2. Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
F.3. Decision Stumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
F.3.1. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
F.3.2. Choosing the Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
F.4. Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
F.5. Building a Bigger Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
F.6. Real-Valued Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
F.7. Categorical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
F.8. Managing Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
F.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
G. Loss Functions 57
vii
1. Binary Classification Algorithms through Loss Functions
1.1. Introduction
Binary classification is a fundamental problem in supervised learning, where the objective is to predict one of two possible labels (+1 or −1) for
a given input. Over the years, numerous algorithms have been developed for binary classification, including logistic regression, support vector
machines (SVMs), decision trees, boosting, and perceptrons. This chapter explores why there are so many algorithms for binary classification and
provides a unified framework for understanding them through the lens of loss functions.
1.2. Motivation
Unlike regression, where a single method (linear regression) can often suffice with modifications like regularization, binary classification involves a
variety of approaches. This variety arises because solving the classification problem directly, as formulated, is computationally hard (NP-hard). To
address this, different algorithms employ surrogate loss functions that approximate the original problem in computationally efficient ways.
h : Rd → {+1, −1}.
We evaluate the performance of h using a zero-one loss function:
1
1. Binary Classification Algorithms through Loss Functions
(
1 if condition is true,
I(condition) =
0 otherwise.
The total loss over the dataset is:
n
X
I(h(xi ) 6= yi ).
i=1
Directly minimizing this loss is NP-hard because the indicator function is non-convex and discontinuous. Thus, practical algorithms rely on
surrogate loss functions.
0-1 Loss
The 0-1 loss function, also known as zero-one loss or misclassification loss, assigns a loss of 0 if a prediction matches the actual outcome
and a loss of 1 if it does not. It’s used in classification tasks to measure the accuracy of predictions:
(
0, ŷ = y
L(h(x, y)) =
1, ŷ 6= y
This loss function does not account for the degree of error, only whether an error occurred.
2
1.5. The NP-Hardness of Classification
is NP-hard. Algorithms overcome this by replacing the zero-one loss with surrogate loss functions that are easier to minimize.
3
1. Binary Classification Algorithms through Loss Functions
While this loss is convex and easy to optimize, it has significant drawbacks for classification. Even if G(x) predicts the correct label, the squared
loss penalizes large values of G(x) · y, which can lead to poor performance in the presence of outliers.
Figure 1.1.: Comparison of zero-one loss and squared loss as functions of G(x) · y.
Each of these loss functions is designed to approximate the zero-one loss more closely while remaining convex or amenable to optimization.
4
1.8. Conclusion
1.8. Conclusion
This chapter introduced the classification problem and explained why it necessitates a variety of algorithms. By adopting a loss function perspective,
we saw how different algorithms approximate the zero-one loss using surrogate loss functions. While the squared loss is computationally efficient,
its poor alignment with the classification objective highlights the importance of selecting appropriate surrogate losses. In the next chapter, we will
delve deeper into specific algorithms like SVMs, logistic regression, and boosting, and analyze their loss functions in detail.
5
2. SVM and Logistic Loss
2.1. Introduction
Binary classification is a cornerstone of machine learning, where the task is to classify inputs into one of two possible categories. This chapter
delves into how various algorithms approach binary classification through the lens of loss functions. It discusses how the inherent complexity of
minimizing the zero-one loss leads to the development of surrogate loss functions, enabling practical algorithms like Support Vector Machines
(SVMs) and logistic regression.
This formulation is non-convex and discontinuous, making it computationally hard to minimize (NP-hard). Consequently, different algorithms
replace the zero-one loss with surrogate loss functions that are easier to optimize.
7
2. SVM and Logistic Loss
subject to:
yi (w> xi + b) ≥ 1 − ξi , ξi ≥ 0 ∀i.
Here: - w is the weight vector, - b is the bias term, - ξi are slack variables allowing misclassification, - C controls the trade-off between maximizing
the margin and minimizing the slack.
Hinge Loss
The hinge loss function, often used in support vector machines, is defined as:
where:
8
2.4. Logistic Regression and Logistic Loss
n
1 X
min kwk2 + C max(0, 1 − yi (w> xi )).
w 2
i=1
Here, the term max(0, 1 − yi (w> xi )) is the hinge loss for a single data point. It penalizes points that lie within the margin or on the wrong side
of the hyperplane.
• For z ≥ 1, the loss is zero, indicating correct classification with sufficient margin.
• For z < 1, the loss increases linearly, penalizing points based on their proximity to the decision boundary.
The hinge loss is convex, making it computationally feasible to minimize, unlike the zero-one loss.
9
2. SVM and Logistic Loss
n
X
min −yi log σ(w> xi ) − (1 − yi ) log(1 − σ(w> xi )) .
w
i=1
2.5.2. Discussion
0-1 loss • The zero-one loss is ideal but computationally intractable.
• The hinge loss is a convex approximation and enforces a margin,
making it suitable for SVMs.
-1 0 1
10 Figure 2.1.: Hinge Loss Vs Logistic Loss
2.6. Conclusion
2.6. Conclusion
This chapter illustrated how different algorithms for binary classification address the NP-hardness of the zero-one loss using surrogate loss func-
tions. SVMs rely on hinge loss, while logistic regression uses logistic loss. Both approaches provide computationally efficient and effective solutions
to binary classification problems. The choice of loss function significantly influences the algorithm’s behavior and performance, emphasizing the
importance of selecting the right surrogate for the problem at hand.
11
3. Perceptron and Boosting Loss
3.1. Introduction
Binary classification algorithms, ranging from perceptrons to support vector machines (SVMs) and boosting, rely on minimizing loss functions to
improve classification performance. This chapter examines the perceptron algorithm, boosting, and the broader implications of using convex and
non-convex surrogate loss functions. We explore their mathematical foundations, interpretations, and connections with modern advancements
such as neural networks.
wt+1 = wt + xt yt .
13
3. Perceptron and Boosting Loss
The perceptron mimics gradient descent on a modified hinge loss where the margin condition is simplified. For the perceptron, the loss is
implicitly defined as:
wt+1 = wt − ηt · (−xy),
where ηt = 1 is a fixed step size in the perceptron.
14
3.4. Convex Surrogates for Loss Functions
These loss functions serve as surrogates for the zero-one loss, providing smooth and differentiable approximations.
15
3. Perceptron and Boosting Loss
3.6. Conclusion
This chapter explored various algorithms for binary classification through the lens of loss functions. The perceptron was reinterpreted as performing
subgradient descent on a modified hinge loss. Boosting was connected to the exponential loss, highlighting its focus on misclassified points. We
concluded by emphasizing the importance of convex surrogates in traditional algorithms and introduced the broader possibilities of non-convex
optimization in neural networks. This discussion provides a foundation for understanding the diverse approaches to binary classification and the
trade-offs involved.
16
4. Neural Networks
4.1. Introduction
In previous discussions, we explored various algorithms for supervised learning, particularly binary classification, through the lens of loss function
minimization combined with regularization. This framework allowed us to understand algorithms such as support vector machines (SVMs), logistic
regression, and perceptron. However, these methods rely heavily on convex surrogate loss functions, which ensure computational tractability.
This chapter introduces neural networks, a family of algorithms inspired by perceptrons, that do not necessarily rely on convex loss functions.
Neural networks are a foundational component of modern machine learning, particularly deep learning, and provide a powerful approach to
modeling complex relationships between input and output.
The perceptron algorithm predicts outputs by learning a weight vector w ∈ Rd and applying the rule:
ŷ = sign(w> x),
where x ∈ Rd is the input vector. This can be visualized as follows: - The input x = [x1 , x2 , . . . , xd ] is represented as a set of nodes. - Each node
connects to an output node with weights w = [w1 , w2 , . . . , wd ]. - The weighted sum w> x determines the output after applying the sign function.
While perceptrons are effective for linearly separable data, they cannot model non-linear relationships. Neural networks generalize this concept
by introducing non-linearity through hidden layers and activation functions.
17
4. Neural Networks
where: - wi ∈ Rd are the weights for the i-th neuron, - a(·) is the activation function, introducing non-linearity.
The output layer computes:
>
ŷ = wout h,
where: - h = [h1 , h2 , . . . , hk ]> is the vector of activations from the hidden layer, - wout ∈ Rk are the weights of the output layer.
k
X
ŷ = wout,i · a(wi> x).
i=1
Here, the network parameters are: 1. Hidden layer weights: w1 , w2 , . . . , wk ∈ Rd , 2. Output layer weights: wout ∈ Rk .
The task of training a neural network involves learning these weights to minimize a loss function over the dataset.
18
4.4. Training Neural Networks
1. Sigmoid Function
1
a(z) = .
1 + e−z
The sigmoid function maps z ∈ R to the range (0, 1), and is differentiable, which facilitates optimization.
Comparison of Activation Functions Figure 4.1 illustrates the sigmoid and ReLU functions.
19
4. Neural Networks
The optimization is typically performed using gradient-based methods, such as stochastic gradient descent (SGD), where gradients are computed
via backpropagation.
4.6. Conclusion
Neural networks extend the perceptron by introducing hidden layers and non-linear activation functions, enabling them to model complex, non-
linear relationships. While they deviate from convex optimization principles, their flexibility and scalability make them indispensable in modern
machine learning. This chapter provides an introduction to neural networks, serving as a foundation for further exploration in deep learning.
20
5. Backpropagation
5.1. Introduction
Neural networks represent a significant extension of classical machine learning algorithms, enabling the modeling of complex, non-linear relation-
ships in data. Training a neural network involves defining an appropriate loss function, optimizing the network’s parameters, and handling challenges
posed by non-convexity and high-dimensional parameter spaces. This chapter elaborates on the training process, introduces backpropagation,
and discusses its application in regression and classification problems.
Li (Θ) = (NN(xi ; Θ) − yi )2 ,
where yi is the true label. The total loss over the dataset is:
n
1X
L(Θ) = (NN(xi ; Θ) − yi )2 .
n i=1
This loss function captures the squared deviation between the predicted and actual values.
21
5. Backpropagation
>
P (y = 1 | x; Θ) = σ wout h ,
where σ(z) = 1+e1−z is the sigmoid function, and h is the vector of activations from the last hidden layer. The cross-entropy loss is used to
compare the predicted probabilities with the true labels:
n
1X
L(Θ) = − [yi log ŷi + (1 − yi ) log(1 − ŷi )] ,
n i=1
Θ ← Θ − η∇Θ L(Θ),
where η > 0 is the learning rate, and ∇Θ L(Θ) is the gradient of the loss with respect to the parameters.
5.3.2. Backpropagation
Backpropagation is the algorithm used to compute gradients efficiently in neural networks. It exploits the chain rule of differentiation to propagate
errors backward through the network. For a parameter wij in layer l, the gradient is computed as:
22
5.4. Architectures and Parameter Complexity
∂L (l) (l−1)
= δj hi ,
∂wij
(l−1) (l)
where: - hi is the activation of the i-th neuron in layer l − 1, - δj is the error term for the j-th neuron in layer l, computed recursively.
The backpropagation algorithm involves the following steps: 1. Perform a forward pass to compute the network’s predictions. 2. Compute the
loss and the gradient of the output layer. 3. Propagate the error backward through the network using the chain rule. 4. Update the parameters
using gradient descent.
5.5.2. Overfitting
With large parameter spaces, neural networks are prone to overfitting, particularly on small datasets. Regularization techniques such as dropout,
weight decay, and data augmentation are commonly used to mitigate this issue.
23
5. Backpropagation
5.7. Conclusion
Neural networks represent a powerful approach to modeling complex relationships, particularly for unstructured data. While their training involves
challenges such as non-convexity and overfitting, techniques like backpropagation and regularization ensure effective optimization. The versatility
of neural networks makes them indispensable in modern machine learning, particularly in fields like computer vision, natural language processing,
and speech recognition.
24
6. Concluding Remarks: Foundations and Frontiers in Machine
Learning
6.1. Introduction
This chapter concludes the course by summarizing the foundational topics we have covered and outlining advanced concepts and areas for further
exploration. The journey through unsupervised learning, supervised learning, and an introduction to neural networks has provided a robust base
for understanding machine learning algorithms and their applications. Additionally, we touch on topics beyond the scope of this course, such as
semi-supervised learning, fairness, explainability, and privacy, offering a glimpse into the future challenges and opportunities in machine learning.
• Representation Learning:
– Principal Component Analysis (PCA): A linear technique to reduce the dimensionality of data while preserving variance.
– Kernel PCA: An extension of PCA to capture non-linear relationships using kernel functions.
• Clustering:
– K-Means Algorithm: A partitioning-based clustering technique minimizing intra-cluster variance.
– Spectral Clustering: An extension of k-means utilizing graph-based representations and eigenvalue decompositions.
• Density Estimation:
25
6. Concluding Remarks: Foundations and Frontiers in Machine Learning
• Regression:
– Ordinary Least Squares (OLS): Minimizing the squared error between predictions and labels.
– Ridge Regression and Lasso: Regularized versions of OLS to prevent overfitting and encourage sparsity.
• Classification:
– K-Nearest Neighbors (KNN): A non-parametric method based on proximity.
– Decision Trees: A hierarchical, rule-based classifier.
– Logistic Regression: A probabilistic classifier using the logistic function.
– Support Vector Machines (SVMs): Maximizing the margin between classes with extensions for non-linear decision boundaries via
kernels.
– Bagging and Boosting: Ensemble methods to improve stability and performance.
• Neural Networks:
– Introduced as an extension of perceptrons with non-linear activations.
– Discussed the basics of backpropagation and its role in optimizing neural networks.
26
6.3. Topics Beyond the Course Scope
• Self-Supervised Learning: Particularly useful in deep learning, it involves generating pseudo-labels from data itself, enabling unsupervised
pretraining.
• Reinforcement Learning
• Fairness: Ensuring the algorithm does not reinforce societal biases present in training data.
• Explainability: Enabling algorithms to provide human-interpretable explanations for their decisions, crucial for domains like healthcare and
autonomous vehicles.
• Privacy: Developing algorithms that respect data privacy, using techniques like differential privacy.
• Edge AI: Designing lightweight models for deployment on resource-constrained devices, such as mobile phones.
• Continual Learning: Adapting models dynamically as new data becomes available, addressing changing data distributions.
27
6. Concluding Remarks: Foundations and Frontiers in Machine Learning
• Deep Learning: Advanced neural network architectures for tasks involving unstructured data (e.g., images, text, and audio).
• Attention Mechanisms and Transformers: Revolutionizing natural language processing and sequence modeling.
• Distributed Learning: Scaling learning algorithms across large datasets and computational infrastructures.
6.5. Conclusion
This course has equipped you with the foundational concepts of machine learning, from regression and classification to clustering and neural
networks. As you move forward, you are encouraged to explore advanced courses and research in areas such as deep learning, reinforcement
learning, and Deployable AI. The topics discussed here form the bedrock for understanding modern AI systems and tackling real-world challenges.
Thank you for your participation, and we wish you success in your journey through machine learning!
28
A. Proof of Convergence of Perceptron Algorithm with
Boundary Conditions
A.1. Introduction
The Perceptron algorithm guarantees convergence if the data is linearly separable. This proof includes boundary conditions using γ, R, and mistake
count.
A.1.1. Definitions
• Margin (γ): The minimum distance between the decision boundary defined by w∗ and any data point:
yi (w∗ · xi )
γ = min .
i kw∗ k
• Maximum Norm (R): The maximum Euclidean norm of any input vector:
R = max kxi k.
i
• Number of Mistakes (M ): The total number of weight updates performed by the algorithm.
2. The norm of the weight vector grows quadratically with the maximum norm of the inputs:
kw(t+1) k2 ≤ M R2 .
29
A. Proof of Convergence of Perceptron Algorithm with Boundary Conditions
30
A.1. Introduction
Iterative Updates:
Verification of Bound:
• Number of updates M = 4.
R2
• Verify the bound: M ≤ γ2
= 5
1
= 5.
31
B. Problem connecting γ, mistakes (M ) and R
B.1. Problem Statement
The Perceptron algorithm is applied to a dataset where:
• The task is to determine which of the following squared lengths of the weight vector (kwk2 ) can be valid in the 11th iteration:
1. (a) 90
2. (b) 150
3. (c) 190
• Margin (γ): 1,
33
B. Problem connecting γ, mistakes (M ) and R
2. Option (b): 150 Since 121 ≤ 150 ≤ 176, this is a valid value for kwk2 .
3. Option (c): 190 Since 190 > 176, this exceeds the upper bound for kwk2 and is invalid.
B.1.3. Conclusion
The valid squared length of the weight vector in the 11th iteration is:
(b) 150.
Options 90 and 190 are invalid because they fall outside the bounds 121 ≤ kwk2 ≤ 176.
B.1.4. Conclusion
R2
The Perceptron algorithm converges after a finite number of updates. The mistake count is bounded by γ2
, demonstrating the efficiency of the
algorithm.
34
C. The Dual Problem in Optimization
C.1. Introduction to the Dual Problem
In optimization, many problems involve minimizing or maximizing a function subject to constraints. The dual problem is an alternative represen-
tation of the original optimization problem, known as the primal problem. Solving the dual problem provides valuable insights and sometimes
computational advantages.
The dual problem arises naturally when applying Lagrangian methods to incorporate constraints into the objective function. It involves expressing
the primal problem in terms of dual variables (Lagrange multipliers), leading to a new optimization problem.
where:
35
C. The Dual Problem in Optimization
where:
The Lagrangian combines the objective function and the constraints into a single function.
• For any α ≥ 0, q(α) provides a lower bound on the optimal value of the primal problem.
• The dual function is concave in α, even if the primal problem is not convex.
36
C.1. Introduction to the Dual Problem
Slater’s Condition
For the primal problem to satisfy Slater’s condition:
1. Convexity: The objective function f (x) must be convex, and the inequality constraint functions gi (x) must also be convex.
2. Feasibility: There must exist a point x0 in the feasible region (the domain of the problem) such that:
3. Equality Constraints: If equality constraints hj (x) = 0 exist, they must be affine (linear).
In simpler terms, Slater’s condition requires the existence of a strictly feasible point for the inequality constraints, i.e., a point that satisfies
all constraints but does so with strict inequality for the inequalities.
1. Simplified Constraints: The dual problem often has simpler constraints compared to the primal problem.
37
C. The Dual Problem in Optimization
2. Reduced Dimensionality: If the primal problem involves many variables but few constraints, the dual problem involves fewer variables.
3. Insights into Optimal Solutions: The dual variables α often provide meaningful interpretations, such as the sensitivity of the objective to
constraint violations.
4. Kernelization: For certain problems, the dual formulation depends on dot products, enabling the use of kernel functions to handle non-linear
cases.
Minimizing L with respect to w and b, and maximizing with respect to α, yields the dual problem:
n n n
X 1 XX
max αi − αi αj yi yj (xi · xj ). (C.9)
α≥0
i=1
2 i=1 j=1
This formulation depends only on dot products, enabling the use of kernel functions in non-linear cases.
C.1.8. Conclusion
The dual problem provides a powerful framework for solving constrained optimization problems, often simplifying computation and offering deeper
insights. By leveraging duality, we can address complex problems like support vector machines and kernelized learning in a principled and efficient
manner.
38
D. Lagrangian Multipliers and Their Role in Support Vector
Machines
D.1. Introduction to Lagrangian Multipliers
In constrained optimization problems, the goal is to optimize an objective function f (w) subject to constraints gi (w) ≤ 0. Lagrangian multipliers
provide a systematic method to incorporate these constraints into the optimization process by defining a single function called the Lagrangian
function.
The general form of a constrained optimization problem is:
where:
The Lagrangian function combines the objective and constraints into a single expression. By solving for the optimal values of w and α, we ensure
that the constraints are satisfied at the optimal solution.
39
D. Lagrangian Multipliers and Their Role in Support Vector Machines
2. Dual feasibility: αi ≥ 0.
4. Stationarity: ∇w L(w, α) = 0.
1
min kwk2 (D.4)
w,b 2
subject to yi (w> xi + b) ≥ 1, ∀i. (D.5)
Here:
• 1
2
kwk2 is the objective function, minimizing the norm of w to maximize the margin.
• yi (w> xi + b) ≥ 1 ensures that all points are correctly classified with a margin of at least 1.
The constraints can be incorporated using Lagrangian multipliers αi ≥ 0, leading to the Lagrangian:
n
1 X
2
αi yi (w> xi + b) − 1 . (D.6)
L(w, b, α) = kwk −
2 i=1
40
D.1. Introduction to Lagrangian Multipliers
αi yi (w> xi + b) − 1 = 0, (D.10)
∀i.
This condition indicates:
• If αi > 0, the point xi lies exactly on the margin, yi (w> xi + b) = 1.
• If αi = 0, the point is either correctly classified and outside the margin or irrelevant to the decision boundary.
41
D. Lagrangian Multipliers and Their Role in Support Vector Machines
D.1.6. Conclusion
Lagrangian multipliers play a crucial role in SVMs by incorporating margin constraints into the objective function. The dual formulation derived
using Lagrangian multipliers enables efficient optimization and kernelization, while complementary slackness ensures that only support vectors
contribute to the decision boundary. This elegant framework underpins the success of SVMs in both linear and non-linear classification tasks.
42
E. Hinge Loss vs. Logistic Loss
E.1. Hinge Loss
Hinge loss is primarily used in Support Vector Machines (SVMs), especially for classification tasks where the goal is to maximize the margin between
classes. It penalizes misclassified points and those inside the margin.
Equation:
L(y, f (x)) = max(0, 1 − y · f (x))
where:
• y is the true label (−1 or +1)
• p is the predicted probability of the positive class, often given by σ(f (x)) = 1
1+e−f (x)
43
E. Hinge Loss vs. Logistic Loss
Explanation: - The loss increases as the predicted probability p deviates from the true label y. - It’s continuous and differentiable, making it
suitable for gradient-based optimization methods.
• Output:
– Hinge loss works with a signed distance from the decision boundary.
– Logistic loss deals with probability outputs.
• Shape:
– Hinge loss has a piecewise linear form, with a flat section at zero for correct classifications.
– Logistic loss is smooth and continuous, with no flat regions.
• Use Cases:
– Hinge loss is tailored for SVM, emphasizing a large margin.
– Logistic loss is preferred in logistic regression, focusing on probabilistic interpretations.
• Gradient Properties:
– Hinge loss has a constant gradient where it’s non-zero, which can be less sensitive to outliers.
– Logistic loss provides a gradient that smoothly decreases as the prediction moves towards the correct classification.
44
E.3. Comparison and Contrast
Logistic Regression :Log loss (-ve log conditional likelihood) • Hinge loss as having a ”hinge” at y · f (x) = 1 where the loss starts
to increase linearly.
0-1 loss
-1 0 1
Figure E.1.: Hinge Loss Vs Logistic Loss
45
E. Hinge Loss vs. Logistic Loss
46
F. Decision Tree
F.1. Introduction
A lot of nature books include diagrams that help identify animals. These diagrams often resemble decision trees, or dichotomous classifiers. To
classify a new example (e.g., a new animal), we start at the root node of the tree. At each node, there is a question about a specific feature of the
example. The outgoing edges are labeled with all possible answers[1].
We choose the correct answer for our example and follow the corresponding edge to a new node (one of the current node’s children). That new
node contains another question, and the process repeats. Eventually, we reach a leaf, which provides the predicted class for the example.
If we do not reach a given node in our path through the tree, we never ask that node’s question. For example, we never ask if an iguana has
wings. Often, an example will reach only a small fraction of the nodes in the tree. In a balanced tree of height h, a path from root to leaf will reach
approximately 2h − 1 nodes, out of a total of 2h − 1.
F.2.1. Advantages
• Interpretability: Decision trees provide clear explanations for every prediction made as they rely on a simple logical function of the input
features.
• Minimal Data Preprocessing: They require minimal data preprocessing (e.g., no need to scale or normalize features).
47
F. Decision Tree
48
F.3. Decision Stumps
F.2.2. Disadvantages
• Accuracy: Decision trees often do not achieve the same prediction performance as less interpretable models.
• Overfitting: Large trees can overfit the training data, especially when the dataset is small or noisy.
• Complexity: If input features are not interpretable (e.g., individual pixels in an image), the tree itself loses interpretability.
Accuracy can be improved by using ensembles of decision trees, such as Random Forests or Gradient Boosted Trees. However, this reduces
interpretability when multiple trees are combined.
49
F. Decision Tree
F.3.1. Example
Consider predicting whether an animal is a parrot based on two features: whether it flies and whether it likes crackers. Here is some training data:
50
F.4. Information Gain
51
F. Decision Tree
Weighted Average: The weights are proportional to the sizes of the subsets: - flies = T : 2 examples out of 5. - flies = F : 3 examples out of 5.
The conditional entropy is:
2 3
H(S|flies) = · H(S|flies = T ) + · H(S|flies = F ).
5 5
Substituting:
2 3
H(S|flies) = · 0 + · 0.918 ≈ 0.551.
5 5
52
F.4. Information Gain
Final Answer:
The information gain for P (parrot|flies = T ) is approximately:
0.419 bits .
Using log2 3
5
≈ −0.736 and log2 2
5
≈ −1.322, we compute:
3 2
H(S) = − · −0.736 + · −1.322 ≈ 0.970.
5 5
53
F. Decision Tree
1 1
P (parrot = T |crackers = T ) = , P (parrot = F |crackers = T ) = .
2 2
Entropy for this subset:
1 1 1 1
H(S|crackers = T ) = − log2 + log2 .
2 2 2 2
Since log2 1
2
= −1, we get:
1 1
H(S|crackers = T ) = − · −1 + · −1 = 1.
2 2
2 1
P (parrot = T |crackers = F ) = , P (parrot = F |crackers = F ) = .
3 3
Entropy for this subset:
2 2 1 1
H(S|crackers = F ) = − log2 + log2 .
3 3 3 3
Using log2 2
3
≈ −0.585 and log2 1
3
≈ −1.585, we compute:
2 1
H(S|crackers = F ) = − · −0.585 + · −1.585 ≈ 0.918.
3 3
Weighted Average: The weights are proportional to the sizes of the subsets: - crackers = T : 2 examples out of 5. - crackers = F : 3 examples
out of 5.
The conditional entropy is:
2 3
H(S|crackers) = · H(S|crackers = T ) + · H(S|crackers = F ).
5 5
Substituting:
2 3
H(S|crackers) = · 1 + · 0.918 ≈ 0.967.
5 5
54
F.5. Building a Bigger Tree
Final Answer:
The information gain for P (S|crackers) is approximately:
0.003 bits .
55
F. Decision Tree
• Early Stopping: Stop splitting when further splits do not improve validation accuracy.
• Pruning: Grow the tree fully, then simplify it by merging nodes that do not improve accuracy on a holdout set.
F.9. Conclusion
Decision trees are a versatile and interpretable model. However, they can overfit or underperform compared to more complex models. Techniques
like pruning, early stopping, and ensemble methods improve their performance.
56
G. Loss Functions in Machine Learning
G.1. Introduction
Loss functions play a critical role in machine learning by quantifying the difference between predicted and actual values. They guide optimiza-
tion algorithms in adjusting model parameters to minimize errors, thereby improving performance. Loss functions can be broadly classified into
regression, classification, and hybrid categories. This essay discusses some of the most commonly used loss functions in detail.
MSE penalizes large errors more significantly, making it sensitive to outliers. It is widely used in regression tasks.
Unlike MSE, MAE is robust to outliers but may lead to slower convergence during optimization.
57
G. Loss Functions in Machine Learning
Here, K is the number of classes, yi,k is the true label, and ŷi,k is the predicted probability for class k.
58
G.4. Loss Functions for Hybrid Tasks
This loss function focuses on reducing the impact of misclassified samples during training.
G.5. Conclusion
Loss functions are integral to machine learning, shaping how models learn from data. Choosing an appropriate loss function depends on the nature
of the problem, the type of data, and the desired trade-offs between accuracy, robustness, and interpretability. As machine learning continues to
evolve, developing new and optimized loss functions remains an active area of research.
59
Bibliography
[1] Decision Tree. URL: https://www.cs.cmu.edu/~aarti/Class/10701_Spring21/Lecs/decision-trees.pdf.
61