Unit 4 ML
Unit 4 ML
Here's an explanation of the chapters and sections from the given references:
3.1: Concept Learning and the General-to-Specific Ordering: Introduces the idea of
concept learning, where the goal is to find a target concept from examples. It
explains how concepts can be learned by generalizing from specific instances.
3.2: Finding Hypotheses: This section discusses how to form hypotheses that
represent a solution to a problem. It covers learning algorithms that find
hypotheses that fit the examples.
3.4: The Decision Tree Learning Algorithm: Introduces decision tree learning, a key
algorithm in machine learning. It focuses on how decision trees are built and
optimized to make decisions based on input features.
6.1: Linear Models for Regression: Discusses linear regression models, which are
used for predicting continuous values by establishing a linear relationship between
input variables and the output.
6.2: Logistic Regression: Explains logistic regression, which is used for binary
classification tasks. It covers the application of the sigmoid function to predict class
probabilities.
6.7: Support Vector Machines (SVMs): Introduces SVMs, a powerful algorithm for
classification. It explains how SVMs work by finding the hyperplane that best
separates the classes.
1/24
4.3.1: Overfitting and Underfitting: Discusses the concepts of overfitting (model
too complex) and underfitting (model too simple), and how to find the balance
between the two to ensure good model performance.
4.3.3: Bias-Variance Tradeoff: Explains the tradeoff between bias and variance, two
key sources of error in machine learning models.
6.1: Linear Models for Regression: Similar to Mitchell's explanation, this section
focuses on linear regression and the assumption of a linear relationship between
dependent and independent variables.
6.2: Logistic Regression: Covers logistic regression in detail, including the use of the
logistic function to model binary outcomes.
6.7: Support Vector Machines (SVMs): In this section, SVMs are introduced as a
powerful method for both classification and regression problems.
6.9: Neural Networks: Explores neural networks, with a focus on how the
architecture of neurons and layers enables complex pattern recognition in machine
learning tasks.
2/24
2. Chapter 9 (9.1, 9.2, 9.3, 9.4):
9.2: Bias-Variance Tradeoff: Covers the concept of bias and variance in model
errors, explaining how to manage the tradeoff to ensure that models generalize well
to new data.
9.3: Model Complexity and Regularization: Discusses the role of model complexity
in overfitting and underfitting, and introduces regularization techniques to prevent
overfitting.
9.4: Decision Trees and Random Forests: Explains decision trees and introduces the
random forest algorithm as an ensemble method to improve predictive accuracy.
3. Chapter 5 (5.1):
This breakdown provides a detailed understanding of the chapters and sections based on
the respective references.
The explanation provided gives a detailed overview of Decision Trees. Here's a structured
breakdown of the key points:
3/24
node represents a decision or test based on a feature, while each branch represents the
outcome of that decision. Leaf nodes represent either class labels (in classification) or
predicted values (in regression).
Decision Nodes: Nodes that perform a test on a feature and split the data into subsets
based on the outcome of that test.
Leaf Nodes: The final nodes that represent the predicted output. For classification, it's
the class label; for regression, it's the predicted value, usually the mean of the target
values in the subset.
Branches: The connections between nodes that represent the flow of control, guiding
the path from one decision to the next.
Purity refers to how homogenous the target variable is within each subset. In
classification, a pure node would mostly contain data points from a single class.
2. Split the Data: Partition the data based on the selected feature.
Each subset contains data points that all belong to the same class, or
A stopping criterion (e.g., maximum depth, minimum samples per leaf) is met.
4. Classify (or Predict): Classify new data by following the decision path from the root to
the appropriate leaf.
4/24
Entropy: A measure of disorder or impurity in the dataset. The goal is to reduce
entropy with each split.
Information Gain: The reduction in entropy after splitting the data on a feature. The
feature with the highest information gain is chosen.
A measure of impurity in a dataset, used by the CART algorithm. The feature that
minimizes the Gini index is chosen for splitting.
In regression tasks, the goal is to minimize variance within the subsets. A feature's
variance reduction indicates how well it splits the data for prediction purposes.
Pruning: A technique used to reduce the complexity of the tree and prevent overfitting.
There are two types of pruning:
Pre-pruning (Early Stopping): Limits the size of the tree during training by setting
constraints like maximum depth or minimum samples per leaf.
Post-pruning: After building the tree, branches that do not improve the model’s
performance are removed.
Interpretability: Decision trees are easy to understand and interpret, and their structure
can be visualized.
Handling Non-Linearity: They don’t require a linear relationship between features and
the target.
No Data Normalization: Unlike other algorithms, decision trees don't require feature
scaling or normalization.
Disadvantages:
5/24
Overfitting: Decision trees can overfit if they are too deep or complex.
Instability: Small changes in the data can lead to a completely different tree being
generated.
Bias towards Dominant Classes: In imbalanced datasets, decision trees may favor the
majority class, leading to biased predictions.
This summary encapsulates the working principles, structure, construction, and evaluation of
decision trees, along with the challenges like overfitting and techniques like pruning to
mitigate them.
P (X ∣ Ck )P (Ck )
P (Ck ∣ X) =
P (X)
Where:
6/24
n
P (X ∣ Ck ) = ∏ P (xi ∣ Ck )
i=1
Where xi is the value of the i-th feature, and Ck is the class label.
Class Prior Probabilities P (Ck ): The prior probability of each class can be estimated as
the relative frequency of each class in the training data. If there are m instances and Nk
Nk
P (Ck ) =
m
Conditional Probabilities P (xi ∣ Ck ): For each feature xi , the likelihood P (xi ∣ Ck ) is
estimated from the training data. The estimation depends on the type of feature:
1 (xi − μk )2
P (xi ∣ Ck ) = exp (− )
2πσk2 2σk2
Where μk and σk are the mean and standard deviation of feature xi for class Ck .
Multinomial Naive Bayes: Used for text classification tasks where features are
represented as word counts or term frequencies. The likelihood P (xi ∣ Ck ) is modeled
Bernoulli Naive Bayes: Suitable for binary/boolean features, where each feature can
either be present or absent (e.g., whether a word appears in a document). The likelihood
P (xi ∣ Ck ) is modeled using a Bernoulli distribution.
Gaussian Naive Bayes: Suitable for continuous features, where each feature is assumed
to follow a Gaussian (normal) distribution. The likelihood P (xi ∣ Ck ) is modeled using
7/24
4. Advantages and Disadvantages of Naive Bayes
Advantages:
Simple and Fast: Naïve Bayes is easy to implement, and the training process is
computationally efficient, even with large datasets.
Scalability: It works well with high-dimensional data, like text data, and can handle large
datasets.
Works Well with Small Datasets: Even with small amounts of data, Naïve Bayes can
provide good classification results, particularly in text classification.
Handles Missing Data: It can handle missing values by ignoring the feature in the
likelihood computation if necessary.
Disadvantages:
Strong Independence Assumption: The Naïve Bayes classifier assumes that features are
independent given the class label, which is often not the case in real-world data. This can
reduce performance if the features are highly correlated.
Poor Performance with Highly Dependent Features: If features are strongly dependent
on each other (e.g., in image classification tasks), Naïve Bayes may perform poorly
compared to more complex models.
Zero Probability Problem: If a feature value was not observed in the training data for a
particular class, the probability of that feature given the class will be zero, leading to a
zero likelihood for the entire instance. This issue can be handled using Laplace
smoothing.
count(xi , Ck ) + 1
P (xi ∣ Ck ) =
count(Ck ) + V
Where:
8/24
count(xi , Ck ) is the count of feature xi in class Ck .
Summary
The Naïve Bayes classifier is a probabilistic model based on Bayes' Theorem that simplifies
classification by assuming feature independence. It is fast, scalable, and effective for tasks
like text classification, but its assumption of conditional independence can limit its
performance when features are strongly correlated. It is particularly powerful with small
datasets and can handle missing values, but it may struggle with highly dependent features.
Laplace smoothing can address the zero probability problem encountered during training.
Logistic Regression
Logistic Regression is a foundational algorithm for binary classification tasks. Despite its
name, it is a classification technique, not a regression technique. It models the probability
that a given input belongs to a particular class using a linear combination of input features.
The model uses the logistic (sigmoid) function to convert the linear output into a
probability.
The logistic (sigmoid) function is an S-shaped curve that maps any real-valued input to a
value between 0 and 1, making it ideal for modeling probabilities in binary classification. It is
defined as:
1
σ(z) =
1 + e−z
Where:
σ(z) is the output of the logistic function, representing the predicted probability of the
positive class.
features.
The logistic function maps z (which can range from −∞ to +∞) to a probability value
between 0 and 1. This output represents the probability that an instance belongs to the
9/24
positive class (Y = 1):
P (Y = 1 ∣ X) = σ(z)
P (Y = 0 ∣ X) = 1 − P (Y = 1 ∣ X)
decision boundary. The goal is to choose these parameters such that the model predicts the
class probabilities accurately.
To estimate the parameters, Maximum Likelihood Estimation (MLE) is used, where the
likelihood function is maximized. The likelihood function for logistic regression is:
n
L(β0 , β1 , … , βp ) = ∏ [P (Yi = 1 ∣ Xi )yi (1 − P (Yi = 1 ∣ Xi ))1−yi ]
i=1
Where:
Once the parameters are estimated, the logistic regression model computes the probability
P (Y = 1 ∣ X) for an input vector X = (x1 , x2 , … , xp ). If this probability exceeds a
chosen threshold (usually 0.5), the model classifies the instance as belonging to the positive
class (class 1). Otherwise, it classifies it as class 0.
Class Prediction = {
1 if σ(z) > 0.5
0 if σ(z) ≤ 0.5
0, which separates the feature space into two regions: one where the model predicts class 1
and the other where it predicts class 0.
4. Model Evaluation
10/24
To evaluate the performance of a logistic regression model, several metrics are used, such
as:
TP + TN
Accuracy =
TP + TN + FP + FN
Where:
TP = True Positives
TN = True Negatives
FP = False Positives
FN = False Negatives
Precision: Proportion of true positives among all predicted positives.
TP
Precision =
TP + FP
TP
Recall =
TP + FN
F1 Score: The harmonic mean of precision and recall, balancing the trade-off between
the two.
Precision ⋅ Recall
F1 = 2 ⋅
Precision + Recall
ROC Curve and AUC: The Receiver Operating Characteristic (ROC) curve plots the true
positive rate vs. the false positive rate at various thresholds, and the Area Under the
Curve (AUC) provides an aggregate measure of model performance across all
thresholds.
To prevent overfitting, especially when the model has many features or when features are
highly correlated, regularization can be applied:
j=1
11/24
Where λ is the regularization parameter.
L1 Regularization (Lasso): Adds a penalty term proportional to the absolute value of the
coefficients:
p
L1_penalty = λ ∑ ∣βj ∣
j=1
L1 regularization can lead to sparse models where some coefficients are driven to zero,
effectively performing feature selection.
When there are more than two classes, multinomial logistic regression (or softmax
regression) is used. Instead of predicting the probability of a binary outcome, the model
predicts the probability of each of the K possible classes using the softmax function:
e βk X
P (Y = k ∣ X) = K
∑j=1 eβj X
Where:
K-Nearest Neighbor (KNN) is a simple, non-parametric algorithm used for classification tasks.
It works by assigning the class of a data point based on the classes of its k -nearest neighbors
in the feature space.
1. Distance Metric: To classify a data point, KNN calculates the distance between the point
and all others in the training set, usually using the Euclidean distance:
p
distance(X, X ′ ) = ∑(xi − x′i )2
i=1
12/24
3. Class Assignment: The class of the input point is assigned based on the majority class of
the k -nearest neighbors.
Advantages:
No training phase.
Disadvantages:
The K-Nearest Neighbors (K-NN) classifier is a simple, intuitive machine learning algorithm
used for both classification and regression tasks. It is based on the idea that similar
instances should have similar outcomes. The K-NN algorithm classifies a data point based on
the majority class of its 'K' nearest neighbors in the feature space. It is an example of an
instance-based learning method, where the model does not generalize the training data but
instead memorizes it and makes predictions only when new data is encountered.
If K=1, the classifier will assign the class label of the nearest neighbor to the test
instance. For larger K values, the algorithm uses a majority vote of the K nearest
neighbors.
To find the nearest neighbors, a distance metric is used to measure how far apart two
instances are in the feature space. The most common distance metric is Euclidean distance,
13/24
but other metrics like Manhattan, Minkowski distance, or cosine similarity can also be
used, depending on the problem.
For example, the Euclidean distance between two data points x = (x1 , x2 , … , xn ) and
n
Distance(x, y) = ∑(xi − yi )2
i=1
Once the distances between the test instance and all training instances are computed, the
algorithm selects the K nearest neighbors—those with the smallest distances.
For classification, the predicted class label of the test instance is determined by a majority
vote among the K nearest neighbors. The class that occurs most frequently among the
neighbors is assigned to the test instance.
For instance, if K = 5 and the classes of the nearest neighbors are {A, A, B, A, C}, the test
instance will be classified as class A because it is the majority class among the 5 nearest
neighbors.
Step 5: Output
The predicted class label for the test instance is returned as the output.
Simple and Intuitive: K-NN is easy to understand and implement. It does not require
training in the traditional sense, as it is an instance-based algorithm.
Non-Parametric: K-NN does not assume any underlying data distribution, which makes
it versatile and applicable to many types of data.
Versatile: It works well for both binary and multiclass classification problems and can
handle continuous or categorical data.
14/24
This issue is exacerbated in high-dimensional datasets, leading to the curse of
dimensionality.
Memory Intensive: Since K-NN is a lazy learner, it requires storing the entire training
dataset in memory during prediction, which can be problematic for large datasets.
Sensitive to Irrelevant Features: The performance of K-NN can degrade in the presence
of irrelevant or unimportant features, particularly in high-dimensional spaces.
Choice of K: The value of K plays a crucial role. A small K value may lead to overfitting,
while a large K value may result in underfitting by oversmoothing the decision
boundary.
Small K (e.g., K=1): A small value makes the model sensitive to noise and overfits,
leading to highly irregular decision boundaries.
Large K: A large K can smooth out the decision boundary and improve generalization.
However, if K is too large, the model may underfit.
Optimal K Selection: The best value for K is typically determined using cross-validation.
By testing various K values and selecting the one that minimizes the error rate on a
validation set, we can improve model performance.
5. Curse of Dimensionality
In high-dimensional datasets, the distance between all points tends to converge, making the
concept of "nearest neighbors" less meaningful. This phenomenon, known as the curse of
dimensionality, can lead to poor performance in K-NN classification.
To mitigate this:
Let’s consider a dataset of patients where features like age, cholesterol levels, and blood
pressure are used to classify whether a patient has heart disease (binary classification: 0 =
No, 1 = Yes).
15/24
1. Train the model by storing the labeled training data.
2. Compute distances between the test instance and all training data points.
For example, if K=3, and the nearest neighbors have labels {1, 0, 1}, the predicted
label for the new patient would be 1 (likely to have heart disease).
This simplicity and intuitive nature make K-NN a powerful tool for classification, though it
requires careful handling of the choice of K and dimensionality to optimize its performance.
1. The Perceptron
The Perceptron is one of the simplest models in neural networks, designed for binary
classification tasks, where data points are classified into one of two classes (e.g., 0 or 1). It
serves as a foundation for more complex models like the Multilayer Perceptron (MLP).
Inputs (x₁, x₂, ..., xp): These represent the features of the dataset, where p is the number
of features.
Weights (w₁, w₂, ..., wp): Each input is assigned a weight that determines the importance
of the feature.
Bias (b): This term allows the model to make predictions even when all inputs are zero.
Activation Function (Step function): The Perceptron uses a step function as the
activation function, which outputs either 0 or 1 based on the weighted sum of inputs.
Mathematical Formulation:
If Z ≥ 0, the output is 1.
16/24
The training process adjusts the weights and bias to minimize the prediction error. The key
steps are:
2. Compute the output (ŷ) using the current weights and bias.
3. If the prediction is incorrect, update the weights and bias using the rule:
b ← b + η(y - ŷ)
4. Repeat the process until the Perceptron converges to a set of weights that correctly
classify all data (or for a pre-set number of iterations).
The MLP is an extension of the Perceptron and allows the model to capture more complex
relationships in the data, including non-linear decision boundaries. The MLP consists of
multiple layers of neurons, forming a feedforward network.
Components of MLP:
Hidden Layers: One or more layers that process the input data. These layers are fully
connected to the previous and next layers, enabling the network to learn complex data
representations.
Output Layer: Produces the final output prediction (for binary classification, this is
typically a value between 0 and 1, representing the probability of class 1).
Sigmoid function: Maps the input to a value between 0 and 1, often used in the output
layer for binary classification.
17/24
ReLU (Rectified Linear Unit): A non-linear function that outputs 0 for negative inputs
and the input value for positive inputs. It's widely used in hidden layers as it speeds up
training and reduces the risk of vanishing gradients.
Forward propagation is the process of computing the output of the MLP by passing the input
through the layers:
1. Input Layer: The input features are passed to the first hidden layer.
2. Hidden Layers: Each hidden neuron computes a weighted sum of the inputs, applies the
activation function, and passes the result to the next layer.
3. Output Layer: The final result is computed by passing the result of the last hidden layer
through the output layer.
W₁ and W₂ are the weight matrices for the hidden and output layers.
In summary, the Perceptron is a basic binary classifier with a simple structure, while the
Multilayer Perceptron is a more powerful, multi-layered neural network that can handle more
complex data and learning tasks.
18/24
Neural Networks are computational models inspired by the structure and function of the
human brain. They are designed to recognize patterns and make decisions based on data.
These networks consist of interconnected neurons or nodes, which process and learn from
input data, enabling them to perform tasks like pattern recognition and decision-making in
machine learning.
Hidden layers: These layers process the data using neurons that apply weighted sums
and activation functions.
The learning process in neural networks involves two key stages: Forward Propagation and
Backward Propagation (Backpropagation).
1. Forward Propagation
Forward propagation is the process by which input data moves through the network, from
the input layer to the output layer, producing the final prediction.
1. Input Layer: Data enters the network, typically in the form of raw features (e.g., images,
text, etc.).
2. Weighted Sum: The input data is multiplied by weights associated with the connections
between neurons. For each neuron, the weighted sum of inputs is calculated:
z = w ⋅ x + b,
Where z is the weighted sum, w is the weight vector, x is the input vector, and b
is the bias term.
3. Activation Function: The weighted sum is passed through an activation function, which
introduces non-linearity into the network. Common activation functions include:
ReLU (Rectified Linear Unit): Outputs the input if positive; otherwise, outputs zero.
19/24
Tanh (Hyperbolic Tangent): Maps inputs between -1 and 1.
4. Hidden Layers: The process of calculating the weighted sum and applying the activation
function is repeated for each hidden layer. The output of each hidden layer becomes the
input for the next layer.
5. Output Layer: The final layer produces the output. For classification tasks, this might be
a probability distribution (e.g., using the softmax function), and for regression tasks, it
may be a real value.
In forward propagation, data flows from the input layer through hidden layers and finally to
the output layer, where the prediction is made.
1. Compute the Loss: After forward propagation, we calculate the loss (or error), which
quantifies the difference between the predicted output and the actual target. Common
loss functions include:
2. Compute Gradients of Loss with Respect to Weights: Using the chain rule of calculus,
we calculate how much each weight contributed to the loss. This step involves
computing the gradient of the loss function with respect to each weight in the network:
Gradient = ∂L / ∂w, where L is the loss, a is the output of the neuron (after
activation), z is the weighted sum, and w is the weight.
3. Update Weights and Biases: The gradients are used to update the weights and biases.
Typically, an optimization algorithm like Gradient Descent or its variants (e.g.,
20/24
Stochastic Gradient Descent (SGD) or Adam) is used to adjust the weights by
subtracting a portion of the gradient:
w ← w - η * ∇L(w),
Where η is the learning rate (a small positive scalar) and ∇L(w) is the gradient of
the loss with respect to the weight.
4. Repeat: Forward and backward propagation steps are repeated for many iterations (or
epochs). With each iteration, the weights are adjusted to minimize the loss, allowing the
network to learn from the data and improve predictions.
By iterating through these steps, neural networks learn from data, adjusting their internal
parameters (weights and biases) to improve their predictions over time. The combination of
forward and backward propagation enables neural networks to learn complex patterns and
relationships in data, making them powerful tools for a wide range of machine learning
tasks.
What is SVM?
Support Vector Machine (SVM) is one of the most popular supervised learning algorithms,
primarily used for classification tasks, though it can also be applied to regression problems.
The objective of SVM is to find the best decision boundary, called a hyperplane, that
separates data points belonging to different classes in a higher-dimensional space.
SVM works by identifying the extreme data points (or vectors) that are closest to the
hyperplane. These points are called support vectors, and they are critical in defining the
optimal hyperplane. The algorithm aims to maximize the margin between the classes by
positioning the hyperplane in a way that maximizes the separation.
21/24
Goal of SVM
Decision Boundary (Hyperplane): The key goal of SVM is to identify the hyperplane that
best divides the data points into two classes, ensuring that the margin between the
closest points of each class is as large as possible.
Support Vectors: These are the data points that are closest to the hyperplane and are
used to define it. The position of the hyperplane depends on the support vectors.
Types of SVM
1. Linear SVM:
Linear SVM is used when the dataset can be divided by a straight line or hyperplane.
In such cases, the data is called linearly separable, and the classifier used is the
Linear SVM classifier.
A linear SVM works by finding a straight line (or hyperplane in higher dimensions)
that separates the data points belonging to two classes.
Linear Hyperplane Equation: The equation of the hyperplane can be written as:
W ⋅X +b=0
Where W is the weight vector (normal to the hyperplane), X is the input feature
vector, and b is the bias term (which controls the distance from the origin to the
hyperplane).
The distance between a data point xi and the hyperplane is calculated as:
∣W ⋅ xi + b∣
∣∣W ∣∣
where ||W|| represents the Euclidean norm (length) of the weight vector W.
2. Non-linear SVM:
Non-linear SVM is used for datasets that cannot be separated by a straight line. In
such cases, the SVM uses kernel functions to map the data into a higher-
dimensional space, making it linearly separable.
22/24
The classifier used for this type of data is the Non-linear SVM classifier.
Purpose of Kernel: The kernel function performs a transformation on the input space so
that the problem becomes linearly separable in the higher-dimensional space.
Common Kernels:
Linear Kernel: Directly uses the input space without transformation (for linearly
separable data).
Radial Basis Function (RBF) Kernel: Uses a Gaussian function to transform the data,
widely used for complex problems.
Advantages of SVM
1. High-Dimensional Performance: SVM performs well in high-dimensional spaces,
making it suitable for tasks like image classification and gene expression analysis.
2. Nonlinear Capability: Through kernel functions, SVM can handle complex non-linear
relationships, allowing it to work well on non-linearly separable data.
3. Outlier Resilience: The soft margin feature allows SVM to be robust against outliers,
which is useful in tasks like spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM can be extended to both binary and multiclass
classification problems, making it versatile for various classification tasks.
5. Memory Efficiency: SVM only needs to store the support vectors, making it more
memory-efficient than other algorithms, especially when the dataset is sparse.
23/24
Disadvantages of SVM
1. Slow Training: SVM can be computationally expensive and slow when working with large
datasets, especially for data mining tasks.
2. Parameter Tuning Difficulty: Selecting the appropriate kernel function and tuning
hyperparameters like the C parameter and kernel-specific parameters require careful
experimentation.
3. Noise Sensitivity: SVM can struggle when the dataset has noise or when there are many
overlapping classes, affecting its performance.
5. Feature Scaling Sensitivity: SVM is sensitive to the scaling of features, and performance
can degrade if features are not properly scaled.
2. Use of Kernels: For non-linear data, SVM employs kernel functions to transform the data
into a higher-dimensional space where a linear separation is possible.
3. Training: The algorithm trains by adjusting the hyperplane based on the support vectors
and optimizing the margin.
4. Prediction: Once the model is trained, it can predict new data points by determining
which side of the hyperplane they fall on.
SVM is powerful, particularly in high-dimensional spaces and with non-linear data, but
requires careful tuning and may not scale well for very large datasets.
24/24