ML 4
ML 4
1. Classification:
Classification is a supervised learning task where the goal is to assign input data into
predefined categories or classes.
Working: SVM finds the optimal hyperplane that separates data points of different
classes by maximizing the margin between them.
Kernel Trick: Allows SVM to work with non-linear data by transforming it into a higher-
dimensional space (e.g., polynomial or radial basis function (RBF) kernel).
1/33
Advantages: Effective for high-dimensional data.
2. Ensemble Learning:
Ensemble learning combines multiple models (often called "weak learners") to create a
stronger model.
Concept: Each weak learner is trained on a random subset of data sampled with
replacement (bootstrap), and predictions are aggregated (e.g., voting or averaging).
Boosting:
Concept: Models are trained sequentially, with each new model correcting the errors of
the previous one. Weights are assigned to misclassified samples to emphasize them.
Random Forest:
Working: An extension of bagging where each decision tree in the ensemble is trained
on a random subset of features in addition to data.
Working: Assigns weights to training instances and iteratively adjusts them, focusing on
difficult-to-classify samples.
2/33
3. Binary-vs-Multiclass Classification:
Binary Classification: Classifies data into two categories (e.g., spam vs. non-spam).
Balanced Classification: All classes have roughly the same number of samples.
Imbalanced Classification: Some classes have significantly fewer samples than others,
which may bias the model towards the majority class. Techniques like oversampling
(SMOTE) or class weighting can be used to address this issue.
1. One-vs-One (OvO):
2. One-vs-All (OvA):
Trains one binary classifier per class, distinguishing that class from all others.
Basic Metrics:
1. Accuracy:
Correct Predictions
Accuracy =
Total Predictions
2. Precision:
3/33
True Positives (TP)
Precision =
True Positives (TP) + False Positives (FP)
3. Recall (Sensitivity):
4. F-Score:
Precision ⋅ Recall
F1 = 2 ×
Precision + Recall
Advanced Metrics:
1. Cross-Validation:
Macro-Averaging: Calculates metrics for each class and takes the average.
Use Case: Gives equal weight to all classes, regardless of their size.
4. Micro-Average F-Score:
5. Macro-Average F-Score:
4/33
Highlights poor performance in minority classes.
If you have specific concepts you’d like to dive deeper into, feel free to ask!
2. Calculate Distance:
Compute the distance between the query point and all points in the dataset.
n
d= ∑(xi − yi )2
i=1
Select the k data points closest to the query point based on the calculated distances.
In classification, the majority class among the neighbors is assigned to the query
point.
5/33
In regression, the average of the neighbor values is used as the prediction.
5. Classify or Predict:
Example of KNN
Problem Statement
1. Weight (grams).
Dataset
Apple 150 7
Orange 200 9
Query Point
Steps:
1. Calculate Distances:
Using Euclidean distance:
d= (x1 − x2 )2 + (y1 − y2 )2
Compute distances between the query point and all data points:
6/33
Fruit Weight (grams) Size (cm) Distance to Query Point
Apple: 2 neighbors.
Orange: 1 neighbor.
Advantages of KNN
1. Simple to understand and implement.
Limitations
1. Computational Complexity: Slow for large datasets as distance computation is required
for all points.
2. Choice of k : Small k can lead to overfitting; large k may oversimplify the decision
boundary.
7/33
Applications
1. Recommendation Systems: Suggesting products or movies.
If you'd like to visualize the working or need further clarification, let me know!
2/2
In k-Nearest Neighbour (k-NN), the choice of distance metric plays a critical role in
determining the neighbors of a given query point. The most commonly used distance
metrics include:
1. Euclidean Distance
Formula:
n
d= ∑(xi − yi )2
i=1
Description: Measures the straight-line distance between two points in Euclidean space.
Characteristics:
d= (4 − 1)2 + (6 − 2)2 = 9 + 16 = 5
8/33
n
d = ∑ ∣xi − yi ∣
i=1
Description: Measures the sum of absolute differences between the coordinates of two
points.
Characteristics:
d = ∣4 − 1∣ + ∣6 − 2∣ = 3 + 4 = 7
3. Minkowski Distance
Formula:
1
d = (∑ ∣xi − yi ∣p )
n p
i=1
p = 1: Manhattan Distance.
p = 2: Euclidean Distance.
p > 2: Places more emphasis on larger differences.
Example: With p = 3, for (1, 2) and (4, 6):
1/3
d = (∣4 − 1∣3 + ∣6 − 2∣3 ) = (27 + 64)1/3 ≈ 4.64
4. Chebyshev Distance
Formula:
9/33
n
d = max ∣xi − yi ∣
i=1
Characteristics:
5. Hamming Distance
Formula:
n
d = ∑ 1(xi
= yi )
i=1
Characteristics:
d = 1(1 =
1) + 1(1 =
0) + 1(0 =
0) + 1(1 =
1) = 1
x⋅y
Cosine Similarity = , Cosine Distance = 1 − Cosine Similarity
∥x∥∥y ∥
10/33
Description: Measures the cosine of the angle between two vectors. Focuses on
orientation rather than magnitude.
Characteristics:
Useful for text data or high-dimensional sparse data (e.g., TF-IDF vectors).
(1)(2) + (2)(3)
Cosine Similarity = = 0.992
12 + 22 22 + 32
7. Mahalanobis Distance
Formula:
d= (x − y)T S −1 (x − y)
Description: Measures the distance between a point and a distribution, considering the
correlations between features.
Characteristics:
8. Bray-Curtis Distance
Formula:
n
∑ ∣xi − yi ∣
d = ni=1
∑i=1 (xi + yi )
11/33
Description: Measures dissimilarity between two vectors, often used in ecological
studies.
Characteristics:
The choice of distance metric should align with the nature of the data and the problem at
hand.
1. Simplicity:
KNN is simple to implement and understand, making it a good starting point for
beginners in machine learning.
2. No Training Phase:
KNN is a lazy learning algorithm, meaning there’s no explicit training phase. The
computation happens at the time of prediction.
3. Adaptability:
4. Non-Parametric:
12/33
KNN does not make any assumptions about the underlying data distribution, which
makes it versatile.
5. Interpretable:
Easy to understand how predictions are made by looking at the nearest neighbors.
Works well with a large, well-labeled dataset as it uses actual data points for making
decisions.
7. Incremental Learning:
KNN can easily adapt to new data since no retraining is required; you just add the new
data to the dataset.
1. Computationally Expensive:
For large datasets, the prediction phase can be slow because it involves calculating
distances to all other data points.
Distance-based algorithms are sensitive to the scale of features. For example, one
feature with larger values can dominate the calculation. Normalization or
standardization is required.
3. Choice of k :
A large k can oversimplify the model (may fail to capture local patterns).
13/33
5. Sensitive to Noise:
KNN can misclassify due to noisy or irrelevant features, as all features contribute equally
unless weighted.
For imbalanced datasets, the majority class can dominate the predictions, leading to
biased results.
7. No Model Interpretability:
While predictions are straightforward, KNN does not provide an explicit model or
coefficients to interpret.
As the size of the dataset grows, the algorithm becomes slower and may require
optimizations (e.g., KD-trees or Ball-trees).
Summary Table
Advantages Disadvantages
When you need a simple baseline model to compare against more complex algorithms.
14/33
Would you like guidance on optimizing KNN for specific use cases?
i) Separating Hyperplane
w⋅x+b=0
Where:
Example: In 2D space:
For a binary classification problem, a straight line (hyperplane) separates the two
classes.
ii) Margin
Definition: The margin is the distance between the separating hyperplane and the
closest data points from either class (called support vectors).
Purpose:
15/33
In SVM, the goal is to maximize this margin while ensuring the correct classification
of data points. This maximized margin creates a decision boundary that generalizes
better to unseen data.
Types of Margins:
Mathematical Representation: If xi are the data points, yi are the corresponding labels
The factor 2 accounts for the margin on both sides of the hyperplane.
Key Concept:
Support vectors are the data points that lie on the margin boundaries. These points
determine the hyperplane.
Example: In 2D space:
Visual Representation
In a 2D space:
Margin: The gap between the hyperplane and the nearest points from both classes.
16/33
Kernel Methods in SVM
Kernel methods are techniques that enable Support Vector Machines (SVM) to solve non-
linear problems by mapping the input data into a higher-dimensional space where a linear
separating hyperplane can be found. This process is known as the kernel trick.
In many cases, data is not linearly separable in its original feature space. Kernels
implicitly transform the data into a higher-dimensional space where it becomes linearly
separable.
2. Kernel Trick
Here:
1. Linear Kernel
Formula:
K(x, y) = x ⋅ y
Use Case:
17/33
Suitable when the data is linearly separable in the original feature space.
Advantages:
2. Polynomial Kernel
Formula:
K(x, y) = (x ⋅ y + c)d
Where:
Useful for problems with complex but polynomial relationships between classes.
Advantages:
Formula:
∥x − y∥2
K(x, y) = exp (− )
2σ 2
Where:
Advantages:
18/33
Popular for general-purpose use cases.
4. Sigmoid Kernel
Formula:
K(x, y) = tanh(α(x ⋅ y) + c)
Where:
α: Scale parameter.
c: Offset parameter.
Use Case:
Advantages:
Linear When the data is linearly Simple and fast Fails for non-linear data
separable in the original space
RBF When the decision boundary is Versatile, works well Requires tuning of σ
(Gaussian) highly non-linear in most cases
Sigmoid When modeling neural network- Captures non-linear May not perform well in
like structures boundaries all cases
19/33
Advantages of Kernel Methods
1. Flexibility: Can handle linear and non-linear problems.
2. Parameter Tuning: Requires careful tuning of kernel parameters (e.g., σ for RBF, d for
polynomial).
Would you like a detailed example of how a specific kernel works in an SVM?
SVR defines a margin (ϵ) around the predicted function where deviations are
considered acceptable.
20/33
The goal is to find a function f (x) such that most data points lie within this margin.
2. Objective:
No penalty is applied for errors within the margin (∣y − f (x)∣ ≤ ϵ).
For errors outside the margin, the model applies a penalty proportional to the
deviation.
4. Optimization Problem:
Minimize:
1
∥w∥2 + C ∑(ξ + ξ ∗ )
2
Subject to:
yi − (w ⋅ xi + b) ≤ ϵ + ξi
(w ⋅ xi + b) − yi ≤ ϵ + ξi∗
Where:
w: Weight vector.
b: Bias term.
C : Regularization parameter.
ξ, ξ ∗ : Slack variables for handling violations.
Steps in SVR
1. Choose a Kernel Function:
21/33
3. Train the Model:
4. Make Predictions:
Problem:
Predict house prices based on features like size, number of rooms, etc.
Dataset (Simplified):
1000 3 300,000
1500 4 450,000
2000 5 500,000
Steps:
1. Kernel Selection:
Use a linear kernel if the relationship between features and price is approximately
linear.
2. Set Parameters:
Find the hyperplane that predicts house prices while ensuring most data points lie
within ±10, 000 of the predicted value.
4. Prediction:
For a house of 1800 sq. ft. with 4 rooms, the model predicts a price of $475,000.
22/33
SVR Output
Graphical Representation:
Margin: A band of width 2ϵ around the regression line where deviations are acceptable.
Advantages of SVR
1. Handles Non-linear Relationships:
2. Robustness:
3. Flexibility:
Disadvantages of SVR
1. Parameter Tuning:
2. Computational Complexity:
3. Sensitive to Outliers:
23/33
Would you like a Python implementation or visualization of SVR?
Ensemble Learning
Ensemble learning is a machine learning technique that combines the predictions of multiple
models (often called base learners or weak learners) to improve performance. The idea is
that a group of models working together can produce more accurate and robust predictions
than a single model alone.
Ensemble learning relies on combining models that make different errors (i.e., they
are diverse) to reduce overall error.
2. Base Learners:
These are the individual models that make up the ensemble. They can be:
Homogeneous: All base learners use the same algorithm (e.g., all decision
trees).
3. Combination Strategy:
4. Bias-Variance Trade-off:
24/33
Boosting techniques reduce bias.
Idea:
Train a model on each subset and aggregate their predictions (e.g., by averaging or
voting).
2. Boosting
Idea:
Sequentially train models, where each new model focuses on correcting the errors
made by the previous ones.
Popular Algorithms:
AdaBoost.
3. Stacking
Idea:
25/33
Train multiple base learners and use their predictions as inputs to a meta-model (or
second-level model) that makes the final prediction.
Popular Meta-Models:
2. Robustness:
3. Versatility:
4. Reduces Overfitting:
2. Time-Consuming:
3. Overfitting Risk:
26/33
Example: Voting Classifier
Suppose you want to classify emails as spam or not spam using three algorithms: Logistic
Regression, Decision Tree, and Naive Bayes. Using ensemble learning, you can:
If Logistic Regression and Decision Tree predict "spam," but Naive Bayes predicts "not
spam," the final output will be "spam" based on the majority.
1. Improved Accuracy
Individual models (base learners) may have limited predictive power or may overfit the
training data.
Ensembles aggregate the predictions of multiple models, reducing errors and improving
overall accuracy.
2. Reducing Overfitting
Techniques like bagging (e.g., Random Forest) create diverse models by training on
different subsets of the data. This reduces the likelihood of overfitting, especially in high-
variance models like decision trees.
27/33
3. Reducing Underfitting
Boosting methods (e.g., AdaBoost, Gradient Boosting) focus on improving weak learners
by iteratively correcting their errors. This reduces bias and helps prevent underfitting.
4. Robustness to Noise
Ensemble models are less sensitive to noise and outliers in the data because the
aggregation process minimizes the impact of any single model's mistakes.
Ensembles strike a balance between bias (error due to overly simple models) and
variance (error due to overly complex models):
6. Increased Generalization
8. Scalability
Ensemble techniques can be used with distributed computing to train large models
efficiently on big data.
2. Real-World Applications:
28/33
Fraud detection, image recognition, recommendation systems, and more use
ensembles for reliable predictions.
3. Imbalanced Data:
Individually, they may not always be correct, but by aggregating their predictions
(majority voting), you are more likely to get an accurate forecast. This concept mirrors
how ensemble learning works by combining the strengths of diverse models.
Conclusion
Ensemble learning is used because it:
Would you like to explore a specific ensemble technique, like bagging or boosting?
29/33
i) Simple Ensemble Methods
Simple ensemble methods combine predictions from base models in straightforward ways,
such as averaging or voting. These methods are easy to implement and computationally
efficient.
Each base model predicts a class, and the final prediction is the class with the most
votes.
Example: If three models predict Yes, Yes, No, the final output is Yes.
The final prediction is the average of predictions made by all base models.
Example: If three models predict prices as $100, $110, and $120, the final output is
(100 + 110 + 120)/3 = 110.
3. Weighted Voting or Averaging:
Example: A high-performing model may have more influence on the final prediction.
Build multiple models using different subsets of the training data (sampled with
replacement).
Advanced ensemble methods use more sophisticated techniques, often involving sequential
or layered modeling, to improve performance.
1. Boosting:
Models are trained sequentially, with each new model focusing on correcting the
errors of the previous ones.
Example Algorithms:
30/33
AdaBoost: Adjusts weights of incorrectly classified samples.
2. Stacking:
Base models make predictions, and these predictions are used as inputs for the
meta-model.
3. Blending:
4. Random Forest:
Use Case: Robust to overfitting and versatile for classification and regression.
Use Cases Small datasets, quick tasks Complex datasets, higher accuracy needed
Would you like examples or Python code for any of these methods?
31/33
Differentiate between bagging and boosting in table
format
Data Sampling Multiple subsets of data are sampled Sequentially trains on full dataset,
with replacement. focusing on misclassified samples.
Model All models are trained independently. Models are trained sequentially, and
Independence each model depends on the previous
one.
Focus on Errors Equal weight given to all data points. Misclassified samples are given
higher weights in subsequent
iterations.
Training Strategy Parallel (can train models Sequential (models are trained one
simultaneously). after another).
Computational Relatively low, as models are trained Higher, due to sequential training and
Cost in parallel. weight adjustments.
Weaknesses May not reduce bias effectively. Sensitive to noisy data and outliers,
which can lead to overfitting.
32/33
Summary
Bagging is ideal for reducing variance and improving model stability.
Boosting is better for addressing bias and building more accurate models on complex
data.
Would you like a deeper explanation of a specific example (e.g., Random Forest or
AdaBoost)?
33/33