ML Exam Preparation Tips
ML Exam Preparation Tips
Best of luck, Abhishek! 😊 How are you feeling about it? Let me know if you need any last-
minute clarifications or quick revision tips—I’m here to help!
Sure, Abhishek! Share your syllabus, and I'll suggest questions along with their answers for
your quick revision. 😊
10
Unit - II
Introduction to Exploratory Data Analysis Introduction to
Exploratory
Data Analysis (EDA) –Steps in EDA, Data Types: Numerical
Data – Discrete
data, continuous data – Categorical data
Data Transformation Transformation Techniques:
Performing data
deduplication - replacing values – Discretization and
binning. Introduction to
Missing data, handling missing data
1/41
Data Visualization using Matplotlib, Seaborn
Unit – III
Supervised Learning Algorithms: Linear Regression,
Logistic
Regression, Decision Trees, Random Forest, Support Vector
Machine,K-
Nearest Neighbours, CN2 Algorithm, Naive Bayes
10
Unit -IV Clustering: K-means, Silhoutte Scores, Hierarchical
Clustering,
Fuzzy c-means, DBScan
Dimensionality Reduction:Low Variance Filter, High
Correlation
Filter, Backward Feature Elimination, Forward Feature
Selection,Principle Component Analysis, Projection
Methods.
8
Machine Learning
2-3
TCS 509
Unit V
Model Evaluation and Selection: Cross-validation, model
evaluation metrics,
model selection, and hyperparameter tuning.
Hyperparameter Optimization Techniques Manual Search,
Random Search,
Grid Search
Case study in Python for Hyperparameter Tuning
Here’s a concise list of possible questions along with their answers, tailored to your syllabus
for quick revision:
2/41
Mean: Average of a dataset.
2. What is an outlier?
It provides efficient tools for data mining and machine learning in Python.
3/41
Matplotlib: General-purpose plotting.
4/41
Unit V: Model Evaluation and Hyperparameter Optimization
19. What is cross-validation?
A technique to assess model performance by splitting data into training and testing
subsets.
Let’s dive deep into the key concepts from your syllabus for a comprehensive review.
N
∑ xi
Mean = i=1
N
Example: For [2, 3, 7], Mean = (2 + 3 + 7)/3 = 4.
5/41
Median:
The middle value when the data is ordered.
2. Outliers
3. Standard Deviation
N
1
σ= ∑(xi − μ)2
N
i=1
Supervised Learning:
Learn from labeled data. Examples: Linear regression, classification.
6/41
Unsupervised Learning:
Find patterns in unlabeled data. Examples: Clustering, PCA.
Reinforcement Learning:
Learn via rewards/punishments. Example: AlphaGo.
1. Steps in EDA
3. Data Visualization
7/41
Unit III: Supervised Learning Algorithms
1. Linear Regression
Y = β0 + β1 X + ϵ
β0 : Intercept.
β1 : Slope.
2. Logistic Regression
3. Decision Trees
4. Random Forest
8/41
Separates data using a hyperplane with maximum margin.
1. K-means Clustering
3. Recalculate centroids.
1. Standardize data.
3. Find eigenvalues/eigenvectors.
1. Cross-validation
9/41
2. Model Evaluation Metrics
3. Hyperparameter Tuning
Cross-Validation Overview
Cross-validation is a model validation technique used to evaluate the performance of a
machine learning model on an independent dataset. It splits the data into training and
testing subsets multiple times to reduce bias and ensure the model generalizes well to
unseen data.
K-Fold Cross-Validation
1. Process:
10/41
2. Advantages:
Uses the entire dataset for both training and testing (each data point is tested
exactly once).
3. Use Case:
Similar to k-fold cross-validation but ensures that the class distribution in each fold
is approximately the same as the original dataset.
Particularly useful for imbalanced datasets where certain classes have fewer
samples.
2. Advantages:
Preserves the class ratio in both training and testing datasets, preventing biased
evaluation.
3. Use Case:
Class Random, may vary between Maintains the same class ratio as in the original
Distribution folds. dataset.
11/41
When to Use Each
K-Fold Cross-Validation:
When you want to ensure the minority class is adequately represented in every fold.
Example
K-Fold Cross-Validation:
Dataset: [1, 2, 3, 4, 5, 6] with k = 3.
Folds:
12/41
It seems the accuracy for the fifth fold is missing. Let’s proceed assuming the missing value
needs to be addressed. You can provide the fifth fold's accuracy if available, or we’ll calculate
based on an example value.
If the fifth fold's accuracy is provided, let’s denote it as A5 . The accuracies for the folds are:
N
1
σ= ∑(Ai − μ)2
N
i=1
Where:
μ = Average accuracy.
Example Calculation
13/41
Let’s assume A5 = 81% (you can replace it with the actual value).
1. Accuracies: 78%, 82%, 80%, 79%, 81%
2. Sum of Accuracies:
78 + 82 + 80 + 79 + 81 = 400
3. Average Accuracy (μ):
400
μ= = 80%
5
4. Compute Variance:
Variance (σ 2 ) formula:
1
σ2 = [(78 − 80)2 + (82 − 80)2 + (80 − 80)2 + (79 − 80)2 + (81 − 80)2 ]
5
1
= [(−2)2 + (2)2 + (0)2 + (−1)2 + (1)2 ]
5
1 10
= [4 + 4 + 0 + 1 + 1] = =2
5 5
5. Standard Deviation (σ ):
σ= 2 ≈ 1.41%
Final Results:
Average Accuracy: 80%
If you have the fifth fold's actual accuracy, I can redo the calculations. Let me know! 😊
14/41
problem's requirements and help in:
2. Optimizing Models: Identify areas where the model needs improvement (e.g., reducing
false positives or false negatives).
1. Precision
Definition:
Precision measures the proportion of true positives (TP ) among all predicted positives (
TP + FP ).
TP
Precision =
TP + FP
Interpretation:
High precision means fewer false positives.
Use Case:
Precision is critical when the cost of false positives is high.
Example: Spam detection (avoiding misclassifying important emails as spam).
2. Recall
Definition:
Recall measures the proportion of true positives (TP ) among all actual positives (TP +
FN ).
TP
Recall =
TP + FN
15/41
Interpretation:
High recall means fewer false negatives.
Use Case:
Recall is crucial when missing a positive instance has a severe consequence.
Example: Disease detection (missing a positive case could lead to serious health risks).
3. F1-Score
Definition:
F1-score is the harmonic mean of precision and recall, balancing the two.
Precision ⋅ Recall
F1-Score = 2 ⋅
Precision + Recall
Interpretation:
A high F1-score indicates a good balance between precision and recall.
Use Case:
Use F1-score when both false positives and false negatives are equally problematic.
Example: Fraud detection (where both missing fraud and falsely flagging legitimate
transactions are costly).
Comparison Table
Metric Focus Strength Weakness When to Use
Precision Positive Good when false May ignore false High-cost false positives
predictions positives matter. negatives. (e.g., spam detection).
accuracy
16/41
Example Scenario
Consider a binary classification problem: detecting fraudulent transactions.
Recall Focus: If you aim to identify all fraudulent transactions, even if it means falsely
flagging some legitimate ones (reduce false negatives).
F1-Score Focus: If both outcomes (false positives and false negatives) are equally
problematic.
What is Overfitting?
Overfitting occurs when a machine learning model learns the training data too well,
including its noise and outliers. This results in a model that performs well on the training
data but poorly on unseen data (test or validation set).
Characteristics of Overfitting:
3. Overly complex models that capture noise instead of the underlying patterns.
17/41
1. K-Fold Cross-Validation:
Each fold is used once as a validation set, while the remaining k − 1 folds are used
for training.
The model is evaluated on the average performance across all folds, reducing the
likelihood of overfitting to a single validation set.
2. Benefits:
Scenario:
You are training a polynomial regression model to predict house prices based on square
footage.
1. Dataset:
Training set: [(1000, 150), (1200, 200), (1500, 300), (2000, 400)]
2. Model Choices:
Test Performance: The polynomial performs poorly because it captures noise rather
than the trend.
4. Using Cross-Validation:
18/41
The high-degree polynomial model will perform inconsistently across folds because
it overfits specific subsets of training data.
A simpler linear model will have more consistent performance across all folds,
indicating better generalization.
2. Regularization: Add penalties for large weights (e.g., L1/L2 regularization in linear
models).
3. Prune the Model: Reduce the complexity of decision trees or neural networks.
4. More Training Data: Helps the model learn the true data distribution.
5. Dropout (Neural Networks): Randomly drop neurons during training to prevent over-
reliance on specific features.
Conclusion
Cross-validation helps identify overfitting by testing the model's ability to generalize across
different splits of the dataset. For example, if a high-degree polynomial model performs well
on the training folds but poorly on the validation folds, it indicates overfitting. Regularization
and simpler models, combined with cross-validation, can help build robust models that
generalize well.
19/41
A Decision Tree is a supervised learning algorithm used for classification and regression
tasks. It models decisions and their possible consequences as a tree structure. Each internal
node represents a test on a feature, each branch represents an outcome of the test, and
each leaf node represents a predicted label or value.
2. Splitting Criteria: Uses metrics like Gini Index, Information Gain, or Mean Squared Error
(for regression) to decide the best split.
3. Advantages: Easy to understand, requires little preprocessing, and works with both
categorical and numerical data.
Definition A single tree structure for decision- An ensemble of multiple decision trees.
making.
Training Faster as only one tree is built. Slower due to multiple tree
Speed constructions.
Trees can grow very deep, capturing noise and outliers in the training data.
20/41
Solution: Prune the tree (post-pruning or pre-pruning) or set a maximum depth.
2. High Variance:
Decision trees are sensitive to small changes in the data, leading to different tree
structures.
Solution: Use ensemble methods like Random Forest to average out predictions and
reduce variance.
3. Bias in Splitting:
Splitting criteria like Information Gain can favor features with more levels, causing
bias.
Solution: Use splitting criteria like Gini Index or Chi-Square that are less biased.
Decision trees struggle with categorical data with many levels or missing values.
Solution: Use one-hot encoding for categorical data and imputation techniques for
missing values.
Solution: Use ensemble methods like Gradient Boosting for smoother outputs.
Limit tree depth, set minimum samples per leaf, or define a maximum number of
nodes.
2. Pruning Techniques:
Pre-Pruning: Stop growing the tree early based on conditions (e.g., minimum
samples per node).
21/41
3. Bagging and Random Forest:
Random Forest uses bootstrapped datasets and random feature selection for
training.
Conclusion
Decision Trees are powerful but can face challenges like overfitting, high variance, and bias.
Random Forest is an effective ensemble technique that overcomes many of these issues. By
employing techniques like pruning, regularization, and ensemble methods, we can harness
the simplicity of Decision Trees while ensuring better generalization and robustness.
Equation:
f (x) = max(0, x)
Working:
22/41
Returns 0 for any negative input.
Graph:
Suitability:
Advantages:
3. Sparse activation: Most neurons are inactive (output 0), leading to efficient
computations.
Disadvantages:
1. Can cause dying ReLU problem: Neurons may output 0 for all inputs if weights are
poorly initialized.
Use Case: Hidden layers in deep neural networks (e.g., CNNs, RNNs).
2. Sigmoid
Equation:
1
f (x) =
1 + e−x
Working:
Smooth curve that maps large negative inputs close to 0 and large positive inputs close
to 1.
Graph:
Suitability:
Advantages:
23/41
1. Suitable for binary classification tasks as it outputs probabilities.
Disadvantages:
1. Vanishing gradient problem: Gradients become very small for large positive or
negative inputs.
3. Softmax
Equation:
exi
f (xi ) =
∑nj=1 exj
Working:
Graph:
Suitability:
Advantages:
Disadvantages:
Use Case: Final layer in multi-class classification networks (e.g., object detection, text
classification).
24/41
Comparison Table
Activation
Function Equation Range Use Case Key Feature
x
∑e j
classification. distributions.
Conclusion
ReLU is preferred for hidden layers due to efficiency and sparse activations.
Softmax is ideal for multi-class classification tasks, where probabilities across classes are
needed.
Would you like Python code examples to visualize these activation functions? 😊
Overview:
Mathematical Equation:
The main idea behind Naïve Bayes is to calculate the posterior probability of a class given the
features (data point). Using Bayes' Theorem:
25/41
P (X∣Ck ) ⋅ P (Ck )
P (Ck ∣X) =
P (X)
Where:
P (X) is the probability of the features X (constant across all classes and can be
ignored for classification).
Since the Naïve Bayes classifier assumes that the features are conditionally independent, the
likelihood can be factored as:
n
P (X∣Ck ) = ∏ P (xi ∣Ck )
i=1
i=1
Example:
Imagine we want to classify whether an email is spam or not (binary classification). Features
could be words like "free", "offer", and "win", and classes could be "spam" or "not spam".
For a given email with features X = ["free", "offer"], the Naïve Bayes classifier computes
the posterior probability of each class by combining the prior probabilities of the classes and
the likelihood of observing these words given the classes.
(ii) Backpropagation
Overview:
Working:
26/41
1. Forward Pass: Input data is passed through the network, and predictions are made.
2. Compute Loss: The difference between the predicted output and the actual target is
computed using a loss function (e.g., Mean Squared Error, Cross-Entropy).
Compute the gradient of the loss with respect to each weight using the chain rule of
calculus. This helps understand how each weight affects the final prediction.
Update the weights in the opposite direction of the gradient (gradient descent) to
reduce the loss.
Mathematical Equation:
1. Loss function:
N
1
L = ∑(yi − y^i )2
2 i=1
2. Gradient calculation:
The gradient of the loss with respect to weights is calculated as:
∂L ∂L ∂ y^
= ⋅
∂w ∂ y^ ∂w
∂L
w =w−η
∂w
Example:
For a neural network predicting the output y for an input X (using weights w ),
backpropagation will:
2. Calculate the loss (error) between the predicted and actual output.
3. Compute gradients of the loss with respect to each weight by applying the chain rule
from output to input layer.
27/41
This process continues iteratively until the model reaches a minimum loss or stops
improving.
Summary:
Naïve Bayes is a probabilistic classifier based on Bayes' Theorem and the assumption of
conditional independence between features.
1. Training Phase:
The KNN algorithm does not explicitly learn from the training data. It simply stores
the entire training dataset, which is why it is called a lazy learner.
1. Calculate the distance between the test point and all points in the training set
using a distance metric (like Euclidean distance).
3. Select the top k neighbors (the data points with the smallest distance).
28/41
4. Classify the test point by majority voting among the k nearest neighbors. The
class with the most votes is assigned to the test point.
The process is similar, but instead of voting for a class label, KNN calculates the
average (or weighted average) of the output values (target variables) of the k
nearest neighbors.
2. Step 2: Calculate the distance between the test point and all the points in the training
dataset. Common distance metrics:
n
Euclidean Distance: d(x, y) = ∑i=1 (xi − yi )2
n
Manhattan Distance: d(x, y) = ∑i=1 ∣xi − yi ∣
3. Step 3: Sort the distances in ascending order and select the top k nearest neighbors.
4. Step 4: For classification, assign the class that is most frequent among the k nearest
neighbors.
5. Step 5: Return the predicted class label for the test data.
Training Data:
150 55 Class A
160 60 Class A
170 65 Class B
29/41
Height (cm) Weight (kg) Class
180 70 Class B
165 63 Class A
175 68 Class B
Test Point:
Height: 167 cm
Weight: 64 kg
1. Calculate the Euclidean Distance between the test point (167, 64) and each point in the
training set:
d= (x1 − x2 )2 + (y1 − y2 )2
2. Compute Distances:
Height (cm) Weight (kg) Class Distance from Test Point (167, 64)
30/41
4. Select the top k = 3 nearest neighbors:
The 3 nearest neighbors are: Class A (2.236), Class B (3.162), and Class A (7.810).
5. Majority Voting:
Advantages of KNN:
1. Simplicity: Easy to understand and implement.
2. No Training Phase: KNN is a lazy learner, meaning it doesn't require a training phase.
Disadvantages of KNN:
1. Computationally Expensive: Requires calculating the distance to every point in the
training set for each test point, which can be slow for large datasets.
3. Sensitive to Irrelevant Features: Irrelevant features can distort the distance metric.
Conclusion:
The K-Nearest Neighbor algorithm is intuitive and effective for classification and regression
tasks, particularly when you have labeled data and want a simple model. However, for large
datasets or high-dimensional data, alternative algorithms like decision trees or support
vector machines may be more efficient.
31/41
libraries involved in its implementation with the help of its
mathematical suitability
YOLOv3 Algorithm: YOLOv3 is applied to detect human figures (or people) in the
input image. YOLOv3 provides both bounding boxes around detected objects
(people) and confidence scores, indicating the probability that the object is indeed a
person.
3. Distance Calculation:
Once the positions of all detected persons are known, the program calculates the
Euclidean distance between every pair of centroids using the formula:
D= (x2 − x1 )2 + (y2 − y1 )2
where (x1 , y1 ) and (x2 , y2 ) are the centroids of two people. The distance D represents
32/41
5. Real-time Monitoring:
Alerts or visual markers (like bounding boxes in red) are shown to highlight
violations.
Purpose: OpenCV is the primary library used for image and video processing. It
helps in loading images, reading video streams, and displaying the results.
Mathematical Suitability:
Used for drawing bounding boxes around detected objects (people) and
calculating Euclidean distances.
Example:
python
import cv2
image = cv2.imread("image.jpg")
cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
Purpose: YOLOv3 is used for detecting objects in images and videos. It detects
people as one of the classes, and provides bounding boxes and class probabilities
for each detected object.
Mathematical Suitability:
33/41
YOLOv3 divides the image into a grid and predicts bounding boxes and class
probabilities for each grid cell. It uses a single convolutional neural network
(CNN) to predict the bounding boxes, class labels, and confidence scores.
The predicted bounding boxes are scaled back to the original image size for
accurate detection.
Example:
python
import cv2
net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
layer_names = net.getLayerNames()
output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]
Mathematical Suitability:
Example:
python
import numpy as np
distance = np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)
Purpose: TensorFlow or Keras can be used if you need to train or fine-tune the
YOLOv3 model on custom datasets. While pre-trained models are often used, you
may train the model to detect specific classes of interest.
Mathematical Suitability:
TensorFlow and Keras are deep learning frameworks that use backpropagation
and optimization techniques for training neural networks. They implement
34/41
convolutional layers, activation functions, and loss functions that are part of
YOLOv3.
Purpose: Matplotlib is used to visualize the results, such as displaying the image or
video with the bounding boxes and alerts for social distancing violations.
Mathematical Suitability:
Matplotlib is useful for plotting bounding boxes, the distances between people,
and other visual elements on the image to easily analyze results.
Example:
python
Purpose: Pandas can be used to store the results in a structured way, such as
tracking the distance violations over time or logging the people detected in a
particular frame.
Mathematical Suitability:
The Euclidean distance formula used in the program is suited for calculating the
straight-line distance between two centroids in a 2D space (image coordinates).
By using NumPy, this can be efficiently calculated for each pair of detected
individuals.
35/41
2. Bounding Box Analysis:
The bounding box is directly tied to the object detection process, and calculating the
centroid from the coordinates is mathematically simple and requires only basic
arithmetic.
3. Real-time Detection:
The algorithm ensures that bounding boxes and distances are calculated efficiently
on each frame.
Example Workflow:
1. Load YOLOv3 Model: Load the pre-trained YOLOv3 model and configuration files.
2. Read Video Stream: Capture frames from a video stream (e.g., from a camera or pre-
recorded video).
4. Calculate Distances: Calculate the Euclidean distance between each pair of detected
persons.
5. Check Social Distancing: Flag any pair where the distance is less than the predefined
threshold (e.g., 1.5 meters).
6. Display Results: Show the frame with bounding boxes, flags, and alerts indicating
violations.
This approach is effective for real-time monitoring and can be used in surveillance systems
to enforce social distancing rules.
what is eda
36/41
Exploratory Data Analysis (EDA) is a critical initial step in the data analysis process where
data scientists and analysts explore the data, summarize its main characteristics, and
visualize its patterns, relationships, and anomalies before applying any statistical modeling
or machine learning algorithms. The purpose of EDA is to better understand the dataset,
uncover underlying patterns, detect outliers, and test assumptions that can inform the next
steps of data preparation and modeling.
The first step in EDA is to collect the dataset from various sources, such as
databases, CSV files, APIs, or real-time data streams.
2. Data Cleaning:
3. Data Transformation:
4. Summary Statistics:
Compute basic statistics to understand the data distribution and central tendencies.
Descriptive Statistics: Mean, median, mode, standard deviation, variance, min, max,
skewness, and kurtosis.
Example: Finding the average age of customers, or the standard deviation of income
levels.
5. Visualization:
37/41
Multivariate Visualization: Exploring the relationship between multiple variables
using 3D scatter plots or heatmaps.
6. Outlier Detection:
Identify outliers that deviate significantly from the general trend of the data, which
could indicate errors, rare events, or interesting insights.
7. Correlation Analysis:
Test assumptions about the data using statistical tests, such as t-tests or chi-square
tests, to validate or reject hypotheses.
9. Feature Engineering:
Purpose of EDA:
1. Uncover Patterns and Trends:
Identify hidden patterns, correlations, and trends that can inform your modeling
process.
2. Test Assumptions:
38/41
Check if your data meets the assumptions required by the modeling techniques,
such as normality, linearity, or homoscedasticity.
EDA helps in transforming and cleaning the data to make it suitable for predictive
modeling by identifying and addressing any data quality issues early on.
4. Feature Selection:
Select or engineer relevant features that will improve the model’s accuracy.
5. Outlier Detection:
Categorical Variables: Use bar charts, pie charts, and frequency tables.
2. Bivariate Analysis:
Use scatter plots, correlation heatmaps, and pair plots to visualize and quantify
relationships.
3. Multivariate Analysis:
Use techniques like 3D scatter plots, parallel coordinate plots, and principal
component analysis (PCA) for dimensionality reduction.
4. Dimensionality Reduction:
If the dataset has many features, dimensionality reduction techniques such as PCA
(Principal Component Analysis) or t-SNE can help visualize and understand high-
dimensional data.
39/41
Common Tools and Libraries for EDA in Python:
1. Pandas:
2. NumPy:
Seaborn is built on top of Matplotlib and provides a high-level interface for drawing
attractive and informative statistical graphics.
4. Scipy:
5. Plotly:
A powerful library for interactive plots and visualizations, useful for creating
dashboards and web-based visualizations.
Example of EDA:
Let’s say you are working on a dataset of house prices. Here’s an outline of how EDA might
look for this dataset:
1. Data Cleaning: Remove or fill missing values (e.g., for missing prices or square footage).
2. Summary Statistics: Find the average, median, and standard deviation of house prices,
square footage, etc.
3. Visualization:
40/41
Use a histogram to visualize the distribution of house prices.
Create a scatter plot to examine the relationship between square footage and price.
4. Correlation Analysis: Check if there’s a strong correlation between variables like square
footage and price.
Conclusion:
EDA is a critical process in any data science or machine learning project. It helps you
understand the dataset, identify important features, uncover patterns, detect anomalies, and
prepare the data for further analysis or modeling. Through visualization and statistical
analysis, EDA serves as a foundation for more sophisticated analysis or predictive modeling
techniques.
41/41