2494508-Machine_Learning_Module_Notes
2494508-Machine_Learning_Module_Notes
4. Autonomous Vehicles:
● Example: Self-driving cars.
● How it works: Machine learning models process data from various sensors (such
as cameras and LiDAR) to make decisions, recognize objects, and navigate
safely. These models continuously adapt to changing road conditions and learn
from real-world driving experiences.
5. Fraud Detection in Finance:
● Example: Identifying fraudulent transactions in credit card systems.
● How it works: Machine learning algorithms analyze patterns in transaction data
to identify anomalies that may indicate fraudulent activity. Models can learn from
historical fraud cases and adapt to new types of fraud.
Machine learning can be categorized into different types based on the learning style,
or the way the algorithm learns from data. The three main types of machine learning
are supervised learning, unsupervised learning, and reinforcement learning.
1. Supervised Learning:
Definition:
Supervised learning involves training a model on a labeled dataset, where the input data
is paired with the corresponding target labels. The goal is for the model to learn the
mapping from inputs to outputs.
Key Characteristics:
Examples:
2. Unsupervised Learning:
Definition:
Unsupervised learning involves training a model on an unlabeled dataset, where the
algorithm must find patterns or relationships in the data without explicit guidance on
the desired outputs.
Key Characteristics:
Examples:
3. Reinforcement Learning:
Definition:
Reinforcement learning involves an agent learning to make decisions by interacting with
an environment. The agent receives feedback in the form of rewards or penalties based
on the actions it takes.
Key Characteristics:
● The algorithm learns through trial and error by interacting with an environment.
● The agent receives feedback in the form of rewards or penalties.
● The goal is for the agent to learn a policy that maximizes cumulative rewards
over time.
Examples:
Model Representation:
Y=b0+b1⋅ X
Scenario:
Consider a real estate scenario where you want to predict house prices based on the
size of the house (in square feet). Here, house price (Y) is the dependent variable, and
house size (X) is the independent variable.
Dataset:
1600 300,000
1700 320,000
1875 370,000
... ...
Objective:
Predict the price of a house (dependent variable) based on its size (independent
variable).
Steps:
Initialize Parameters:
● Initialize b0 and b1 with some values (these will be adjusted during
training).
Calculate Predictions:
● Use the current values of b0 and b1 to make predictions for each house
size.
Calculate Error:
● Compare the predicted prices with the actual prices and calculate the
error (the difference between predicted and actual values).
Update Parameters:
● Adjust b0 and b1 to minimize the error. This is typically done using
optimization algorithms like gradient descent.
Repeat:
● Repeat steps 4-6 until the model provides accurate predictions.
Outcome:
After training the model, you obtain the values of b0 and b1 that best fit the data. The
final linear equation can be used to predict house prices based on house size.
Final Model:
In this example, the Simple Linear Regression algorithm helps in establishing a linear
relationship between house size and price, enabling predictions for new houses based
on their size. The goal is to find the best-fitting line that minimizes prediction errors.
1. R-squared (R²):
Definition:
R-squared measures the proportion of the variance in the dependent variable that is
predictable from the independent variables. It ranges from 0 to 1, where 0 indicates that
the model does not explain any variability, and 1 indicates that the model explains all
the variability.
Interpretation:
● R2=1: The model perfectly explains the variability in the dependent variable
● R2=0: The model does not explain any variability.
Formula:
2. Adjusted R-squared:
Definition:
Adjusted R-squared is a modified version of R-squared that adjusts for the number of
independent variables in the model. It penalizes the inclusion of unnecessary variables
that do not contribute significantly to the explanation of variability.
Formula:
● n: Number of observations.
● k: Number of independent variables.
3. Mean Squared Error (MSE):
Definition:
Mean Squared Error measures the average of the squared differences between
predicted and actual values. It is sensitive to outliers since it squares the errors.
Formula:
Formula:
Comparison:
● MSE and RMSE: Both measure the average squared difference between
predicted and actual values, but RMSE is more interpretable as it is in the same
unit as the dependent variable.
● MAE: Represents the average absolute difference, providing a clearer picture of
the average magnitude of errors.
These metrics help assess the accuracy and goodness of fit of regression models. It's
common to use a combination of these metrics to get a comprehensive understanding
of the model's performance. Lower values of MSE, MAE, and RMSE, and higher values of
R-squared and Adjusted R-squared are generally desirable.
Logistic Regression
Logistic Regression is a statistical method used for binary classification problems,
where the dependent variable is categorical and has two classes (usually coded as 0
and 1). Despite its name, logistic regression is primarily used for classification rather
Key Concepts:
Log Odds:
● The logistic regression equation models the log-odds (logit function) of
the probability. The odds ratio represents the odds of the event occurring
compared to the odds of it not occurring.
Sigmoid Function:
● The sigmoid function is used to transform the log-odds into a probability
value between 0 and 1.
P(Y=1)=1+e−(log-odds)
Decision Boundary:
● The decision boundary is the threshold at which we classify instances into
one of the two classes. By default, this threshold is set at 0.5, but it can be
adjusted based on the specific requirements of the problem.
Training Procedure:
Evaluation:
Logistic regression models are evaluated using metrics such as accuracy, precision,
recall, F1 score, and the area under the Receiver Operating Characteristic (ROC) curve.
Logistic regression is widely used in various domains, including finance, healthcare, and
When evaluating the performance of a Logistic Regression model, several metrics are
commonly used to assess its effectiveness in binary classification tasks. Here are
some key performance metrics for Logistic Regression:
1. Accuracy:
Definition:
Accuracy measures the overall correctness of the model by calculating the ratio of
correctly predicted instances to the total number of instances.
Accuracy is a useful metric when the classes are balanced. However, in imbalanced
datasets, where one class significantly outnumbers the other, accuracy alone may not
provide a complete picture of the model's performance.
2. Precision:
Definition:
Precision measures the accuracy of positive predictions. It calculates the ratio of true
positive predictions to the total number of positive predictions made by the model.
Use Case:
Precision is important when the cost of false positives is high. For example, in a spam
email classification task, high precision ensures that legitimate emails are not
incorrectly marked as spam.
Use Case:
Recall is crucial when the cost of false negatives is high. In medical diagnosis, for
instance, high recall ensures that actual positive cases are not missed.
4. F1 Score:
Definition:
The F1 score is the harmonic mean of precision and recall. It provides a balanced
measure of a model's performance.
F1 Score=2×Precision×Recall/Precision + Recall
Use Case:
Use Case:
ROC curves and AUC are beneficial for assessing how well the model separates classes
across different threshold values. A model with a higher AUC is generally considered
better at distinguishing between positive and negative instances.
6. Confusion Matrix:
A confusion matrix provides a detailed breakdown of the model's predictions, including
true positives, true negatives, false positives, and false negatives. It is a helpful tool for
understanding where the model may be making errors.
CROSS VALIDATION
Cross-validation is a statistical technique used to evaluate the performance and
generalizability of a machine learning model. It involves dividing the dataset into
multiple subsets, training the model on some of these subsets, and then testing it on
the remaining subsets. Cross-validation helps assess how well a model will generalize
to an independent dataset by simulating its performance on different subsets of the
data.
Types of Cross-Validation:
K-Fold Cross-Validation:
● Procedure:
● Divide the dataset into K equally sized folds.
● Train the model K times, each time using K-1 folds for training and
the remaining fold for testing.
● Calculate the average performance across all K iterations.
● Advantages:
● Provides a robust estimate of the model's performance.
● Reduces the impact of dataset variability.
● Considerations:
● Computationally more expensive than a single train-test split.
Stratified K-Fold Cross-Validation:
● Procedure:
● Similar to K-Fold, but ensures that each fold maintains the same
class distribution as the original dataset.
● Advantages:
● Particularly useful when dealing with imbalanced datasets with
unequal class proportions.
● Helps ensure that each fold is representative of the overall class
distribution.
Leave-One-Out Cross-Validation (LOOCV):
● Procedure:
● Each observation serves as a test set exactly once, with the rest of
the data used for training.
● Advantages:
● Provides a comprehensive evaluation, especially for small datasets.
● Each model is trained on nearly all available data.
● Considerations:
● Computationally expensive for large datasets.
● Can be sensitive to outliers.
Leave-P-Out Cross-Validation:
● Procedure:
● Similar to LOOCV but leaves out P observations for testing.
● Generalizes LOOCV by allowing the exclusion of more than one
observation at a time.
● Advantages:
● Less computationally intensive than LOOCV but more robust than
K-Fold for small datasets.
Time Series Cross-Validation:
● Procedure:
● Splits the dataset into training and test sets in a way that respects
the temporal order of observations.
● Successive time periods are used for training, and the model is
tested on the most recent period.
● Advantages:
● Suitable for time series data where past observations influence
future observations.
● Helps evaluate how well the model generalizes to unseen future
data.
Repeated K-Fold Cross-Validation:
● Procedure:
● Repeats K-Fold cross-validation multiple times with different
random splits.
● Advantages:
● Reduces the variability in the evaluation by averaging results across
multiple runs.
● Considerations:
● May be computationally expensive.
Each type of cross-validation has its advantages and is suitable for different scenarios.
The choice depends on factors such as dataset size, distribution, and the specific goals
of the analysis. Cross-validation is a crucial step in model evaluation, helping to ensure
that the performance metrics are representative and robust.
Hyperparameter Tuning
Decision Trees are versatile and widely used machine learning algorithms that can
be applied to both classification and regression tasks. They work by recursively
partitioning the dataset into subsets based on the values of input features, making
decisions at each node of the tree.
How it Works:
Splitting:
● At each node of the tree, the algorithm selects the feature that best splits
the data into subsets.
● The split is determined based on a criterion such as Gini impurity, entropy,
or information gain.
Recursive Partitioning:
● The process is repeated recursively for each subset, creating a tree
structure.
● The tree continues to grow until a stopping criterion is met, such as
reaching a maximum depth or a minimum number of samples per leaf.
Leaf Nodes:
● The final nodes, called leaf nodes, represent the predicted class.
Example:
Consider a dataset of emails labeled as spam or not spam. The Decision Tree could
learn rules such as "If the email contains the word 'free' and is from an unknown sender,
classify as spam."
How it Works:
Splitting:
● Similar to the classifier, the algorithm selects the feature that best splits
the data based on a criterion like mean squared error or mean absolute
error.
Recursive Partitioning:
● The tree grows by recursively splitting the data into subsets based on the
selected features.
● The process continues until a stopping criterion is met.
Leaf Nodes:
● The final nodes (leaf nodes) represent the predicted continuous value.
Example:
Consider a dataset of houses with features like size, number of bedrooms, and location.
The Decision Tree Regressor could learn rules such as "If the house size is less than
1500 square feet and it has 2 bedrooms, predict the price as $200,000."
Key Characteristics:
Interpretability:
● Decision Trees are inherently interpretable, and the learned rules can be
easily understood.
Handling Nonlinear Relationships:
● Decision Trees can capture complex relationships in the data, including
nonlinear patterns.
Overfitting:
● Without proper constraints, Decision Trees can be prone to overfitting,
capturing noise in the training data.
Ensemble Methods:
● Decision Trees are often used as building blocks in ensemble methods
like Random Forests and Gradient Boosted Trees to improve predictive
performance.
Categorical and Numerical Features:
● Decision Trees can handle both categorical and numerical features.
Decision Trees are versatile and powerful, but it's essential to tune hyperparameters or
use ensemble methods to mitigate overfitting. They are widely used in various domains,
including finance, healthcare, and natural language processing.
How it Works:
Bootstrap Sampling:
● Create multiple subsets of the training dataset by randomly sampling with
replacement (bootstrap sampling).
Model Training:
● Train a separate base model (e.g., Decision Tree) on each subset.
Aggregation:
● Combine the predictions of individual models through averaging (for
regression) or voting (for classification).
Random Forest:
Advantages:
2. Boosting:
Objective:
Boosting aims to improve model accuracy and focus on instances that previous models
find challenging.
How it Works:
Sequential Training:
● Train a series of base models sequentially.
● Emphasize the training of instances that the previous models found
difficult.
Weighted Voting:
● Assign weights to each model's prediction based on its accuracy.
● Models that perform well receive higher weights during the aggregation.
● Another widely used boosting algorithm that builds trees sequentially and
corrects errors made by previous trees.
Advantages:
Ensemble techniques are widely used in practice and have led to the development of
sophisticated algorithms like Random Forest, Gradient Boosting Machines (GBM),
XGBoost, and LightGBM. They are effective across various types of machine learning
tasks, including classification, regression, and anomaly detection.
Random Forest
It is an ensemble learning algorithm that belongs to the bagging family. It is primarily
used for classification and regression tasks. The Random Forest algorithm builds
multiple decision trees during training and combines their predictions to provide a more
accurate and robust result. The name "Random Forest" comes from the use of
randomness in both the construction of individual trees and the aggregation of their
predictions.
Bootstrapped Sampling:
● Create multiple random subsets of the training dataset by sampling with
replacement (bootstrapped samples).
● Each subset is used to train a decision tree.
Random Feature Selection:
● At each node of a decision tree, a random subset of features is considered
for splitting. This introduces additional diversity among the trees.
Tree Construction:
● Grow each decision tree until a stopping criterion is met (e.g., maximum
depth, minimum samples per leaf).
Voting:
● For classification tasks, each tree in the forest "votes" for a class.
● The class with the most votes becomes the final prediction.
Advantages:
Example:
Consider a dataset of emails labeled as spam or not spam. The Random Forest
algorithm creates a forest of decision trees, each trained on a different subset of emails
and considering a random subset of features for splitting. When a new email needs to
be classified, each tree votes on whether it is spam or not, and the majority vote
rf_classifier.fit(X_train, y_train)
# Make predictions on the test data
predictions = rf_classifier.predict(X_test)
Random Forest is a powerful and widely used algorithm, known for its versatility and
ability to handle various types of data. It is suitable for both small and large datasets
and is commonly employed in practice for tasks such as image classification, fraud
detection, and bioinformatics.
Boosting
One of the most popular boosting algorithms is AdaBoost (Adaptive Boosting), and
another widely used algorithm is Gradient Boosting. Let's explore the basic concepts of
boosting:
Initialization:
● Assign equal weights to all training instances.
Sequential Training:
● Train a weak learner (e.g., a decision tree) on the training data, and
calculate the error.
● Increase the weights of misclassified instances, making them more
important for the next model.
● Repeat this process for a specified number of iterations (or until a certain
level of accuracy is achieved).
Weighted Voting:
● Combine the predictions of each weak learner using weighted voting.
● Assign higher weights to models with lower error rates.
● The final model is a weighted sum of the weak learners.
Advantages:
Gradient Boosting:
How it Works:
Initialization:
● Initialize the model with a constant value (e.g., the mean of the target
variable).
Sequential Training:
● Train a weak learner (e.g., a decision tree) to predict the residuals (the
differences between the true and predicted values).
● Update the model by adding the predictions from the new weak learner
multiplied by a learning rate.
● Repeat this process for a specified number of iterations.
Combination:
● The final model is the sum of the initial model and the contributions of
each weak learner.
Advantages:
Hyperparameters:
● Number of Estimators: The number of weak learners to train.
● Learning Rate: Shrinks the contribution of each weak learner.
● Maximum Depth of Trees: The maximum depth of each decision tree.
● Subsample: The fraction of training instances used to train each weak learner.
Both AdaBoost and Gradient Boosting are powerful algorithms that are widely used in
practice. They are effective in a variety of tasks, including classification, regression, and
ranking. The choice between the two often depends on the characteristics of the data
and the specific goals of the problem.
It is a simple and intuitive supervised machine learning algorithm used for both
classification and regression tasks. The fundamental idea behind KNN is to predict the
class or value of a new data point based on the majority class or average of its k-
nearest neighbors in the feature space. In other words, KNN makes predictions based
on the similarity of instances in the input space.
Training:
● Store all training instances in the feature space.
Prediction:
● Given a new instance, find the k-nearest neighbors based on a distance
metric (commonly Euclidean distance).
● Assign the class label that is most frequent among the k-nearest
neighbors.
Hyperparameter:
Example:
Consider a dataset of points in a 2D plane, where each point is labeled as either red or
blue. To predict the label of a new point, KNN would find the k-nearest neighbors
(nearest points in terms of distance) and assign the label based on the majority class
among those neighbors.
Training:
● Store all training instances in the feature space along with their
corresponding target values.
Prediction:
● Given a new instance, find the k-nearest neighbors based on a distance
metric.
● Assign the predicted value as the average (or weighted average) of the
target values of the k-nearest neighbors.
Hyperparameter:
In a regression scenario, consider a dataset of houses with features like size, number of
bedrooms, and target values representing the prices. To predict the price of a new
house, KNN would find the k-nearest neighbors and assign the predicted price as the
average of the prices of those neighbors.
Distance Metric:
The choice of the distance metric is crucial in KNN, and it depends on the nature of the
data. Common distance metrics include:
Considerations:
Scaling:
● Feature scaling is often important to ensure that all features contribute
equally to the distance calculation.
Choice of K:
● The value of K affects the smoothness of the decision boundary. A
smaller K may lead to a more complex and potentially noisy decision
boundary, while a larger K may result in a smoother decision boundary but
may be too influenced by distant points.
Computational Complexity:
● As the number of training instances grows, the computational cost of
finding the nearest neighbors increases.
High-Dimensional Data:
● KNN may be less effective in high-dimensional spaces due to the "curse of
dimensionality."
Local Patterns:
● KNN tends to work well when the decision boundary is highly irregular and
when the decision is based on local patterns in the data.
KNN is a non-parametric and instance-based learning algorithm, meaning it doesn't
make assumptions about the underlying data distribution and retains all training
instances for prediction. It is a versatile algorithm but may not perform well in certain
scenarios, especially with large datasets or high-dimensional data.
It is a powerful supervised machine learning algorithm that can be used for both
classification and regression tasks. SVM is particularly effective in high-dimensional
spaces and is widely used in various domains, including image classification, text
classification, and bioinformatics. The primary objective of SVM is to find a hyperplane
that best separates the data into different classes while maximizing the margin
between the classes.
Objective:
For a binary classification problem, Linear SVM aims to find a hyperplane that separates
instances of one class from instances of the other class, with the maximum margin.
How it Works:
Hyperplane:
● A hyperplane is a decision boundary that divides the feature space into
two classes.
● In a two-dimensional space, the hyperplane is a line; in three dimensions,
it's a plane, and so on.
Margin:
● The margin is the distance between the hyperplane and the nearest data
point from either class.
● SVM seeks to maximize this margin, providing a robust decision boundary.
Support Vectors:
● Support vectors are the data points that lie closest to the hyperplane and
are crucial in defining the margin.
● These instances influence the position and orientation of the hyperplane.
Mathematical Formulation:
y=sign(w⋅ x+b)
Hyperparameter:
Objective:
When the data is not linearly separable, SVM can use a kernel trick to map the original
feature space into a higher-dimensional space, making it possible to find a hyperplane
that separates the classes.
Kernel Trick:
Common Kernels:
● Linear Kernel:
● K(x,x)=x⋅ x
● Polynomial Kernel:K(x,x)=(x⋅ x+c)d
● Radial Basis Function (RBF) Kernel:K(x,x)=exp(−2σ2∥x−x∥2)
Hyperparameters for Non-Linear SVM:
Objective:
SVR aims to find a hyperplane that best fits the data within a certain margin (epsilon-
tube) around the predicted values.
How it Works:
Epsilon-Tube:
● Defines a margin within which deviations from the predicted values are
tolerated.
● Instances outside this margin contribute to the loss function.
Mathematical Formulation:
y=w⋅ x+b
● y: Predicted value.
● w: Weight vector.
● x: Input features.
● b: Bias term.
● ϵ: Deviation tolerance.
Considerations:
Sensitivity to Feature Scaling:
● SVM is sensitive to the scale of input features, so it's essential to
standardize or normalize the data.
Choice of Kernel:
● The choice of the kernel function and its parameters can significantly
impact SVM performance.
Interpretability:
● SVM can be less interpretable than simpler models due to the complexity
of the hyperplane.
Computational Complexity:
● Training an SVM can be computationally intensive, especially with large
datasets.
SVM is a versatile and powerful algorithm, and its effectiveness often depends on the
characteristics of the data and the choice of hyperparameters. It is widely used in
practice and has been extended to handle multi-class classification and other complex
tasks.
analysis. The primary goal of PCA is to transform high-dimensional data into a new
possible. This is achieved by identifying and capturing the principal components of the
Key Concepts:
Principal Components:
● Principal components are linear combinations of the original features that
capture the most significant variability in the data.
● The first principal component (PC1) corresponds to the direction of
maximum variance, and subsequent components (PC2, PC3, etc.) capture
orthogonal directions of decreasing variance.
Variance and Information Preservation:
● PCA aims to retain the most important information in the data by
maximizing the variance along the principal components.
● The cumulative explained variance is often used to assess how much of
the total variance in the data is retained by including a certain number of
principal components.
Orthogonality:
● Principal components are orthogonal to each other, meaning they are
uncorrelated.
● This orthogonality property allows PCA to decorrelate the features and
reduce multicollinearity in the data.
Steps in PCA:
Standardization:
● Standardize the data by centering the features around their mean and
scaling them to have unit variance. This step is crucial to ensure that all
features contribute equally to the PCA.
Covariance Matrix:
● Compute the covariance matrix of the standardized data. The covariance
matrix provides information about the relationships between pairs of
features.
Eigendecomposition:
● Perform eigendecomposition on the covariance matrix to obtain the
eigenvectors and eigenvalues.
● Eigenvectors represent the directions (principal components), and
eigenvalues represent the magnitude of the variance along those
directions.
Select Principal Components:
● Sort the eigenvectors based on their corresponding eigenvalues in
descending order.
● Choose the top
● k eigenvectors to form the new subspace (where
● k is the desired dimensionality of the reduced data).
Projection:
● Project the original data onto the selected principal components to obtain
the reduced-dimensional representation.
Applications of PCA:
Dimensionality Reduction:
● PCA is commonly used to reduce the dimensionality of datasets with a
large number of features, making subsequent analyses more manageable.
Noise Reduction:
● By focusing on the principal components with the highest variance, PCA
helps filter out noise and retain the essential information.
Visualization:
● PCA can be used to visualize high-dimensional data in a lower-
dimensional space (e.g., 2D or 3D), making it easier to interpret and
understand.
Feature Engineering:
● In some cases, the principal components themselves can be used as new
features that capture essential patterns in the data.
Limitations:
Linearity Assumption:
● PCA assumes linear relationships between features, which may limit its
effectiveness in capturing complex nonlinear patterns.
Interpretability:
● The principal components may not always have a clear interpretation in
terms of the original features.
Sensitivity to Outliers:
● PCA is sensitive to outliers, and extreme values can influence the principal
components.
PCA is a valuable tool for exploratory data analysis, feature extraction, and
K-Means clustering
into K distinct, non-overlapping subsets (clusters). The algorithm aims to group similar
data points together and assign them to the same cluster while minimizing the within-
cluster variance. Each cluster is represented by its centroid, which is the mean of the
means++ initialization).
● Assign each data point to the cluster whose centroid is closest. Distance
between each data point and its assigned cluster centroid. This is also known as the
algorithm. Common methods for selecting K include the elbow method, silhouette score,
and cross-validation.
● Elbow Method:
● Look for the "elbow" point where the rate of decrease in inertia slows
down.
● Silhouette Score:
Choosing K:
Outliers:
import numpy as np
np.random.seed(42)
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
In this example, K-Means is applied to synthetic data with three well-separated clusters.
The algorithm correctly identifies and assigns data points to the clusters. The red "X"
rf_classifier.fit(X_train, y_train)