Module3 DS PPT
Module3 DS PPT
Modeling, What Is Machine Learning?, Overfitting and Underfitting, Correctness, The Bias-
Variance Tradeoff, Feature Extraction and Selection, k-Nearest Neighbors, The Model,
Example: The Iris Dataset, The Curse of Dimensionality, Naive Bayes, A Really Dumb Spam
Filter, A More Sophisticated Spam Filter, Implementation, Testing Our Model, Using Our
Model, Simple Linear Regression, The Model, Using Gradient Descent, Maximum
Likelihood Estimation, Multiple Regression, The Model, Further Assumptions of the Least
Squares Model, Fitting the Model, Interpreting the Model, Goodness of Fit, Digression: The
Bootstrap, Standard Errors of Regression Coefficients, Regularization, Logistic Regression,
The Problem, The Logistic Function, Applying the Model, Goodness of Fit, Support Vector
Machines.
Text Book : Chapters 11, 12, 13, 14, 15 and 16
Modeling
What is a model? It’s simply a specification of a mathematical (or probabilistic) relationship
that exists between different variables.
Examples
• business model: that takes inputs like “number of users,” “ad revenue per user,” and
“number of employees” and outputs your annual profit for the next several years.
The business model is probably based on simple mathematical relationships: profit is
revenue minus expenses, revenue is units sold times average price, and so on.
• The recipe model: This model relates inputs like “number of eaters” and “hungriness” to
quantities of ingredients needed. It is probably based on trial and error—someone went in a
kitchen and tried different combinations of ingredients until they found one they liked.
• The poker model: Here each player’s “win probability” is estimated in real time based on a
model that takes into account the cards that have been revealed so far and the distribution
of cards in the deck. It is based on probability theory, the rules of poker, and some
reasonably innocuous assumptions about the random process by which cards are dealt.
• A model is an explicit description of patterns within the data in the form of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
• The key to this definition is that the systems should learn by itself without explicit
programming.
• Everyone has her own exact definition, but we’ll use machine learning to refer to creating and
using models that are learned from data.
• In other contexts this might be called predictive modeling or data mining, but we will stick
with machine learning.
• Typically, our goal will be to use existing data to develop models that we can use to predict
various outcomes for new data, such as:
• Whether an email message is spam or not
• Whether a credit card transaction is fraudulent
• Which advertisement a shopper is most likely to click on
• Which cricket team is going to win IPL 2025
• Supervised models (in which there is a set of data labeled with the correct answers to learn
from)
• Unsupervised models (in which there are no such labels). There are various other types, like
• Semi supervised (in which only some of the data are labeled),
• online (in which the model needs to continuously adjust to newly arriving data), and
reinforcement (in which, after making a series of predictions, the model gets a signal indicating
how well it did)
• Supervised learning uses labelled data
• Unsupervised learning uses unlabeled data
• Semi-supervised algorithms use unlabelled data by assigning a pseudo-label. Then, the labelled
and pseudo-labelled dataset can be combined.
• Reinforcement learning mimics human beings. Like human beings use ears and eyes to
perceive the world and take actions, reinforcement learning allows the agent to interact with the
environment to get rewards.
Overfitting and Underfitting
Overfitting:
• A common danger in machine learning is overfitting.
• Producing a model that performs well on the data you train it on but generalizes poorly to
any new data.
• This could involve learning noise in the data. Or it could involve learning to identify specific
inputs rather than whatever factors are actually predictive for the desired output.
Underfitting:
• Producing a model that doesn’t perform well even on the training data, although typically
when this happens you decide your model isn’t good enough and keep looking for a better
one.
Correctness
Imagine building a model to make a binary judgment. Is this email spam? Should we
hire this candidate? Is this air traveler secretly a terrorist?
Given a set of labeled data and such a predictive model, every data point lies in one of
four categories:
True positive
“This message is spam, and we correctly predicted spam.”
False positive (Type 1 error)
“This message is not spam, but we predicted spam.”
False negative (Type 2 error)
“This message is spam, but we predicted not spam.”
True negative
“This message is not spam, and we correctly predicted not spam.”
We often represent these as counts in a confusion matrix:
Spam Not a spam
Predict spam True positive False positive
Predict not False negative True negative
spam
These days approximately 5 babies out of 1,000 are named Luke. And the lifetime prevalence
of leukemia is about 1.4%, or 14 out of every 1,000 people.
If we believe these two factors are independent and apply my “Luke is for leukemia” test to 1
million people, we’d expect to see a confusion matrix like:
def accuracy(tp: int, fp: int, fn: int, tn: int) -> float:
correct = tp + tn
total = tp + fp + fn + tn
return correct / total
assert accuracy(70, 4930, 13930, 981070) == 0.98114
It’s common to look at the combination of precision and recall.
Precision measures how accurate our positive predictions were:
def precision(tp: int, fp: int, fn: int, tn: int) -> float:
return tp / (tp + fp)
assert precision(70, 4930, 13930, 981070) == 0.014
And recall measures what fraction of the positives our model identified:
def recall(tp: int, fp: int, fn: int, tn: int) -> float:
return tp / (tp + fn)
assert recall(70, 4930, 13930, 981070) == 0.005
Sometimes precision and recall are combined into the F1 score, which is defined as:
def f1_score(tp: int, fp: int, fn: int, tn: int) -> float:
p = precision(tp, fp, fn, tn)
r = recall(tp, fp, fn, tn)
return 2 * p * r / (p + r)
This is the harmonic mean of precision and recall and necessarily lies between them.
The Bias-Variance Tradeoff
• Another way of thinking about the overfitting problem is as a tradeoff between bias
and variance.
• Both are measures of what would happen if you were to retrain your model many
times on different sets of training data (from the same larger population).
• For example, the degree 0 model in “Overfitting and Underfitting” will
make a lot of mistakes for pretty much any training set (drawn from the same population),
which means that it has a high bias.
• However, any two randomly chosen training sets should give pretty similar models (since
any two randomly chosen training sets should have pretty similar average values).
• So we say that it has a low variance. High bias and low variance typically correspond to
underfitting.
Bias: Bias is the error due to overly simplistic models. A model with high bias pays little
attention to the data's details and assumptions, leading to systematic errors.
High bias can cause underfitting, where the model is too simple to capture the patterns in the
data.
Variance: Variance refers to the model's sensitivity to small fluctuations in the training data. A
model with high variance pays too much attention to the training data and can capture noise as
if it were a pattern.
High variance can cause overfitting, where the model becomes too complex and fails to
generalize to new data.
Tradeoff:
• Low bias, high variance: The model fits the training data well but may not generalize to
unseen data (overfitting).
• High bias, low variance: The model is too simple to capture the true patterns in the data
(underfitting).
Feature Extraction and Selection
• As has been mentioned, when your data doesn’t have enough features, your model is
likely to underfit. And when your data has too many features, it’s easy to overfit. But
what are features, and where do they come from?
• Features are whatever inputs we provide to our model.
• In the simplest case, features are simply given to you. If you want to predict someone’s
salary based on her years of experience, then years of experience is the only feature
you have.
• Feature extraction involves transforming raw data into a set of features that better represent
the underlying structure or patterns in the data.
• The goal is to reduce the dimensionality of the data while retaining important information.
• Feature selection involves selecting the most important features from the dataset that
contribute the most to the prediction outcome.
• It reduces overfitting, speeds up the learning process, and improves model interpretability.
• By performing feature extraction and selection, you ensure that the model is not
overwhelmed with irrelevant or redundant data, which improves accuracy and
generalization.
K Nearest Neighbor (KNN) Algorithm
• K Nearest Neighbor (KNN) is a supervised machine learning algorithm used for
classification and regression.
• It works by finding the 'k' nearest data points (neighbors) to a new point, based on a chosen
distance metric (usually Euclidean distance), and making a decision (classifying or
predicting) based on the majority label of the nearest neighbors.
• In classification, the majority class among the neighbors determines the class of the new
point. For regression, the average of the values of the nearest neighbors is taken.
• Nearest neighbors is one of the simplest predictive models.
• It makes no mathematical assumptions, and it doesn’t require any sort of heavy machinery
• The only things it requires are:
• Some notion of distance
• An assumption that points that are close to one another are similar
#K-Nearest Neighbors Algorithm using Iris dataset
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
• The numerator is the probability that a message is spam and contains bitcoin, while the
denominator is just the probability that a message contains bitcoin.
• If we have a large collection of messages we know are spam, and a large collection of
messages we know are not spam, then we can easily estimate P(B|S) and P(B|¬S).
If we further assume that any message is equally likely to be spam or not spam (so that
P(S) = P(¬S) = 0.5), then:
P(S|B) = P(B|S) / [P(B|S) + P(B |¬S)]
For example, if 50% of spam messages have the word bitcoin, but only 1% of nonspam
messages do, then the probability that any given bitcoin-containing email is spam is:
0 . 5/ 0 . 5 + 0 . 01 = 98%
Regression
• Regression analysis is a supervised learning method for predicting continuous variables.
• The most significant difference between regression and classification is that while regression
helps predict a continuous quantity, classification predicts discrete class labels.
• Regression is used to predict continuous variables or quantitative variables such as price and
revenue. The main concern of regression analysis is to answer questions such as
1. What is the relationship between variables?
2. What is the strength of the relationship?
3. What is the nature of relationship such as linear or nonlinear?
4. What is the relevance of each attribute?
5. What is the contribution of each attribute?
• Types of regression method
1. Linear regression
2. Multiple regression
3. Polynomial regression
4. Logistic regression
5. Lasso and Ridge regression method
Simple Linear Regression
• This is the simplest form of linear regression, and it involves only one independent variable
and one dependent variable.
• The equation for simple linear regression is:
Y=β0+β1x+ε
where
Y is the dependent variable
x is the independent variable
β0 is the intercept
β1 is the slope
ε is the error
• The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
• To achieve the best-fit regression line, the model aims to predict the target value Y^ such
that the error difference between the predicted value Y^ and the true value Y is minimum.
• So, it is very important to update the θ1 and θ2 values, to reach the best value that
minimizes the error between the predicted y value (pred) and the true y value (y)
• In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which
calculates the average of the squared errors between the predicted values y^iand the actual
values yi.
• The purpose is to determine the optimal values for the intercept θ1and the coefficient of
the input feature θ2providing the best-fit line for the given data points.
• A linear regression model can be trained using the optimization algorithm gradient descent
by iteratively modifying the model’s parameters to reduce the mean squared error of the
model on a training dataset.
• To update θ1 and θ2 values in order to reduce the Cost function (minimizing RMSE value)
and achieve the best-fit line the model uses Gradient Descent.
Assumptions of Simple Linear Regression
1. Linearity: The independent and dependent variables have a linear relationship with one
another. This implies that changes in the dependent variable follow those in the
independent variable(s) in a linear fashion.
2. Independence: The observations in the dataset are independent of each other. This
means that the value of the dependent variable for one observation does not depend on
the value of the dependent variable for another observation.
3. Homoscedasticity: Across all levels of the independent variable(s), the variance of the
errors is constant. This indicates that the amount of the independent variable(s) has no
impact on the variance of the errors. If the variance of the residuals is not constant, then
linear regression will not be an accurate model.
4. Normality: The residuals should be normally distributed. This means that the residuals
should follow a bell-shaped curve. If the residuals are not normally distributed, then
linear regression will not be an accurate model.
Multiple Linear Regression
Multiple linear regression extends the simple linear regression by allowing more than one
independent variable and one dependent variable. The equation for multiple linear regression
is:
Y=β0+β1x1+β2x2+……+βnxn+ ε
Y is the dependent variable
x1, x2, …, xn are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
ε- is the error
Need for fitting the model
1. Estimating the coefficients: the primary goal of is to estimate the regression coefficient that
quantify the relationship between dependent variable and each independent variable.
2. Making predictions: To make the prediction about the dependent variable based on new
data for the independent variables
3. Understanding the relationships: Fitting the model helps in understanding the strength and
direction of the relationship between the dependent variable and each independent variable,
4. Assessing the model fit: Enables the evaluation of how well the model explains the
variations in the dependent variable.
5. Hypothesis testing: hypothesis test on regression coefficients
6. Identifying multicollinearity: fitting the model and examining Variance Inflation Factor
(VIF) helps detect multicollinearity, which occurs when independent variables highly
correlated with each other, which can make it difficult to determine the individual effect
of each variable on dependent variables.
4. Overfitting: Overfitting occurs when the model fits the training data too closely, capturing
noise or random fluctuations that do not represent the true underlying relationship between
variables. This can lead to poor generalization performance on new, unseen data.
Goodness of fit test
• In the context of linear regression, a goodness of fit test is used to evaluate how well the
regression model explains the observed data.
• It helps determine how closely the predicted values from the regression line match the actual
data points.
• Several metrics and tests are commonly used to assess the goodness of fit in linear
regression:
1. R-Squared (Coefficient of Determination)
• Measures the proportion of the variance in the dependent variable that is explained by the
independent variable(s).
Formula:
R2 =1−𝑆𝑆𝑟𝑒𝑠/ 𝑆𝑆𝑡𝑜𝑡
Where:
𝑆𝑆𝑟𝑒𝑠= sum of squares of residuals (difference between observed and predicted values)
𝑆𝑆𝑡𝑜𝑡= total sum of squares (variation of the observed data from its mean)
• Value ranges from 0 to 1. A value of 1 indicates a perfect fit, and 0 means the model
explains none of the variance.
2. Adjusted R-Squared
•Adjusted for the number of predictors in the model. It penalizes adding
unnecessary independent variables.
•Formula:
Where
n represents the number of data points in our dataset
k represents the number of independent variables, and
R represents the R-squared values determined by the model.
3. Residual Standard Error (RSE)
•Measures the standard deviation of the residuals (differences between actual and
predicted values).
•F-TestCompares the full model with no predictors to see if at least one predictor has
a statistically significant relationship with the dependent variable.Hypotheses:Null
Hypothesis (H₀): None of the predictors are significant.Alternative Hypothesis (H₁):
At least one predictor is significant.A high F-statistic (and low p-value) indicates that
the model is a good fit.
•5. Mean Squared Error (MSE)Measures the average squared difference between
observed and predicted values.
•6. Root Mean Squared Error (RMSE)The square root of MSE, giving the error in the
same units as the dependent variable.
Digression
• In data science, digression can refer to a departure from the main objective of analysis or
modeling.
• It can occur during various stages of a project such as data collection, preprocessing, analysis,
modelling and interpretation.
• It typically happens when a data scientist or analyst diverts attention to unrelated or less
important aspects of the project, which can slow down progress or cause confusion.
• In multiple regression, digression can refer to straying from the main goals of building and
interpreting the regression model. This can happen in several ways, often due to focusing on
irrelevant or secondary aspects of the data or analysis.
• digressions can slow down or obscure the path to an accurate model.
Common Examples of Digression in Multiple Regression:
1. Including Irrelevant Predictors: A digression occurs when too many variables are included
in the model, especially those that have no significant relationship with the dependent
variable.
Example: In a model predicting house prices, including variables like the owner’s favorite
color or whether the house is near a bakery could be irrelevant and complicate the model
unnecessarily.
2. Over-Emphasizing Collinearity: Collinearity occurs when independent variables are highly
correlated with each other, which can inflate standard errors and make estimates unreliable.
Focusing too much on this issue without considering how to address it effectively (e.g.,
through regularization techniques) can lead to digression.
Example: Spending too much time diagnosing collinearity without taking practical steps like
removing one of the correlated variables, using principal component analysis (PCA), or
applying ridge regression to mitigate the problem.
3. Excessive Model Complexity: Adding too many variables or interaction terms in the model
might create overfitting, which means the model fits the training data too closely and
performs poorly on unseen data.
This is a form of digression, as it moves away from the goal of building a generalized model.
Example: Including numerous interaction terms or polynomial terms that add little predictive
power but overcomplicate the model.
4. Overly Complicated Diagnostics:Diagnostic tests like residual plots, influence diagnostics,
or hypothesis tests are essential for understanding model fit and assumptions. However,
focusing excessively on minor deviations or running unnecessary tests can divert attention
from model improvement.
Example: Spending too much time on minor outliers or very small violations of assumptions
(e.g., minor deviations from normality in residuals) that are not significantly impacting the
model’s predictive accuracy.
5. Inappropriate Interpretation of Coefficients: Another form of digression happens when
too much emphasis is placed on interpreting individual coefficients without considering the
overall model’s performance.
In multiple regression, individual predictor coefficients must be interpreted with caution,
especially if there is multicollinearity or interaction terms involved.
Example: Trying to draw detailed conclusions about each predictor’s coefficient when the
predictors are highly correlated, making the interpretations less reliable.
How to Avoid Digression in Multiple Regression:
1. Feature Selection: Use systematic methods like stepwise regression, LASSO, or
regularization to select the most relevant predictors. Avoid including too many irrelevant
predictors that may dilute the model's effectiveness.
2. Simplification:Keep the model as simple as possible without sacrificing performance.
Focus on the most important predictors that have a clear relationship with the outcome.
3. Model Diagnostics: Perform essential diagnostic checks, such as checking for
multicollinearity, heteroscedasticity, and residual patterns, but avoid focusing too much on
minor or non-critical issues.
4. Cross-Validation: Use cross-validation to assess the generalizability of the model to unseen
data. Avoid overfitting by ensuring the model performs well on both training and test
datasets.
5. Practical and Statistical Significance: Pay attention to both practical relevance and
statistical significance when selecting predictors. Not every statistically significant predictor
needs to be included if it doesn't add value to the model.
Regularization
• Regularization in machine learning refers to techniques used to prevent overfitting by
penalizing complex models.
• Overfitting occurs when a model learns not only the underlying patterns in the data but also
the noise, leading to poor generalization to unseen data.
Common Types of Regularization:
1. L1 Regularization (Lasso):Adds the absolute values of the coefficients to the cost
function. Tends to produce sparse models, meaning some feature weights may become
zero, effectively performing feature selection.
2. L2 Regularization (Ridge):
Adds the square of the magnitudes of the coefficients to the cost function. Shrinks the
coefficients, preventing large values but does not zero them out.
3. Dropout (specific to neural networks):
Randomly drops a fraction of neurons during each iteration of training. Prevents co-
adaptation of neurons and reduces the chances of overfitting.
4. Early Stopping: Stops the training process once the model's performance on a validation
set starts to degrade, preventing overfitting.
Why Regularization Works
• Regularization techniques work by adding a penalty term to the cost function that
discourages large weights.
• This pushes the model towards simpler hypotheses, making it less likely to overfit to
noise or irrelevant details in the training data.
• The hyperparameter 𝜆 controls the strength of the regularization.
Logistic Regression
• Linear regression predicts the numerical response but is not suitable for predicting
categorical variables.
• When categorical variables are involved, it is called classification problem. Logistical
regression is suitable for binary classification problem where the goal is to predict the
probability of one of two possible outcomes.
• Hence the output is often categorical variable. Examples is the mail is spam or not, student
being pass or fail based on marks secured etc.
• Linear regression generated value is in the range of -∞ to +∞, where as probability of
response variable ranges between 0 and 1. Hence there must be mapping function to map -∞
to +∞ to 0-1.
• The core of the mapping function in logistic regression is sigmoidal function.
• A sigmoidal function is a ‘S’ shaped function that yields a value between 0 and 1. this is
known as logit function.
• Odds are defined as ratio of the probability of an event and probability of an event not
happing.
odd=p/1-p
Applications:
Binary classification: Predicting outcomes such as spam vs. not spam, disease vs. no
disease, etc
Probability estimation: Logistic regression not only classifies but also provides a
probability score for how confident the model is in its prediction.
A separating hyperplane
A nonseparable one-dimensional dataset
Dataset becomes separable in higher dimensions
Key Concepts of SVM:
Hyperplane:
• In SVM, the goal is to find the best hyperplane that separates data points belonging to
different classes.
• In a two-dimensional space, this is a line, but in higher dimensions, it becomes a plane or
a hyperplane.
Support Vectors:
• These are the data points that are closest to the hyperplane. The position of these points
determines the hyperplane’s orientation and location.
• The algorithm focuses on these critical points.
Margin:
• The margin is the distance between the hyperplane and the nearest support vectors of
each class.
• SVM tries to maximize this margin, providing a robust classification boundary.
Linear and Non-Linear SVM:
• Linear SVM: Used when data is linearly separable, i.e., when a straight line or a flat
hyperplane can clearly separate the classes.
• Non-Linear SVM: For cases where the data isn't linearly separable, SVM uses a kernel trick
to project the data into higher dimensions where a linear separator can be found.
Kernels:
• Kernels help SVM handle non-linearly separable data by transforming it into a higher-
dimensional space.
• Popular kernel functions include Linear Kernel: Used when the data is linearly separable.
Polynomial Kernel: Maps the data into a higher-dimensional space using polynomial
functions. Radial Basis Function (RBF) Kernel: Also known as the Gaussian kernel, it is
commonly used for complex datasets.
C Parameter:
• This controls the trade-off between maximizing the margin and minimizing the classification
error.
• A larger C aims for a smaller margin with fewer misclassifications, while a smaller C allows
more flexibility and a larger margin.
SVM for Classification Example:
In a simple binary classification problem, given a dataset with two classes, SVM will:
1. Find the hyperplane that best separates the classes.
2. Maximize the margin between the support vectors of the two classes.
3. Predict the class of a new data point based on which side of the hyperplane it falls.
Applications of SVM:
• Text and hypertext categorization
• Image classification
• Bioinformatics (e.g., protein classification)
• Handwriting recognition
Some Questions from module 3.