0% found this document useful (0 votes)
37 views68 pages

Module3 DS PPT

data science module 3

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views68 pages

Module3 DS PPT

data science module 3

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Module 3: Machine Learning

Modeling, What Is Machine Learning?, Overfitting and Underfitting, Correctness, The Bias-
Variance Tradeoff, Feature Extraction and Selection, k-Nearest Neighbors, The Model,
Example: The Iris Dataset, The Curse of Dimensionality, Naive Bayes, A Really Dumb Spam
Filter, A More Sophisticated Spam Filter, Implementation, Testing Our Model, Using Our
Model, Simple Linear Regression, The Model, Using Gradient Descent, Maximum
Likelihood Estimation, Multiple Regression, The Model, Further Assumptions of the Least
Squares Model, Fitting the Model, Interpreting the Model, Goodness of Fit, Digression: The
Bootstrap, Standard Errors of Regression Coefficients, Regularization, Logistic Regression,
The Problem, The Logistic Function, Applying the Model, Goodness of Fit, Support Vector
Machines.
Text Book : Chapters 11, 12, 13, 14, 15 and 16
Modeling
What is a model? It’s simply a specification of a mathematical (or probabilistic) relationship
that exists between different variables.
Examples
• business model: that takes inputs like “number of users,” “ad revenue per user,” and
“number of employees” and outputs your annual profit for the next several years.
The business model is probably based on simple mathematical relationships: profit is
revenue minus expenses, revenue is units sold times average price, and so on.

• The recipe model: This model relates inputs like “number of eaters” and “hungriness” to
quantities of ingredients needed. It is probably based on trial and error—someone went in a
kitchen and tried different combinations of ingredients until they found one they liked.

• The poker model: Here each player’s “win probability” is estimated in real time based on a
model that takes into account the cards that have been revealed so far and the distribution
of cards in the deck. It is based on probability theory, the rules of poker, and some
reasonably innocuous assumptions about the random process by which cards are dealt.
• A model is an explicit description of patterns within the data in the form of:

1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters

What is Machine Learning?


Machine learning is an important sub-branch of Artificial Intelligence (AI). A frequently
quoted definition of machine learning was by Arthur Samuel, one of the pioneers of Artificial
Intelligence.
“Machine learning is the field of study that gives the computer‘s ability to learn without being
explicitly programmed.”

• The key to this definition is that the systems should learn by itself without explicit
programming.
• Everyone has her own exact definition, but we’ll use machine learning to refer to creating and
using models that are learned from data.
• In other contexts this might be called predictive modeling or data mining, but we will stick
with machine learning.
• Typically, our goal will be to use existing data to develop models that we can use to predict
various outcomes for new data, such as:
• Whether an email message is spam or not
• Whether a credit card transaction is fraudulent
• Which advertisement a shopper is most likely to click on
• Which cricket team is going to win IPL 2025

• Supervised models (in which there is a set of data labeled with the correct answers to learn
from)
• Unsupervised models (in which there are no such labels). There are various other types, like
• Semi supervised (in which only some of the data are labeled),
• online (in which the model needs to continuously adjust to newly arriving data), and
reinforcement (in which, after making a series of predictions, the model gets a signal indicating
how well it did)
• Supervised learning uses labelled data
• Unsupervised learning uses unlabeled data
• Semi-supervised algorithms use unlabelled data by assigning a pseudo-label. Then, the labelled
and pseudo-labelled dataset can be combined.
• Reinforcement learning mimics human beings. Like human beings use ears and eyes to
perceive the world and take actions, reinforcement learning allows the agent to interact with the
environment to get rewards.
Overfitting and Underfitting
Overfitting:
• A common danger in machine learning is overfitting.
• Producing a model that performs well on the data you train it on but generalizes poorly to
any new data.
• This could involve learning noise in the data. Or it could involve learning to identify specific
inputs rather than whatever factors are actually predictive for the desired output.
Underfitting:
• Producing a model that doesn’t perform well even on the training data, although typically
when this happens you decide your model isn’t good enough and keep looking for a better
one.
Correctness
Imagine building a model to make a binary judgment. Is this email spam? Should we
hire this candidate? Is this air traveler secretly a terrorist?
Given a set of labeled data and such a predictive model, every data point lies in one of
four categories:
True positive
“This message is spam, and we correctly predicted spam.”
False positive (Type 1 error)
“This message is not spam, but we predicted spam.”
False negative (Type 2 error)
“This message is spam, but we predicted not spam.”
True negative
“This message is not spam, and we correctly predicted not spam.”
We often represent these as counts in a confusion matrix:
Spam Not a spam
Predict spam True positive False positive
Predict not False negative True negative
spam
These days approximately 5 babies out of 1,000 are named Luke. And the lifetime prevalence
of leukemia is about 1.4%, or 14 out of every 1,000 people.
If we believe these two factors are independent and apply my “Luke is for leukemia” test to 1
million people, we’d expect to see a confusion matrix like:

Leukemia No leukemia Total


“Luke” 70 4,930
5,000
Not “Luke” 13,930 981,070 995,000
Total 14,000 986,000 1,000,000

def accuracy(tp: int, fp: int, fn: int, tn: int) -> float:
correct = tp + tn
total = tp + fp + fn + tn
return correct / total
assert accuracy(70, 4930, 13930, 981070) == 0.98114
It’s common to look at the combination of precision and recall.
Precision measures how accurate our positive predictions were:
def precision(tp: int, fp: int, fn: int, tn: int) -> float:
return tp / (tp + fp)
assert precision(70, 4930, 13930, 981070) == 0.014

And recall measures what fraction of the positives our model identified:
def recall(tp: int, fp: int, fn: int, tn: int) -> float:
return tp / (tp + fn)
assert recall(70, 4930, 13930, 981070) == 0.005
Sometimes precision and recall are combined into the F1 score, which is defined as:
def f1_score(tp: int, fp: int, fn: int, tn: int) -> float:
p = precision(tp, fp, fn, tn)
r = recall(tp, fp, fn, tn)
return 2 * p * r / (p + r)
This is the harmonic mean of precision and recall and necessarily lies between them.
The Bias-Variance Tradeoff
• Another way of thinking about the overfitting problem is as a tradeoff between bias
and variance.
• Both are measures of what would happen if you were to retrain your model many
times on different sets of training data (from the same larger population).
• For example, the degree 0 model in “Overfitting and Underfitting” will
make a lot of mistakes for pretty much any training set (drawn from the same population),
which means that it has a high bias.
• However, any two randomly chosen training sets should give pretty similar models (since
any two randomly chosen training sets should have pretty similar average values).
• So we say that it has a low variance. High bias and low variance typically correspond to
underfitting.
Bias: Bias is the error due to overly simplistic models. A model with high bias pays little
attention to the data's details and assumptions, leading to systematic errors.
High bias can cause underfitting, where the model is too simple to capture the patterns in the
data.

Variance: Variance refers to the model's sensitivity to small fluctuations in the training data. A
model with high variance pays too much attention to the training data and can capture noise as
if it were a pattern.
High variance can cause overfitting, where the model becomes too complex and fails to
generalize to new data.

Tradeoff:
• Low bias, high variance: The model fits the training data well but may not generalize to
unseen data (overfitting).
• High bias, low variance: The model is too simple to capture the true patterns in the data
(underfitting).
Feature Extraction and Selection
• As has been mentioned, when your data doesn’t have enough features, your model is
likely to underfit. And when your data has too many features, it’s easy to overfit. But
what are features, and where do they come from?
• Features are whatever inputs we provide to our model.
• In the simplest case, features are simply given to you. If you want to predict someone’s
salary based on her years of experience, then years of experience is the only feature
you have.
• Feature extraction involves transforming raw data into a set of features that better represent
the underlying structure or patterns in the data.
• The goal is to reduce the dimensionality of the data while retaining important information.
• Feature selection involves selecting the most important features from the dataset that
contribute the most to the prediction outcome.
• It reduces overfitting, speeds up the learning process, and improves model interpretability.
• By performing feature extraction and selection, you ensure that the model is not
overwhelmed with irrelevant or redundant data, which improves accuracy and
generalization.
K Nearest Neighbor (KNN) Algorithm
• K Nearest Neighbor (KNN) is a supervised machine learning algorithm used for
classification and regression.
• It works by finding the 'k' nearest data points (neighbors) to a new point, based on a chosen
distance metric (usually Euclidean distance), and making a decision (classifying or
predicting) based on the majority label of the nearest neighbors.
• In classification, the majority class among the neighbors determines the class of the new
point. For regression, the average of the values of the nearest neighbors is taken.
• Nearest neighbors is one of the simplest predictive models.
• It makes no mathematical assumptions, and it doesn’t require any sort of heavy machinery
• The only things it requires are:
• Some notion of distance
• An assumption that points that are close to one another are similar
#K-Nearest Neighbors Algorithm using Iris dataset
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a KNN classifier


k = 3 # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# Train the model


knn.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'KNN Classification accuracy: {accuracy:.4f}')
The Curse of Dimensionality
• The curse of dimensionality refers to various challenges and phenomena that arise when
analyzing and organizing data in high-dimensional spaces (with many features).
• As the number of dimensions (features) increases, several issues can make standard data
analysis techniques less effective or computationally infeasible.
• The k-nearest neighbors algorithm runs into trouble in higher dimensions
• Points in high-dimensional spaces tend not to be close to one another at all.
• One way to see this is by randomly generating pairs of points in the d-dimensional “unit cube”
in a variety of dimensions, and calculating the distances between them.
Generating random points should be second nature by
now:
def random_point(dim: int) -> Vector:
return [random.random() for _ in range(dim)]

as is writing a function to generate the distances:

def random_distances(dim: int, num_pairs: int) ->


List[float]:
return [distance(random_point(dim), random_point(dim))
For every dimension from 1 to 100, we’ll compute 10,000 distances and
use those to compute the average distance between points and the
minimum distance between points in each dimension
import tqdm
dimensions = range(1, 101)
avg_distances = []
min_distances = []
random.seed(0)
for dim in tqdm.tqdm(dimensions, desc="Curse of Dimensionality"):
distances = random_distances(dim, 10000) # 10,000 random pairs
avg_distances.append(sum(distances) / 10000) # track the average
min_distances.append(min(distances)) # track the minimum
As the number of dimensions increases, the average distance between
points increases.
But what’s more problematic is the ratio between the closest distance
and the
average distance (Figure 2):
• In low-dimensional datasets, the closest points tend to be much closer than average.
• But two points are close only if they’re close in every dimension.
• When you have a lot of dimensions, it’s likely that the closest points aren’t much closer than
average, so two points being close doesn’t mean very much
• A different way of thinking about the problem involves the sparsity of higher dimensional
• spaces.
• If you pick 50 random numbers between 0 and 1, you’ll probably get a pretty good
• sample of the unit interval see below figure

Fifty random points in one dimension


If you pick 50 random points in the unit square, you’ll get less coverage

Fifty random points in two dimensions


And in three dimensions

• you can see already that there are starting to be


large empty spaces with no points near them.
• In more dimensions—unless you get
exponentially more data—those large empty
spaces represent regions far from all the points
you want to use in your predictions.
• So if you’re trying to use nearest neighbors in
higher dimensions, it’s probably a good idea to
do some kind of dimensionality reduction first.

Fifty random points in three dimensions


Naive Bayes
• Naive Bayes algorithm is a simple yet powerful classification technique based on Bayes'
Theorem
Some Applications:
• Spam detection: Naive Bayes is widely used in email spam filtering.
• Text classification: It’s commonly used for document classification, sentiment analysis, etc.
• Medical diagnosis: Used to predict diseases based on symptoms.

A Really Dumb Spam Filter


• "A Really Dumb Spam Filter" might refer to a simple or overly basic spam filtering
approach, which is likely to be ineffective at distinguishing spam from legitimate messages.
• Imagine a “universe” that consists of receiving a message chosen randomly from all
possible messages.
• Let S be the event “the message is spam” and B be the event “the message contains the
word bitcoin.”
• Bayes’s theorem tells us that the probability that the message is spam conditional on
containing the word bitcoin is:
P(S|B)=[p(B|S)* P(S)]/[P(B|S)*P(S)+P(B|

• The numerator is the probability that a message is spam and contains bitcoin, while the
denominator is just the probability that a message contains bitcoin.

• If we have a large collection of messages we know are spam, and a large collection of
messages we know are not spam, then we can easily estimate P(B|S) and P(B|¬S).

If we further assume that any message is equally likely to be spam or not spam (so that
P(S) = P(¬S) = 0.5), then:
P(S|B) = P(B|S) / [P(B|S) + P(B |¬S)]

For example, if 50% of spam messages have the word bitcoin, but only 1% of nonspam
messages do, then the probability that any given bitcoin-containing email is spam is:

0 . 5/ 0 . 5 + 0 . 01 = 98%
Regression
• Regression analysis is a supervised learning method for predicting continuous variables.
• The most significant difference between regression and classification is that while regression
helps predict a continuous quantity, classification predicts discrete class labels.
• Regression is used to predict continuous variables or quantitative variables such as price and
revenue. The main concern of regression analysis is to answer questions such as
1. What is the relationship between variables?
2. What is the strength of the relationship?
3. What is the nature of relationship such as linear or nonlinear?
4. What is the relevance of each attribute?
5. What is the contribution of each attribute?
• Types of regression method
1. Linear regression
2. Multiple regression
3. Polynomial regression
4. Logistic regression
5. Lasso and Ridge regression method
Simple Linear Regression
• This is the simplest form of linear regression, and it involves only one independent variable
and one dependent variable.
• The equation for simple linear regression is:
Y=β0+β1x+ε
where
Y is the dependent variable
x is the independent variable
β0 is the intercept
β1 is the slope
ε is the error

• The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
• To achieve the best-fit regression line, the model aims to predict the target value Y^ such
that the error difference between the predicted value Y^ and the true value Y is minimum.
• So, it is very important to update the θ1 and θ2 values, to reach the best value that
minimizes the error between the predicted y value (pred) and the true y value (y)

• In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which
calculates the average of the squared errors between the predicted values y^iand the actual
values yi.
• The purpose is to determine the optimal values for the intercept θ1​and the coefficient of
the input feature θ2​providing the best-fit line for the given data points.

• A linear regression model can be trained using the optimization algorithm gradient descent
by iteratively modifying the model’s parameters to reduce the mean squared error of the
model on a training dataset.

• To update θ1 and θ2 values in order to reduce the Cost function (minimizing RMSE value)
and achieve the best-fit line the model uses Gradient Descent.
Assumptions of Simple Linear Regression
1. Linearity: The independent and dependent variables have a linear relationship with one
another. This implies that changes in the dependent variable follow those in the
independent variable(s) in a linear fashion.

2. Independence: The observations in the dataset are independent of each other. This
means that the value of the dependent variable for one observation does not depend on
the value of the dependent variable for another observation.

3. Homoscedasticity: Across all levels of the independent variable(s), the variance of the
errors is constant. This indicates that the amount of the independent variable(s) has no
impact on the variance of the errors. If the variance of the residuals is not constant, then
linear regression will not be an accurate model.
4. Normality: The residuals should be normally distributed. This means that the residuals
should follow a bell-shaped curve. If the residuals are not normally distributed, then
linear regression will not be an accurate model.
Multiple Linear Regression
Multiple linear regression extends the simple linear regression by allowing more than one
independent variable and one dependent variable. The equation for multiple linear regression
is:
Y=β0+β1x1+β2x2+……+βnxn+ ε
Y is the dependent variable
x1, x2, …, xn are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
ε- is the error
Need for fitting the model
1. Estimating the coefficients: the primary goal of is to estimate the regression coefficient that
quantify the relationship between dependent variable and each independent variable.
2. Making predictions: To make the prediction about the dependent variable based on new
data for the independent variables
3. Understanding the relationships: Fitting the model helps in understanding the strength and
direction of the relationship between the dependent variable and each independent variable,
4. Assessing the model fit: Enables the evaluation of how well the model explains the
variations in the dependent variable.
5. Hypothesis testing: hypothesis test on regression coefficients
6. Identifying multicollinearity: fitting the model and examining Variance Inflation Factor
(VIF) helps detect multicollinearity, which occurs when independent variables highly
correlated with each other, which can make it difficult to determine the individual effect
of each variable on dependent variables.

Assumptions of Multiple Linear Regression


For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression
apply. In addition to this, below are few more:
7. No multicollinearity: There is no high correlation between the independent variables.
This indicates that there is little or no correlation between the independent variables.
8. Additivity: The model assumes that the effect of changes in a predictor variable on the
response variable is consistent regardless of the values of the other variables. This
assumption implies that there is no interaction between variables in their effects on the
dependent variable.
3. Feature Selection: In multiple linear regression, it is essential to carefully select the
independent variables that will be included in the model. Including irrelevant or redundant
variables may lead to overfitting and complicate the interpretation of the model.

4. Overfitting: Overfitting occurs when the model fits the training data too closely, capturing
noise or random fluctuations that do not represent the true underlying relationship between
variables. This can lead to poor generalization performance on new, unseen data.
Goodness of fit test
• In the context of linear regression, a goodness of fit test is used to evaluate how well the
regression model explains the observed data.
• It helps determine how closely the predicted values from the regression line match the actual
data points.
• Several metrics and tests are commonly used to assess the goodness of fit in linear
regression:
1. R-Squared (Coefficient of Determination)
• Measures the proportion of the variance in the dependent variable that is explained by the
independent variable(s).
Formula:
R2 =1−𝑆𝑆𝑟𝑒𝑠/ 𝑆𝑆𝑡𝑜𝑡
​Where:
𝑆𝑆𝑟𝑒𝑠= sum of squares of residuals (difference between observed and predicted values)
𝑆𝑆𝑡𝑜𝑡= total sum of squares (variation of the observed data from its mean)
• Value ranges from 0 to 1. A value of 1 indicates a perfect fit, and 0 means the model
explains none of the variance.
2. Adjusted R-Squared
•Adjusted for the number of predictors in the model. It penalizes adding
unnecessary independent variables.
•Formula:

Where
n represents the number of data points in our dataset
k represents the number of independent variables, and
R represents the R-squared values determined by the model.
3. Residual Standard Error (RSE)
•Measures the standard deviation of the residuals (differences between actual and
predicted values).
•F-TestCompares the full model with no predictors to see if at least one predictor has
a statistically significant relationship with the dependent variable.Hypotheses:Null
Hypothesis (H₀): None of the predictors are significant.Alternative Hypothesis (H₁):
At least one predictor is significant.A high F-statistic (and low p-value) indicates that
the model is a good fit.
•5. Mean Squared Error (MSE)Measures the average squared difference between
observed and predicted values.
•6. Root Mean Squared Error (RMSE)The square root of MSE, giving the error in the
same units as the dependent variable.
Digression
• In data science, digression can refer to a departure from the main objective of analysis or
modeling.
• It can occur during various stages of a project such as data collection, preprocessing, analysis,
modelling and interpretation.
• It typically happens when a data scientist or analyst diverts attention to unrelated or less
important aspects of the project, which can slow down progress or cause confusion.
• In multiple regression, digression can refer to straying from the main goals of building and
interpreting the regression model. This can happen in several ways, often due to focusing on
irrelevant or secondary aspects of the data or analysis.
• digressions can slow down or obscure the path to an accurate model.
Common Examples of Digression in Multiple Regression:
1. Including Irrelevant Predictors: A digression occurs when too many variables are included
in the model, especially those that have no significant relationship with the dependent
variable.
Example: In a model predicting house prices, including variables like the owner’s favorite
color or whether the house is near a bakery could be irrelevant and complicate the model
unnecessarily.
2. Over-Emphasizing Collinearity: Collinearity occurs when independent variables are highly
correlated with each other, which can inflate standard errors and make estimates unreliable.
Focusing too much on this issue without considering how to address it effectively (e.g.,
through regularization techniques) can lead to digression.
Example: Spending too much time diagnosing collinearity without taking practical steps like
removing one of the correlated variables, using principal component analysis (PCA), or
applying ridge regression to mitigate the problem.
3. Excessive Model Complexity: Adding too many variables or interaction terms in the model
might create overfitting, which means the model fits the training data too closely and
performs poorly on unseen data.
This is a form of digression, as it moves away from the goal of building a generalized model.
Example: Including numerous interaction terms or polynomial terms that add little predictive
power but overcomplicate the model.
4. Overly Complicated Diagnostics:Diagnostic tests like residual plots, influence diagnostics,
or hypothesis tests are essential for understanding model fit and assumptions. However,
focusing excessively on minor deviations or running unnecessary tests can divert attention
from model improvement.
Example: Spending too much time on minor outliers or very small violations of assumptions
(e.g., minor deviations from normality in residuals) that are not significantly impacting the
model’s predictive accuracy.
5. Inappropriate Interpretation of Coefficients: Another form of digression happens when
too much emphasis is placed on interpreting individual coefficients without considering the
overall model’s performance.
In multiple regression, individual predictor coefficients must be interpreted with caution,
especially if there is multicollinearity or interaction terms involved.
Example: Trying to draw detailed conclusions about each predictor’s coefficient when the
predictors are highly correlated, making the interpretations less reliable.
How to Avoid Digression in Multiple Regression:
1. Feature Selection: Use systematic methods like stepwise regression, LASSO, or
regularization to select the most relevant predictors. Avoid including too many irrelevant
predictors that may dilute the model's effectiveness.
2. Simplification:Keep the model as simple as possible without sacrificing performance.
Focus on the most important predictors that have a clear relationship with the outcome.
3. Model Diagnostics: Perform essential diagnostic checks, such as checking for
multicollinearity, heteroscedasticity, and residual patterns, but avoid focusing too much on
minor or non-critical issues.
4. Cross-Validation: Use cross-validation to assess the generalizability of the model to unseen
data. Avoid overfitting by ensuring the model performs well on both training and test
datasets.
5. Practical and Statistical Significance: Pay attention to both practical relevance and
statistical significance when selecting predictors. Not every statistically significant predictor
needs to be included if it doesn't add value to the model.
Regularization
• Regularization in machine learning refers to techniques used to prevent overfitting by
penalizing complex models.
• Overfitting occurs when a model learns not only the underlying patterns in the data but also
the noise, leading to poor generalization to unseen data.
Common Types of Regularization:
1. L1 Regularization (Lasso):Adds the absolute values of the coefficients to the cost
function. Tends to produce sparse models, meaning some feature weights may become
zero, effectively performing feature selection.
2. L2 Regularization (Ridge):
Adds the square of the magnitudes of the coefficients to the cost function. Shrinks the
coefficients, preventing large values but does not zero them out.
3. Dropout (specific to neural networks):
Randomly drops a fraction of neurons during each iteration of training. Prevents co-
adaptation of neurons and reduces the chances of overfitting.
4. Early Stopping: Stops the training process once the model's performance on a validation
set starts to degrade, preventing overfitting.
Why Regularization Works

• Regularization techniques work by adding a penalty term to the cost function that
discourages large weights.
• This pushes the model towards simpler hypotheses, making it less likely to overfit to
noise or irrelevant details in the training data.
• The hyperparameter 𝜆 controls the strength of the regularization.
Logistic Regression

• Linear regression predicts the numerical response but is not suitable for predicting
categorical variables.
• When categorical variables are involved, it is called classification problem. Logistical
regression is suitable for binary classification problem where the goal is to predict the
probability of one of two possible outcomes.
• Hence the output is often categorical variable. Examples is the mail is spam or not, student
being pass or fail based on marks secured etc.
• Linear regression generated value is in the range of -∞ to +∞, where as probability of
response variable ranges between 0 and 1. Hence there must be mapping function to map -∞
to +∞ to 0-1.
• The core of the mapping function in logistic regression is sigmoidal function.
• A sigmoidal function is a ‘S’ shaped function that yields a value between 0 and 1. this is
known as logit function.
• Odds are defined as ratio of the probability of an event and probability of an event not
happing.
odd=p/1-p
Applications:
Binary classification: Predicting outcomes such as spam vs. not spam, disease vs. no
disease, etc
Probability estimation: Logistic regression not only classifies but also provides a
probability score for how confident the model is in its prediction.

Multiclass Classification: Although logistic regression is mainly used for binary


classification, it can be extended to multiclass classification problems using strategies
like One-vs-Rest (OvR) or Softmax Regression.
# Pyhhon code to train a regularized logistic regression classifier on the iris dataset using sklearn.
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset


iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a logistic regression model with regularization


C = 1e4 # Regularization strength
model = LogisticRegression(C=C, max_iter=200, solver='lbfgs', multi_class='auto', random_state=42)
# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Best classification accuracy: {accuracy:.4f}')
Support Vector Machines
• A Support Vector Machine (SVM) is a supervised learning algorithm commonly used for
classification and regression tasks in machine learning.
• It is particularly well-suited for binary classification problems but can be adapted for
multiclass tasks as well.
• An alternative approach to classification is to just look for the hyperplane that “best”
separates the classes in the training data.
• This is the idea behind the support vector machine, which finds the hyperplane that
maximizes the distance to the nearest point in each class

A separating hyperplane
A nonseparable one-dimensional dataset
Dataset becomes separable in higher dimensions
Key Concepts of SVM:
Hyperplane:
• In SVM, the goal is to find the best hyperplane that separates data points belonging to
different classes.
• In a two-dimensional space, this is a line, but in higher dimensions, it becomes a plane or
a hyperplane.
Support Vectors:
• These are the data points that are closest to the hyperplane. The position of these points
determines the hyperplane’s orientation and location.
• The algorithm focuses on these critical points.
Margin:
• The margin is the distance between the hyperplane and the nearest support vectors of
each class.
• SVM tries to maximize this margin, providing a robust classification boundary.
Linear and Non-Linear SVM:
• Linear SVM: Used when data is linearly separable, i.e., when a straight line or a flat
hyperplane can clearly separate the classes.
• Non-Linear SVM: For cases where the data isn't linearly separable, SVM uses a kernel trick
to project the data into higher dimensions where a linear separator can be found.
Kernels:
• Kernels help SVM handle non-linearly separable data by transforming it into a higher-
dimensional space.
• Popular kernel functions include Linear Kernel: Used when the data is linearly separable.
Polynomial Kernel: Maps the data into a higher-dimensional space using polynomial
functions. Radial Basis Function (RBF) Kernel: Also known as the Gaussian kernel, it is
commonly used for complex datasets.
C Parameter:
• This controls the trade-off between maximizing the margin and minimizing the classification
error.
• A larger C aims for a smaller margin with fewer misclassifications, while a smaller C allows
more flexibility and a larger margin.
SVM for Classification Example:
In a simple binary classification problem, given a dataset with two classes, SVM will:
1. Find the hyperplane that best separates the classes.
2. Maximize the margin between the support vectors of the two classes.
3. Predict the class of a new data point based on which side of the hyperplane it falls.

Applications of SVM:
• Text and hypertext categorization
• Image classification
• Bioinformatics (e.g., protein classification)
• Handwriting recognition
Some Questions from module 3.

1. What is machine learning? Discuss the different types of machine learing


2. Explain underfitting and overfitting in detail.
3. Explain K-Nearest Neighbors Algorithm using Iris dataset
4. Explain Naïve Bayes Algorithm in the context of classification with functions
5. Explain the various parameters used in checking the correctness of prediction of Machine
Learning Model
6. Discuss the need for feature extraction and feature selection
7. Explain support vector machine in detail.
8. What is regression? What are the types of regression? Explain the multiple regression in detail.
9. Discuss the goodness of fit test in detail.
10.Discuss Digression in detail.
11. Explain the bias variance tradeoff

You might also like