Module4 CSE3190 FDA Updated
Module4 CSE3190 FDA Updated
2
Linear least squares is a statistical method used to find the best-fitting
line or curve to a set of data points.
It's a powerful technique for understanding relationships between
variables and making predictions.
Here's how it works:
Data Points: You have a set of data points, each with an x-value and a
corresponding y-value.
Line of Best Fit: You want to find a line (or curve) that best represents
the trend in the data.
Residuals: For each data point, calculate the vertical distance between
the point and the line. These distances are called residuals.
Minimizing the Sum of Squared Residuals: The goal of linear least
squares is to find the line that minimizes the sum of the squares of these
residuals.
Why Square the Residuals?
Squaring the residuals ensures that both positive and negative deviations
from the line contribute to the sum.
Key Terms:
•Method of Least Squares: Another name for linear least
squares, emphasizing the minimization of squared residuals.
•Line of Best Fit: The line that best fits the data points,
according to the least squares criterion.
•Ordinary Least Squares (OLS): The most common form
of linear least squares, used when the errors are assumed to be
normally distributed.
•Linear Regression: A statistical method used to model the
relationship between a dependent variable and one or more
independent variables.
4
What is the Least Square Method?
• Least Square Method is used to derive a generalized linear
equation between two variables. when the value of
the dependent and independent variable is represented as the x
and y coordinates in a 2D cartesian coordinate system.
• Initially, known values are marked on a plot. The plot obtained
at this point is called a scatter plot.
• Then, represent all the marked points as a straight line or
a linear equation. The equation of such a line is obtained with
the help of the Least Square method.
• This is done to get the value of the dependent variable for an
independent variable for which the value was initially
unknown.
• This helps us to make predictions for the value of dependent
variable.
5
Least Square Method Definition: Least Squares method is a statistical
technique used to find the equation of best-fitting curve or line to a set of
data points by minimizing the sum of the squared differences between
the observed values and the values predicted by the model.
6
Formula for Least Square Method
Least Square Method formula is used to find the best-fitting line through
a set of data points. For a simple linear regression, which is a line of the
form y=mx+c, where y is the dependent variable, x is the independent
variable, a is the slope of the line, and b is the y-intercept, the formulas
to calculate the slope (m) and intercept (c) of the line are derived from
the following equations:
1. Slope (m) Formula: m = n(∑xy)−(∑x)(∑y) / n(∑x2)−(∑x)2
2. Intercept (c) Formula: c = (∑y)−a(∑x) / n
Where:
n is the number of data points,
∑xy is the sum of the product of each pair of x and y values,
∑x is the sum of all x values,
∑y is the sum of all y values,
∑x2 is the sum of the squares of x values.
The steps to find the line of best fit by using the least square method is:
• Step 1: Denote the independent variable values as xi and the
dependent ones as yi.
• Step 2: Calculate the average values of xi and yi as X and Y.
• Step 3: Presume the equation of the line of best fit as y = mx + c,
where m is the slope of the line and c represents the intercept of the
line on the Y-axis.
• Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2
• Step 5: The intercept c is calculated from the following formula:
c = Y – mX
• Thus, we obtain the line of best fit as y = mx + c, where values of m
and c can be calculated from the formulae defined above.
• These formulas are used to calculate the parameters of the line that
best fits the data according to the criterion of the least squares,
minimizing the sum of the squared differences between the observed
values and the values predicted by the linear model.
Least Square Method Graph
• Let us have a look at how the data points and the line of best fit obtained from the
Least Square method look when plotted on a graph.
• The red points in the above plot represent the data points for the sample data
available. Independent variables are plotted as x-coordinates and dependent ones
are plotted as y-coordinates.
• The equation of the line of best fit obtained from the Least Square method is
plotted as the red line in the graph.
• We can conclude from the above graph that how the Least Square method helps
us to find a line that best fits the given data points and hence can be used to
make further predictions about the value of the dependent variable where it is not
known initially.
Limitations of the Least Square Method
• The Least Square method assumes that the data is evenly
distributed and doesn’t contain any outliers for deriving a line
of best fit.
• This method doesn’t provide accurate results for unevenly
distributed data or for data containing outliers.
10
Least Square Method Solved Examples
Problem 1: Find the line of best fit for the following data
points using the Least Square method: (x,y) = (1,3), (2,4),
(4,8), (6,10), (8,15).
Solution:
• x as the independent variable and y as the dependent
variable.
• calculate the means of x and y values denoted by X and Y
respectively.
X = (1+2+4+6+8)/5 = 4.2
Y = (3+4+8+10+15)/5 = 8
11
xi yi X – xi Y – yi (X-xi)*(Y-yi) (X – xi)2
3 5 16 10.24
1 3.2
0 0 0.04
4 8 0.2
-2 3.6 3.24
6 10 -1.8
-7 26.6 14.44
8 15 -3.8
12
• The slope of the line of best fit can be calculated from the
formula as follows:
m = (Σ (X – xi)*(Y – yi)) /Σ(X – xi)2
m = 55/32.8 = 1.68
• intercept will be calculated from the formula as follows:
c = Y – mX
c = 8 – 1.68*4.2 = 0.94
• Thus, the equation of the line of best fit becomes,
y = 1.68x + 0.94.
13
Problem 2: Find the line of best fit for the following
data of heights and weights of students of a school
using the Least Square method:
Height (in centimeters): [160, 162, 164, 166, 168]
14
What Is a Line of Best Fit?
• The line of best fit, also known as a trend line or linear regression
line, is a straight line that is used to approximate the relationship
between two variables in a set of data points on a scatter plot.
• This line attempts to show the pattern within the data by
minimizing the total distance between itself and all the data
points.
• Line of best fit is calculated using the statistical method of linear
regression.
• The line of best fit minimizes the distance between the observed
data points and the line itself indicating the overall data trends.
15
Line of Best Fit in Regression
• Line of best fit is a straight line that best represents the
relationship between two variables in a dataset.
• It is used in statistics to summarize and analyze the
relationship between the variables.
• The line of best fit is often determined through
regression analysis, a statistical technique used to
model the relationship between variables.
• Regression analysis helps quantify the strength and
direction of the relationship, allowing for predictions
and further analysis.
16
Line of Best Fit in Statistics
• In statistics, the line of best fit, also known as the trend
line, is a straight line that best represents the data points
on a scatter plot.
• This line attempts to show the relationship between two
variables by minimizing the distances between the points
and the line itself, specifically the vertical distances.
• The process of finding this line involves a method
called linear regression, where the goal is to minimize the
sum of the squares of these vertical distances, a technique
often referred to as the "least squares" method.
17
Line of Best Fit Formula
• The line of best fit is calculated using the least squares method,
which minimizes the sum of the squares of the vertical distances
between the observed data points and the line.
• The formula for the equation of the line of best fit is:
y = mx + b
where:
• m is Slope of Line
• b is Y-Intercept
18
How to Calculate the Line of Best Fit?
• Calculating the line of best fit involves finding the slope and y-
intercept of the line that minimizes the overall distance between
the line and the data points. A regression with two independent
is solved using a formula
y = c + b1(x1) + b2(x2)
where,
• y is Dependent Variable
• c is Constant
• b1 is First Regression Coefficient
• x1 is First Independent Variable
• b2 is Second Regression Coefficient
• x2 is Second Independent Variable
19
• To find the line of best fit, you can use various statistical
software or programming languages like Python or R which
have built-in functions for regression analysis.
• Alternatively, you can manually calculate the line's parameters
using statistical formulas.
• The line of best fit is a key concept in statistics, showcasing the
relationship between two variables within a dataset.
• It serves as a foundational tool in regression analysis enabling
the prediction of one variable's value based on another.
• By determining the line of best fit, the aim is to minimize the
vertical distances between the line and the data points
providing the closest approximation of the overall trend.
20
Example: Consider a dataset representing the relationship between the number of
hours studied and the score achieved on a test:
Hours Studied Test Score
2 65
3 70
4 75
5 80
6 85
By calculating the line of best fit for this data, we can predict the test score based on
the number of hours studied. Let's assume the line of best fit equation given is:
y = mx +b
Test Score = 5 × Hours Studied + 60
where:
• m=5
• b = 60
21
Is a Line of Best Fit Always Straight?
• Line of best fit is typically assumed to be straight in linear regression
analysis.
• However, in more complex regression techniques like polynomial
regression, the line of best fit can take on curved forms to better fit
the data.
• Thus, the line of best fit is not always required to be straight.
22
Line of Best Fit Examples
Example 1: For the following data, equation of the line of best fit is y = -3.2x
+ 880. Find, how many people will attend the show if ticket price is $15?
Ticket price 12 15 18 20
(x) (in $)
Number of
people 780 800 820 830
attending (y)
Solution:
Substitute ticket price as x = 15 into the equation of line of best fit
y = -3.2x + 880
y = -3.2 × 15 + 880 = 830
830 people will attend the show if the ticket price is $15.
23
Example 2: For the following data, equation of the line of best fit is y
= -6.5x + 1350. Find, how many people will purchase chocolate if
price of chocolate is $30?
Chocolate
$25 $30 $35 $40
price (x)
Number of
people
purchasing 480 600 720 840
chocolate
(y)
24
• Reliability of the line of best fit depends on the strength of the
relationship between the variables and the variability of the data.
Generally, if the data points closely follow the trend of the line,
predictions tend to be more accurate.
• Goodness-of-fit of the line of best fit is often assessed using
metrics like the coefficient of determination (R-squared). A
higher R-squared value indicates a better fit, meaning the line
explains more of the variability in the data.
• Line of best fit is used in various fields such as economics,
engineering, biology and social sciences. Common applications
include forecasting sales based on historical data analyzing
trends in stock prices and predicting crop yields.
25
Linear Regression analysis
• It is a statistical method that is used for predictive analysis.
• Linear regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.
• Linear regression algorithm shows a linear relationship between a dependent (y)
and one or more independent (y) variables, hence called as linear regression.
• Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Artificial Intelligence
26
Linear Regression analysis
• Types of Linear Regression
Simple Linear Regression: If a single independent variable is used to
predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Simple Linear Regression.
Multiple Linear regression: If more than one independent variable is used
to predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Multiple Linear Regression.
• Model Performance: R-squared method:
R-squared is a statistical method that determines the goodness of fit.
The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
It can be calculated from the below formula:
Artificial Intelligence
27
Simple Linear regression
• Models the relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line.
• Simple Linear regression algorithm has mainly two objectives:
Model the relationship between the two variables. Eg: Income and expenditure,
experience and Salary, etc.
Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
• The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which is either increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Artificial Intelligence
28
Multiple linear regression
• Multiple Linear Regression is one of the important regression algorithms which
models the linear relationship between a single dependent continuous variable
and more than one independent variable.
• For MLR, the dependent or target variable(Y) must be the continuous/real, but
the predictor or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent
variable.
• MLR tries to fit a regression line through a multidimensional space of data-
points.
• Example: Prediction of CO2 emission based on engine size and number of
cylinders in a car.
Artificial Intelligence
29
Multiple linear regression
• MLR equation:
o In Multiple Linear Regression, the target variable(Y) is a linear
combination of multiple predictor variables x1, x2, x3, ...,xn.
Y=b 0+b 1x 1+b 2x 2+b 3x 3+⋯+bnx n
where, Y= Output/Response variable
b0: Intercept (constant term).
b0, b1, b2, b3 ,…………… bn....= Coefficients for the independent variables.
x1, x2, x3, x4,....xn= Various Independent/feature variable
• Assumptions for Multiple Linear Regression:
A linear relationship should exist between the Target and predictor
variables.
The regression residuals must be normally distributed.
MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.
Artificial Intelligence
30
Evaluation metrics for regression
model
• In regression problems, the prediction error is used to define the model
performance. The prediction error is also referred to as residuals and it is
defined as the difference between the actual and predicted values.
• Residuals are important when determining the quality of a model.
• Residual = actual value — predicted value
error(e) = y — ŷ
• We can technically inspect all residuals to judge the model’s accuracy, but this
does not scale if we have thousands or millions of data points. That’s why we
have summary measurements that take our collection of residuals and condense
them into a single value representing our model's predictive ability.
Artificial Intelligence
31
Evaluation metrics for regression
model – R Squared
• R2 score is a metric that tells the performance of your model, not the loss in an
absolute sense that how many wells did your model perform.
• So, with help of R squared we have a baseline model to compare a model which
none of the other metrics provides.
• The same we have in classification problems which we call a threshold which is
fixed at 0.5. So basically R2 squared calculates how must regression line is better
than a mean line.
Artificial Intelligence
32
Evaluation metrics for regression
model – R Squared
• Now, how will you interpret the R2 score? suppose If the R2 score is zero then
the above regression line by mean line is equal means 1 so 1-1 is zero.
• So, in this case, both lines are overlapping means model performance is worst, It
is not capable to take advantage of the output column.
• Now the second case is when the R2 score is 1, it means when the division term
is zero and it will happen when the regression line does not make any mistake, it
is perfect. In the real world, it is not possible.
• So we can conclude that as our regression line moves towards perfection, R2
score move towards one. And the model performance improves.
• The normal case is when the R2 score is between zero and one like 0.8 which
means your model is capable to explain 80 per cent of the variance of data.
Artificial Intelligence
33
Tools for Model Testing:
• Statistical Software:
• R
• Python (with libraries like Statsmodels, Scikit-learn)
• SPSS
• SAS
• Visualization Tools:
• Python's Matplotlib and Seaborn
• R's ggplot2
34
Implementations of Linear Least Squares
1. R: The lm() function in R provides an implementation of
LLS.
2. Python: The scikit-learn library in Python provides an
implementation of LLS.
3. Excel: Excel provides a built-in function LINEST() for
LLS.
35
Logistic Regression
• Logistic regression, also known as a logit model, is a statistical
analysis method to predict a binary outcome, such as yes or no,
based on prior observations of a dataset.
• A logistic regression model predicts a dependent data variable by
analyzing the relationship between one or more existing
independent variables.
• For example, logistic regression could be used to predict whether a
political candidate will win or lose an election or whether a high
school student will be admitted to a particular college. These binary
outcomes enable straightforward decisions between two
alternatives.
36
Some Key Points:
• Logistic regression predicts the output of a categorical
dependent variable. Therefore, the outcome must be a
categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead
of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we
fit an “S” shaped logistic function, which predicts two
maximum values (0 or 1).
37
How does logistic regression work?
• Logistic regression finds the best possible fit between the predictor
and target variables to predict the probability of the target variable
belonging to a labeled class/category.
• It performs regression on the probabilities of the outcome being a
category. It uses a sigmoid function (the cumulative distribution
function of the logistic distribution) to transform the right-hand
side of that equation.
y_predictions = logistic_cdf(intercept + slope * features)
Cdf=cumulative distribution function
• Again, the model uses optimization to try and find the best possible
values of intercept and slope.
38
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified
into three types:
• Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent variable, such as
“cat”, “dogs”, or “sheep”
• Ordinal: In ordinal Logistic regression, there can be 3 or more
possible ordered types of dependent variables, such as “low”,
“Medium”, or “High”.
39
Logistic Regression Packages
• In R, there are two popular workflows for modeling logistic
regression: base-R and tidymodels.
• The base-R workflow models is simpler and includes functions like
glm() and summary() to fit the model and generate a model
summary.
• The tidymodels workflow allows easier management of multiple
models and a consistent interface for working with different model
types.
40
Problem :- How does the code model the relationship between the hours
a student studies and their likelihood of passing, predict whether a
student will pass or fail, and evaluate the model's accuracy by comparing
predictions to actual results? Execute the same in R studio using an
example.
• Understanding the Relationship:
The code analyzes how the number of hours a student studies affects the
chance of passing an exam. It tries to find a pattern or rule that connects
these two.
• Making Predictions:
Based on the pattern it finds, the code guesses whether a student will pass
or fail based on how many hours they studied.
• Checking Accuracy:
It checks how good its guesses are by comparing them to the actual
results (whether students really passed or failed).
41
Code:
# Simulated dataset
data <- data.frame( Hours_Studied = c(2, 3, 5, 7, 1, 4, 6, 8, 2.5,3.5), Pass = c(0,
0, 1, 1, 0, 0, 1, 1, 0, 1))
# View the dataset
print(data)
# Fit logistic regression model
model <- glm(Pass ~ Hours_Studied, data = data, family = binomial)
# View model summary
summary(model)
# Predict probabilities
data$Predicted_Prob <- predict(model, type = "response")
42
Contd..(Code)
# Convert probabilities to binary predictions (threshold = 0.5)
data$Predicted_Class <- ifelse(data$Predicted_Prob > 0.5, 1, 0)
# View predictions
print(data)
# Calculate accuracy
correct_predictions <- sum(data$Predicted_Class == data$Pass)
accuracy <- correct_predictions / nrow(data)
# Print accuracy
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
43
Explanation
data.frame: Creates a dataset with two columns:
• Hours_Studied: Independent variable (number of hours a student studied).
• Pass: Dependent variable (binary outcome: 1 for pass, 0 for fail).
glm: Fits a Generalized Linear Model.
Pass ~ Hours_Studied: Specifies the relationship between Pass (dependent
variable) and Hours_Studied (independent variable).
family = binomial: Indicates logistic regression, as the response variable is binary.
summary(model): Outputs the regression coefficients, significance levels, and
goodness-of-fit metrics for the logistic regression model.
predict(model, type = "response"): Generates predicted probabilities for Pass
(values between 0 and 1) based on the logistic regression model. These
probabilities are stored in a new column, Predicted_Prob.
44
Contd..(Explanation)
• ifelse(condition, value_if_true, value_if_false): Converts
probabilities into binary classes based on a threshold (0.5 here).If
Predicted_Prob > 0.5, the student is predicted to pass (1); otherwise,
fail (0).
• sum(condition): Counts how many predictions (Predicted_Class)
match the actual outcomes (Pass).
• accuracy: Calculates the proportion of correct predictions out of the
total number of rows in the dataset.
• round(accuracy * 100, 2): Converts accuracy to a percentage with
2 decimal places.paste(): Combines text and calculated accuracy
into a printable statement.
45
Output
46