Module 4
Module 4
4CS1201
Syllabus • MODULE 4
Regression
y=β0+β1X
where:
• Y is the dependent variable
• X is the independent variable
• β0 is the intercept
• β1 is the slope
• For example, suppose we have the following
dataset with the weight and height of seven
individuals:
Let weight be the predictor variable and let height be the response variable.
If we graph these two variables using a scatterplot, with weight on the x-axis and
height on the y-axis, here’s what it would look like:
From the scatterplot we can clearly see that as weight increases,
height tends to increase as well, but to actually quantify this
relationship between weight and height, we need to use linear
regression.
Implementation
• Simple Linear Regression in Machine learning - Javatpoint
Multiple Linear Regression
• Multiple linear regression is used to estimate the relationship
between two or more independent variables and one dependent
variable. You can use multiple linear regression when you want to
know:
1.How strong the relationship is between two or more
independent variables and one dependent variable (e.g. how rainfall,
temperature, and amount of fertilizer added affect crop growth).
2.The value of the dependent variable at a certain value of the
independent variables (e.g. the expected yield of a crop at certain
levels of rainfall, temperature, and fertilizer addition).
Multiple Linear Regression
Multiple Linear Regression
• Below is a list of what each variable represents, as seen in the picture above:
• Y = the dependent or response variable. This is the variable you are looking to
predict.
• B0 = This is the y-intercept, which is the value of y when all other parameters are
set to 0 (independent variables and error term).
• B1X1= (B1) is the coefficient of the first independent variable (X1) in your model.
This can be interpreted as the effect that changing the value of the independent
variable has on the predicted y value, holding all else equal. That is when X1 goes
up by one unit, then predicted y goes up by B1 value.
• “…” = the additional variables you have in your model.
• e = this is the model error. It explains how much variation there is in our prediction
of y.
• Steps Involved in any Multiple Linear Regression Model
• Step #1: Data Pre Processing
1.Importing The Libraries.
2.Importing the Data Set.
3.Encoding the Categorical Data.
4.Avoiding the Dummy Variable Trap.
5.Splitting the Data set into Training Set and Test Set.
• Step #2: Fitting Multiple Linear Regression to the Training set
Step #3: Predict the Test set results.
Implementation
• Multiple Linear Regression With scikit-learn – GeeksforGeeks
•In the above image, we have taken a dataset which is arranged non-linearly. So if we try to cover it with a linear
model, then we can clearly see that it hardly covers any data point. On the other hand, a curve is suitable to cover
most of the data points, which is of the Polynomial model.
•Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial Regression model
instead of Simple Linear Regression.
• Steps :
• Data Preparation: Like any machine learning task, you need to prepare your
dataset. This involves cleaning the data, handling missing values, and splitting it
into training and testing sets.
• Feature Engineering: In polynomial regression, you might need to create
additional features by raising the original features (independent variables) to
different powers. For example, if you have a feature x, you might create new
features up to the desired degree.
• Model Selection: Choose the degree of the polynomial that best fits your data.
• Model Fitting: Once you've chosen the degree of the polynomial, fit the
polynomial regression model to your training data. This involves estimating the
coefficients of the polynomial terms that minimize the error between the
predicted and actual values.
• Model Evaluation: Evaluate the performance of your model using metrics like
Mean Squared Error (MSE), Root Mean Squared Error(RMSE),R-squared, etc.,
on the testing dataset.
• Prediction: Use the trained model to make predictions on new, unseen data.
Applications of polynomial regression:
• Curve Fitting: Polynomial regression is often used in curve fitting applications
where the relationship between variables is non-linear. For example, in physics,
it can be used to fit data to equations describing physical phenomena such as
projectile motion or the behavior of a spring.
• Finance: In finance, polynomial regression can be used to model the relationship
between factors such as interest rates, economic indicators, and stock prices.
For instance, it can be employed to analyze the behavior of stock prices over time,
taking into account non-linear patterns.
• Economics: Polynomial regression can be applied in economics to study the
relationship between economic variables like GDP, inflation rates,
unemployment rates, and other factors affecting economic growth. It allows
economists to capture non-linear trends in the data.
• Environmental Science: In environmental science, polynomial regression can be
used to analyze trends in environmental data such as temperature changes,
pollution levels, or species population dynamics. It helps researchers understand
the complex relationships between various environmental factors.
• Medicine and Biology: Polynomial regression is used in medical research and biology to model the relationship
between variables such as dosage and response in drug trials, growth patterns of organisms, or disease
progression. It enables researchers to identify non-linear relationships in biological data.
• Marketing and Sales: In marketing and sales, polynomial regression can be employed to analyze consumer
behavior, sales trends, and market demand. It helps businesses understand the non-linear relationship
between factors like advertising expenditure and sales revenue.
• Signal Processing: Polynomial regression can be used in signal processing applications such as noise filtering,
audio and image processing, and signal reconstruction. It helps in capturing non-linear patterns in signals and
extracting meaningful information.
• Geology and Geophysics: In geology and geophysics, polynomial regression can be utilized to analyze
geological data such as seismic measurements, rock properties, or soil composition. It assists in
understanding the non-linear relationships between geological variables.
• Quality Control and Manufacturing: Polynomial regression can be applied in manufacturing processes for
quality control and process optimization. It helps in modeling the relationship between process parameters
and product quality, identifying non-linear patterns affecting manufacturing outcomes.
• Astronomy and Astrophysics: Polynomial regression is used in astronomy and astrophysics to analyze
observational data, model celestial phenomena, and predict astronomical events. It helps researchers
understand the complex relationships between astronomical variables.
Implementation
• On the basis of the categories, Logistic Regression can be classified into three
types:
• Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.
Implementation
• Logistic Regression in Machine Learning - Javatpoint
The formula to calculate MAE for a data with “n” data points is:
Where:
xi represents the actual or observed values for the i-th data point.
yi represents the predicted value for the i-th data point.
• Mean Squared Error (MSE)
A popular metric in statistics and machine learning is the Mean Squared Error (MSE). It
measures the square root of the average discrepancies between a dataset’s actual values and
projected values. MSE is frequently utilized in regression issues and is used to assess how
well predictive models work.
For a dataset containing ‘n’ data points, the MSE calculation formula is:
where:
xi represents the actual or observed value for the i-th data point.
yi represents the predicted value for the i-th data point.
R-squared (R²) Score
A statistical metric frequently used to assess the goodness of fit of a regression model is the R-
squared (R2) score, also referred to as the coefficient of determination. It quantifies the
percentage of the dependent variable’s variation that the model’s independent variables
contribute to. R2 is a useful statistic for evaluating the overall effectiveness and explanatory
power of a regression model.
Where:
R2 is the R-Squared.
SSR represents the sum of squared residuals between the predicted values and actual values.
SST represents the total sum of squares, which measures the total variance in the dependent
variable.
• Root Mean Squared Error (RMSE)
RMSE stands for Root Mean Squared Error. It is a usually used metric in regression analysis
and machine learning to measure the accuracy or goodness of fit of a predictive model,
especially when the predictions are continuous numerical values.
The RMSE quantifies how well the predicted values from a model align with the actual
observed values in the dataset.
The formula for RMSE for a data with ‘n’ data points is as follows:
Where: