0% found this document useful (0 votes)
4 views

Assignment 2

The document outlines an assignment on simple linear regression, explaining its theory, calculations for estimating coefficients, and making predictions using the Boston Housing Dataset. It details the steps involved in exploratory data analysis, feature selection, and regression modeling, along with performance metrics for the dataset. The conclusion highlights the significant impact of certain features on house prices and suggests future improvements for model generalizability.

Uploaded by

apurva.kondekar6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Assignment 2

The document outlines an assignment on simple linear regression, explaining its theory, calculations for estimating coefficients, and making predictions using the Boston Housing Dataset. It details the steps involved in exploratory data analysis, feature selection, and regression modeling, along with performance metrics for the dataset. The conclusion highlights the significant impact of certain features on house prices and suggests future improvements for model generalizability.

Uploaded by

apurva.kondekar6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ASSIGNMENT NO.

2
AIM: Assignment on Linear Regression

PREREQUISITE: Python programming

THEORY:

Simple Linear Regression

When we have a single input attribute (x), and we want to use linear regression, this is called
simple linear regression.
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear
regression. The procedure for linear regression is different and simpler than that for multiple
linear regression, so it is a good place to start.
In this section we are going to create a simple linear regression model from our training data,
then make predictions for our training data to get an idea of how well the model learned the
relationship in the data.
With simple linear regression we want to model our data as follows:

y = B0 + B1 * x

This is a line where y is the output variable we want to predict, x is the input variable we know
and B0 and B1 are coefficients that we need to estimate that move the line around.
Technically, B0 is called the intercept because it determines where the line intercepts the y-axis.
In machine learning we can call this the bias, because it is added to offset all predictions that we
make. The B1 term is called the slope because it defines the slope of the line or how x translates
into a y value before we add our bias.
The goal is to find the best estimates for the coefficients to minimize the errors in predicting y
from x.
Simple regression is great, because rather than having to search for values by trial and error or
calculate them analytically using more advanced linear algebra, we can estimate them directly
from our data.
We can start off by estimating the value for B1 as:


Where mean() is the average value for the variable in our dataset. The xi and yi refer to the fact
that we need to repeat these calculations across all values in our dataset and i refers to the i’th
value of x or y. We can calculate B0 using B1 and some statistics from our dataset, as follows:

Not that bad right? We can calculate these right in our spreadsheet.
Estimating Slope (B1) Let’s start with the top part of the equation, the numerator. First we need
to calculate the mean value of x and y. The mean is calculated as: sum(x) / n Where n is the
number of values (5 in this case). Let’s calculate the mean value of our x and y variables:
𝑋̅ = 3 , 𝑌̅= 2.8
We now have the parts for calculating the numerator. All we need to do is multiple the error for
each x with the error for each y and calculate the sum of these multiplications.

Summing the final column we have calculated our numerator as 8.


Now we need to calculate the bottom part of the equation for calculating B1, or the denominator.
This is calculated as the sum of the squared differences of each x value from the mean.
We have already calculated the difference of each x value from the mean, all we need to do is
square each value and calculate the sum.
Calculating the sum of these squared values gives us up denominator of 10 Now we can calculate
the value of our slope.
B1 = 8 / 10 so further B1 = 0.8
Estimating Intercept (B0) This is much easier as we already know the values of all of the terms
involved. 𝐵0 = 𝑌̅ − (𝐵1 ∗ 𝑋̅)
or
B0 = 2.8 – 0.8 * 3 , or further B0 = 0.4
Making Predictions
We now have the coefficients for our simple linear regression equation.


y = B0 + B1 * x
or
y = 0.4 + 0.8 * x
Let’s try out the model by making predictions for our training data.

We can plot these predictions as a line with our data. This gives us a visual idea of how well the
line models our data.

Estimating Error We can calculate a error for our predictions called the Root Mean Squared Error
or RMSE.

Where sqrt() is the square root function, p is the predicted value and y is the actual value, i is the
index for a specific instance, n is the number of predictions, because we must calculate the error
across all predicted values.


First we must calculate the difference between each model prediction and the actual y values. We
can easily calculate the square of each of these error values (error*error or error^2).

The sum of these errors is 2.4 units, dividing by n and taking the square root gives us: RMSE =
0.692 Or, each prediction is on average wrong by about 0.692 units.

The Boston Housing Dataset contains information about housing prices in different areas of
Boston, influenced by various factors like crime rate, proximity to employment centers, and
pollution levels. The key steps involved in the analysis include:

●​ Exploratory Data Analysis (EDA): Identifying patterns and correlations between


features such as crime rate (CRIM), average number of rooms (RM), and pollution levels
(NOX).
●​ Feature Selection & Engineering: Determining which features contribute most to
housing prices.
●​ Regression Modeling: Using machine learning algorithms (e.g., linear regression,
decision trees, or random forests) to predict median house prices (MEDV).

The performance metrics for this dataset are:


●​ Mean absolute error: 3.1627098714574053
●​ Mean squared error: 21.517444231177205
●​ R squared error: 0.7112260057484934
●​ Root mean squared error: 4.6386899261728205


REFERENCES:

1. Mitchell M., T., Machine Learning, McGraw Hill (1997) 1st Edition.

2. Alpaydin E., Introduction to Machine Learning, MIT Press (2014) 3rd Edition.

3. https://medium.com/analytics-vidhya/understanding-the-linear-regression-808c1f6941c0

CONCLUSION:

The analysis of the Boston Housing Dataset reveals that features like the number of rooms (RM),
crime rate (CRIM), and lower status population (LSTAT) significantly impact house prices.
Regression models can effectively predict median house values, but the dataset has limitations in
generalizability. Future improvements could include using advanced models and incorporating
real-world economic factors for

You might also like