Assignment 2
Assignment 2
2
AIM: Assignment on Linear Regression
THEORY:
When we have a single input attribute (x), and we want to use linear regression, this is called
simple linear regression.
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear
regression. The procedure for linear regression is different and simpler than that for multiple
linear regression, so it is a good place to start.
In this section we are going to create a simple linear regression model from our training data,
then make predictions for our training data to get an idea of how well the model learned the
relationship in the data.
With simple linear regression we want to model our data as follows:
y = B0 + B1 * x
This is a line where y is the output variable we want to predict, x is the input variable we know
and B0 and B1 are coefficients that we need to estimate that move the line around.
Technically, B0 is called the intercept because it determines where the line intercepts the y-axis.
In machine learning we can call this the bias, because it is added to offset all predictions that we
make. The B1 term is called the slope because it defines the slope of the line or how x translates
into a y value before we add our bias.
The goal is to find the best estimates for the coefficients to minimize the errors in predicting y
from x.
Simple regression is great, because rather than having to search for values by trial and error or
calculate them analytically using more advanced linear algebra, we can estimate them directly
from our data.
We can start off by estimating the value for B1 as:
Where mean() is the average value for the variable in our dataset. The xi and yi refer to the fact
that we need to repeat these calculations across all values in our dataset and i refers to the i’th
value of x or y. We can calculate B0 using B1 and some statistics from our dataset, as follows:
Not that bad right? We can calculate these right in our spreadsheet.
Estimating Slope (B1) Let’s start with the top part of the equation, the numerator. First we need
to calculate the mean value of x and y. The mean is calculated as: sum(x) / n Where n is the
number of values (5 in this case). Let’s calculate the mean value of our x and y variables:
𝑋̅ = 3 , 𝑌̅= 2.8
We now have the parts for calculating the numerator. All we need to do is multiple the error for
each x with the error for each y and calculate the sum of these multiplications.
y = B0 + B1 * x
or
y = 0.4 + 0.8 * x
Let’s try out the model by making predictions for our training data.
We can plot these predictions as a line with our data. This gives us a visual idea of how well the
line models our data.
Estimating Error We can calculate a error for our predictions called the Root Mean Squared Error
or RMSE.
Where sqrt() is the square root function, p is the predicted value and y is the actual value, i is the
index for a specific instance, n is the number of predictions, because we must calculate the error
across all predicted values.
First we must calculate the difference between each model prediction and the actual y values. We
can easily calculate the square of each of these error values (error*error or error^2).
The sum of these errors is 2.4 units, dividing by n and taking the square root gives us: RMSE =
0.692 Or, each prediction is on average wrong by about 0.692 units.
The Boston Housing Dataset contains information about housing prices in different areas of
Boston, influenced by various factors like crime rate, proximity to employment centers, and
pollution levels. The key steps involved in the analysis include:
REFERENCES:
1. Mitchell M., T., Machine Learning, McGraw Hill (1997) 1st Edition.
2. Alpaydin E., Introduction to Machine Learning, MIT Press (2014) 3rd Edition.
3. https://medium.com/analytics-vidhya/understanding-the-linear-regression-808c1f6941c0
CONCLUSION:
The analysis of the Boston Housing Dataset reveals that features like the number of rooms (RM),
crime rate (CRIM), and lower status population (LSTAT) significantly impact house prices.
Regression models can effectively predict median house values, but the dataset has limitations in
generalizability. Future improvements could include using advanced models and incorporating
real-world economic factors for