Regression Stat Assignment
Regression Stat Assignment
May 4, 2024
1
70 5.77
80 5.64
90 5.39
110 5.09
130 4.87
150 4.6
160 4.5
170 4.36
180 4.27
Lets represent the data table as two numpy arrays for further mathematical queries
[3]: x_hours = np.array([2, 4, 6, 8, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90,␣
↪110, 130, 150, 160, 170, 180])
y_sulfate = np.array([15.11, 11.36, 9.77, 9.09, 8.48, 7.69, 7.33, 7.06, 6.7, 6.
↪43, 6.16, 5.99, 5.77, 5.64, 5.39, 5.09, 4.87, 4.6, 4.5, 4.36, 4.27])
2
#Question 2: Prepare a plot showing - 1. the data points and 2. the regression curve in
the original coordinates.
First, plot the data points as is:
[12]: plt.scatter(x_hours, y_sulfate, color='black', label='Data Points')
plt.xlabel('Hours')
plt.ylabel('Sulfate')
plt.title('Plot of Sulfate Concentration vs. Time')
3
Here, for the regression curve we need to use curve_fit() function from scipy.optimize. We
need to assume the type of curve e.g. sin/tan/y = mx+c/exponential.
For example: In this dataset, the points resembles negative power exponential function. So we
assume the function to be,
f(x) = a * e ^ (bx) + c
Now, we pass this fuction and the data points inside curve_fit() function. It will return in the
constants, for this case:
f(x) = a * e ^ (bx) + c
a, b and c are the constants.
[ ]: constants, _ = curve_fit(e_to_the_power, x_hours, y_sulfate)
4
[18]: plt.scatter(x_hours, y_sulfate, color='black', label='Data Points')
plt.plot(x_hours, e_to_the_power(x_hours, *constants), label='Regression Curve')
plt.xlabel('Hours')
plt.ylabel('Sulfate')
plt.title('Plot of Sulfate Concentration vs. Time')
#Question 3: Plot the residual against the fitted values in log-log and in original coor-
dinates.
The residual is the difference between the observed value of the dependent variable (in this case,
the sulfate concentration) and the value predicted by the regression model. In other words, it
represents the error or deviation of each data point from the fitted regression line or curve.
[19]: regression_line_log = slope * log_hours + intercept
residual_log = log_sulfate - regression_line_log
Now we plot residual vs fitted values which represents the error or deviation of each data point
from the fitted regression line or curve.:
5
.
[21]: plt.scatter(regression_line_log, residual_log, color='black')
plt.xlabel('Fitted Values (Log)')
plt.ylabel('Residual (Log)')
plt.title('Plot 5: Residual vs Fitted Values (Log-Log)')
For the original coordinates it is the same. The difference between main datapoints and fitted
curve. We will plot it against hours value
[24]: fit_curve = e_to_the_power(x_hours, *constants)
residual_original = y_sulfate - fit_curve
6
1 Question 4:
Use your plots to explain whether your regression is good or bad and why.
In Plot 5, the regression line we’ve calculated shows residuals distributed around zero, indicating
a good fit. Random scattering suggests the model captures data variation well, making it a strong
predictor.
In contrast, Plot 6 displays residuals clustered away from zero, particularly around values like 4-6.
This indicates a poor fit, with systematic errors in predictions. The model struggles to accurately
forecast values, making it unreliable for future predictions.