0% found this document useful (0 votes)
44 views

Regression Stat Assignment

Uploaded by

Pritom Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Regression Stat Assignment

Uploaded by

Pritom Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

copy-of-regression-stat-assignment

May 4, 2024

#Statistics Assignment: Generating Regression Model


[2]: import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
from scipy.optimize import curve_fit

Solve the problems: At http://www.statsci.org/data/general/brunhild.html, you will find a dataset


that measures the concentration of a sulfate in the blood of a baboon named Brunhilda as a function
of time. Build a linear regression of the log of the concentration against the log of time.
(a) Prepare a plot showing
1. the data points and
2. the regression line in log-log coordinates.
(b) Prepare a plot showing
1. the data points and
2. the regression curve in the original coordinates.
(c) Plot the residual against the fitted values in log-log and in original coordinates.
(d) Use your plots to explain whether your regression is good or bad and why.
From http://www.statsci.org/data/general/brunhild.html, a dataset that measures the concentra-
tion of a sulfate in the blood of a baboon named Brunhilda as a function of time was found. The
data table is presented here:
Hours Sulfate
2 15.11
4 11.36
6 9.77
8 9.09
10 8.48
15 7.69
20 7.33
25 7.06
30 6.7
40 6.43
50 6.16
60 5.99

1
70 5.77
80 5.64
90 5.39
110 5.09
130 4.87
150 4.6
160 4.5
170 4.36
180 4.27
Lets represent the data table as two numpy arrays for further mathematical queries
[3]: x_hours = np.array([2, 4, 6, 8, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90,␣
↪110, 130, 150, 160, 170, 180])

y_sulfate = np.array([15.11, 11.36, 9.77, 9.09, 8.48, 7.69, 7.33, 7.06, 6.7, 6.
↪43, 6.16, 5.99, 5.77, 5.64, 5.39, 5.09, 4.87, 4.6, 4.5, 4.36, 4.27])

#Question a: Prepare a plot showing :


1. the data points and
2. the regression line in log-log coordinates.
[4]: log_hours = np.log(x_hours)
log_sulfate = np.log(y_sulfate)

[5]: slope, intercept, a, b, c = linregress(log_hours, log_sulfate)

[10]: plt.scatter(log_hours, log_sulfate, color='black')


plt.plot(log_hours, slope * log_hours + intercept, label='Regression Line')
plt.xlabel('Log_Hours')
plt.ylabel('Log_Sulfate')
plt.title('Log-Log Plot of Sulfate Concentration vs. Hours')

2
#Question 2: Prepare a plot showing - 1. the data points and 2. the regression curve in
the original coordinates.
First, plot the data points as is:
[12]: plt.scatter(x_hours, y_sulfate, color='black', label='Data Points')
plt.xlabel('Hours')
plt.ylabel('Sulfate')
plt.title('Plot of Sulfate Concentration vs. Time')

[12]: Text(0.5, 1.0, 'Plot of Sulfate Concentration vs. Time')

3
Here, for the regression curve we need to use curve_fit() function from scipy.optimize. We
need to assume the type of curve e.g. sin/tan/y = mx+c/exponential.
For example: In this dataset, the points resembles negative power exponential function. So we
assume the function to be,
f(x) = a * e ^ (bx) + c

[15]: def e_to_the_power(x, a, b, c):


return a * np.exp(-b * x) + c

Now, we pass this fuction and the data points inside curve_fit() function. It will return in the
constants, for this case:
f(x) = a * e ^ (bx) + c
a, b and c are the constants.
[ ]: constants, _ = curve_fit(e_to_the_power, x_hours, y_sulfate)

plotting the regression curve:

4
[18]: plt.scatter(x_hours, y_sulfate, color='black', label='Data Points')
plt.plot(x_hours, e_to_the_power(x_hours, *constants), label='Regression Curve')
plt.xlabel('Hours')
plt.ylabel('Sulfate')
plt.title('Plot of Sulfate Concentration vs. Time')

[18]: Text(0.5, 1.0, 'Plot of Sulfate Concentration vs. Time')

#Question 3: Plot the residual against the fitted values in log-log and in original coor-
dinates.
The residual is the difference between the observed value of the dependent variable (in this case,
the sulfate concentration) and the value predicted by the regression model. In other words, it
represents the error or deviation of each data point from the fitted regression line or curve.
[19]: regression_line_log = slope * log_hours + intercept
residual_log = log_sulfate - regression_line_log

Now we plot residual vs fitted values which represents the error or deviation of each data point
from the fitted regression line or curve.:

5
.
[21]: plt.scatter(regression_line_log, residual_log, color='black')
plt.xlabel('Fitted Values (Log)')
plt.ylabel('Residual (Log)')
plt.title('Plot 5: Residual vs Fitted Values (Log-Log)')

[21]: Text(0.5, 1.0, 'Plot 5: Residual vs Fitted Values (Log-Log)')

For the original coordinates it is the same. The difference between main datapoints and fitted
curve. We will plot it against hours value
[24]: fit_curve = e_to_the_power(x_hours, *constants)
residual_original = y_sulfate - fit_curve

[26]: plt.scatter(fit_curve, residual_original, color='black')


plt.xlabel('Fitted Values (Original)')
plt.ylabel('Residual (Original)')
plt.title('Plot 6: Residual vs Fitted Values (Original)')

[26]: Text(0.5, 1.0, 'Plot 6: Residual vs Fitted Values (Original)')

6
1 Question 4:
Use your plots to explain whether your regression is good or bad and why.
In Plot 5, the regression line we’ve calculated shows residuals distributed around zero, indicating
a good fit. Random scattering suggests the model captures data variation well, making it a strong
predictor.
In contrast, Plot 6 displays residuals clustered away from zero, particularly around values like 4-6.
This indicates a poor fit, with systematic errors in predictions. The model struggles to accurately
forecast values, making it unreliable for future predictions.

You might also like