Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab
Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab
ipynb - Colab
Linear models assume that the dependent variables X take a linear relationship with the dependent variable Y. If the assumption is not met, the
model may show poor performance. In this recipe, we will learn how to visualize the linear relationships between X and Y.
import pandas as pd
import numpy as np
# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
boston.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
print(boston_dataset.DESCR)
.. _boston_dataset:
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 1/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machi
n = 200 # in the book we pass directly 200 within brackets, without defining n
x = np.random.randn(n)
y = x * 10 + np.random.randn(n) * 2
x y
0 -0.417482 -1.271561
1 0.706032 7.990600
2 1.915985 19.848687
3 -2.141755 -21.928903
4 0.719057 5.579070
plt.ylabel('Target')
plt.xlabel('Independent variable')
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 2/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
<seaborn.axisgrid.FacetGrid at 0xc2b631fa20>
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 3/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
<seaborn.axisgrid.FacetGrid at 0xc2b639d2e8>
Linear relationships can also be assessed by evaluating the residuals. Residuals are the difference between the value estimated by the linear
relationship and the real output. If the relationship is linear, the residuals should be normally distributed and centered around zero.
# SIMULATED DATA
# Residuals plot
# if the relationship is linear, the noise should be
# random, centered around zero, and follow a normal distribution
plt.scatter(y=error, x=data['x'])
plt.ylabel('Residuals')
plt.xlabel('Independent variable x')
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 4/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
sns.distplot(error, bins=30)
plt.xlabel('Residuals')
Text(0.5, 0, 'Residuals')
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 5/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')
Text(0.5, 0, 'LSTAT')
<matplotlib.axes._subplots.AxesSubplot at 0xc2b6e0c2b0>
For this particular case, the residuals are centered around zero, but they are not homogeneously distributed across the values of LSTAT. Bigger
and smaller values of LSTAT show higher residual values. In addition, we see in the histogram that the residuals do not adopt a strictly
Gaussian distribution.
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 6/6