0% found this document useful (0 votes)
9 views

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

AB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

AB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.

ipynb - Colab

Linear models assume that the dependent variables X take a linear relationship with the dependent variable Y. If the assumption is not met, the
model may show poor performance. In this recipe, we will learn how to visualize the linear relationships between X and Y.

import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# the dataset for the demo


from sklearn.datasets import load_boston

# for linear regression


from sklearn.linear_model import LinearRegression

# load the the Boston House price data from scikit-learn

# this is how we load the boston dataset from sklearn


boston_dataset = load_boston()

# create a dataframe with the independent variables


boston = pd.DataFrame(boston_dataset.data,
columns=boston_dataset.feature_names)

# add the target


boston['MEDV'] = boston_dataset.target

boston.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

# this is the information about the boston house prince dataset


# get familiar with the variables before continuing with
# the notebook

# the aim is to predict the "Median value of the houses"


# MEDV column of this dataset

# and we have variables with characteristics about


# the homes and the neighborhoods

print(boston_dataset.DESCR)

.. _boston_dataset:

Boston house prices dataset


---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):


- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 1/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.


https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic


prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machi

# I will create a dataframe with the variable x that


# follows a normal distribution and shows a
# linear relationship with y

# this will provide the expected plots


# i.e., how the plots should look like if the
# linear assumption is met

np.random.seed(29) # for reproducibility

n = 200 # in the book we pass directly 200 within brackets, without defining n
x = np.random.randn(n)
y = x * 10 + np.random.randn(n) * 2

data = pd.DataFrame([x, y]).T


data.columns = ['x', 'y']
data.head()

x y

0 -0.417482 -1.271561

1 0.706032 7.990600

2 1.915985 19.848687

3 -2.141755 -21.928903

4 0.719057 5.579070

Linear relationships can be assessed by scatter plots.

# for the simulated data

# this is how the scatter-plot looks like when


# there is a linear relationship between X and Y

sns.lmplot(x="x", y="y", data=data, order=1)


# order 1 indicates that we want seaborn to
# estimate a linear model (the line in the plot below)
# between x and y

plt.ylabel('Target')
plt.xlabel('Independent variable')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 2/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0.5, 6.79999999999999, 'Independent variable')

# now we make a scatter plot for the boston


# house price dataset

# we plot the variable LAST (% lower status of the population)


# vs the target MEDV (median value of the house)

sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)

<seaborn.axisgrid.FacetGrid at 0xc2b631fa20>

Although not perfect, the relationship is fairly linear.

# now we plot CRIM (per capita crime rate by town)


# vs the target MEDV (median value of the house)

sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 3/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

<seaborn.axisgrid.FacetGrid at 0xc2b639d2e8>

Linear relationships can also be assessed by evaluating the residuals. Residuals are the difference between the value estimated by the linear
relationship and the real output. If the relationship is linear, the residuals should be normally distributed and centered around zero.

# SIMULATED DATA

# step 1: build a linear model


# call the linear model from sklearn
linreg = LinearRegression()

# fit the model


linreg.fit(data['x'].to_frame(), data['y'])

# step 2: obtain the predictions


# make the predictions
pred = linreg.predict(data['x'].to_frame())

# step 3: calculate the residuals


error = data['y'] - pred

# plot predicted vs real


plt.scatter(x=pred, y=data['y'])
plt.xlabel('Predictions')
plt.ylabel('Real value')

Text(0, 0.5, 'Real value')

# step 4: observe the distribution of the residuals

# Residuals plot
# if the relationship is linear, the noise should be
# random, centered around zero, and follow a normal distribution

# we plot the error terms vs the independent variable x


# error values should be around 0 and homogeneously distributed

plt.scatter(y=error, x=data['x'])
plt.ylabel('Residuals')
plt.xlabel('Independent variable x')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 4/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0.5, 0, 'Independent variable x')

# step 4: observe the distribution of the errors

# plot a histogram of the residuals


# they should follow a gaussian distribution
# centered around 0

sns.distplot(error, bins=30)
plt.xlabel('Residuals')

Text(0.5, 0, 'Residuals')

# now we do the same for the variable LSTAT of the boston


# house price dataset from sklearn

# call the linear model from sklearn


linreg = LinearRegression()

# fit the model


linreg.fit(boston['LSTAT'].to_frame(), boston['MEDV'])

# make the predictions


pred = linreg.predict(boston['LSTAT'].to_frame())

# calculate the residuals


error = boston['MEDV'] - pred

# plot predicted vs real


plt.scatter(x=pred, y=boston['MEDV'])
plt.xlabel('Predictions')
plt.ylabel('MEDV')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 5/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0, 0.5, 'MEDV')


# Residuals plot

# if the relationship is linear, the noise should be


# random, centered around zero, and follow a normal distribution

plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')

Text(0.5, 0, 'LSTAT')

# plot a histogram of the residuals


# they should follow a gaussian distribution
sns.distplot(error, bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0xc2b6e0c2b0>

For this particular case, the residuals are centered around zero, but they are not homogeneously distributed across the values of LSTAT. Bigger
and smaller values of LSTAT show higher residual values. In addition, we see in the histogram that the residuals do not adopt a strictly
Gaussian distribution.

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 6/6

You might also like