0% found this document useful (0 votes)

9 views6 pages

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.

ipynb - Colab

Linear models assume that the dependent variables X take a linear relationship with the dependent variable Y. If the assumption is not met, the
model may show poor performance. In this recipe, we will learn how to visualize the linear relationships between X and Y.

import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# the dataset for the demo

from sklearn.datasets import load_boston

# for linear regression

from sklearn.linear_model import LinearRegression

# load the the Boston House price data from scikit-learn

# this is how we load the boston dataset from sklearn

boston_dataset = load_boston()

# create a dataframe with the independent variables

boston = pd.DataFrame(boston_dataset.data,
columns=boston_dataset.feature_names)

# add the target

boston['MEDV'] = boston_dataset.target

boston.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

# this is the information about the boston house prince dataset

# get familiar with the variables before continuing with
# the notebook

# the aim is to predict the "Median value of the houses"

# MEDV column of this dataset

# and we have variables with characteristics about

# the homes and the neighborhoods

print(boston_dataset.DESCR)

.. _boston_dataset:

Boston house prices dataset

---------------------------

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 1/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machi

# I will create a dataframe with the variable x that

# follows a normal distribution and shows a
# linear relationship with y

# this will provide the expected plots

# i.e., how the plots should look like if the
# linear assumption is met

np.random.seed(29) # for reproducibility

n = 200 # in the book we pass directly 200 within brackets, without defining n
x = np.random.randn(n)
y = x * 10 + np.random.randn(n) * 2

data = pd.DataFrame([x, y]).T

data.columns = ['x', 'y']
data.head()

x y

0 -0.417482 -1.271561

1 0.706032 7.990600

2 1.915985 19.848687

3 -2.141755 -21.928903

4 0.719057 5.579070

Linear relationships can be assessed by scatter plots.

# for the simulated data

# this is how the scatter-plot looks like when

# there is a linear relationship between X and Y

sns.lmplot(x="x", y="y", data=data, order=1)

# order 1 indicates that we want seaborn to
# estimate a linear model (the line in the plot below)
# between x and y

plt.ylabel('Target')
plt.xlabel('Independent variable')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 2/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0.5, 6.79999999999999, 'Independent variable')

# now we make a scatter plot for the boston

# house price dataset

# we plot the variable LAST (% lower status of the population)

# vs the target MEDV (median value of the house)

sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)

<seaborn.axisgrid.FacetGrid at 0xc2b631fa20>

Although not perfect, the relationship is fairly linear.

# now we plot CRIM (per capita crime rate by town)

# vs the target MEDV (median value of the house)

sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 3/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

<seaborn.axisgrid.FacetGrid at 0xc2b639d2e8>

Linear relationships can also be assessed by evaluating the residuals. Residuals are the difference between the value estimated by the linear
relationship and the real output. If the relationship is linear, the residuals should be normally distributed and centered around zero.

# SIMULATED DATA

# step 1: build a linear model

# call the linear model from sklearn
linreg = LinearRegression()

# fit the model

linreg.fit(data['x'].to_frame(), data['y'])

# step 2: obtain the predictions

# make the predictions
pred = linreg.predict(data['x'].to_frame())

# step 3: calculate the residuals

error = data['y'] - pred

# plot predicted vs real

plt.scatter(x=pred, y=data['y'])
plt.xlabel('Predictions')
plt.ylabel('Real value')

Text(0, 0.5, 'Real value')

# step 4: observe the distribution of the residuals

# Residuals plot
# if the relationship is linear, the noise should be
# random, centered around zero, and follow a normal distribution

# we plot the error terms vs the independent variable x

# error values should be around 0 and homogeneously distributed

plt.scatter(y=error, x=data['x'])
plt.ylabel('Residuals')
plt.xlabel('Independent variable x')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 4/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0.5, 0, 'Independent variable x')

# step 4: observe the distribution of the errors

# plot a histogram of the residuals

# they should follow a gaussian distribution
# centered around 0

sns.distplot(error, bins=30)
plt.xlabel('Residuals')

Text(0.5, 0, 'Residuals')

# now we do the same for the variable LSTAT of the boston

# house price dataset from sklearn

# call the linear model from sklearn

linreg = LinearRegression()

# fit the model

linreg.fit(boston['LSTAT'].to_frame(), boston['MEDV'])

# make the predictions

pred = linreg.predict(boston['LSTAT'].to_frame())

# calculate the residuals

error = boston['MEDV'] - pred

# plot predicted vs real

plt.scatter(x=pred, y=boston['MEDV'])
plt.xlabel('Predictions')
plt.ylabel('MEDV')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 5/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0, 0.5, 'MEDV')

# Residuals plot

# if the relationship is linear, the noise should be

# random, centered around zero, and follow a normal distribution

plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')

Text(0.5, 0, 'LSTAT')

# plot a histogram of the residuals

# they should follow a gaussian distribution
sns.distplot(error, bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0xc2b6e0c2b0>

For this particular case, the residuals are centered around zero, but they are not homogeneously distributed across the values of LSTAT. Bigger
and smaller values of LSTAT show higher residual values. In addition, we see in the histogram that the residuals do not adopt a strictly
Gaussian distribution.

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 6/6

Extra Practice - Listening
No ratings yet
Extra Practice - Listening
7 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Exploring Boston Housing Data
No ratings yet
Exploring Boston Housing Data
7 pages
MSC Physics Education Syllabus 2016
No ratings yet
MSC Physics Education Syllabus 2016
13 pages
The Boston Housing Dataset
100% (2)
The Boston Housing Dataset
4 pages
07aRegresionLinealBostonVerdSinEstandarizacion - Jupyter Notebook
No ratings yet
07aRegresionLinealBostonVerdSinEstandarizacion - Jupyter Notebook
15 pages
07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook
No ratings yet
07bRegresionLinealBostonVerdConEstandarizacion - Jupyter Notebook
17 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Week 6 LAB
No ratings yet
Week 6 LAB
13 pages
Boston Housing
No ratings yet
Boston Housing
17 pages
Boston Dataset
No ratings yet
Boston Dataset
6 pages
Boston - Housing - Kaggle - Boston - Housing - Ipynb at Master Eric-Bunch - Boston - Housing GitHub
No ratings yet
Boston - Housing - Kaggle - Boston - Housing - Ipynb at Master Eric-Bunch - Boston - Housing GitHub
13 pages
House Pricing
No ratings yet
House Pricing
15 pages
Xgboost
No ratings yet
Xgboost
12 pages
Making Predictions
No ratings yet
Making Predictions
13 pages
Regression Problem
No ratings yet
Regression Problem
28 pages
Lab 6 Linear Regression
No ratings yet
Lab 6 Linear Regression
1 page
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
No ratings yet
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
113 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Linear Regression Apply On House Price Prediction On Boston House Dataset
No ratings yet
Linear Regression Apply On House Price Prediction On Boston House Dataset
12 pages
Predicting Boston Housing Price Using Machine Learning Models
No ratings yet
Predicting Boston Housing Price Using Machine Learning Models
6 pages
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
Real Estate Valuation Data Set: Section Order
No ratings yet
Real Estate Valuation Data Set: Section Order
17 pages
Build A Machine Learning Model To Predict House Prices Using Supervised Learing Algorithm 'S Linear Regression
No ratings yet
Build A Machine Learning Model To Predict House Prices Using Supervised Learing Algorithm 'S Linear Regression
14 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Continuous Assessment
No ratings yet
Continuous Assessment
4 pages
Exp 4
No ratings yet
Exp 4
2 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Boston Housing Kaggle Challenge With Linear Regression
No ratings yet
Boston Housing Kaggle Challenge With Linear Regression
3 pages
Boston Housing - Prediction of House Price - by Harsh - Medium
No ratings yet
Boston Housing - Prediction of House Price - by Harsh - Medium
14 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
Python - Vectorized - Tute - Jupyter Notebook
No ratings yet
Python - Vectorized - Tute - Jupyter Notebook
16 pages
Assignment
No ratings yet
Assignment
3 pages
20MIS1025 - Regression - Ipynb - Colaboratory
No ratings yet
20MIS1025 - Regression - Ipynb - Colaboratory
5 pages
Research On The Prediction of Boston House Price B
No ratings yet
Research On The Prediction of Boston House Price B
11 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Assignment 4
No ratings yet
Assignment 4
7 pages
DSBDAL - Assignment No 4
No ratings yet
DSBDAL - Assignment No 4
15 pages
1 2 Boston Housing Data
No ratings yet
1 2 Boston Housing Data
12 pages
Raport
No ratings yet
Raport
2 pages
ML Assignment4 (22bcb7162)
No ratings yet
ML Assignment4 (22bcb7162)
3 pages
ML Assignment2 33418
No ratings yet
ML Assignment2 33418
6 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Exp 3 ML
No ratings yet
Exp 3 ML
3 pages
PRJ Housuing Price
No ratings yet
PRJ Housuing Price
14 pages
01.multiple Linear Regression - Ipynb - Colaboratory
No ratings yet
01.multiple Linear Regression - Ipynb - Colaboratory
10 pages
ML Lab-3
No ratings yet
ML Lab-3
14 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
House Price Prediction 1
No ratings yet
House Price Prediction 1
27 pages
DL Assignment 1ms24rai03
No ratings yet
DL Assignment 1ms24rai03
10 pages
Practical Activity 01: Linear Regression: Case of Study: Predicting House Prices
No ratings yet
Practical Activity 01: Linear Regression: Case of Study: Predicting House Prices
2 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
ML Manual
No ratings yet
ML Manual
30 pages
Information Regarding Sales Made in Real Estate in A Tabular Format
No ratings yet
Information Regarding Sales Made in Real Estate in A Tabular Format
13 pages
Module 4 - ML-21EC744
No ratings yet
Module 4 - ML-21EC744
18 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
Glossary of Islamic Terms
No ratings yet
Glossary of Islamic Terms
30 pages
SQL Session 02 - Manual
No ratings yet
SQL Session 02 - Manual
8 pages
Ex13-Using SQLite As A Time Series Database
No ratings yet
Ex13-Using SQLite As A Time Series Database
6 pages
Ex01-Quick Start
No ratings yet
Ex01-Quick Start
2 pages
3 1-Lists
No ratings yet
3 1-Lists
4 pages
Week 3
No ratings yet
Week 3
22 pages
Nature Background and History
No ratings yet
Nature Background and History
2 pages
Prime Insurance
100% (1)
Prime Insurance
32 pages
Tierney 1988
No ratings yet
Tierney 1988
7 pages
Audit Report
No ratings yet
Audit Report
2 pages
A Study Conducted To Investigate The Feasibility of Recycling Commingled Plastics Fiber in Concrete
No ratings yet
A Study Conducted To Investigate The Feasibility of Recycling Commingled Plastics Fiber in Concrete
8 pages
(21st Century Skills Library - Cool Military Careers) Josh Gregory-Avionics Technician-Cherry Lake Publishing (2012)
No ratings yet
(21st Century Skills Library - Cool Military Careers) Josh Gregory-Avionics Technician-Cherry Lake Publishing (2012)
36 pages
The Evolution of Luxury.
No ratings yet
The Evolution of Luxury.
9 pages
Space and Culture Using Space Syntax For The Tenganan Pageringsingan Housing of Bali, Indonesia
No ratings yet
Space and Culture Using Space Syntax For The Tenganan Pageringsingan Housing of Bali, Indonesia
5 pages
(MAA 5.2) DERIVATIVES - BASIC RULES - Solutions
No ratings yet
(MAA 5.2) DERIVATIVES - BASIC RULES - Solutions
6 pages
Chapter 1
No ratings yet
Chapter 1
12 pages
Process Diagrams
100% (1)
Process Diagrams
139 pages
Script - What Is Applied Behaviour Analysis
No ratings yet
Script - What Is Applied Behaviour Analysis
4 pages
The Project Stakeholder Management Plan MSPM1-GC4000 TEMPLATE
No ratings yet
The Project Stakeholder Management Plan MSPM1-GC4000 TEMPLATE
6 pages
Admin Commands
No ratings yet
Admin Commands
4 pages
ACTL5101 Probability and Statistics For Actuaries S12013
No ratings yet
ACTL5101 Probability and Statistics For Actuaries S12013
17 pages
Interdisciplinary Lesson Plan
No ratings yet
Interdisciplinary Lesson Plan
6 pages
Euro Standards - Eye and Face Protection
100% (2)
Euro Standards - Eye and Face Protection
3 pages
Newsletter Term 3 2024
No ratings yet
Newsletter Term 3 2024
12 pages
Construction and Trial Experiment of A Small Size Thermo-Acoustic Refrigeration System
No ratings yet
Construction and Trial Experiment of A Small Size Thermo-Acoustic Refrigeration System
6 pages
Dronacharya-Ii-2024-Sample Paper-Class-Xi-P1-I.q
100% (1)
Dronacharya-Ii-2024-Sample Paper-Class-Xi-P1-I.q
6 pages
2.2 Hypothesis Testing Critical Values - COMPLETE
No ratings yet
2.2 Hypothesis Testing Critical Values - COMPLETE
13 pages
An Examination of The Relationship Between Ability Model Emotional Intelligence and Leadership Practices of Organizational Leaders and Entrepreneurs
No ratings yet
An Examination of The Relationship Between Ability Model Emotional Intelligence and Leadership Practices of Organizational Leaders and Entrepreneurs
251 pages
Automated Software Testing
No ratings yet
Automated Software Testing
10 pages
Burrus, E. J., Clavijero y Los Manuscritos Perdidos de Sigüenza y Góngora
No ratings yet
Burrus, E. J., Clavijero y Los Manuscritos Perdidos de Sigüenza y Góngora
32 pages
SCP 1471 A One of A Kind Friend
No ratings yet
SCP 1471 A One of A Kind Friend
195 pages
Blue Whale Communication
No ratings yet
Blue Whale Communication
9 pages

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.

# the dataset for the demo

# for linear regression

# load the the Boston House price data from scikit-learn

# this is how we load the boston dataset from sklearn

# create a dataframe with the independent variables

# add the target

# this is the information about the boston house prince dataset

# the aim is to predict the "Median value of the houses"

# and we have variables with characteristics about

Boston house prices dataset

**Data Set Characteristics:**

:Number of Instances: 506

:Attribute Information (in order):

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

# I will create a dataframe with the variable x that

# this will provide the expected plots

np.random.seed(29) # for reproducibility

data = pd.DataFrame([x, y]).T

Linear relationships can be assessed by scatter plots.

# for the simulated data

# this is how the scatter-plot looks like when

sns.lmplot(x="x", y="y", data=data, order=1)

Text(0.5, 6.79999999999999, 'Independent variable')

# now we make a scatter plot for the boston

# we plot the variable LAST (% lower status of the population)

sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)

Although not perfect, the relationship is fairly linear.

# now we plot CRIM (per capita crime rate by town)

sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)

# step 1: build a linear model

# fit the model

# step 2: obtain the predictions

# step 3: calculate the residuals

# plot predicted vs real

Text(0, 0.5, 'Real value')

# step 4: observe the distribution of the residuals

# we plot the error terms vs the independent variable x

Text(0.5, 0, 'Independent variable x')

# step 4: observe the distribution of the errors

# plot a histogram of the residuals

# now we do the same for the variable LSTAT of the boston

# call the linear model from sklearn

# fit the model

# make the predictions

# calculate the residuals

# plot predicted vs real

Text(0, 0.5, 'MEDV')

# if the relationship is linear, the noise should be

# plot a histogram of the residuals

You might also like

Data Set Characteristics: