0% found this document useful (0 votes)

156 views7 pages

Project Linear Regression

1. The document loads data on housing from a CSV file and performs exploratory data analysis including generating histograms, pair plots, and correlation heatmaps. 2. The data is split into training and test sets and a linear regression model is trained to predict housing prices. 3. The model is evaluated on the test set and achieves a mean absolute error of around $82,000 and root mean squared error of around $102,000. 4. The model is used to predict the price of a sample house with given attributes.

Uploaded by

Calo Soft

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

156 views7 pages

Project Linear Regression

Uploaded by

Calo Soft

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('USA_Housing.csv')

In [3]:
df.head()

Out[3]:

Avg. Area Avg. Area Avg. Area Number Avg. Area Number Area
Price Address
Income House Age of Rooms of Bedrooms Population

208 Michael Ferry Apt.

0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 674\nLaurabury, NE
3701...

188 Johnson Views Suite

1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06 079\nLake Kathleen,
CA...

9127 Elizabeth
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 Stravenue\nDanieltown,
WI 06482...

USS Barnett\nFPO AP
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06
44820

USNS Raymond\nFPO
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05
AE 09386

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB

In [5]:
df.describe()
Out[5]:

Avg. Area Avg. Area House Avg. Area Number of Avg. Area Number of Area
Price
Income Age Rooms Bedrooms Population

count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03

mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06

std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05

min Avg. Area
17796.631190 Avg. Area House
2.644304 Avg. Area Number of
3.236194 Avg. Area Number of
2.000000 Area 1.593866e+04
172.610686 Price
Income Age Rooms Bedrooms Population
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05

50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06

75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06

max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06

In [6]:
df.columns
Out[6]:
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [7]:
#EDA

In [8]:
sns.pairplot(df)
Out[8]:

<seaborn.axisgrid.PairGrid at 0x7f10a32c6450>
In [9]:
sns.distplot(df['Price'])

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `dis

tplot` is a deprecated function and will be removed in a future version. Please adapt you
r code to use either `displot` (a figure-level function with similar flexibility) or `his
tplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1098999310>

In [10]:
sns.heatmap(df.corr(), annot = True)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f109a4dba90>

In [11]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = df['Price']

In [11]:

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4, random_state =
101)

In [14]:
# X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.95, random_state
= 101)

In [15]:
#X_test

In [16]:
from sklearn.linear_model import LinearRegression

In [17]:
lm = LinearRegression()

In [18]:
lm.fit(X_train, y_train)
Out[18]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [19]:
#Model Evaluation

In [20]:
predictions = lm.predict(X_test)

In [21]:
predictions
Out[21]:

array([1260960.70567629, 827588.75560301, 1742421.24254363, ...,

372191.40626868, 1365217.15140901, 1914519.54178955])

In [22]:
y_test
Out[22]:
1718 1.251689e+06
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
...
1776 1.489520e+06
4269 7.777336e+05
1661 1.515271e+05
2410 1.343824e+06
2302 1.906025e+06
Name: Price, Length: 2000, dtype: float64

In [23]:
plt.scatter(y_test, predictions)
Out[23]:
<matplotlib.collections.PathCollection at 0x7f1091e384d0>

In [24]:
sns.distplot((y_test-predictions), bins = 50)

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `dis

tplot` is a deprecated function and will be removed in a future version. Please adapt you
r code to use either `displot` (a figure-level function with similar flexibility) or `his
tplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1095039450>

In [25]:
lm.intercept_
Out[25]:

-2640159.796853739

In [26]:
coef_df = pd.DataFrame(lm.coef_, X.columns, columns = ['Coeff'])
In [27]:
coef_df
Out[27]:

Coeff

Avg. Area Income 21.528276

Avg. Area House Age 164883.282027

Avg. Area Number of Rooms 122368.678027

Avg. Area Number of Bedrooms 2233.801864

Area Population 15.150420

In [28]:
from sklearn import metrics

In [29]:
metrics.mean_absolute_error(y_test, predictions)
Out[29]:
82288.22251914928

In [30]:
metrics.mean_squared_error(y_test, predictions)#MSE
Out[30]:
10460958907.208244

In [31]:
np.sqrt(metrics.mean_squared_error(y_test, predictions)) #RMSE
Out[31]:
102278.82922290538

In [32]:
df.columns
Out[32]:
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [33]:
single_data = pd.DataFrame([
{'Avg. Area Income':65000,
'Avg. Area House Age':10,
'Avg. Area Number of Rooms':7,
'Avg. Area Number of Bedrooms':4,
'Area Population':100000
}
])

In [34]:
single_data
Out[34]:
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population

0 65000 10 7 4 100000

In [35]:
lm.predict(single_data)
Out[35]:
array([2788568.88485117])

In [35]:

Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Assignment Clustering
No ratings yet
Assignment Clustering
22 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
7z1018 CW Example Predicting House Prices in King County
No ratings yet
7z1018 CW Example Predicting House Prices in King County
16 pages
Data Mining Quiz 3 - Random Forest: Course Content
No ratings yet
Data Mining Quiz 3 - Random Forest: Course Content
8 pages
GDP Forecasting Using Time Series Analysis
No ratings yet
GDP Forecasting Using Time Series Analysis
15 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
DVT (Car Cliams Insurance) Project
No ratings yet
DVT (Car Cliams Insurance) Project
20 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Why Do You Need To Scale Data in KNN: 3 Answers
No ratings yet
Why Do You Need To Scale Data in KNN: 3 Answers
1 page
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
No ratings yet
Project Time Series Forecasting ROSE Dataset by Somya Dhar 1 PDF
52 pages
Churn Modelling: TM 298 - Big Data Analytics Group 1
No ratings yet
Churn Modelling: TM 298 - Big Data Analytics Group 1
31 pages
Time Series Forecasting - Rose - Buisness Report
100% (1)
Time Series Forecasting - Rose - Buisness Report
69 pages
Project 3 - Build A Logistic Regression Model To Predict Custo Mer Churn in Telecom IndustryV1.0 PDF
100% (1)
Project 3 - Build A Logistic Regression Model To Predict Custo Mer Churn in Telecom IndustryV1.0 PDF
38 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Project Data Mining
No ratings yet
Project Data Mining
55 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Problem 1
No ratings yet
Problem 1
12 pages
SQL Quiz
No ratings yet
SQL Quiz
4 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Prathamesh Shukla SMDM Project 20.08.23
100% (1)
Prathamesh Shukla SMDM Project 20.08.23
34 pages
Color: Due On Sunday June 7th, by 11:59PM
No ratings yet
Color: Due On Sunday June 7th, by 11:59PM
2 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
ML Quiz 3
No ratings yet
ML Quiz 3
2 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
BUSINESS REPORT Part 1
No ratings yet
BUSINESS REPORT Part 1
9 pages
TSF - Project
100% (1)
TSF - Project
5 pages
Credit EDA Case Study
100% (3)
Credit EDA Case Study
16 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
SMDM Project: Submitted By: Tina Das
100% (1)
SMDM Project: Submitted By: Tina Das
15 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Project - Ipynb - Colaboratory
No ratings yet
Project - Ipynb - Colaboratory
4 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
Statistical Methods For Decision Making
100% (1)
Statistical Methods For Decision Making
15 pages
Time Series Forecasting Jupyter Code - Ipynb
No ratings yet
Time Series Forecasting Jupyter Code - Ipynb
2,484 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
Start Predicting In A World Of Data Science And Predictive Analysis
From Everand
Start Predicting In A World Of Data Science And Predictive Analysis
Matthew Abbitt
No ratings yet
Keerthana Resume
No ratings yet
Keerthana Resume
3 pages
04-Milestone Project 1
No ratings yet
04-Milestone Project 1
34 pages
MVC With Angular Development
No ratings yet
MVC With Angular Development
12 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Database and Data Warehouse
No ratings yet
Database and Data Warehouse
80 pages
Unit III EBDP 2022
No ratings yet
Unit III EBDP 2022
77 pages
What Is An AS400
No ratings yet
What Is An AS400
159 pages
B-PSU-100 My Ls
No ratings yet
B-PSU-100 My Ls
4 pages
Billing Using Python
No ratings yet
Billing Using Python
70 pages
Evashish ANE: Profile Summary
No ratings yet
Evashish ANE: Profile Summary
2 pages
Caleb Wekesa
No ratings yet
Caleb Wekesa
3 pages
RoboticsCustomizedUIManual (031 060)
No ratings yet
RoboticsCustomizedUIManual (031 060)
30 pages
WINSEM2024-25 CSE3041 ETH AP2024254000393 2025-03-05 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE3041 ETH AP2024254000393 2025-03-05 Reference-Material-I
50 pages
CMPE 102 - Module 6 - Iteration or Looping in Python
No ratings yet
CMPE 102 - Module 6 - Iteration or Looping in Python
50 pages
BS ISO 20022-2 - 2013 Financial Services. Universal Financial Industry Message Scheme. UML Profile - Libgen - Li
No ratings yet
BS ISO 20022-2 - 2013 Financial Services. Universal Financial Industry Message Scheme. UML Profile - Libgen - Li
84 pages
DDS (2018S)
No ratings yet
DDS (2018S)
3 pages
Bank Code
No ratings yet
Bank Code
9 pages
2.entity Relationship Model
No ratings yet
2.entity Relationship Model
14 pages
Nhập môn chương trình dịch
No ratings yet
Nhập môn chương trình dịch
8 pages
Kids and The Commodore 64 (1983)
No ratings yet
Kids and The Commodore 64 (1983)
240 pages
ADSD Question Bank Final
No ratings yet
ADSD Question Bank Final
2 pages
OOP Mcqs
No ratings yet
OOP Mcqs
13 pages
Middleware Architecture With Patterns and Framework
No ratings yet
Middleware Architecture With Patterns and Framework
437 pages
Production Management Systems in Textile Industry
No ratings yet
Production Management Systems in Textile Industry
2 pages
EXAM
No ratings yet
EXAM
6 pages
2007 Microsoft Press - Inside Windows Communication Foundation
No ratings yet
2007 Microsoft Press - Inside Windows Communication Foundation
488 pages
Django Views and URLconfs
No ratings yet
Django Views and URLconfs
19 pages
Java Awt UIComponents
No ratings yet
Java Awt UIComponents
2 pages
BIT 222 - Data Structures - and Algorithms
No ratings yet
BIT 222 - Data Structures - and Algorithms
2 pages
Ppa Pratical Programs For Algorithm
No ratings yet
Ppa Pratical Programs For Algorithm
40 pages