0% found this document useful (0 votes)
119 views

Project Linear Regression

1. The document loads data on housing from a CSV file and performs exploratory data analysis including generating histograms, pair plots, and correlation heatmaps. 2. The data is split into training and test sets and a linear regression model is trained to predict housing prices. 3. The model is evaluated on the test set and achieves a mean absolute error of around $82,000 and root mean squared error of around $102,000. 4. The model is used to predict the price of a sample house with given attributes.

Uploaded by

Calo Soft
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views

Project Linear Regression

1. The document loads data on housing from a CSV file and performs exploratory data analysis including generating histograms, pair plots, and correlation heatmaps. 2. The data is split into training and test sets and a linear regression model is trained to predict housing prices. 3. The model is evaluated on the test set and achieves a mean absolute error of around $82,000 and root mean squared error of around $102,000. 4. The model is used to predict the price of a sample house with given attributes.

Uploaded by

Calo Soft
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('USA_Housing.csv')

In [3]:
df.head()

Out[3]:

Avg. Area Avg. Area Avg. Area Number Avg. Area Number Area
Price Address
Income House Age of Rooms of Bedrooms Population

208 Michael Ferry Apt.


0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 674\nLaurabury, NE
3701...

188 Johnson Views Suite


1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06 079\nLake Kathleen,
CA...

9127 Elizabeth
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 Stravenue\nDanieltown,
WI 06482...

USS Barnett\nFPO AP
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06
44820

USNS Raymond\nFPO
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05
AE 09386

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB

In [5]:
df.describe()
Out[5]:

Avg. Area Avg. Area House Avg. Area Number of Avg. Area Number of Area
Price
Income Age Rooms Bedrooms Population

count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03

mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06

std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05


min Avg. Area
17796.631190 Avg. Area House
2.644304 Avg. Area Number of
3.236194 Avg. Area Number of
2.000000 Area 1.593866e+04
172.610686 Price
Income Age Rooms Bedrooms Population
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05

50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06

75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06

max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06

In [6]:
df.columns
Out[6]:
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [7]:
#EDA

In [8]:
sns.pairplot(df)
Out[8]:

<seaborn.axisgrid.PairGrid at 0x7f10a32c6450>
In [9]:
sns.distplot(df['Price'])

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `dis


tplot` is a deprecated function and will be removed in a future version. Please adapt you
r code to use either `displot` (a figure-level function with similar flexibility) or `his
tplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1098999310>

In [10]:
sns.heatmap(df.corr(), annot = True)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f109a4dba90>

In [11]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = df['Price']

In [11]:

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4, random_state =
101)

In [14]:
# X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.95, random_state
= 101)

In [15]:
#X_test

In [16]:
from sklearn.linear_model import LinearRegression

In [17]:
lm = LinearRegression()

In [18]:
lm.fit(X_train, y_train)
Out[18]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [19]:
#Model Evaluation

In [20]:
predictions = lm.predict(X_test)

In [21]:
predictions
Out[21]:

array([1260960.70567629, 827588.75560301, 1742421.24254363, ...,


372191.40626868, 1365217.15140901, 1914519.54178955])

In [22]:
y_test
Out[22]:
1718 1.251689e+06
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
...
1776 1.489520e+06
4269 7.777336e+05
1661 1.515271e+05
2410 1.343824e+06
2302 1.906025e+06
Name: Price, Length: 2000, dtype: float64

In [23]:
plt.scatter(y_test, predictions)
Out[23]:
<matplotlib.collections.PathCollection at 0x7f1091e384d0>

In [24]:
sns.distplot((y_test-predictions), bins = 50)

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `dis


tplot` is a deprecated function and will be removed in a future version. Please adapt you
r code to use either `displot` (a figure-level function with similar flexibility) or `his
tplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1095039450>

In [25]:
lm.intercept_
Out[25]:

-2640159.796853739

In [26]:
coef_df = pd.DataFrame(lm.coef_, X.columns, columns = ['Coeff'])
In [27]:
coef_df
Out[27]:

Coeff

Avg. Area Income 21.528276

Avg. Area House Age 164883.282027

Avg. Area Number of Rooms 122368.678027

Avg. Area Number of Bedrooms 2233.801864

Area Population 15.150420

In [28]:
from sklearn import metrics

In [29]:
metrics.mean_absolute_error(y_test, predictions)
Out[29]:
82288.22251914928

In [30]:
metrics.mean_squared_error(y_test, predictions)#MSE
Out[30]:
10460958907.208244

In [31]:
np.sqrt(metrics.mean_squared_error(y_test, predictions)) #RMSE
Out[31]:
102278.82922290538

In [32]:
df.columns
Out[32]:
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')

In [33]:
single_data = pd.DataFrame([
{'Avg. Area Income':65000,
'Avg. Area House Age':10,
'Avg. Area Number of Rooms':7,
'Avg. Area Number of Bedrooms':4,
'Area Population':100000
}
])

In [34]:
single_data
Out[34]:
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population

0 65000 10 7 4 100000

In [35]:
lm.predict(single_data)
Out[35]:
array([2788568.88485117])

In [35]:

You might also like