ML LinearRegression
ML LinearRegression
February 9, 2023
[4]: house=pd.read_csv('USA_Housing.csv')
[5]: house.head(3)
[5]: Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms \
0 79545.458574 5.682861 7.009188
1 79248.642455 6.002900 6.730821
2 61287.067179 5.865890 8.512727
Address
0 208 Michael Ferry Apt. 674\nLaurabury, NE 3701…
1 188 Johnson Views Suite 079\nLake Kathleen, CA…
2 9127 Elizabeth Stravenue\nDanieltown, WI 06482…
[6]: house.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
1
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
[ ]:
[7]: house.describe()
[7]: Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms \
count 5000.000000 5000.000000 5000.000000
mean 68583.108984 5.977222 6.987792
std 10657.991214 0.991456 1.005833
min 17796.631190 2.644304 3.236194
25% 61480.562388 5.322283 6.299250
50% 68804.286404 5.970429 7.002902
75% 75783.338666 6.650808 7.665871
max 107701.748378 9.519088 10.759588
[8]: sns.pairplot(house)
2
[9]: sns.distplot(house['Price'])
[9]: <AxesSubplot:xlabel='Price'>
3
[10]: house.corr()
4
Price 0.171071 0.408556
Price
Avg. Area Income 0.639734
Avg. Area House Age 0.452543
Avg. Area Number of Rooms 0.335664
Avg. Area Number of Bedrooms 0.171071
Area Population 0.408556
Price 1.000000
[11]: sns.heatmap(house.corr(),annot=True)
[11]: <AxesSubplot:>
[13]: house.columns
[13]: Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
5
[ ]:
[14]: X=house[['Avg. Area Income', 'Avg. Area House Age','Avg. Area Number of␣
↪Rooms','Avg. Area Number of Bedrooms', 'Area Population']]
[15]: y=house[['Price']]
[19]: lm=LinearRegression()
[20]: lm.fit(X_train,y_train)
[20]: LinearRegression()
[21]: print(lm.intercept_)
[-2640159.79685191]
[22]: lm.coef_
[24]: lmcoeftransp
[24]: array([[2.15282755e+01],
[1.64883282e+05],
[1.22368678e+05],
[2.23380186e+03],
[1.51504200e+01]])
[25]: X.columns
[25]: Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')
[26]: X_train.columns
6
[26]: Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population'],
dtype='object')
[28]: cdf
[28]: Coeff
Avg. Area Income 21.528276
Avg. Area House Age 164883.282027
Avg. Area Number of Rooms 122368.678027
Avg. Area Number of Bedrooms 2233.801864
Area Population 15.150420
PREDICTIONS
[29]: predictions = lm.predict(X_test)
[30]: predictions
[30]: array([[1260960.70567626],
[ 827588.75560352],
[1742421.24254328],
…,
[ 372191.40626952],
[1365217.15140895],
[1914519.54178824]])
[31]: y_test
[31]: Price
1718 1.251689e+06
2511 8.730483e+05
345 1.696978e+06
2521 1.063964e+06
54 9.487883e+05
… …
1776 1.489520e+06
4269 7.777336e+05
1661 1.515271e+05
2410 1.343824e+06
2302 1.906025e+06
[32]: #Now we wanna know how far off we are from real dataset
7
[33]: plt.scatter(y_test,predictions)
[34]: #Now we need to see through hist for a residue - what is a residue?
[35]: <AxesSubplot:>
8
[36]: #Evaluating metrics
[38]: metrics.mean_absolute_error(y_test,predictions)
[38]: 82288.22251914957
[39]: metrics.mean_squared_error(y_test,predictions)
[39]: 10460958907.209501
[40]: #RMSE
[47]: RresultMSE
[47]: 102278.82922291153
91.76824009649201 %
9
[ ]:
10