0% found this document useful (0 votes)
13 views16 pages

Ex 2 TP1

The document outlines a data analysis process using a housing dataset, including data loading, cleaning, and preparation for modeling. It employs libraries like pandas, numpy, and sklearn to handle data, perform transformations, and prepare for linear regression analysis. The analysis includes handling missing values, feature engineering, and visualizing correlations among features.

Uploaded by

Hajar Bensahl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views16 pages

Ex 2 TP1

The document outlines a data analysis process using a housing dataset, including data loading, cleaning, and preparation for modeling. It employs libraries like pandas, numpy, and sklearn to handle data, perform transformations, and prepare for linear regression analysis. The analysis includes handling missing values, feature engineering, and visualizing correlations among features.

Uploaded by

Hajar Bensahl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

ex2TP1

November 5, 2024

[59]: import pandas as pd


import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"]=(20,10)
import seaborn as sns

[60]: import sklearn


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

[61]: df1=pd.read_csv("housing.csv")
df1.head()

[61]: longitude latitude housing_median_age total_rooms total_bedrooms \


0 -122.23 37.88 41.0 880.0 129.0
1 -122.22 37.86 21.0 7099.0 1106.0
2 -122.24 37.85 52.0 1467.0 190.0
3 -122.25 37.85 52.0 1274.0 235.0
4 -122.25 37.85 52.0 1627.0 280.0

population households median_income median_house_value ocean_proximity


0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 496.0 177.0 7.2574 352100.0 NEAR BAY
3 558.0 219.0 5.6431 341300.0 NEAR BAY
4 565.0 259.0 3.8462 342200.0 NEAR BAY

[62]: df1.shape

[62]: (20640, 10)

[63]: df1.isnull().sum()

1
[63]: longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64

[64]: df2=df1.dropna()
df2.isnull().sum()

[64]: longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64

[65]: from sklearn.model_selection import train_test_split

x=df2.drop(['median_house_value'],axis=1)
y=df2['median_house_value']

[66]: x

[66]: longitude latitude housing_median_age total_rooms total_bedrooms \


0 -122.23 37.88 41.0 880.0 129.0
1 -122.22 37.86 21.0 7099.0 1106.0
2 -122.24 37.85 52.0 1467.0 190.0
3 -122.25 37.85 52.0 1274.0 235.0
4 -122.25 37.85 52.0 1627.0 280.0
… … … … … …
20635 -121.09 39.48 25.0 1665.0 374.0
20636 -121.21 39.49 18.0 697.0 150.0
20637 -121.22 39.43 17.0 2254.0 485.0
20638 -121.32 39.43 18.0 1860.0 409.0
20639 -121.24 39.37 16.0 2785.0 616.0

2
population households median_income ocean_proximity
0 322.0 126.0 8.3252 NEAR BAY
1 2401.0 1138.0 8.3014 NEAR BAY
2 496.0 177.0 7.2574 NEAR BAY
3 558.0 219.0 5.6431 NEAR BAY
4 565.0 259.0 3.8462 NEAR BAY
… … … … …
20635 845.0 330.0 1.5603 INLAND
20636 356.0 114.0 2.5568 INLAND
20637 1007.0 433.0 1.7000 INLAND
20638 741.0 349.0 1.8672 INLAND
20639 1387.0 530.0 2.3886 INLAND

[20433 rows x 9 columns]

[67]: y

[67]: 0 452600.0
1 358500.0
2 352100.0
3 341300.0
4 342200.0

20635 78100.0
20636 77100.0
20637 92300.0
20638 84700.0
20639 89400.0
Name: median_house_value, Length: 20433, dtype: float64

[68]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.20)

[69]: train_data=x_train.join(y_train)
#train_data
#train_data['ocean_proximity'].unique()

[70]: train_data.hist()

[70]: array([[<Axes: title={'center': 'longitude'}>,


<Axes: title={'center': 'latitude'}>,
<Axes: title={'center': 'housing_median_age'}>],
[<Axes: title={'center': 'total_rooms'}>,
<Axes: title={'center': 'total_bedrooms'}>,
<Axes: title={'center': 'population'}>],
[<Axes: title={'center': 'households'}>,
<Axes: title={'center': 'median_income'}>,
<Axes: title={'center': 'median_house_value'}>]], dtype=object)

3
[71]: train_data.corr(numeric_only=True)
plt.figure()
sns.heatmap(train_data.corr(numeric_only=True), annot=True, cmap="YlGnBu")
plt.show()

[72]: train_data['total_rooms']=np.log(train_data['total_rooms']+1)
train_data['total_bedrooms']=np.log(train_data['total_bedrooms']+1)
train_data['population']=np.log(train_data['population']+1)

4
train_data['households']=np.log(train_data['households']+1)

[73]: train_data.hist()

[73]: array([[<Axes: title={'center': 'longitude'}>,


<Axes: title={'center': 'latitude'}>,
<Axes: title={'center': 'housing_median_age'}>],
[<Axes: title={'center': 'total_rooms'}>,
<Axes: title={'center': 'total_bedrooms'}>,
<Axes: title={'center': 'population'}>],
[<Axes: title={'center': 'households'}>,
<Axes: title={'center': 'median_income'}>,
<Axes: title={'center': 'median_house_value'}>]], dtype=object)

[74]: train_data.ocean_proximity.value_counts()

[74]: ocean_proximity
<1H OCEAN 7217
INLAND 5170
NEAR OCEAN 2118
NEAR BAY 1837
ISLAND 4
Name: count, dtype: int64

[75]: #pd.get_dummies(train_data.ocean_proximity).astype(int)
#train_data.join(pd.get_dummies(train_data.ocean_proximity.astype(int))).
↪drop(['ocean_proximity'],axis=1)

dummies = pd.get_dummies(train_data['ocean_proximity']).astype(int)

5
train_data = train_data.join(dummies).drop(['ocean_proximity'], axis=1)
train_data

[75]: longitude latitude housing_median_age total_rooms total_bedrooms \


8424 -118.36 33.93 27.0 8.399760 7.116394
19741 -122.57 39.90 15.0 8.262043 6.698268
19155 -122.71 38.37 16.0 7.764721 5.846439
7751 -118.15 33.92 28.0 6.946014 5.533389
17400 -120.44 34.93 15.0 6.767343 5.501258
… … … … … …
17208 -119.72 34.43 36.0 7.053586 5.736572
19855 -119.45 36.35 22.0 7.509335 5.811141
19380 -120.85 37.77 52.0 6.079933 4.406719
17254 -119.71 34.42 39.0 7.067320 5.777652
8760 -118.44 33.81 33.0 8.292799 6.898715

population households median_income median_house_value <1H OCEAN \


8424 8.114025 7.015712 3.1656 204500.0 1
19741 7.437206 6.442540 2.4555 55600.0 0
19155 6.922644 5.855072 5.6018 253000.0 1
7751 6.816736 5.505332 2.5875 161200.0 1
17400 7.033506 5.537334 2.0995 87500.0 1
… … … … … …
17208 6.257668 5.720312 2.6014 320600.0 1
19855 6.981935 5.645447 2.3365 69600.0 0
19380 5.288267 4.234107 1.8625 85400.0 0
17254 6.408529 5.758902 2.1600 259100.0 1
8760 7.407318 6.837333 5.0106 500001.0 0

INLAND ISLAND NEAR BAY NEAR OCEAN


8424 0 0 0 0
19741 1 0 0 0
19155 0 0 0 0
7751 0 0 0 0
17400 0 0 0 0
… … … … …
17208 0 0 0 0
19855 1 0 0 0
19380 1 0 0 0
17254 0 0 0 0
8760 0 0 0 1

[16346 rows x 14 columns]

[76]: plt.figure()
sns.heatmap(train_data.corr(),annot=True,cmap="YlGnBu")
plt.show()

6
[77]: plt.figure()
sns.scatterplot(x='longitude',y='latitude',data=train_data,␣
↪hue="median_house_value",palette="coolwarm")

[77]: <Axes: xlabel='longitude', ylabel='latitude'>

7
[78]: train_data['bedrooms_ratio']=train_data['total_bedrooms'] /␣
↪train_data['total_rooms']

train_data['household_rooms']=train_data['total_rooms'] /␣
↪train_data['households']

[79]: plt.figure()
sns.heatmap(train_data.corr(),annot=True,cmap="YlGnBu")
plt.show()

[80]: x_train=train_data.drop(['median_house_value'],axis=1)
y_train=train_data['median_house_value']

#normalisation des donnees


scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)

model=LinearRegression()
model.fit(x_train_scaled,y_train_scaled)

[80]: LinearRegression()

[81]: test_data=x_test.join(y_test)
test_data['total_rooms']=np.log(test_data['total_rooms']+1)
test_data['total_bedrooms']=np.log(test_data['total_bedrooms']+1)

8
test_data['population']=np.log(test_data['population']+1)
test_data['households']=np.log(test_data['households']+1)

test_data = test_data.join(pd.get_dummies(test_data['ocean_proximity']).
↪astype(int)).drop(['ocean_proximity'], axis=1)

test_data['bedrooms_ratio']=test_data['total_bedrooms'] /␣
↪test_data['total_rooms']

test_data['household_rooms']=test_data['total_rooms'] / test_data['households']

#pd.get_dummies(test_data['ocean_proximity'])

[82]: test_data

[82]: longitude latitude housing_median_age total_rooms total_bedrooms \


10713 -117.84 33.66 5.0 6.501290 5.147494
20468 -118.71 34.27 26.0 6.898715 5.411646
17148 -122.20 37.43 38.0 8.196161 6.270988
7645 -118.27 33.81 10.0 7.540090 6.349139
17440 -120.44 34.66 22.0 8.080856 6.309918
… … … … … …
14905 -117.06 32.60 24.0 6.993015 5.594711
15752 -122.44 37.77 52.0 8.153637 6.694562
8273 -118.16 33.77 29.0 8.032360 6.668228
1746 -122.35 37.96 34.0 7.264730 5.817111
15696 -122.45 37.79 52.0 7.458763 6.180017

population households median_income median_house_value <1H OCEAN \


10713 5.953243 5.147494 4.5833 230400.0 1
20468 6.579251 5.451038 3.1630 179400.0 1
17148 7.208600 6.278521 7.3681 500001.0 0
7645 7.478735 6.317165 3.9286 114000.0 1
17440 7.461640 6.366470 4.5417 142400.0 0
… … … … … …
14905 6.999422 5.509388 2.4191 107300.0 0
15752 7.325808 6.656727 3.6186 500001.0 0
8273 7.286876 6.602588 2.8750 232500.0 0
1746 7.149132 5.768321 2.5461 93900.0 0
15696 6.595781 6.063785 1.4804 425000.0 0

INLAND ISLAND NEAR BAY NEAR OCEAN bedrooms_ratio household_rooms


10713 0 0 0 0 0.791765 1.263001
20468 0 0 0 0 0.784443 1.265578
17148 0 0 0 1 0.765113 1.305429
7645 0 0 0 0 0.842051 1.193588
17440 0 0 0 1 0.780848 1.269284
… … … … … … …
14905 0 0 0 1 0.800043 1.269291

9
15752 0 0 1 0 0.821052 1.224872
8273 0 0 0 1 0.830170 1.216547
1746 0 0 1 0 0.800733 1.259419
15696 0 0 1 0 0.828558 1.230051

[4087 rows x 16 columns]

[136]: x_test,y_test=test_data.
↪drop(['median_house_value'],axis=1),test_data['median_house_value']

x_test_scaled = scaler.transform(x_test)
#y_test_scaled = scaler.transform(y_test.values.reshape(-1, 1))
'''
y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = y_scaler.transform(y_test.values.reshape(-1, 1))
'''

[84]: x_test

[84]: longitude latitude housing_median_age total_rooms total_bedrooms \


10713 -117.84 33.66 5.0 6.501290 5.147494
20468 -118.71 34.27 26.0 6.898715 5.411646
17148 -122.20 37.43 38.0 8.196161 6.270988
7645 -118.27 33.81 10.0 7.540090 6.349139
17440 -120.44 34.66 22.0 8.080856 6.309918
… … … … … …
14905 -117.06 32.60 24.0 6.993015 5.594711
15752 -122.44 37.77 52.0 8.153637 6.694562
8273 -118.16 33.77 29.0 8.032360 6.668228
1746 -122.35 37.96 34.0 7.264730 5.817111
15696 -122.45 37.79 52.0 7.458763 6.180017

population households median_income <1H OCEAN INLAND ISLAND \


10713 5.953243 5.147494 4.5833 1 0 0
20468 6.579251 5.451038 3.1630 1 0 0
17148 7.208600 6.278521 7.3681 0 0 0
7645 7.478735 6.317165 3.9286 1 0 0
17440 7.461640 6.366470 4.5417 0 0 0
… … … … … … …
14905 6.999422 5.509388 2.4191 0 0 0
15752 7.325808 6.656727 3.6186 0 0 0
8273 7.286876 6.602588 2.8750 0 0 0
1746 7.149132 5.768321 2.5461 0 0 0
15696 6.595781 6.063785 1.4804 0 0 0

NEAR BAY NEAR OCEAN bedrooms_ratio household_rooms


10713 0 0 0.791765 1.263001

10
20468 0 0 0.784443 1.265578
17148 0 1 0.765113 1.305429
7645 0 0 0.842051 1.193588
17440 0 1 0.780848 1.269284
… … … … …
14905 0 1 0.800043 1.269291
15752 1 0 0.821052 1.224872
8273 0 1 0.830170 1.216547
1746 1 0 0.800733 1.259419
15696 1 0 0.828558 1.230051

[4087 rows x 15 columns]

[138]: model.score(x_train_scaled, y_train_scaled)

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[138], line 1
----> 1 model.score(x_train_scaled, y_train_scaled)

AttributeError: 'Sequential' object has no attribute 'score'

[86]: y_pred = model.predict(x_test_scaled)

[87]: mse = mean_squared_error(y_test, y_pred)


mse

[87]: 57926172223.61929

[88]: from sklearn.ensemble import RandomForestRegressor


forest=RandomForestRegressor()
forest.fit(x_train,y_train)

[88]: RandomForestRegressor()

[89]: forest.score(x_test,y_test)

[89]: 0.8146615773444194

[91]: import keras


from keras.models import Sequential
from keras.layers import Dense

[92]: #definir le modele


model = Sequential([Dense(1, input_dim=x_train.shape[1], activation='linear')])
model.compile(optimizer='adam', loss='mean_squared_error')
model.summary()

11
C:\Users\king\anaconda3\Lib\site-packages\keras\src\layers\core\dense.py:87:
UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When
using Sequential models, prefer using an `Input(shape)` object as the first
layer in the model instead.
super().__init__(activity_regularizer=activity_regularizer, **kwargs)
Model: "sequential"

����������������������������������������������������������������������������
� Layer (type) � Output Shape � Param # �
����������������������������������������������������������������������������
� dense (Dense) � (None, 1) � 16 �
����������������������������������������������������������������������������

Total params: 16 (64.00 B)

Trainable params: 16 (64.00 B)

Non-trainable params: 0 (0.00 B)

[93]: train_5 = model.fit(x_train, y_train, validation_data=(x_test, y_test),␣


↪epochs=5)

# Entraînement pour 10 epochs


train_10 = model.fit(x_train, y_train, validation_data=(x_test, y_test),␣
↪epochs=10)

# Entraînement pour 15 epochs


train_15 = model.fit(x_train, y_train, validation_data=(x_test, y_test),␣
↪epochs=15)

Epoch 1/5
511/511 �������������������� 3s 4ms/step -
loss: 55905394688.0000 - val_loss: 57862823936.0000
Epoch 2/5
511/511 �������������������� 3s 6ms/step -
loss: 55015342080.0000 - val_loss: 57816014848.0000
Epoch 3/5
511/511 �������������������� 2s 4ms/step -
loss: 55397969920.0000 - val_loss: 57769230336.0000
Epoch 4/5
511/511 �������������������� 3s 5ms/step -
loss: 55470444544.0000 - val_loss: 57722441728.0000
Epoch 5/5
511/511 �������������������� 2s 3ms/step -

12
loss: 55485304832.0000 - val_loss: 57675702272.0000
Epoch 1/10
511/511 �������������������� 2s 3ms/step -
loss: 56076288000.0000 - val_loss: 57629020160.0000
Epoch 2/10
511/511 �������������������� 3s 6ms/step -
loss: 55215300608.0000 - val_loss: 57582419968.0000
Epoch 3/10
511/511 �������������������� 1s 2ms/step -
loss: 55182422016.0000 - val_loss: 57535737856.0000
Epoch 4/10
511/511 �������������������� 1s 3ms/step -
loss: 54860214272.0000 - val_loss: 57489137664.0000
Epoch 5/10
511/511 �������������������� 1s 3ms/step -
loss: 55305433088.0000 - val_loss: 57442570240.0000
Epoch 6/10
511/511 �������������������� 2s 3ms/step -
loss: 54938718208.0000 - val_loss: 57396002816.0000
Epoch 7/10
511/511 �������������������� 2s 3ms/step -
loss: 55876726784.0000 - val_loss: 57349505024.0000
Epoch 8/10
511/511 �������������������� 2s 3ms/step -
loss: 55152832512.0000 - val_loss: 57302986752.0000
Epoch 9/10
511/511 �������������������� 1s 3ms/step -
loss: 54960181248.0000 - val_loss: 57256488960.0000
Epoch 10/10
511/511 �������������������� 3s 6ms/step -
loss: 56382586880.0000 - val_loss: 57210064896.0000
Epoch 1/15
511/511 �������������������� 2s 3ms/step -
loss: 55094513664.0000 - val_loss: 57163624448.0000
Epoch 2/15
511/511 �������������������� 2s 3ms/step -
loss: 55323426816.0000 - val_loss: 57117188096.0000
Epoch 3/15
511/511 �������������������� 3s 5ms/step -
loss: 55270674432.0000 - val_loss: 57070817280.0000
Epoch 4/15
511/511 �������������������� 1s 3ms/step -
loss: 53296803840.0000 - val_loss: 57024409600.0000
Epoch 5/15
511/511 �������������������� 2s 3ms/step -
loss: 54269583360.0000 - val_loss: 56978063360.0000
Epoch 6/15
511/511 �������������������� 2s 3ms/step -

13
loss: 54459072512.0000 - val_loss: 56931782656.0000
Epoch 7/15
511/511 �������������������� 2s 3ms/step -
loss: 54719565824.0000 - val_loss: 56885473280.0000
Epoch 8/15
511/511 �������������������� 2s 4ms/step -
loss: 53834076160.0000 - val_loss: 56839192576.0000
Epoch 9/15
511/511 �������������������� 2s 4ms/step -
loss: 54277263360.0000 - val_loss: 56793022464.0000
Epoch 10/15
511/511 �������������������� 3s 5ms/step -
loss: 54326079488.0000 - val_loss: 56746799104.0000
Epoch 11/15
511/511 �������������������� 2s 4ms/step -
loss: 55075880960.0000 - val_loss: 56700678144.0000
Epoch 12/15
511/511 �������������������� 2s 3ms/step -
loss: 54168629248.0000 - val_loss: 56654499840.0000
Epoch 13/15
511/511 �������������������� 2s 4ms/step -
loss: 55217995776.0000 - val_loss: 56608378880.0000
Epoch 14/15
511/511 �������������������� 2s 3ms/step -
loss: 54757396480.0000 - val_loss: 56562302976.0000
Epoch 15/15
511/511 �������������������� 2s 3ms/step -
loss: 54337257472.0000 - val_loss: 56516194304.0000

[94]: def plot_performance(train, epochs):


plt.plot(train.history['loss'], label='train loss')
plt.plot(train.history['val_loss'], label='validation loss')
plt.title(f'Loss over {epochs} epochs')
plt.xlabel('Epochs')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.show()

[95]: plot_performance(train_5, 5)
plot_performance(train_10, 10)
plot_performance(train_15, 15)

14
15
[96]: model.summary()

Model: "sequential"

����������������������������������������������������������������������������
� Layer (type) � Output Shape � Param # �
����������������������������������������������������������������������������
� dense (Dense) � (None, 1) � 16 �
����������������������������������������������������������������������������

Total params: 50 (204.00 B)

Trainable params: 16 (64.00 B)

Non-trainable params: 0 (0.00 B)

Optimizer params: 34 (140.00 B)

[ ]:

16

You might also like