0% found this document useful (0 votes)
83 views

Data Science Project

The document summarizes the steps taken to clean a housing data set. It reads in a CSV file, gets information about the data, describes it, finds and drops null values, and fills null values using mean, mode, and other imputation methods. The goal is to clean the raw housing data for further analysis by handling missing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

Data Science Project

The document summarizes the steps taken to clean a housing data set. It reads in a CSV file, gets information about the data, describes it, finds and drops null values, and fills null values using mean, mode, and other imputation methods. The goal is to clean the raw housing data for further analysis by handling missing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

12/4/23, 7:28 PM Data Cleaning Project 1st Draft.

ipynb - Colaboratory

import pandas as pd

Read the CSV File

df = pd.read_csv('/content/Housing_Data_Set (1).csv')

df

output price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefar

0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2 y

1 12250000.0 8960 4.0 4 4.0 yes no no no yes 3

2 12250000.0 9960 3.0 2 2.0 yes no yes no no 2 y

3 12215000.0 7500 4.0 2 2.0 yes no yes no yes 3 y

4 11410000.0 7420 4.0 1 2.0 yes yes yes no yes 2

... ... ... ... ... ... ... ... ... ... ... ...

df.head()

price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefarea

0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2 yes

1 12250000.0 8960 4.0 4 4.0 yes no no no yes 3 no

Getting information about the Dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 538 non-null float64
1 area 545 non-null int64
2 bedrooms 544 non-null float64
3 bathrooms 545 non-null int64
4 stories 543 non-null float64
5 mainroad 545 non-null object
6 guestroom 543 non-null object
7 basement 545 non-null object
8 hotwaterheating 537 non-null object
9 airconditioning 544 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
13 Date 542 non-null object
dtypes: float64(3), int64(3), object(8)
memory usage: 59.7+ KB

Describing the Dataset

df.describe()

https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 1/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory

price area bedrooms bathrooms stories parking

count 5.380000e+02 545.000000 544.000000 545.000000 543.000000 545.000000

mean 4.779255e+06 5150.541284 2.966912 1.286239 2.069982 0.693578

std 1.876768e+06 2170.141023 0.737579 0.502470 4.996187 0.861586

min 1.750000e+06 1650.000000 1.000000 1.000000 1.000000 0.000000

25% 3.438750e+06 3600.000000 2.000000 1.000000 1.000000 0.000000

50% 4.340000e+06 4600.000000 3.000000 1.000000 2.000000 0.000000

75% 5.796000e+06 6360.000000 3.000000 2.000000 2.000000 1.000000

max 1.330000e+07 16200.000000 6.000000 4.000000 110.000000 3.000000

Finding the Null Values in the Dataset

df.isnull().sum()

price 7
area 0
bedrooms 1
bathrooms 0
stories 2
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
Date 3
dtype: int64

Dropping the unnecessary columns in the Dataste

df = df.drop(columns = 'Date')

df.head()

price area bedrooms bathrooms stories mainroad guestroom basement hotw

0 13300000.0 7420 4.0 2 3.0 yes no no

1 12250000.0 8960 4.0 4 4.0 yes no no

2 12250000.0 9960 3.0 2 2.0 yes no yes

3 12215000.0 7500 4.0 2 2.0 yes no yes

Fill the null values in the Data using the mean average method

df['price'].fillna(df["price"].mean(), inplace=True)

df.isnull().sum()

price 0
area 0
bedrooms 1
bathrooms 0
stories 2
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 2/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
df['price'].fillna(df["price"].mean(), inplace=True)

df

price area bedrooms bathrooms stories mainroad guestroom basement ho

0 13300000.0 7420 4.0 2 3.0 yes no no

1 12250000.0 8960 4.0 4 4.0 yes no no

2 12250000.0 9960 3.0 2 2.0 yes no yes

3 12215000.0 7500 4.0 2 2.0 yes no yes

4 11410000.0 7420 4.0 1 2.0 yes yes yes

... ... ... ... ... ... ... ... ...

540 1820000.0 3000 2.0 1 1.0 yes no yes

541 1767150.0 2400 3.0 1 1.0 no no no

542 1750000.0 3620 2.0 1 1.0 yes no no

543 1750000.0 2910 3.0 1 1.0 no no no

544 1750000.0 3850 3.0 1 2.0 yes no no

df['bedrooms'].fillna(df["bedrooms"].mean(), inplace=True)

df.isnull().sum()

price 0
area 0
bedrooms 0
bathrooms 0
stories 2
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

df['stories'].fillna(df["stories"].mean(), inplace=True)

df.isnull().sum()

price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 2
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

Fill the null values by getting the Mode of the Column

df['guestroom'].fillna(df['guestroom'].mode().iloc[0], inplace=True)

df.isnull().sum()

price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0

https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 3/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory
guestroom 0
basement 0
hotwaterheating 8
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

df['hotwaterheating'].fillna(df['hotwaterheating'].mode().iloc[0], inplace=True)

df.isnull().sum()

price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 1
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

df['airconditioning'].fillna(df['airconditioning'].mode().iloc[0], inplace=True)

df.isnull().sum()

price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 0
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

Sorting the Dataset according to the "furnishingstatus" column

df = df.sort_values('furnishingstatus')

df.head()

price area bedrooms bathrooms stories mainroad guestroom basement ho

0 13300000.0 7420 4.0 2 3.0 yes no no

365 3703000.0 5450 2.0 1 1.0 yes no no

124 5950000.0 6525 3.0 2 4.0 yes no no

362 3710000.0 4050 2.0 1 1.0 yes no no

Rephrasing the Dataset

df = df.reset_index()

df

https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 4/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory

index price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking

0 0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2

1 365 3703000.0 5450 2.0 1 1.0 yes no no no no 0

2 124 5950000.0 6525 3.0 2 4.0 yes no no no no 1

3 362 3710000.0 4050 2.0 1 1.0 yes no no no no 0

4 128 5873000.0 5500 3.0 1 3.0 yes yes no no yes 1

... ... ... ... ... ... ... ... ... ... ... ... ...

540 405 3465000.0 3060 3.0 1 1.0 yes no no no no 0

541 406 3465000.0 5320 2.0 1 1.0 yes no no no no 1


df = df.drop(columns
542 408 = 'index')
3430000.0 4000 2.0 1 1.0 yes no no no no 0

543 410 3430000.0 3850 3.0 1 1.0 yes no no no no 0


df.head()
544 544 1750000.0 3850 3.0 1 2.0 yes no no no no 0
price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefarea

0 13300000.0 7420 4.0 2 3.0 yes no no no yes 2 yes

1 3703000.0 5450 2.0 1 1.0 yes no no no no 0 no

2 5950000.0 6525 3.0 2 4.0 yes no no no no 1 no

3 3710000.0 4050 2.0 1 1.0 yes no no no no 0 no

Checking if the Dataset is clean or not

df.isnull().sum()

price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 0
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

EDA(Exploratory Data Analysis) of the Dataset

import matplotlib.pyplot as plt

plt.scatter(df["price"], df["area"])

https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 5/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory

<matplotlib.collections.PathCollection at 0x7a4a6102f010>
import seaborn as sns

sns.pairplot(df)

https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 6/7
12/4/23, 7:28 PM Data Cleaning Project 1st Draft.ipynb - Colaboratory

<seaborn.axisgrid.PairGrid at 0x7a4a6109f880>

Getting the Correlation of the Data and plotting it on the Heatmap

correlation_matrix = df[['price', 'area']].corr()

sns.set(style="darkgrid")

sns.heatmap(correlation_matrix, annot=True, cmap='magma', fmt=".2f", linewidths=.5)

<Axes: >

sns.lmplot(x='price', y='area', data=df)

<seaborn.axisgrid.FacetGrid at 0x7a4a5c071e40>

https://colab.research.google.com/drive/1HU684UqeRwNdBFPV2X1bDAnDeztx5GVD#scrollTo=W2c02AzKM3kZ&printMode=true 7/7

You might also like