0% found this document useful (0 votes)
24 views

Exercise2 Solution

IE0005 Exercise solutions 2

Uploaded by

Derrick
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Exercise2 Solution

IE0005 Exercise solutions 2

Uploaded by

Derrick
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Exercise 2 : Basic Statistics

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Problem 1 : Data Preparation


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The dataset is train.csv; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

houseData = pd.read_csv('train.csv')
houseData.head()

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape


\
0 1 60 RL 65.0 8450 Pave NaN Reg

1 2 20 RL 80.0 9600 Pave NaN Reg

2 3 60 RL 68.0 11250 Pave NaN IR1

3 4 70 RL 60.0 9550 Pave NaN IR1

4 5 60 RL 84.0 14260 Pave NaN IR1

LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal


MoSold \
0 Lvl AllPub ... 0 NaN NaN NaN 0
2
1 Lvl AllPub ... 0 NaN NaN NaN 0
5
2 Lvl AllPub ... 0 NaN NaN NaN 0
9
3 Lvl AllPub ... 0 NaN NaN NaN 0
2
4 Lvl AllPub ... 0 NaN NaN NaN 0
12

YrSold SaleType SaleCondition SalePrice


0 2008 WD Normal 208500
1 2007 WD Normal 181500
2 2008 WD Normal 223500
3 2006 WD Abnorml 140000
4 2008 WD Normal 250000

[5 rows x 81 columns]

You may get information about the data types using dtypes.

houseData.dtypes

Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 81, dtype: object

You may also get more information about the dataset using info().

houseData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Note that there are 35 int64 and 3 float64 variables in the dataset.
Extract the 38 variables by filtering the variables using their dtypes.

houseDataNum = houseData.loc[:, houseData.dtypes == np.int64]


print("Data dims : ", houseDataNum.shape)
houseDataNum.info() # note that all variables are now int64

Data dims : (1460, 35)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotArea 1460 non-null int64
3 OverallQual 1460 non-null int64
4 OverallCond 1460 non-null int64
5 YearBuilt 1460 non-null int64
6 YearRemodAdd 1460 non-null int64
7 BsmtFinSF1 1460 non-null int64
8 BsmtFinSF2 1460 non-null int64
9 BsmtUnfSF 1460 non-null int64
10 TotalBsmtSF 1460 non-null int64
11 1stFlrSF 1460 non-null int64
12 2ndFlrSF 1460 non-null int64
13 LowQualFinSF 1460 non-null int64
14 GrLivArea 1460 non-null int64
15 BsmtFullBath 1460 non-null int64
16 BsmtHalfBath 1460 non-null int64
17 FullBath 1460 non-null int64
18 HalfBath 1460 non-null int64
19 BedroomAbvGr 1460 non-null int64
20 KitchenAbvGr 1460 non-null int64
21 TotRmsAbvGrd 1460 non-null int64
22 Fireplaces 1460 non-null int64
23 GarageCars 1460 non-null int64
24 GarageArea 1460 non-null int64
25 WoodDeckSF 1460 non-null int64
26 OpenPorchSF 1460 non-null int64
27 EnclosedPorch 1460 non-null int64
28 3SsnPorch 1460 non-null int64
29 ScreenPorch 1460 non-null int64
30 PoolArea 1460 non-null int64
31 MiscVal 1460 non-null int64
32 MoSold 1460 non-null int64
33 YrSold 1460 non-null int64
34 SalePrice 1460 non-null int64
dtypes: int64(35)
memory usage: 399.3 KB

That was very Pythonic way of implementing the dtypes filter.


There is a much cleaner way of doing it in Pandas, as follows.

houseDataNum = houseData.select_dtypes(include = np.int64)


print("Data dims : ", houseDataNum.shape)
houseDataNum.info() # note that all variables are now int64

Data dims : (1460, 35)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotArea 1460 non-null int64
3 OverallQual 1460 non-null int64
4 OverallCond 1460 non-null int64
5 YearBuilt 1460 non-null int64
6 YearRemodAdd 1460 non-null int64
7 BsmtFinSF1 1460 non-null int64
8 BsmtFinSF2 1460 non-null int64
9 BsmtUnfSF 1460 non-null int64
10 TotalBsmtSF 1460 non-null int64
11 1stFlrSF 1460 non-null int64
12 2ndFlrSF 1460 non-null int64
13 LowQualFinSF 1460 non-null int64
14 GrLivArea 1460 non-null int64
15 BsmtFullBath 1460 non-null int64
16 BsmtHalfBath 1460 non-null int64
17 FullBath 1460 non-null int64
18 HalfBath 1460 non-null int64
19 BedroomAbvGr 1460 non-null int64
20 KitchenAbvGr 1460 non-null int64
21 TotRmsAbvGrd 1460 non-null int64
22 Fireplaces 1460 non-null int64
23 GarageCars 1460 non-null int64
24 GarageArea 1460 non-null int64
25 WoodDeckSF 1460 non-null int64
26 OpenPorchSF 1460 non-null int64
27 EnclosedPorch 1460 non-null int64
28 3SsnPorch 1460 non-null int64
29 ScreenPorch 1460 non-null int64
30 PoolArea 1460 non-null int64
31 MiscVal 1460 non-null int64
32 MoSold 1460 non-null int64
33 YrSold 1460 non-null int64
34 SalePrice 1460 non-null int64
dtypes: int64(35)
memory usage: 399.3 KB

Read data_description.txt (from the Kaggle data folder) to identify the actual Numeric
variables.
Note that this table is created manually, and this is my interpretation. Feel free to choose your
own.

Variable Observation
Id Numeric, but simply an index
MSSubClass Categorial, numeric encoding
LotArea Numeric Variable
OverallQual Categorial : Ordinal 1-to-10
Variable Observation
OverallCond Categorial : Ordinal 1-to-10
YearBuilt Time Stamp, not just numeric
YearRemodAdd Time Stamp, not just numeric
BsmtFinSF1 Numeric Variable
BsmtFinSF2 Numeric Variable
BsmtUnfSF Numeric Variable
TotalBsmtSF Numeric Variable
1stFlrSF Numeric Variable
2ndFlrSF Numeric Variable
LowQualFinSF Numeric Variable
GrLivArea Numeric Variable
BsmtFullBath Numeric Variable
BsmtHalfBath Numeric Variable
FullBath Numeric Variable
HalfBath Numeric Variable
BedroomAbvGr Numeric Variable
KitchenAbvGr Numeric Variable
TotRmsAbvGrd Numeric Variable
Fireplaces Numeric Variable
GarageCars Numeric Variable
GarageArea Numeric Variable
WoodDeckSF Numeric Variable
OpenPorchSF Numeric Variable
EnclosedPorc Numeric Variable
3SsnPorch Numeric Variable
ScreenPorch Numeric Variable
PoolArea Numeric Variable
MiscVal Numeric Variable
MoSold Time Stamp, not just numeric
YrSold Time Stamp, not just numeric
SalePrice Numeric Variable

Drop the non-Numeric variables (axis = 1) from the DataFrame to obtain a pure Numeric
DataFrame. Keeping Id for records.

houseDataNum =
houseDataNum.drop(['MSSubClass','OverallQual','OverallCond','YearBuilt
','YearRemodAdd','MoSold','YrSold'], axis = 1)
houseDataNum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 LotArea 1460 non-null int64
2 BsmtFinSF1 1460 non-null int64
3 BsmtFinSF2 1460 non-null int64
4 BsmtUnfSF 1460 non-null int64
5 TotalBsmtSF 1460 non-null int64
6 1stFlrSF 1460 non-null int64
7 2ndFlrSF 1460 non-null int64
8 LowQualFinSF 1460 non-null int64
9 GrLivArea 1460 non-null int64
10 BsmtFullBath 1460 non-null int64
11 BsmtHalfBath 1460 non-null int64
12 FullBath 1460 non-null int64
13 HalfBath 1460 non-null int64
14 BedroomAbvGr 1460 non-null int64
15 KitchenAbvGr 1460 non-null int64
16 TotRmsAbvGrd 1460 non-null int64
17 Fireplaces 1460 non-null int64
18 GarageCars 1460 non-null int64
19 GarageArea 1460 non-null int64
20 WoodDeckSF 1460 non-null int64
21 OpenPorchSF 1460 non-null int64
22 EnclosedPorch 1460 non-null int64
23 3SsnPorch 1460 non-null int64
24 ScreenPorch 1460 non-null int64
25 PoolArea 1460 non-null int64
26 MiscVal 1460 non-null int64
27 SalePrice 1460 non-null int64
dtypes: int64(28)
memory usage: 319.5 KB

Problem 2 : Statistical Summary


Extract just one variable, SalePrice, from the DataFrame.

saleprice = pd.DataFrame(houseDataNum['SalePrice'])
print("Data type : ", type(saleprice))
print("Data dims : ", saleprice.size)
saleprice.head()

Data type : <class 'pandas.core.frame.DataFrame'>


Data dims : 1460

SalePrice
0 208500
1 181500
2 223500
3 140000
4 250000

Summary Statistics of saleprice, followed by Statistical Visualizations on the variable.

saleprice.describe()

SalePrice
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000

f = plt.figure(figsize=(24, 4))
sb.boxplot(data=saleprice, orient = "h", color = "cornflowerblue")

<AxesSubplot:>

f = plt.figure(figsize=(24, 12))
sb.histplot(data=saleprice, x = "SalePrice", color = "royalblue")

<AxesSubplot:xlabel='SalePrice', ylabel='Count'>
f = plt.figure(figsize=(24, 12))
sb.violinplot(data=saleprice, orient='h')

<AxesSubplot:>

Summary Statistics of LotArea, followed by Statistical Visualizations on the variable.


lotarea = pd.DataFrame(houseDataNum['LotArea'])
print("Data type : ", type(lotarea))
print("Data dims : ", lotarea.size)
lotarea.head()

Data type : <class 'pandas.core.frame.DataFrame'>


Data dims : 1460

LotArea
0 8450
1 9600
2 11250
3 9550
4 14260

lotarea.describe()

LotArea
count 1460.000000
mean 10516.828082
std 9981.264932
min 1300.000000
25% 7553.500000
50% 9478.500000
75% 11601.500000
max 215245.000000

f = plt.figure(figsize=(24, 4))
sb.boxplot(data=lotarea, orient = "h")

<AxesSubplot:>

f = plt.figure(figsize=(24, 12))
sb.histplot(data=lotarea, x = "LotArea", color = "brown")

<AxesSubplot:xlabel='LotArea', ylabel='Count'>
f = plt.figure(figsize=(24, 12))
sb.violinplot(data=lotarea, orient='h')

<AxesSubplot:>

Extract two variables from the DataFrame -- SalePrice and LotArea -- and check their
mutual relationship.
saleprice = pd.DataFrame(houseDataNum['SalePrice'])
lotarea = pd.DataFrame(houseDataNum['LotArea'])

# Set up matplotlib figure with three subplots


f, axes = plt.subplots(2, 3, figsize=(24, 12))

# Plot the basic uni-variate figures for HP


sb.boxplot(data=saleprice, orient = "h", ax = axes[0,0])
sb.histplot(data=saleprice, ax = axes[0,1])
sb.violinplot(data=saleprice, ax = axes[0,2])

# Plot the basic uni-variate figures for Attack


sb.boxplot(data=lotarea, orient = "h", ax = axes[1,0])
sb.histplot(data=lotarea, ax = axes[1,1])
sb.violinplot(data=lotarea, ax = axes[1,2])

<AxesSubplot:>

jointDF = pd.concat([lotarea, saleprice], axis =


1).reindex(lotarea.index)
sb.jointplot(data=jointDF, x = 'LotArea', y = 'SalePrice', height =
16)

<seaborn.axisgrid.JointGrid at 0x1cd4be7f5e0>
# Calculate the correlation between the two columns/variables
jointDF.corr()

LotArea SalePrice
LotArea 1.000000 0.263843
SalePrice 0.263843 1.000000

sb.heatmap(jointDF.corr(), vmin = -1, vmax = 1, annot = True,


fmt=".2f")

<AxesSubplot:>

You might also like