0% found this document useful (0 votes)
10 views119 pages

Ex 1

Uploaded by

sooyoungchoi093
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views119 pages

Ex 1

Uploaded by

sooyoungchoi093
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

Exercise 1 : Data Acquisition

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation

In [1]: # Basic Libraries


import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Problem 1 : Kaggle

a) Import the “train.csv” data you downloaded in Jupyter Notebook.


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
(https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

Note : header is an optional input parameter to the function read_csv .

If you do not input a header value, it will default to infer , taking (generally) the first row of the CSV
file as column names.
If you set header = None then it will understand that there is no column names in the CSV file, and
every row contains just data.
If you set heaeder = 0 , it will understand that you want the 0-th row (first row) of the CSV file to be
considered as column names.

Check any function definition in Jupyter Notebook by running function_name? , e.g., try running
pd.read_csv? in a cell.
In [2]: houseData = pd.read_csv('train.csv', header='infer') #can use header= 2, 3, etc if neede
houseData.head(5) #5 means there are 5 rows printed

Out[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ...

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ...

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ...

3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ...

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ...

5 rows × 81 columns

b) How many observations (rows) and variables (columns) are in the above
dataset?

Check the “shape”


It is super simple to get the dimensions of the dataset using shape .

Note that shape is an attribute/variable stored inside the DataFrame class of pandas.
Find out all the stored attributes by checking the documentation on Pandas DataFrame :

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [3]: houseData.shape

Out[3]: (1460, 81)

c) What are the data types (“dtypes”) – Numeric/Categorical – of the variables


(columns) in the dataset?
You may get information about the data types using dtypes . This is another attribute.

In [4]: houseData.dtypes

Out[4]: Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 81, dtype: object

d) What does the .info() method do? Use the .info() method on the imported
dataset to check this out.
You may also get more information about the dataset using info() .

Note that info() is a method/function stored inside the DataFrame class of pandas.
Find out all the stored methods by checking the documentation on Pandas DataFrame :

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
In [5]: houseData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

e) What does the .describe() method do? Use the .describe() method on the
imported dataset to check.
describe() provides you with statistical information about the data. This is another method.

In [6]: houseData.describe()

Out[6]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt Year

count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1

mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1

std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904

min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1

25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1

50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1

75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2

max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2

8 rows × 38 columns

Observation : Why are there less number of variables in describe() than in info() ?

Describe provides the basic statistics, but only for the Numeric Variables. You should also be careful that
a variable that looks numeric may actually be categorical, as levels of categorical variables may often
be encoded as numbers. However, Pandas does not know that -- it follows duck-typing.

In Exercise 2, you are explicitly going over each variable, reading its description, and understanding its true
meaning, to judge if a variable that looks like a number should be considered numeric. This will be an
important part of data preparation before you go ahead with exploratory data analysis in Exercise 3.
Problem 2 : Wikipedia
As the dataset is in a table formal within an HTML website, we may use the read_html function from
Pandas.
Same as read_csv , there are multiple optional input parameters to this function. Try running
pd.read_html ?

a) Import the Wikipedia page in Jupyter Notebook

In [7]: medal_html = pd.read_html('https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_tabl

b) How many tables are in this Wikipedia page? Check the “len” of the
imported data/page to find this out.
Check that the imported data is a list , and note the len of it. This tells us how many tables are there.

In [8]: print("Data type : ", type(medal_html))


print("HTML tables length : ", len(medal_html))

Data type : <class 'list'>


HTML tables length : 7

c) Which one is the actual “2016 Summer Olympics medal table”? Explore all
tables in the data to know.
Check each table in the dataset to identify the one that we want to extract. Note that this is a standard
python list where each element of the list is a pandas DataFrame . That is, every single table from the
HTML document (the webpage) has been parsed into an individual DataFrame .

In [9]: medal_html[2] # vary the index from 0 to 1, 2, 3 etc. to check each table parsed from th

Out[9]:
Rank NOC Gold Silver Bronze Total

0 1 United States 46 37 38 121

1 2 Great Britain 27 23 17 67

2 3 China 26 18 26 70

3 4 Russia 19 17 20 56

4 5 Germany 17 10 15 42

... ... ... ... ... ... ...

82 78 Nigeria 0 0 1 1

83 78 Portugal 0 0 1 1

84 78 Trinidad and Tobago 0 0 1 1

85 78 United Arab Emirates 0 0 1 1

86 Totals (86 entries) Totals (86 entries) 306 307 359 972

87 rows × 6 columns

Just by exploring each element in the list above, you will find that the actual data table corresponding to
the 2016 Summer Olympics medal table is located at index 1 2 of the list .

d) Extract the main table, “2016 Summer Olympics medal table”, and store it
as a new Pandas DataFrame.
Assign it to a new DataFrame , as follows, so that we can use it for further exploration. Check the basic
information too.

In [10]: medalTable = medal_html[2]


medalTable.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 87 non-null object
1 NOC 87 non-null object
2 Gold 87 non-null int64
3 Silver 87 non-null int64
4 Bronze 87 non-null int64
5 Total 87 non-null int64
dtypes: int64(4), object(2)
memory usage: 4.2+ KB

e) Extract the TOP 20 countries from the medal table, as above, and store
these rows as a new DataFrame.
The DataFrame seems to have 87 rows/countries. Extract the top 20 rows of the DataFrame to capture
the TOP 20 countries in the medal tally. There are plenty of ways to do this. You may use the standard
.iloc[] method to index the specific rows you want to extract. You may also use .head(20) .
In [11]: medalData = medalTable.iloc[:20]
medalData

Out[11]:
Rank NOC Gold Silver Bronze Total

0 1 United States 46 37 38 121

1 2 Great Britain 27 23 17 67

2 3 China 26 18 26 70

3 4 Russia 19 17 20 56

4 5 Germany 17 10 15 42

5 6 Japan 12 8 21 41

6 7 France 10 18 14 42

7 8 South Korea 9 3 9 21

8 9 Italy 8 12 8 28

9 10 Australia 8 11 10 29

10 11 Netherlands 8 7 4 19

11 12 Hungary 8 3 4 15

12 13 Brazil* 7 6 6 19

13 14 Spain 7 4 6 17

14 15 Kenya 6 6 1 13

15 16 Jamaica 6 3 2 11

16 17 Croatia 5 3 2 10

17 18 Cuba 5 2 4 11

18 19 New Zealand 4 9 5 18

19 20 Canada 4 3 15 22

Observation : If you can just change the 2016 part within the URL
'https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table' , you should be able to
fetch any other Summer Olympic dataset similarly. Try with 2012 , 2008 etc. Can this be done for any
year? What about 1980 ?

Interesting : If the 2016 part can be changed this way, you may also try to write a loop to iterate over a
list of years [2016, 2012, 2008, 2004, 2000] , and fetch all the tables within the loop. Try it out --
should be fun! This is a bonus problem, and will be discussed in the next Review Session.

More Interesting : Any structured website can be scraped for tables in the same way. However, what
would you do for data that are not in a table format? Can you extract the name of the movie, its rating and
genres from https://www.imdb.com/title/tt0441773/ (https://www.imdb.com/title/tt0441773/) using some
other library in python? Give it a shot! :-)

Bonus Problems A
To download the data: https://www.kaggle.com/datasets/uciml/adult-census-income
(https://www.kaggle.com/datasets/uciml/adult-census-income)
A. Download the “Census Income” dataset (source :
https://archive.ics.uci.edu/ml/datasets/Census+Income
(https://archive.ics.uci.edu/ml/datasets/Census+Income)) from the UCI
Machine Learning Repository (in the “Data Folder”), and import it in Jupyter
Notebook as a DataFrame.
Explore the dataset using .shape , .info() and .describe() , exactly as you did in Problem 1 above.
Do you spot anything interesting while exploring this dataset? Discuss amongst friends or talk to the
Instructor, if you did.

In [18]: census_income=pd.read_csv('adult.csv')
In [30]: print(census_income.shape)
print('\n') #spacing
print(census_income.info())
print('\n') #spacing
print(census_income.describe())

(32560, 15)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 39 32560 non-null int64
1 State-gov 32560 non-null object
2 77516 32560 non-null int64
3 Bachelors 32560 non-null object
4 13 32560 non-null int64
5 Never-married 32560 non-null object
6 Adm-clerical 32560 non-null object
7 Not-in-family 32560 non-null object
8 White 32560 non-null object
9 Male 32560 non-null object
10 2174 32560 non-null int64
11 0 32560 non-null int64
12 40 32560 non-null int64
13 United-States 32560 non-null object
14 <=50K 32560 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None

39 77516 13 2174 0 \
count 32560.000000 3.256000e+04 32560.000000 32560.000000 32560.000000
mean 38.581634 1.897818e+05 10.080590 1077.615172 87.306511
std 13.640642 1.055498e+05 2.572709 7385.402999 402.966116
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000
25% 28.000000 1.178315e+05 9.000000 0.000000 0.000000
50% 37.000000 1.783630e+05 10.000000 0.000000 0.000000
75% 48.000000 2.370545e+05 12.000000 0.000000 0.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000

40
count 32560.000000
mean 40.437469
std 12.347618
min 1.000000
25% 40.000000
50% 40.000000
75% 45.000000
max 99.000000

The method describe() returns only statistics for columns whose dtype is 'int64' and ignores 'object'
ones.

Bonus Problems B
Note that the Summer Olympic medal tally on Wikipedia follows a really nice structure for the URL, where
you can simply change the year in https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table
(https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table) to fetch any Summer Olympic page.
Try changing 2016 in the URL to 2012 or 2008 or 2004 to see for yourself. This allows us to fetch the
Olympics medal table from all these years (in fact, any year) quite easily.

Let’s try the following. Write a loop to extract the main tables, “20XX Summer Olympics medal table”, from
2000 to 2016, that is, for the five consecutive Olympics in 2000, 2004, 2008, 2012 and 2016. Store all five
tables in respective DataFrames. Now, extract the TOP 20 countries from each of these medal tables, and
store these rows as new DataFrames.
In [35]: first_year = 2000
last_year = 2016
medal_tables=[]

for year in range(first_year, last_year + 1, 4):

# fetch the html page with the url for year
wiki_url = "https://en.wikipedia.org/wiki/" + str(year) + "_Summer_Olympics_medal_ta

# scrape table(s) only if it matches "Rank"


medal_html = pd.read_html(wiki_url, match = "Rank")
medalTable=medal_html[0]
medal_tables.append(medalTable)

# extract only TOP 10 countries in the table


medalData = medalTable.iloc[:10]

# display the year and the top 10 countries


print("___________________________________________")
print(year)
display(medalData)

___________________________________________
2000

Rank Nation Gold Silver Bronze Total

0 1 United States 37 24 32 93

1 2 Russia 32 28 29 89

2 3 China 28 16 14 58

3 4 Australia 16 25 17 58

4 5 Germany 13 17 26 56

5 6 France 13 14 11 38

6 7 Italy 13 8 13 34

7 8 Netherlands 12 9 4 25

8 9 Cuba 11 11 7 29

9 10 Great Britain 11 10 7 28

___________________________________________
2004

Rank Nation Gold Silver Bronze Total

0 1 United States 36 39 26 101

1 2 China 32 17 14 63

2 3 Russia 28 26 36 90

3 4 Australia 17 16 17 50

4 5 Japan 16 9 12 37

5 6 Germany 13 16 20 49

6 7 France 11 9 13 33

7 8 Italy 10 11 11 32

8 9 South Korea 9 12 9 30

9 10 Great Britain 9 9 12 30
___________________________________________
2008

Rank NOC Gold Silver Bronze Total

0 1 China* 48 22 30 100

1 2 United States 36 39 37 112

2 3 Russia 24 13 23 60

3 4 Great Britain 19 13 19 51

4 5 Germany 16 11 14 41

5 6 Australia 14 15 17 46

6 7 South Korea 13 11 8 32

7 8 Japan 9 8 8 25

8 9 Italy 8 9 10 27

9 10 France 7 16 20 43

___________________________________________
2012

Rank NOC Gold Silver Bronze Total

0 1 United States 47 27 30 104

1 2 China 39 31 22 92

2 3 Great Britain* 29 18 18 65

3 4 Russia 19 21 27 67

4 5 South Korea 13 9 8 30

5 6 Germany 11 20 13 44

6 7 France 11 11 13 35

7 8 Australia 8 15 12 35

8 9 Italy 8 9 11 28

9 10 Hungary 8 4 6 18

___________________________________________
2016

Rank NOC Gold Silver Bronze Total

0 1 United States 46 37 38 121

1 2 Great Britain 27 23 17 67

2 3 China 26 18 26 70

3 4 Russia 19 17 20 56

4 5 Germany 17 10 15 42

5 6 Japan 12 8 21 41

6 7 France 10 18 14 42

7 8 South Korea 9 3 9 21

8 9 Italy 8 12 8 28

9 10 Australia 8 11 10 29
In [33]: medalDataFrames = []

for medal_table in medal_tables:
# extract only TOP 20 countries in the table
medalData = medal_table.iloc[:20]

# rename the columns to make tables uniform


medalData.columns = ["Rank", "Nation", "Gold", "Silver", "Bronze", "Total"]

# store the medal dataframe in the main list


medalDataFrames.append(medalData)

medalDataAll = pd.concat(medalDataFrames, axis = 0)
medalDataAll

Out[33]:
Rank Nation Gold Silver Bronze Total

0 1 United States 37 24 32 93

1 2 Russia 32 28 29 89

2 3 China 28 16 14 58

3 4 Australia 16 25 17 58

4 5 Germany 13 17 26 56

... ... ... ... ... ... ...

15 16 Jamaica 6 3 2 11

16 17 Croatia 5 3 2 10

17 18 Cuba 5 2 4 11

18 19 New Zealand 4 9 5 18

19 20 Canada 4 3 15 22

100 rows × 6 columns


Exercise 2 : Basic Statistics

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

In [1]: # Basic Libraries


import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Problem 1 : Data Preparation


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
(https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

a) Import the “train.csv” data you downloaded (either from NTU Learn or
Kaggle) in Jupyter Notebook.

In [2]: houseData = pd.read_csv('train.csv')


houseData.head()

Out[2]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ...

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ...

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ...

3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ...

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ...

5 rows × 81 columns

b) What are the data types (“dtypes”) – int64/float64/object – of the variables


(columns) in the dataset?
You may get information about the data types using dtypes .

In [25]: houseData.dtypes

Out[25]: Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 81, dtype: object

You may also get more information about the dataset using info() .
In [26]: houseData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

Note that there are 35 int64 and 3 float64 variables in the dataset.
Extract the 38 variables by filtering the variables using their dtypes .

c) Create a new Pandas DataFrame consisting of only the variables (columns)


of type Integer (int64).
Take only int64 in the dataset
In [5]: houseDataNum = houseData.loc[:, houseData.dtypes == np.int64]
print("Data dims : ", houseDataNum.shape)
houseDataNum.info() # note that all variables are now int64

Data dims : (1460, 35)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotArea 1460 non-null int64
3 OverallQual 1460 non-null int64
4 OverallCond 1460 non-null int64
5 YearBuilt 1460 non-null int64
6 YearRemodAdd 1460 non-null int64
7 BsmtFinSF1 1460 non-null int64
8 BsmtFinSF2 1460 non-null int64
9 BsmtUnfSF 1460 non-null int64
10 TotalBsmtSF 1460 non-null int64
11 1stFlrSF 1460 non-null int64
12 2ndFlrSF 1460 non-null int64
13 LowQualFinSF 1460 non-null int64
14 GrLivArea 1460 non-null int64
15 BsmtFullBath 1460 non-null int64
16 BsmtHalfBath 1460 non-null int64
17 FullBath 1460 non-null int64
18 HalfBath 1460 non-null int64
19 BedroomAbvGr 1460 non-null int64
20 KitchenAbvGr 1460 non-null int64
21 TotRmsAbvGrd 1460 non-null int64
22 Fireplaces 1460 non-null int64
23 GarageCars 1460 non-null int64
24 GarageArea 1460 non-null int64
25 WoodDeckSF 1460 non-null int64
26 OpenPorchSF 1460 non-null int64
27 EnclosedPorch 1460 non-null int64
28 3SsnPorch 1460 non-null int64
29 ScreenPorch 1460 non-null int64
30 PoolArea 1460 non-null int64
31 MiscVal 1460 non-null int64
32 MoSold 1460 non-null int64
33 YrSold 1460 non-null int64
34 SalePrice 1460 non-null int64
dtypes: int64(35)
memory usage: 399.3 KB

That was very Pythonic way of implementing the dtypes filter.


There is a much cleaner way of doing it in Pandas, as follows.
In [6]: houseDataNum = houseData.select_dtypes(include = np.int64)
print("Data dims : ", houseDataNum.shape)
houseDataNum.info() # note that all variables are now int64

Data dims : (1460, 35)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotArea 1460 non-null int64
3 OverallQual 1460 non-null int64
4 OverallCond 1460 non-null int64
5 YearBuilt 1460 non-null int64
6 YearRemodAdd 1460 non-null int64
7 BsmtFinSF1 1460 non-null int64
8 BsmtFinSF2 1460 non-null int64
9 BsmtUnfSF 1460 non-null int64
10 TotalBsmtSF 1460 non-null int64
11 1stFlrSF 1460 non-null int64
12 2ndFlrSF 1460 non-null int64
13 LowQualFinSF 1460 non-null int64
14 GrLivArea 1460 non-null int64
15 BsmtFullBath 1460 non-null int64
16 BsmtHalfBath 1460 non-null int64
17 FullBath 1460 non-null int64
18 HalfBath 1460 non-null int64
19 BedroomAbvGr 1460 non-null int64
20 KitchenAbvGr 1460 non-null int64
21 TotRmsAbvGrd 1460 non-null int64
22 Fireplaces 1460 non-null int64
23 GarageCars 1460 non-null int64
24 GarageArea 1460 non-null int64
25 WoodDeckSF 1460 non-null int64
26 OpenPorchSF 1460 non-null int64
27 EnclosedPorch 1460 non-null int64
28 3SsnPorch 1460 non-null int64
29 ScreenPorch 1460 non-null int64
30 PoolArea 1460 non-null int64
31 MiscVal 1460 non-null int64
32 MoSold 1460 non-null int64
33 YrSold 1460 non-null int64
34 SalePrice 1460 non-null int64
dtypes: int64(35)
memory usage: 399.3 KB

d) Open the “data_description.txt” file you downloaded (either from NTU Learn or Kaggle) in
Wordpad.

Read the description for each variable carefully and try to identify the “actual” Numeric variables.
Categorical variables are often “encoded” as Numeric variables for easy representation. Spot them.

Observation : Note that in a given data, Categorical variables can be "encoded" in either of two ways, as
Characters (as in MSZoning ) or a Numbers (as in MSSubClass ). Even if a categorical variable is
"encoded" as numbers, interpreting it as a numeric variable is wrong. Thus, one should be careful in
reading the given data description file and identifying the "actual" numeric variables from the dataset to
perform statistical exploration.

Read data_description.txt (from the Kaggle data folder) to identify the actual Numeric variables.
Note that this table is created manually, and this is my interpretation. Feel free to choose your own.
Variable Observation

Id Numeric, but simply an index

MSSubClass Categorial, numeric encoding

LotArea Numeric Variable

OverallQual Categorial : Ordinal 1-to-10

OverallCond Categorial : Ordinal 1-to-10

YearBuilt Time Stamp, not just numeric

YearRemodAdd Time Stamp, not just numeric

BsmtFinSF1 Numeric Variable

BsmtFinSF2 Numeric Variable

BsmtUnfSF Numeric Variable

TotalBsmtSF Numeric Variable

1stFlrSF Numeric Variable

2ndFlrSF Numeric Variable

LowQualFinSF Numeric Variable

GrLivArea Numeric Variable

BsmtFullBath Numeric Variable

BsmtHalfBath Numeric Variable

FullBath Numeric Variable

HalfBath Numeric Variable

BedroomAbvGr Numeric Variable

KitchenAbvGr Numeric Variable

TotRmsAbvGrd Numeric Variable

Fireplaces Numeric Variable

GarageCars Numeric Variable

GarageArea Numeric Variable

WoodDeckSF Numeric Variable

OpenPorchSF Numeric Variable

EnclosedPorc Numeric Variable

3SsnPorch Numeric Variable

ScreenPorch Numeric Variable

PoolArea Numeric Variable

MiscVal Numeric Variable

MoSold Time Stamp, not just numeric

YrSold Time Stamp, not just numeric

SalePrice Numeric Variable

e) Drop non-Numeric variables from the DataFrame to have a clean


DataFrame with Numeric variables.
Drop the non-Numeric variables using .drop (axis = 1) from the DataFrame to obtain a pure Numeric
DataFrame. Keeping Id for records.
In [7]: houseDataNum = houseDataNum.drop(['MSSubClass','OverallQual','OverallCond','YearBuilt',

In [27]: houseDataNum.info() #to check

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 LotArea 1460 non-null int64
2 BsmtFinSF1 1460 non-null int64
3 BsmtFinSF2 1460 non-null int64
4 BsmtUnfSF 1460 non-null int64
5 TotalBsmtSF 1460 non-null int64
6 1stFlrSF 1460 non-null int64
7 2ndFlrSF 1460 non-null int64
8 LowQualFinSF 1460 non-null int64
9 GrLivArea 1460 non-null int64
10 BsmtFullBath 1460 non-null int64
11 BsmtHalfBath 1460 non-null int64
12 FullBath 1460 non-null int64
13 HalfBath 1460 non-null int64
14 BedroomAbvGr 1460 non-null int64
15 KitchenAbvGr 1460 non-null int64
16 TotRmsAbvGrd 1460 non-null int64
17 Fireplaces 1460 non-null int64
18 GarageCars 1460 non-null int64
19 GarageArea 1460 non-null int64
20 WoodDeckSF 1460 non-null int64
21 OpenPorchSF 1460 non-null int64
22 EnclosedPorch 1460 non-null int64
23 3SsnPorch 1460 non-null int64
24 ScreenPorch 1460 non-null int64
25 PoolArea 1460 non-null int64
26 MiscVal 1460 non-null int64
27 SalePrice 1460 non-null int64
dtypes: int64(28)
memory usage: 319.5 KB

Problem 2 : Statistical Summary


Extract just one variable, SalePrice , from the DataFrame.
In [24]: saleprice = pd.DataFrame(houseDataNum['SalePrice'])
print("Data type : ", type(saleprice))
print("Data dims : ", saleprice.size)
saleprice.head()

Data type : <class 'pandas.core.frame.DataFrame'>


Data dims : 1460

Out[24]: SalePrice

0 208500

1 181500

2 223500

3 140000

4 250000

a) Find the Summary Statistics (Mean, Median, Quartiles etc.) of SalePrice


from the Numeric DataFrame
Summary Statistics of saleprice , followed by Statistical Visualizations on the variable.

In [10]: saleprice.describe()

Out[10]: SalePrice

count 1460.000000

mean 180921.195890

std 79442.502883

min 34900.000000

25% 129975.000000

50% 163000.000000

75% 214000.000000

max 755000.000000

b) Visualize the summary statistics and distribution of SalePrice using


standard Box-Plot, Histogram, KDE.

In [32]: # BOXPLOT
f = plt.figure(figsize=(17, 3))
sb.boxplot(data = saleprice, orient = "h")

Out[32]: <AxesSubplot:>
In [50]: # HISTOGRAM
f = plt.figure(figsize=(15, 6))
sb.histplot(data = saleprice)

Out[50]: <AxesSubplot:ylabel='Count'>

In [45]: # KDE PLOT


f = plt.figure(figsize=(8, 4))
#sb.kdeplot(data = saleprice, hist = True, kde = True, color = "red")
sb.distplot(saleprice, hist = True, kde = True, color = "red")

C:\Users\timot\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarnin
g: `distplot` is a deprecated function and will be removed in a future version. Please
adapt your code to use either `displot` (a figure-level function with similar flexibil
ity) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

Out[45]: <AxesSubplot:ylabel='Density'>
In [46]: # VIOLIN PLOT

f = plt.figure(figsize=(8, 4))
sb.violinplot(data = saleprice, orient = "h")

Out[46]: <AxesSubplot:>

In [49]: f = plt.figure(figsize=(2, 7))


sb.violinplot(data = saleprice, orient = "v", color = "deeppink")

Out[49]: <AxesSubplot:>

Summary Statistics of LotArea , followed by Statistical Visualizations on the variable.


In [14]: lotarea = pd.DataFrame(houseDataNum['LotArea'])
print("Data type : ", type(lotarea))
print("Data dims : ", lotarea.size)
lotarea.head()

Data type : <class 'pandas.core.frame.DataFrame'>


Data dims : 1460

Out[14]: LotArea

0 8450

1 9600

2 11250

3 9550

4 14260

c) Find the Summary Statistics (Mean, Median, Quartiles etc) of LotArea from
the Numeric DataFrame.

(same as part a)

In [15]: lotarea.describe()

Out[15]: LotArea

count 1460.000000

mean 10516.828082

std 9981.264932

min 1300.000000

25% 7553.500000

50% 9478.500000

75% 11601.500000

max 215245.000000

d) Visualize the summary statistics and distribution of LotArea using


standard Box-Plot, Histogram, KDE.

(same as part b)

In [55]: # BOXPLOT
f = plt.figure(figsize=(17, 3))
sb.boxplot(data = lotarea, orient = "h")

Out[55]: <AxesSubplot:>
In [60]: # HISTOGRAM
f = plt.figure(figsize=(6, 4))
sb.histplot(data = lotarea)

Out[60]: <AxesSubplot:ylabel='Count'>

In [62]: # VIOLIN PLOT



f = plt.figure(figsize=(6, 4))
sb.violinplot(data = lotarea, orient = "h")

Out[62]: <AxesSubplot:>

e) Plot SalePrice (y-axis) vs LotArea (x-axis) using jointplot and find the
Correlation between the two.
Extract two variables from the DataFrame -- SalePrice and LotArea -- and check their mutual
relationship.

In [65]: saleprice = pd.DataFrame(houseDataNum['SalePrice'])


lotarea = pd.DataFrame(houseDataNum['LotArea'])
### not necessary
In [66]: # Set up matplotlib figure with three subplots
f, axes = plt.subplots(2, 3, figsize=(12, 6))

# Plot the basic uni-variate figures for SalePrice
sb.boxplot(data = saleprice, orient = "h", ax = axes[0,0])
sb.histplot(data = saleprice, ax = axes[0,1])
sb.violinplot(data = saleprice, orient = "h", ax = axes[0,2])

# Plot the basic uni-variate figures for LotArea
sb.boxplot(data = lotarea, orient = "h", ax = axes[1,0])
sb.histplot(data = lotarea, ax = axes[1,1])
sb.violinplot(data = lotarea, orient = "h", ax = axes[1,2])

Out[66]: <AxesSubplot:>

Create a joint dataframe by concatenating the two variables

In [75]: #JOINT DATAFRAME


jointDF = pd.concat([lotarea, saleprice], axis = 1).reindex(lotarea.index)

Draw jointplot of the two variables in the joined dataframe


In [71]: sb.jointplot(data = jointDF, x = "LotArea", y = "SalePrice", height = 5)

Out[71]: <seaborn.axisgrid.JointGrid at 0x189c2cbae50>

Calculate the correlation between the two columns/variables

In [74]: jointDF.corr()

Out[74]: LotArea SalePrice

LotArea 1.000000 0.263843

SalePrice 0.263843 1.000000

Heatmap
In [23]: sb.heatmap(jointDF.corr(), vmin = -1, vmax = 1, annot = True, fmt=".2f")

Out[23]: <AxesSubplot:>

Observation : Note that the correlation between LotArea and SalePrice is 0.26, as shown above. Do
you think LotArea will have any effect on SalePrice of a house, in case you want to estimate the
SalePrice using LotArea while exploring the real-estate market? Think about it.

Bonus

Create a new Pandas DataFrame consisting of all variables (columns) of type


Integer (int64) or Float (float64). Read the description for each variable
carefully and try to identify the “actual” Numeric variables in the data.

Drop non-Numeric variables from the DataFrame to have a clean DataFrame with only the Numeric
variables.

Plot SalePrice vs each of the Numeric variables you identified to understand their correlation or
dependence.

In [94]: df=pd.read_csv('train.csv')
In [95]: df_sub = df.loc[:, (df.dtypes == np.int64) | (df.dtypes == np.float64)]
df_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotFrontage 1201 non-null float64
3 LotArea 1460 non-null int64
4 OverallQual 1460 non-null int64
5 OverallCond 1460 non-null int64
6 YearBuilt 1460 non-null int64
7 YearRemodAdd 1460 non-null int64
8 MasVnrArea 1452 non-null float64
9 BsmtFinSF1 1460 non-null int64
10 BsmtFinSF2 1460 non-null int64
11 BsmtUnfSF 1460 non-null int64
12 TotalBsmtSF 1460 non-null int64
13 1stFlrSF 1460 non-null int64
14 2ndFlrSF 1460 non-null int64
15 LowQualFinSF 1460 non-null int64
16 GrLivArea 1460 non-null int64
17 BsmtFullBath 1460 non-null int64
18 BsmtHalfBath 1460 non-null int64
19 FullBath 1460 non-null int64
20 HalfBath 1460 non-null int64
21 BedroomAbvGr 1460 non-null int64
22 KitchenAbvGr 1460 non-null int64
23 TotRmsAbvGrd 1460 non-null int64
24 Fireplaces 1460 non-null int64
25 GarageYrBlt 1379 non-null float64
26 GarageCars 1460 non-null int64
27 GarageArea 1460 non-null int64
28 WoodDeckSF 1460 non-null int64
29 OpenPorchSF 1460 non-null int64
30 EnclosedPorch 1460 non-null int64
31 3SsnPorch 1460 non-null int64
32 ScreenPorch 1460 non-null int64
33 PoolArea 1460 non-null int64
34 MiscVal 1460 non-null int64
35 MoSold 1460 non-null int64
36 YrSold 1460 non-null int64
37 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35)
memory usage: 433.6 KB
In [96]: column_categorical=['MSSubClass', 'OverallQual', 'OverallCond']
df_num = df_sub.drop(column_categorical, axis=1)
df_num.dtypes

Out[96]: Id int64
LotFrontage float64
LotArea int64
YearBuilt int64
YearRemodAdd int64
MasVnrArea float64
BsmtFinSF1 int64
BsmtFinSF2 int64
BsmtUnfSF int64
TotalBsmtSF int64
1stFlrSF int64
2ndFlrSF int64
LowQualFinSF int64
GrLivArea int64
BsmtFullBath int64
BsmtHalfBath int64
FullBath int64
HalfBath int64
BedroomAbvGr int64
KitchenAbvGr int64
TotRmsAbvGrd int64
Fireplaces int64
GarageYrBlt float64
GarageCars int64
GarageArea int64
WoodDeckSF int64
OpenPorchSF int64
EnclosedPorch int64
3SsnPorch int64
ScreenPorch int64
PoolArea int64
MiscVal int64
MoSold int64
YrSold int64
SalePrice int64
dtype: object
In [100]: for var in df_num:
if var != 'SalePrice':
jointDF = pd.concat([df_num[var], df_num['SalePrice']], axis = 1).reindex(df_num

sb.jointplot(data = jointDF, x = var, y = 'SalePrice', height = 3)
plt.show()

f, axes = plt.subplots(1, 1, figsize=(1.5, 1))


sb.heatmap(jointDF.corr(), vmin = -1, vmax = 1, annot = True, fmt=".2f")
plt.show()
In [103]: f, axes = plt.subplots(1, 1, figsize=(22, 20))
sb.heatmap(df_num.corr(), vmin = -1, vmax = 1, annot = True, fmt=".2f")
plt.show()
In [ ]: ​
Exercise 3 : Exploratory Analysis

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

In [3]: # Basic Libraries


import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
(https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

In [4]: houseData = pd.read_csv('train.csv')


houseData.head()

Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ...

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ...

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ...

3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ...

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ...

5 rows × 81 columns

In [ ]: ​

Problem 1 : Numeric Variables


Extract the following Numeric variables from the dataset, and store as a new Pandas
DataFrame.

houseNumData = pd.DataFrame(houseData[['LotArea', 'GrLivArea', 'TotalBsmtSF', 'GarageArea',


'SalePrice']])

Extract the required variables from the dataset, as mentioned in the problem.
LotArea , GrLivArea , TotalBsmtSF , GarageArea , SalePrice

In [5]: houseNumData = pd.DataFrame(houseData[['LotArea', 'GrLivArea', 'TotalBsmtSF', 'GarageAre


houseNumData.head() #default head is 5

Out[5]:
LotArea GrLivArea TotalBsmtSF GarageArea SalePrice

0 8450 1710 856 548 208500

1 9600 1262 1262 460 181500

2 11250 1786 920 608 223500

3 9550 1717 756 642 140000

4 14260 2198 1145 836 250000

b) Check the individual statistical description and visualize the statistical


distributions of each of these variables.

Check the Variables Independently


Summary Statistics of houseNumData , followed by Statistical Visualizations on the variables.

In [6]: houseNumData.describe()

Out[6]:
LotArea GrLivArea TotalBsmtSF GarageArea SalePrice

count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000

mean 10516.828082 1515.463699 1057.429452 472.980137 180921.195890

std 9981.264932 525.480383 438.705324 213.804841 79442.502883

min 1300.000000 334.000000 0.000000 0.000000 34900.000000

25% 7553.500000 1129.500000 795.750000 334.500000 129975.000000

50% 9478.500000 1464.000000 991.500000 480.000000 163000.000000

75% 11601.500000 1776.750000 1298.250000 576.000000 214000.000000

max 215245.000000 5642.000000 6110.000000 1418.000000 755000.000000


In [7]: # Draw the distributions of all variables

f, axes = plt.subplots(5, 3, figsize=(15, 5)) #5, 3 is the size of the plots

count = 0 #set count at 0
for var in houseNumData:
sb.boxplot(data = houseNumData[var], orient = "h", ax = axes[count,0], color = "spri
sb.histplot(data = houseNumData[var], ax = axes[count,1], color = "mediumorchid")
sb.violinplot(data = houseNumData[var], orient = "h", ax = axes[count,2], color = "r
#sb.violinplot(data = houseNumData[var], orient = "v", ax = axes[count,3], color =
#useless, change size to 5,4 if needed
count += 1 #move count up by one

b) Comment if the distributions look like “Normal Distribution”, or different.


Use the .skew() method to find the “skewness” of each of the five
distributions. Which of the variables has the maximum number of outliers?

Let's count the number of outliers in each variable


Remember the formula for the box-and-whiskers plot end-points to find the outliers.

In [8]: houseNumData.skew()

Out[8]: LotArea 12.207688


GrLivArea 1.366560
TotalBsmtSF 1.524255
GarageArea 0.179981
SalePrice 1.882876
dtype: float64
In [9]: # Calculate the quartiles
Q1 = houseNumData.quantile(0.25)
Q3 = houseNumData.quantile(0.75)

# Rule to identify outliers
rule = ((houseNumData < (Q1 - 1.5 * (Q3 - Q1))) | (houseNumData > (Q3 + 1.5 * (Q3 - Q1))

# Count the number of outliers
rule.sum()

Out[9]: LotArea 69
GrLivArea 31
TotalBsmtSF 61
GarageArea 21
SalePrice 61
dtype: int64

Check the Relationship amongst Variables

c) Check the relationship amongst the variables using mutual correlation and
the correlation heatmap. Comment which of the variables has the strongest
correlation with “SalePrice”.
Is this useful in predicting “SalePrice”? Correlation between the variables, followed by all bi-variate
jointplots.

In [10]: # Correlation Matrix


print(houseNumData.corr())

# Heatmap of the Correlation Matrix
f = plt.figure(figsize=(4, 4))
sb.heatmap(houseNumData.corr(), vmin = -1, vmax = 1, linewidths = 1,
annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")

LotArea GrLivArea TotalBsmtSF GarageArea SalePrice


LotArea 1.000000 0.263116 0.260833 0.180403 0.263843
GrLivArea 0.263116 1.000000 0.454868 0.468997 0.708624
TotalBsmtSF 0.260833 0.454868 1.000000 0.486665 0.613581
GarageArea 0.180403 0.468997 0.486665 1.000000 0.623431
SalePrice 0.263843 0.708624 0.613581 0.623431 1.000000

Out[10]: <AxesSubplot:>
d) Check the relationship amongst the variables using mutual jointplots and
an overall pairplot. Comment which of the variables has the strongest linear
relation with “SalePrice”. Is this useful in predicting “SalePrice”?

In [9]: # Draw pairs of variables against one another


sb.pairplot(data = houseNumData)

Out[9]: <seaborn.axisgrid.PairGrid at 0x212809dc610>

Observation : Which variables do you think will help us predict SalePrice in this dataset?

GrLivArea : Possibly the most important variable : Highest Correlation, Strong


Linearity
GarageArea and TotalBsmtSF : Important variables : High Correlation, Strong
Linearity
LotArea : Doesn't seem so important as a variable : Low Correlation, Weak Linear
Relation

Bonus : Attempt a comprehensive analysis with all Numeric variables in the dataset.
Problem 2 : Categorical Variables
Extract the required variables from the dataset, as mentioned in the problem.
MSSubClass , Neighborhood , BldgType , OverallQual

In [34]: houseCatData = pd.DataFrame(houseData[['MSSubClass', 'Neighborhood', 'BldgType', 'Overal


houseCatData.head()

Out[34]:
MSSubClass Neighborhood BldgType OverallQual

0 60 CollgCr 1Fam 7

1 20 Veenker 1Fam 6

2 60 CollgCr 1Fam 7

3 70 Crawfor 1Fam 7

4 60 NoRidge 1Fam 8

In [31]: houseCatData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 1460 non-null int64
1 Neighborhood 1460 non-null object
2 BldgType 1460 non-null object
3 OverallQual 1460 non-null int64
dtypes: int64(2), object(2)
memory usage: 45.8+ KB

a) Convert each of these variables into “category” data type (note that some
are “int64”, and some are “object”).
Fix the data types of the first four variables to convert them to categorical.

In [32]: houseCatData['MSSubClass'] = houseCatData['MSSubClass'].astype('category')


houseCatData['Neighborhood'] = houseCatData['Neighborhood'].astype('category')
houseCatData['BldgType'] = houseCatData['BldgType'].astype('category')
houseCatData['OverallQual'] = houseCatData['OverallQual'].astype('category')
#Convert to categorical as Python default reads them as int or obj.

In [33]: houseCatData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 1460 non-null category
1 Neighborhood 1460 non-null category
2 BldgType 1460 non-null category
3 OverallQual 1460 non-null category
dtypes: category(4)
memory usage: 7.8 KB
Check the Variables Independently

b) Check the individual statistical description and visualize the distributions


(catplot) of each of these variables ( MSSubClass , Neighborhood , BldgType ,
OverallQual ).
Summary Statistics of houseCatData , followed by Statistical Visualizations on the variables.

In [13]: houseCatData.describe()

Out[13]:
MSSubClass Neighborhood BldgType OverallQual

count 1460 1460 1460 1460

unique 15 25 5 10

top 20 NAmes 1Fam 5

freq 536 225 1220 397

In [14]: ### CATPLOT


sb.catplot(y = 'MSSubClass', data = houseCatData, kind = "count", height = 5)

Out[14]: <seaborn.axisgrid.FacetGrid at 0x212808c4af0>


In [15]: sb.catplot(y = 'Neighborhood', data = houseCatData, kind = "count", height = 6)

Out[15]: <seaborn.axisgrid.FacetGrid at 0x21282f3c310>

In [36]: sb.catplot(y = 'BldgType', data = houseCatData, kind = "count", height = 3)

Out[36]: <seaborn.axisgrid.FacetGrid at 0x21284d43d90>


In [17]: sb.catplot(y = 'OverallQual', data = houseCatData, kind = "count", height = 3)

Out[17]: <seaborn.axisgrid.FacetGrid at 0x21282b9be20>

Check the Relationship amongst Variables

c) One may check the relation amongst two categorical variables through the
bi-variate joint heatmap of counts. Use groupby() command to generate joint
heatmap of counts for “OverallQual” against the other three variables.
Comment if this is useful in identifying the relation between “OverallQual”
with the other variables.
Joint heatmaps of some of the important bi-variate relationships in houseCatData .

In [56]: # Distribution of BldgType across MSSubClass


f = plt.figure(figsize=(8, 3))
sb.heatmap(houseCatData.groupby(['BldgType', 'MSSubClass']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 13}, cmap = "bw

Out[56]: <AxesSubplot:xlabel='MSSubClass', ylabel='BldgType'>


In [54]: # Distribution of OverallQual across MSSubClass
f = plt.figure(figsize=(8, 4))
sb.heatmap(houseCatData.groupby(['OverallQual', 'MSSubClass']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 12}, cmap = "Pu

Out[54]: <AxesSubplot:xlabel='MSSubClass', ylabel='OverallQual'>


In [53]: # Distribution of OverallQual across Neighborhood
f = plt.figure(figsize=(10, 5))
sb.heatmap(houseCatData.groupby(['OverallQual', 'Neighborhood']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 12}, cmap = "Pu

Out[53]: <AxesSubplot:xlabel='Neighborhood', ylabel='OverallQual'>

In [57]: # Distribution of OverallQual across BldgType


f = plt.figure(figsize=(4, 4))
sb.heatmap(houseCatData.groupby(['OverallQual', 'BldgType']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 13}, cmap = "Bl

Out[57]: <AxesSubplot:xlabel='BldgType', ylabel='OverallQual'>


In [58]: # self added ###############################################################

housenewData = pd.DataFrame(houseData[['BldgType', 'Street']])
housenewData['BldgType'] = housenewData['BldgType'].astype('category')
housenewData['Street'] = housenewData['Street'].astype('category')
# Distribution of Street across BldgType
f = plt.figure(figsize=(6, 1))
sb.heatmap(housenewData.groupby(['Street', 'BldgType']).size().unstack(),
linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 13}, cmap = "au

Out[58]: <AxesSubplot:xlabel='BldgType', ylabel='Street'>

Check the effect of the Variables on SalePrice


Create a joint DataFrame by concatenating SalePrice to houseCatData .

In [60]: saleprice = pd.DataFrame(houseData['SalePrice'])


houseCatSale = pd.concat([houseCatData, saleprice], sort = False, axis = 1).reindex(inde
houseCatSale.head()

Out[60]:
MSSubClass Neighborhood BldgType OverallQual SalePrice

0 60 CollgCr 1Fam 7 208500

1 20 Veenker 1Fam 6 181500

2 60 CollgCr 1Fam 7 223500

3 70 Crawfor 1Fam 7 140000

4 60 NoRidge 1Fam 8 250000

d) Draw boxplots of “SalePrice” against each of these categorical variables.


Notice the patterns in these boxplots. Comment on which of these variables
has the most influence in predicting “SalePrice”.

Check the distribution of SalePrice across different MSSubClass .


In [23]: f = plt.figure(figsize=(8, 6))
sb.boxplot(x = 'MSSubClass', y = 'SalePrice', data = houseCatSale)

Out[23]: <AxesSubplot:xlabel='MSSubClass', ylabel='SalePrice'>

Check the distribution of SalePrice across different Neighborhood .


In [24]: f = plt.figure(figsize=(10, 6))
sb.boxplot(x = 'Neighborhood', y = 'SalePrice', data = houseCatSale)
plt.xticks(rotation=90);

Check the distribution of SalePrice across different BldgType .


In [62]: f = plt.figure(figsize=(6, 5))
sb.boxplot(x = 'BldgType', y = 'SalePrice', data = houseCatSale)

Out[62]: <AxesSubplot:xlabel='BldgType', ylabel='SalePrice'>

Check the distribution of SalePrice across different OverallQual .


In [64]: f = plt.figure(figsize=(8, 6))
sb.boxplot(x = 'OverallQual', y = 'SalePrice', data = houseCatSale)

Out[64]: <AxesSubplot:xlabel='OverallQual', ylabel='SalePrice'>

Observation : Which variables do you think will help us predict SalePrice in this dataset?

OverallQual : Definitely the most important variable : Highest variation in SalePrice


across the levels
Neighborhood and MSSubClass : Moderately important variables : Medium variation in
SalePrice across levels
BldgType : Not clear if important as a variable at all : Not much variation in SalePrice
across the levels

Bonus : Attempt a comprehensive analysis with all Categorical variables in the dataset.

Bonus
Perform a similar analysis on every other variable in the dataset, against
“SalePrice”. This will let you gain more insight about the data, and find out
which variables are actually useful in predicting “SalePrice”.

In [65]: no

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [65], in <cell line: 1>()
----> 1 no

NameError: name 'no' is not defined

In [ ]: ​
Exercise 4 : Linear Regression

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

In [98]: # Basic Libraries


import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Setup : Import the Dataset


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
(https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

In [49]: houseData = pd.read_csv('train.csv')


houseData.head()

Out[49]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ...

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ...

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ...

3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ...

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ...

5 rows × 81 columns

Problem 1 : Predicting SalePrice using GrLivArea


a) Plot SalePrice against GrLivArea using any appropriate bivariate plot to
note the strong linear relationship.
Plot SalePrice against GrLivArea using standard ScatterPlot/JointPlot.

In [50]: sb.jointplot(data = houseData, x = "GrLivArea", y = "SalePrice", height = 4)

Out[50]: <seaborn.axisgrid.JointGrid at 0x1ce8fddab20>

b) Print the correlation coefficient between these two variables to get a


numerical evidence of the relationship.
Check the Correlation Coefficient to get a confirmation on the strong linear relationship you
observe.

In [51]: houseData.SalePrice.corr(houseData.GrLivArea)

Out[51]: 0.7086244776126522

c) Import Linear Regression model from Scikit-Learn : from


sklearn.linear_model import LinearRegression

Import the LinearRegression model from sklearn.linear_model .

In [52]: # Import LinearRegression model from Scikit-Learn


from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
linreg = LinearRegression()

d) Partition the dataset houseData into two “random” portions : Train Data
(1100 rows) and Test Data (360 rows).
Split the dataset in Train and Test sets, uniformly at random.
Train Set with 1100 samples and Test Set with 360 samples.
In [99]: # Import the required function from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData['GrLivArea'])

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360) # x, y and

# Check the sample sizes
print("X Train Set :", X_train.shape, y_train.shape)
print("Y Test Set :", X_test.shape, y_test.shape)

# Create a joint dataframe by concatenating the two variables
trainDF = pd.concat([X_train, y_train], axis = 1).reindex(X_train.index)

X Train Set : (1100, 1) (1100, 1)


Y Test Set : (360, 1) (360, 1)

e) Training : Fit a Linear Regression model on the Train Dataset to predict or


estimate SalePrice using GrLivArea.
Fit Linear Regression model on the Training Dataset.

In [100]: linreg.fit(X_train, y_train)

Out[100]: LinearRegression()

f) Print the coefficients of the Linear Regression model you just fit, and plot
the regression line on a scatterplot

Visual Representation of the Linear Regression Model

Check the coefficients of the Linear Regression model you just fit.

In [101]: print('Intercept \t: b = ', linreg.intercept_)


print('Coefficients \t: a = ', linreg.coef_)

Intercept : b = [14000.8848827]
Coefficients : a = [[109.85619063]]

Plot the regression line based on the coefficients-intercept form.


In [102]: # Formula for the Regression line
regline_x = X_train
regline_y = linreg.intercept_ + linreg.coef_ * X_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(X_train, y_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_l
oc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:

File ~\anaconda3\lib\site-packages\pandas\_libs\index.pyx:136, in pandas._libs.inde


x.IndexEngine.get_loc()

File ~\anaconda3\lib\site-packages\pandas\_libs\index.pyx:142, in pandas._libs.inde


x.IndexEngine.get_loc()

TypeError: '(slice(None, None, None), None)' is an invalid key

During handling of the above exception, another exception occurred:

InvalidIndexError Traceback (most recent call last)


I t In [102] i < ll li 8>()
In [103]: #SELF ADDED
import numpy as np; np.random.seed(8)
mean, cov = [4, 6], [(1.5, .7), (.7, 1)]
x, y = np.random.multivariate_normal(mean, cov, 80).T
ax = sb.regplot(x=X_train, y=y_train, color="red")

g) Print Explained Variance (R^2) and Mean Squared Error (MSE) on Train
Data to check Goodness of Fit of model.

Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Train Set.


Metric : Explained Variance or R^2 on the Train Set.
In [104]: # Explained Variance in simply the "Score"
print("Explained Variance (R^2) \t:", linreg.score(X_train, y_train))

Explained Variance (R^2) : 0.5123006711203322

Metric : Mean Squared Error (MSE) on the Train Set.

In [105]: # Import the required metric from sklearn


from sklearn.metrics import mean_squared_error

# Predict the response on the train set
y_train_pred = linreg.predict(X_train)

# Compute MSE on the train set
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_train, y_train_pred))

Mean Squared Error (MSE) : 2935725152.5729613

h) Predict SalePrice in case of Test Data using the Linear Regression model
and the predictor variable GrLivArea

Prediction of Response based on the Predictor

Predict SalePrice given GrLivArea in the Test dataset.

In [106]: # Predict SalePrice values corresponding to GrLivArea


y_test_pred = linreg.predict(X_test)

i) Plot the predictions on a Scatterplot of GrLivArea and SalePrice in the Test


Data to visualize model accuracy.

In [107]: # Plot the Predictions on a Scatterplot


f = plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()
j) Print the Mean Squared Error (MSE) on Test Data to check Goodness of Fit
of model, compared to the Training.

Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Test Set.


Metric : Mean Squared Error (MSE) on the Train Set.

In [108]: # Compute MSE on the test set


print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))

Mean Squared Error (MSE) : 3772627647.534909

In [109]: GrLivAreaPredictR2 = linreg.score(X_train, y_train)


GrLivAreaTrainMSE = mean_squared_error(y_train, y_train_pred)
GrLivAreaTestMSE= mean_squared_error(y_test, y_test_pred)
print ("GrLivArea")
print ("=========================================")
print ("Train R^2:", GrLivAreaPredictR2)
print ("Train MSE:", GrLivAreaTrainMSE)
print ("Test MSE:", GrLivAreaTestMSE)
print()
print()

GrLivArea
=========================================
Train R^2: 0.5123006711203322
Train MSE: 2935725152.5729613
Test MSE: 3772627647.534909

Problem 2 : Predicting SalePrice using LotArea

Perform all the above steps on “SalePrice” against each of the variables
“LotArea”, “TotalBsmtSF”, “GarageArea” one by one to perform individual
Linear Regressions and obtain individual univariate Linear Regression
Models in each case.
Check the relationship between the two variables : SalePrice and the Predictor.
In [64]: sb.jointplot(data = houseData, x = "LotArea", y = "SalePrice", height = 4)

Out[64]: <seaborn.axisgrid.JointGrid at 0x1ce8efacd30>

In [65]: houseData.SalePrice.corr(houseData.LotArea)

Out[65]: 0.2638433538714057

Linear Regression on SalePrice vs Predictor

(Random test and train sets)

In [66]: # Import the required function from sklearn


from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData['LotArea'])

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)

Train Set : (1100, 1) (1100, 1)


Test Set : (360, 1) (360, 1)
In [67]: # Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(X_train, y_train)

Out[67]: LinearRegression()

Visual Representation of the Linear Regression Model

In [68]: import numpy as np


from numpy.polynomial.polynomial import polyfit
import matplotlib.pyplot as plt

In [70]: #SELF ADDED


import numpy as np; np.random.seed(8)
mean, cov = [4, 6], [(1.5, .7), (.7, 1)]
x, y = np.random.multivariate_normal(mean, cov, 80).T
ax = sb.regplot(x=X_train, y=y_train, color="b")
In [71]: print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line
regline_x = X_train
regline_y = linreg.intercept_ + linreg.coef_ * X_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(8, 3))
plt.scatter(X_train, y_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

Intercept : b = [153080.85711237]
Coefficients : a = [[2.49403785]]

Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Train Set.


Metric : Explained Variance or R^2 on the Train Set.
Metric : Mean Squared Error (MSE) on the Train Set.

In [72]: # Explained Variance in simply the "Score"


print("Explained Variance (R^2) \t:", linreg.score(X_train, y_train))

### Import the required metric from sklearn
from sklearn.metrics import mean_squared_error

# Predict the response on the train set
y_train_pred = linreg.predict(X_train)

# Compute MSE on the train set
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_train, y_train_pred))

Explained Variance (R^2) : 0.07650117075042129


Mean Squared Error (MSE) : 5559037260.985449

Prediction of Response based on the Predictor


In [73]: # Predict SalePrice values corresponding to Predictor
y_test_pred = linreg.predict(X_test)

# Plot the Predictions on a Scatterplot
f = plt.figure(figsize=(16, 3))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Test Set.


Metric : Mean Squared Error (MSE) on the Train Set.

In [74]: # Compute MSE on the test set


print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))

Mean Squared Error (MSE) : 6884271779.626366

In [75]: LotAreaPredictR2 = linreg.score(X_train, y_train)


LotAreaTrainMSE = mean_squared_error(y_train, y_train_pred)
LotAreaTestMSE= mean_squared_error(y_test, y_test_pred)
print ("LotArea")
print ("=========================================")
print ("Train R^2:", LotAreaPredictR2)
print ("Train MSE:", LotAreaTrainMSE)
print ("Test MSE:", LotAreaTestMSE)
print()
print()

LotArea
=========================================
Train R^2: 0.07650117075042129
Train MSE: 5559037260.985449
Test MSE: 6884271779.626366

In [ ]: ​

Problem 2 : Predicting SalePrice using TotalBsmtSF


Check the relationship between the two variables : SalePrice and the Predictor.
In [76]: sb.jointplot(data = houseData, x = "TotalBsmtSF", y = "SalePrice", height = 4)

Out[76]: <seaborn.axisgrid.JointGrid at 0x1ce906f6790>

In [77]: houseData.SalePrice.corr(houseData.TotalBsmtSF)

Out[77]: 0.6135805515591952

In [ ]: ​

Linear Regression on SalePrice vs Predictor

In [78]: # Import the required function from sklearn


from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData['TotalBsmtSF'])

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)

Train Set : (1100, 1) (1100, 1)


Test Set : (360, 1) (360, 1)

In [79]: # Import LinearRegression model from Scikit-Learn


from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(X_train, y_train)

Out[79]: LinearRegression()

In [ ]: ​
Visual Representation of the Linear Regression Model

In [81]: #SELF ADDED


import numpy as np; np.random.seed(8)
mean, cov = [4, 6], [(1.5, .7), (.7, 1)]
x, y = np.random.multivariate_normal(mean, cov, 80).T
ax = sb.regplot(x=X_train, y=y_train, color="b")

In [27]: print('Intercept \t: b = ', linreg.intercept_)


print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line
regline_x = X_train
regline_y = linreg.intercept_ + linreg.coef_ * X_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(X_train, y_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

Intercept : b = [67116.87650897]
Coefficients : a = [[107.54551488]]

Goodness of Fit of the Linear Regression Model


Check how good the predictions are on the Train Set.
Metric : Explained Variance or R^2 on the Train Set.
Metric : Mean Squared Error (MSE) on the Train Set.

In [82]: # Explained Variance in simply the "Score"


print("Explained Variance (R^2) \t:", linreg.score(X_train, y_train))

# Import the required metric from sklearn
from sklearn.metrics import mean_squared_error

# Predict the response on the train set
y_train_pred = linreg.predict(X_train)

# Compute MSE on the train set
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_train, y_train_pred))

Explained Variance (R^2) : 0.396575160056191


Mean Squared Error (MSE) : 3632339384.9644613

In [ ]: ​

Prediction of Response based on the Predictor

In [83]: # Predict SalePrice values corresponding to Predictor


y_test_pred = linreg.predict(X_test)

# Plot the Predictions on a Scatterplot
f = plt.figure(figsize=(15, 3))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

In [ ]: ​

Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Test Set.


Metric : Mean Squared Error (MSE) on the Train Set.

In [84]: # Compute MSE on the test set


print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))

Mean Squared Error (MSE) : 4889229988.232506


In [85]: TotalBsmtSFPredictR2 = linreg.score(X_train, y_train)
TotalBsmtSFTrainMSE = mean_squared_error(y_train, y_train_pred)
TotalBsmtSFTestMSE= mean_squared_error(y_test, y_test_pred)
print ("TotalBsmtSF")
print ("=========================================")
print ("Train R^2:", TotalBsmtSFPredictR2)
print ("Train MSE:", TotalBsmtSFTrainMSE)
print ("Test MSE:", TotalBsmtSFTestMSE)
print()
print()

TotalBsmtSF
=========================================
Train R^2: 0.396575160056191
Train MSE: 3632339384.9644613
Test MSE: 4889229988.232506

In [ ]: ​

Problem 2 : Predicting SalePrice using GarageArea


Check the relationship between the two variables : SalePrice and the Predictor.

In [86]: sb.jointplot(data = houseData, x = "GarageArea", y = "SalePrice", height = 4)

Out[86]: <seaborn.axisgrid.JointGrid at 0x1ce9198a4f0>

In [87]: houseData.SalePrice.corr(houseData.GarageArea)

Out[87]: 0.6234314389183618

In [ ]: ​

Linear Regression on SalePrice vs Predictor


In [88]: # Import the required function from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData['GarageArea'])

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)

Train Set : (1100, 1) (1100, 1)


Test Set : (360, 1) (360, 1)

In [89]: # Import LinearRegression model from Scikit-Learn


from sklearn.linear_model import LinearRegression

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(X_train, y_train)

Out[89]: LinearRegression()

In [ ]: ​

Visual Representation of the Linear Regression Model

In [97]: #SELF ADDED


import numpy as np; np.random.seed(8)
mean, cov = [4, 6], [(1.5, .7), (.7, 1)]
x, y = np.random.multivariate_normal(mean, cov, 80).T
ax = sb.regplot(x=X_train, y=y_train, color="b")
In [91]: print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

# Formula for the Regression line
regline_x = X_train
regline_y = linreg.intercept_ + linreg.coef_ * X_train

# Plot the Linear Regression line
f, axes = plt.subplots(1, 1, figsize=(16, 8))
plt.scatter(X_train, y_train)
plt.plot(regline_x, regline_y, 'r-', linewidth = 3)
plt.show()

Intercept : b = [69930.06931615]
Coefficients : a = [[231.73496835]]

In [ ]: ​

Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Train Set.


Metric : Explained Variance or R^2 on the Train Set.
Metric : Mean Squared Error (MSE) on the Train Set.

In [92]: # Explained Variance in simply the "Score"


print("Explained Variance (R^2) \t:", linreg.score(X_train, y_train))

# Import the required metric from sklearn
from sklearn.metrics import mean_squared_error

# Predict the response on the train set
y_train_pred = linreg.predict(X_train)

# Compute MSE on the train set
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_train, y_train_pred))

Explained Variance (R^2) : 0.39437068162335176


Mean Squared Error (MSE) : 3645609328.965528

In [ ]: ​

Prediction of Response based on the Predictor


In [93]: # Predict SalePrice values corresponding to Predictor
y_test_pred = linreg.predict(X_test)

# Plot the Predictions on a Scatterplot
f = plt.figure(figsize=(16, 4))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

Goodness of Fit of the Linear Regression Model

Check how good the predictions are on the Test Set.


Metric : Mean Squared Error (MSE) on the Train Set.

In [94]: # Compute MSE on the test set


print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))

Mean Squared Error (MSE) : 4504815332.85322

In [95]: GarageAreaPredictR2 = linreg.score(X_train, y_train)


GarageAreaTrainMSE = mean_squared_error(y_train, y_train_pred)
GarageAreaTestMSE= mean_squared_error(y_test, y_test_pred)
print ("GarageArea")
print ("=========================================")
print ("Train R^2:", GarageAreaPredictR2)
print ("Train MSE:", GarageAreaTrainMSE)
print ("Test MSE:", GarageAreaTestMSE)
print()

GarageArea
=========================================
Train R^2: 0.39437068162335176
Train MSE: 3645609328.965528
Test MSE: 4504815332.85322

Problem 3 : Comparing the Uni-Variate Linear Models

Compare and contrast the four models in terms of Explained Variance (R^2)
and Mean Squared Error (MSE) on Train Data, the accuracy of prediction on
Test Data, and comment on which model you think is the best to predict
“SalePrice”.

Original values:
GrLivArea R^2: 0.5340296119421228 Train MSE: 3059483255 Test MSE: 3440839968

LotArea R^2: 0.05694364321108347 Train MSE: 5746184372 Test MSE: 6312628036

TotalBsmtSF R^2: 0.35978297283766747 Train MSE: 4036132693 Test MSE: 3625382634

GarageArea R^2: 0.3856595104936924 Train MSE: 3682132085 Test MSE: 4407653993

Compare and contrast the four models in terms of R^2 and MSE on Train Data, as well as MSE on Test
Data.

SalePrice vs GrLivArea has the best Explained Variance (R^2) out of the four models.
SalePrice vs LotArea has the worst Explained Variance (R^2) out of the four models.
Naturally, the model with GrLivArea is the best one in terms of just the Training accuracy.

We also find SalePrice vs GrLivArea has the minimum MSE on both the Train and Test Sets
compared to other models.
We also find SalePrice vs LotArea has the maximum MSE on both the Train and Test Sets
compared to other models.
Naturally, the model with GrLivArea is the best one in terms of Test accuracy as evident from MSE
(error) on the Test Set.

So, overall, the predictor GrLivArea is the best amongst the four in predicting SalePrice .

Did you notice? : Go back and check again the R^2 and MSE values for the four models. I am pretty sure
you did not get the exact same values as I did. This is due to the random selection of Train-Test sets. In
fact, if you run the above cells again, you will get a different set of R^2 and MSE values. If that is so, can
we really be confident that GrLivArea will always be the best variable to predict SalePrice ? Think
about it. ;-)
In [110]: print ("GrLivArea \n =========================================")
print ("Train R^2:", GrLivAreaPredictR2)
print ("Train MSE:", GrLivAreaTrainMSE)
print ("Test MSE:", GrLivAreaTestMSE)
print()

print ("LotArea \n =========================================")
print ("Train R^2:", LotAreaPredictR2)
print ("Train MSE:", LotAreaTrainMSE)
print ("Test MSE:", LotAreaTestMSE)
print()

print ("TotalBsmtSF \n =========================================")
print ("Train R^2:", TotalBsmtSFPredictR2)
print ("Train MSE:", TotalBsmtSFTrainMSE)
print ("Test MSE:", TotalBsmtSFTestMSE)
print()

print ("GarageArea \n =========================================")
print ("Train R^2:", GarageAreaPredictR2)
print ("Train MSE:", GarageAreaTrainMSE)
print ("Test MSE:", GarageAreaTestMSE)
print()
print ("Root of MSE is standard deviation")

GrLivArea
=========================================
Train R^2: 0.5123006711203322
Train MSE: 2935725152.5729613
Test MSE: 3772627647.534909

LotArea
=========================================
Train R^2: 0.07650117075042129
Train MSE: 5559037260.985449
Test MSE: 6884271779.626366

TotalBsmtSF
=========================================
Train R^2: 0.396575160056191
Train MSE: 3632339384.9644613
Test MSE: 4889229988.232506

GarageArea
=========================================
Train R^2: 0.39437068162335176
Train MSE: 3645609328.965528
Test MSE: 4504815332.85322

Root of MSE is standard deviation

Extra : Predicting SalePrice using Multiple Variables

1. Note that LinearRegression() model can take more than one Predictor to model the Response
variable. Try using this feature to fit a Linear Regression model to predict “SalePrice” using all the
four variables “GrLivArea” , “LotArea” , “TotalBsmtSF” , and “GarageArea” . Print the Explained
Variance (R^2) of this multi-variate model on Train Data, and check the model’s accuracy of
prediction on the Test Data using Mean Squared Error (MSE).

Extract the required variables from the dataset, and then perform Multi-Variate Regression.
In [109]: # Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Import the required function from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors ####################
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData[['GrLivArea','LotArea','TotalBsmtSF','GarageArea']])

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(X_train, y_train)

Train Set : (1100, 4) (1100, 1)


Test Set : (360, 4) (360, 1)

Out[109]: LinearRegression()

Coefficients of the Linear Regression Model

Note that you CANNOT visualize the model as a line on a 2D plot, as it is a multi-dimensional surface.

In [110]: print('Intercept \t: b = ', linreg.intercept_)


print('Coefficients \t: a = ', linreg.coef_)

Intercept : b = [-35587.22778885]
Coefficients : a = [[70.53200529 0.26914264 56.88385943 98.63075464]]

In [ ]: ​

Prediction of Response based on the Predictor


In [41]: # Predict SalePrice values corresponding to Predictors
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)

# Plot the Predictions vs the True values
f, axes = plt.subplots(1, 2, figsize=(24, 12))
axes[0].scatter(y_train, y_train_pred, color = "blue")
axes[0].plot(y_train, y_train, 'w-', linewidth = 1)
axes[0].set_xlabel("True values of the Response Variable (Train)")
axes[0].set_ylabel("Predicted values of the Response Variable (Train)")
axes[1].scatter(y_test, y_test_pred, color = "green")
axes[1].plot(y_test, y_test, 'w-', linewidth = 1)
axes[1].set_xlabel("True values of the Response Variable (Test)")
axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
plt.show()

train vs test, actual vs predition plots ^

Goodness of Fit of the Linear Regression Model


In [111]: print("Explained Variance (R^2) on Train Set \t:", linreg.score(X_train, y_train))
print("Mean Squared Error (MSE) on Train Set \t:", mean_squared_error(y_train, y_train_p
print("Mean Squared Error (MSE) on Test Set \t:", mean_squared_error(y_test, y_test_pred

Explained Variance (R^2) on Train Set : 0.6960698417397807


Mean Squared Error (MSE) on Train Set : 9479255105.962603
Mean Squared Error (MSE) on Test Set : 3772627647.534909

Observation : The model with SalePrice against all the four variables GrLivArea , LotArea ,
TotalBsmtSF , GarageArea is definitely better!

2. Fit a Linear Regression model to predict “SalePrice” using all the numeric variables in the given
dataset. You may use all the numeric variables from Exercise 2. Print the Explained Variance (R^2) of
this multi-variate model on Train Data, and check the model’s accuracy of prediction on the Test Data
using Mean Squared Error (MSE).

3. Is the Explained Variance (R^2) of a multi-variate model equal to the Sum of Explained Variances
(R^2) of the component univariate models? If R^2 for “SalePrice” vs “GrLivArea” is 0.53 and R^2 of
“SalePrice” vs “LotArea” is 0.22, will the R^2 for “SalePrice” vs [“GrLivArea”, “LotArea”] be 0.75?
Experiment a little and think about it.

In [ ]: ######################################################################################

In [256]: # Basic Libraries


import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [257]: houseData = pd.read_csv('train.csv')


houseData.head()

Out[257]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ...

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ...

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ...

3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ...

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ...

5 rows × 81 columns
In [264]: houseDataNum = houseData.select_dtypes(include = np.int64)
print("Data dims : ", houseDataNum.shape)
houseDataNum.info() # note that all variables are now int64

Data dims : (1460, 35)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotArea 1460 non-null int64
3 OverallQual 1460 non-null int64
4 OverallCond 1460 non-null int64
5 YearBuilt 1460 non-null int64
6 YearRemodAdd 1460 non-null int64
7 BsmtFinSF1 1460 non-null int64
8 BsmtFinSF2 1460 non-null int64
9 BsmtUnfSF 1460 non-null int64
10 TotalBsmtSF 1460 non-null int64
11 1stFlrSF 1460 non-null int64
12 2ndFlrSF 1460 non-null int64
13 LowQualFinSF 1460 non-null int64
14 GrLivArea 1460 non-null int64
15 BsmtFullBath 1460 non-null int64
16 BsmtHalfBath 1460 non-null int64
17 FullBath 1460 non-null int64
18 HalfBath 1460 non-null int64
19 BedroomAbvGr 1460 non-null int64
20 KitchenAbvGr 1460 non-null int64
21 TotRmsAbvGrd 1460 non-null int64
22 Fireplaces 1460 non-null int64
23 GarageCars 1460 non-null int64
24 GarageArea 1460 non-null int64
25 WoodDeckSF 1460 non-null int64
26 OpenPorchSF 1460 non-null int64
27 EnclosedPorch 1460 non-null int64
28 3SsnPorch 1460 non-null int64
29 ScreenPorch 1460 non-null int64
30 PoolArea 1460 non-null int64
31 MiscVal 1460 non-null int64
32 MoSold 1460 non-null int64
33 YrSold 1460 non-null int64
34 SalePrice 1460 non-null int64
dtypes: int64(35)
memory usage: 399.3 KB
In [273]: # Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression

# Import the required function from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors ####################
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData[['MSSubClass','LotArea', 'YearBuilt']])

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)

# Create a Linear Regression object
linreg = LinearRegression()

# Train the Linear Regression model
linreg.fit(X_train, y_train)

Train Set : (1100, 3) (1100, 1)


Test Set : (360, 3) (360, 1)

Out[273]: LinearRegression()

In [ ]: ​
Exercise 5 : Classification Tree

Essential Libraries
Let us begin by importing the essential Python Libraries.

NumPy : Library for Numeric Computations in Python


Pandas : Library for Data Acquisition and Preparation
Matplotlib : Low-level library for Data Visualization
Seaborn : Higher-level library for Data Visualization

In [35]: # Basic Libraries


import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

from scipy import stats

sb.set() # set the default Seaborn style for graphics

Setup : Import the Dataset


Dataset from Kaggle : The "House Prices" competition
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
(https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.

In [2]: houseData = pd.read_csv('train.csv')


houseData.head()

Out[2]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ...

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ...

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ...

3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ...

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ...

5 rows × 81 columns
Problem 1 : Predicting CentralAir using SalePrice

a) Plot the distribution of CentralAir to check the imbalance of Y against N.


Print the ratio of the classes Y : N.
Explore the variable CentralAir from the dataset, as mentioned in the problem.

In [179]: houseData['CentralAir'].describe()

Out[179]: count 1460


unique 2
top Y
freq 1365
Name: CentralAir, dtype: object

Check the catplot for CentralAir , to visually understand the distribution.

In [180]: sb.catplot(y = 'CentralAir', data = houseData, kind = "count")

Out[180]: <seaborn.axisgrid.FacetGrid at 0x1f60d1bc730>

Print the ratio Y : N for CentralAir to check the imbalance in the classes.

In [181]: countY, countX = houseData.CentralAir.value_counts()


print("Ratio of classes is Y : N = ", countY, ":", countX)

Ratio of classes is Y : N = 1365 : 95

b) Plot CentralAir against SalePrice using any appropriate bivariate plot to


note the mutual relationship.

Plot CentralAir against SalePrice to visualize their mutual relationship.


In [182]: f = plt.figure(figsize=(16, 3))
sb.boxplot(x = 'SalePrice', y = 'CentralAir', data = houseData)

Out[182]: <AxesSubplot:xlabel='SalePrice', ylabel='CentralAir'>

Good to note that the two boxplots for SalePrice , for CentralAir = Y and CentralAir = N , are
different from one another in terms of their median value, as well as spread. This means that CentralAir
has an effect on SalePrice , and hence, SalePrice will probably be an important variable in predicting
CentralAir . Boxplots do not tell us where to make the cuts though -- it will be easier to visualize in the
following swarmplot .

SWARMPLOT

In [183]: f = plt.figure(figsize=(16, 4))


sb.swarmplot(x = 'SalePrice', y = 'CentralAir', data = houseData)

C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 3
9.7% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)

Out[183]: <AxesSubplot:xlabel='SalePrice', ylabel='CentralAir'>

Hmm, it seems that swarmplot asks you to decrease the size of the markers or use stripplot . What is
this stripplot anyway? Let's check.
In [184]: f = plt.figure(figsize=(16, 4))
sb.stripplot(x = 'SalePrice', y = 'CentralAir', data = houseData)

Out[184]: <AxesSubplot:xlabel='SalePrice', ylabel='CentralAir'>

c) Import Classification Tree model from Scikit-Learn : from sklearn.tree


import DecisionTreeClassifier

Now it's time to build the Decision Tree classifier. Import the DecisionTreeClassifier model from
sklearn.tree .

In [185]: # Import Decision Tree Classifier model from Scikit-Learn


from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree Classifier object
# you can change the max_depth as you wish
dectree = DecisionTreeClassifier(max_depth = 2)

d) Partition the dataset houseData into two “random” portions : Train Data
(1100 rows) and Test Data (360 rows)
Split the dataset in Train and Test sets, uniformly at random.
Train Set with 1100 samples and Test Set with 360 samples.
In [186]: # Import the required function from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['SalePrice'])

# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)

Train Set : (1100, 1) (1100, 1)


Test Set : (360, 1) (360, 1)

e) Training : Fit a Decision Tree model on the Train Dataset to predict the
class (Y/N) of CentralAir using SalePrice.
Fit Decision Tree Classifier model on the Train Dataset.

In [99]: dectree.fit(X_train, y_train) #(X, Y)

Out[99]: DecisionTreeClassifier(max_depth=2)

f) Visualize the Decision Tree model using the plot_tree function : from
sklearn.tree import plot_tree

Visual Representation of the Decision Tree Model


In [188]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(12,6))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])

Out[188]: [Text(0.5, 0.8333333333333334, 'SalePrice <= 107750.0\ngini = 0.124\nsamples = 1100\nv


alue = [73, 1027]\nclass = Y'),
Text(0.25, 0.5, 'SalePrice <= 61691.5\ngini = 0.473\nsamples = 120\nvalue = [46, 74]
\nclass = Y'),
Text(0.125, 0.16666666666666666, 'gini = 0.219\nsamples = 8\nvalue = [7, 1]\nclass =
N'),
Text(0.375, 0.16666666666666666, 'gini = 0.454\nsamples = 112\nvalue = [39, 73]\nclas
s = Y'),
Text(0.75, 0.5, 'SalePrice <= 145125.0\ngini = 0.054\nsamples = 980\nvalue = [27, 95
3]\nclass = Y'),
Text(0.625, 0.16666666666666666, 'gini = 0.134\nsamples = 305\nvalue = [22, 283]\ncla
ss = Y'),
Text(0.875, 0.16666666666666666, 'gini = 0.015\nsamples = 675\nvalue = [5, 670]\nclas
s = Y')]

g) Predict CentralAir for the train dataset using the Decision Tree model and
plot the Two-Way Confusion Matrix

Prediction on Train Data and Goodness of Fit

Check how good the predictions are on the Train Set.


Metrics : Classification Accuracy and Confusion Matrix .
In [189]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 50})

Out[189]: <AxesSubplot:>

h) Print accuracy measures of the Decision Tree model, including its


Classification Accuracy, True Positive Rate ( TPR ), True Negative Rate ( TNR ),
False Positive Rate ( FPR ) and False Negative Rate ( FNR ), based on the
confusion matrix on train data.
Print the Classification Accuracy and all other Accuracy Measures from the Confusion Matrix.

Confusion Matrix

Actual Negative (0) TN FP

Actual Positive (1) FN TP

(0) (1)

Predicted Negative Predicted Postitive

* TPR = TP / (TP + FN) : True Positive Rate = True Positives / All Positives

* TNR = TN / (TN + FP) : True Negative Rate = True Negatives / All Negatives

* FPR = FP / (TN + FP) : False Positive Rate = False Positives / All Negatives

* FNR = FN / (TP + FN) : False Negative Rate = False Negatives / All Positives
In [190]: # Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the CONFUSION MATRIX
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

Train Data
Accuracy : 0.9390909090909091

TPR Train : 0.9990262901655307


TNR Train : 0.0958904109589041

FPR Train : 0.9041095890410958


FNR Train : 0.0009737098344693282

In [191]: SalePriceAccuracyTrain = dectree.score(X_train, y_train)


SalePriceTPRTrain = tpTrain/(tpTrain + fnTrain)
SalePriceTNRTrain = tnTrain/(tnTrain + fpTrain)
SalePriceFPRTrain = fpTrain/(tnTrain + fpTrain)
SalePriceFNRTrain = fnTrain/(tpTrain + fnTrain)

print ("Sale price Train:\n ````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTrain)
print("TPRTrain: \t", SalePriceTPRTrain)
print("TNRTrain: \t", SalePriceTNRTrain)
print("FPRTrain: \t", SalePriceFPRTrain)
print("FNRTrain: \t", SalePriceFNRTrain)

Sale price Train:


```````````````````````````````````````````````````````````````
Accuracy : 0.9390909090909091
TPRTrain: 0.9990262901655307
TNRTrain: 0.0958904109589041
FPRTrain: 0.9041095890410958
FNRTrain: 0.0009737098344693282

i) Predict CentralAir for the test dataset using the Decision Tree model and
plot the Two-Way Confusion Matrix.

Prediction on Test Data and Goodness of Fit

Check how good the predictions are on the Test Set. Metrics : Classification Accuracy and Confusion
Matrix.
In [192]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test) #X_test is used here

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), #y_test is used here
annot = True, fmt=".0f", annot_kws={"size": 58})

Out[192]: <AxesSubplot:>

Print the Classification Accuracy and all other Accuracy Measures from the Confusion Matrix.

Confusion Matrix

Actual Negative (0) TN FP

Actual Positive (1) FN TP

(0) (1)

Predicted Negative Predicted Postitive

* TPR = TP / (TP + FN) : True Positive Rate = True Positives / All Positives

* TNR = TN / (TN + FP) : True Negative Rate = True Negatives / All Negatives

* FPR = FP / (TN + FP) : False Positive Rate = False Positives / All Negatives

* FNR = FN / (TP + FN) : False Negative Rate = False Negatives / All Positives
In [193]: # Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

Test Data
Accuracy : 0.9527777777777777

TPR Test : 0.9970414201183432


TNR Test : 0.2727272727272727

FPR Test : 0.7272727272727273


FNR Test : 0.0029585798816568047

Important : Note the huge imbalance in the False Positives and False Negatives in the confusion matrix.
False Positives are much higher in number than False Negatives in both Train and Test data. This is not
surprising -- actually, this is a direct effect of the huge Y vs N class imbalance in the response variable
CentralAir . As CentralAir = Y was more likely in the data, False Positives are more likely too. Think
about how you can fix it!
In [194]: SalePriceAccuracyTrain = dectree.score(X_train, y_train)
SalePriceTPRTrain = tpTrain/(tpTrain + fnTrain)
SalePriceTNRTrain = tnTrain/(tnTrain + fpTrain)
SalePriceFPRTrain = fpTrain/(tnTrain + fpTrain)
SalePriceFNRTrain = fnTrain/(tpTrain + fnTrain)

print ("Sale price Train:\n ````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTrain)
print("TPRTrain: \t", SalePriceTPRTrain)
print("TNRTrain: \t", SalePriceTNRTrain)
print("FPRTrain: \t", SalePriceFPRTrain)
print("FNRTrain: \t", SalePriceFNRTrain)

SalePriceAccuracyTest = dectree.score(X_test, y_test)
SalePriceTPRTest = tpTest/(tpTest + fnTest)
SalePriceTNRTest = tnTest/(tnTest + fpTest)
SalePriceFPRTest = fpTest/(tnTest + fpTest)
SalePriceFNRTest = fnTest/(tpTest + fnTest)

print ("Sale price Test:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTest)
print("TPRTest: \t", SalePriceTPRTest)
print("TNRTest: \t", SalePriceTNRTest)
print("FPRTest: \t", SalePriceFPRTest)
print("FNRTest: \t", SalePriceFNRTest)

Sale price Train:


```````````````````````````````````````````````````````````````
Accuracy : 0.9390909090909091
TPRTrain: 0.9990262901655307
TNRTrain: 0.0958904109589041
FPRTrain: 0.9041095890410958
FNRTrain: 0.0009737098344693282
Sale price Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9527777777777777
TPRTest: 0.9970414201183432
TNRTest: 0.2727272727272727
FPRTest: 0.7272727272727273
FNRTest: 0.0029585798816568047

Problem 2 : Predicting CentralAir using Other Variables


Use the other variables from the dataset to predict CentralAir , as mentioned in the problem.

Predicting CentralAir using GrLivArea


In [247]: # Plot Response against Predictor to visualize their mutual relationship.
f = plt.figure(figsize=(16, 3))
sb.swarmplot(x = 'GrLivArea', y = 'CentralAir', data = houseData)

C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 5
0.3% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)

Out[247]: <AxesSubplot:xlabel='GrLivArea', ylabel='CentralAir'>

In [248]: f = plt.figure(figsize=(16, 4))


sb.violinplot(x = 'GrLivArea', y = 'CentralAir', data = houseData)

Out[248]: <AxesSubplot:xlabel='GrLivArea', ylabel='CentralAir'>

In [249]: # Import essential models and functions from sklearn


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['GrLivArea'])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model

Out[249]: DecisionTreeClassifier(max_depth=2)
In [250]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(7,7))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])

Out[250]: [Text(0.4, 0.8333333333333334, 'GrLivArea <= 562.5\ngini = 0.133\nsamples = 1100\nvalu


e = [79, 1021]\nclass = Y'),
Text(0.2, 0.5, 'gini = 0.0\nsamples = 3\nvalue = [3, 0]\nclass = N'),
Text(0.6, 0.5, 'GrLivArea <= 1045.5\ngini = 0.129\nsamples = 1097\nvalue = [76, 1021]
\nclass = Y'),
Text(0.4, 0.16666666666666666, 'gini = 0.251\nsamples = 204\nvalue = [30, 174]\nclass
= Y'),
Text(0.8, 0.16666666666666666, 'gini = 0.098\nsamples = 893\nvalue = [46, 847]\nclass
= Y')]

In [ ]: ​
In [251]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 50})

Train Data
Accuracy : 0.9309090909090909

TPR Train : 1.0


TNR Train : 0.0379746835443038

FPR Train : 0.9620253164556962


FNR Train : 0.0

Out[251]: <AxesSubplot:>
In [ ]: ​

In [253]: # Import the required metric from sklearn


from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 50})

Test Data
Accuracy : 0.9583333333333334

TPR Test : 1.0


TNR Test : 0.0625

FPR Test : 0.9375


FNR Test : 0.0

Out[253]: <AxesSubplot:>
In [254]: GrLivAreaAccuracyTrain = dectree.score(X_train, y_train)
GrLivAreaTPRTrain = tpTrain/(tpTrain + fnTrain)
GrLivAreaTNRTrain = tnTrain/(tnTrain + fpTrain)
GrLivAreaFPRTrain = fpTrain/(tnTrain + fpTrain)
GrLivAreaFNRTrain = fnTrain/(tpTrain + fnTrain)

print ("GrLivArea Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTrain)
print("TPRTrain: \t", GrLivAreaTPRTrain)
print("TNRTrain: \t", GrLivAreaTNRTrain)
print("FPRTrain: \t", GrLivAreaFPRTrain)
print("FNRTrain: \t", GrLivAreaFNRTrain)
print ()
GrLivAreaAccuracyTest = dectree.score(X_test, y_test)
GrLivAreaTPRTest = tpTest/(tpTest + fnTest)
GrLivAreaTNRTest = tnTest/(tnTest + fpTest)
GrLivAreaFPRTest = fpTest/(tnTest + fpTest)
GrLivAreaFNRTest = fnTest/(tpTest + fnTest)

print ("GrLivArea Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTest)
print("TPRTest: \t", GrLivAreaTPRTest)
print("TNRTest: \t", GrLivAreaTNRTest)
print("FPRTest: \t", GrLivAreaFPRTest)
print("FNRTest: \t", GrLivAreaFNRTest)

GrLivArea Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9309090909090909
TPRTrain: 1.0
TNRTrain: 0.0379746835443038
FPRTrain: 0.9620253164556962
FNRTrain: 0.0

GrLivArea Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9583333333333334
TPRTest: 1.0
TNRTest: 0.0625
FPRTest: 0.9375
FNRTest: 0.0

In [ ]: ​

Predicting CentralAir using OverallQual


In [258]: # Plot Response against Predictor to visualize their mutual relationship.
f = plt.figure(figsize=(16, 4))
sb.swarmplot(x = 'OverallQual', y = 'CentralAir', data = houseData)

C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 9
0.6% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)
C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 2
3.2% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)

Out[258]: <AxesSubplot:xlabel='OverallQual', ylabel='CentralAir'>

In [259]: f = plt.figure(figsize=(16, 4))


sb.violinplot(x = 'OverallQual', y = 'CentralAir', data = houseData)

Out[259]: <AxesSubplot:xlabel='OverallQual', ylabel='CentralAir'>

In [260]: # Import essential models and functions from sklearn


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['OverallQual'])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model

Out[260]: DecisionTreeClassifier(max_depth=2)
In [261]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(8,6))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])

Out[261]: [Text(0.5, 0.8333333333333334, 'OverallQual <= 3.5\ngini = 0.118\nsamples = 1100\nvalu


e = [69, 1031]\nclass = Y'),
Text(0.25, 0.5, 'OverallQual <= 2.5\ngini = 0.48\nsamples = 15\nvalue = [9, 6]\nclass
= N'),
Text(0.125, 0.16666666666666666, 'gini = 0.0\nsamples = 2\nvalue = [2, 0]\nclass =
N'),
Text(0.375, 0.16666666666666666, 'gini = 0.497\nsamples = 13\nvalue = [7, 6]\nclass =
N'),
Text(0.75, 0.5, 'OverallQual <= 5.5\ngini = 0.104\nsamples = 1085\nvalue = [60, 1025]
\nclass = Y'),
Text(0.625, 0.16666666666666666, 'gini = 0.188\nsamples = 391\nvalue = [41, 350]\ncla
ss = Y'),
Text(0.875, 0.16666666666666666, 'gini = 0.053\nsamples = 694\nvalue = [19, 675]\ncla
ss = Y')]

In [ ]: ​
In [262]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})

Train Data
Accuracy : 0.94

TPR Train : 0.9941804073714839


TNR Train : 0.13043478260869565

FPR Train : 0.8695652173913043


FNR Train : 0.005819592628516004

Out[262]: <AxesSubplot:>

In [ ]: ​
In [263]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 50})

Test Data
Accuracy : 0.9388888888888889

TPR Test : 0.9910179640718563


TNR Test : 0.2692307692307692

FPR Test : 0.7307692307692307


FNR Test : 0.008982035928143712

Out[263]: <AxesSubplot:>
In [264]: OverallQualAccuracyTrain = dectree.score(X_train, y_train)
OverallQualTPRTrain = tpTrain/(tpTrain + fnTrain)
OverallQualTNRTrain = tnTrain/(tnTrain + fpTrain)
OverallQualFPRTrain = fpTrain/(tnTrain + fpTrain)
OverallQualFNRTrain = fnTrain/(tpTrain + fnTrain)

print ("OverallQual Train:\n ```````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTrain)
print("TPRTrain: \t", OverallQualTPRTrain)
print("TNRTrain: \t", OverallQualTNRTrain)
print("FPRTrain: \t", OverallQualFPRTrain)
print("FNRTrain: \t", OverallQualFNRTrain)
print ()
OverallQualAccuracyTest = dectree.score(X_test, y_test)
OverallQualTPRTest = tpTest/(tpTest + fnTest)
OverallQualTNRTest = tnTest/(tnTest + fpTest)
OverallQualFPRTest = fpTest/(tnTest + fpTest)
OverallQualFNRTest = fnTest/(tpTest + fnTest)

print ("OverallQualArea Test:\n ````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTest)
print("TPRTest: \t", OverallQualTPRTest)
print("TNRTest: \t", OverallQualTNRTest)
print("FPRTest: \t", OverallQualFPRTest)
print("FNRTest: \t", OverallQualFNRTest)

OverallQual Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.94
TPRTrain: 0.9941804073714839
TNRTrain: 0.13043478260869565
FPRTrain: 0.8695652173913043
FNRTrain: 0.005819592628516004

OverallQualArea Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9388888888888889
TPRTest: 0.9910179640718563
TNRTest: 0.2692307692307692
FPRTest: 0.7307692307692307
FNRTest: 0.008982035928143712

In [ ]: ​

Predicting CentralAir using YearBuilt


In [265]: # Plot Response against Predictor to visualize their mutual relationship.
f = plt.figure(figsize=(16, 4))
sb.swarmplot(x = 'YearBuilt', y = 'CentralAir', data = houseData)

C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 2
4.0% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)

Out[265]: <AxesSubplot:xlabel='YearBuilt', ylabel='CentralAir'>

In [266]: f = plt.figure(figsize=(16, 4))


sb.violinplot(x = 'YearBuilt', y = 'CentralAir', data = houseData)

Out[266]: <AxesSubplot:xlabel='YearBuilt', ylabel='CentralAir'>

In [268]: # Import essential models and functions from sklearn


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['YearBuilt'])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model

Out[268]: DecisionTreeClassifier(max_depth=2)
In [269]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(8,6))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])

Out[269]: [Text(0.5, 0.8333333333333334, 'YearBuilt <= 1952.5\ngini = 0.122\nsamples = 1100\nval


ue = [72, 1028]\nclass = Y'),
Text(0.25, 0.5, 'YearBuilt <= 1917.5\ngini = 0.371\nsamples = 260\nvalue = [64, 196]
\nclass = Y'),
Text(0.125, 0.16666666666666666, 'gini = 0.496\nsamples = 59\nvalue = [27, 32]\nclass
= Y'),
Text(0.375, 0.16666666666666666, 'gini = 0.3\nsamples = 201\nvalue = [37, 164]\nclass
= Y'),
Text(0.75, 0.5, 'YearBuilt <= 1965.5\ngini = 0.019\nsamples = 840\nvalue = [8, 832]\n
class = Y'),
Text(0.625, 0.16666666666666666, 'gini = 0.078\nsamples = 173\nvalue = [7, 166]\nclas
s = Y'),
Text(0.875, 0.16666666666666666, 'gini = 0.003\nsamples = 667\nvalue = [1, 666]\nclas
s = Y')]

In [ ]: ​
In [270]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})

Train Data
Accuracy : 0.9345454545454546

TPR Train : 1.0


TNR Train : 0.0

FPR Train : 1.0


FNR Train : 0.0

Out[270]: <AxesSubplot:>
In [ ]: ​

In [271]: # Import the required metric from sklearn


from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})

Test Data
Accuracy : 0.9361111111111111

TPR Test : 1.0


TNR Test : 0.0

FPR Test : 1.0


FNR Test : 0.0

Out[271]: <AxesSubplot:>
In [272]: YearBuiltAccuracyTrain = dectree.score(X_train, y_train)
YearBuiltTPRTrain = tpTrain/(tpTrain + fnTrain)
YearBuiltTNRTrain = tnTrain/(tnTrain + fpTrain)
YearBuiltFPRTrain = fpTrain/(tnTrain + fpTrain)
YearBuiltFNRTrain = fnTrain/(tpTrain + fnTrain)

print ("YearBuilt Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTrain)
print("TPRTrain: \t", YearBuiltTPRTrain)
print("TNRTrain: \t", YearBuiltTNRTrain)
print("FPRTrain: \t", YearBuiltFPRTrain)
print("FNRTrain: \t", YearBuiltFNRTrain)
print ()
YearBuiltAccuracyTest = dectree.score(X_test, y_test)
YearBuiltTPRTest = tpTest/(tpTest + fnTest)
YearBuiltTNRTest = tnTest/(tnTest + fpTest)
YearBuiltFPRTest = fpTest/(tnTest + fpTest)
YearBuiltFNRTest = fnTest/(tpTest + fnTest)

print ("YearBuilt Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTest)
print("TPRTest: \t", YearBuiltTPRTest)
print("TNRTest: \t", YearBuiltTNRTest)
print("FPRTest: \t", YearBuiltFPRTest)
print("FNRTest: \t", YearBuiltFNRTest)

YearBuilt Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9345454545454546
TPRTrain: 1.0
TNRTrain: 0.0
FPRTrain: 1.0
FNRTrain: 0.0

YearBuilt Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9361111111111111
TPRTest: 1.0
TNRTest: 0.0
FPRTest: 1.0
FNRTest: 0.0

Problem 3 : Comparing the Uni-Variate Decision Tree


Models
Compare and contrast the four models in terms of Classification Accuracy, TPR and FPR on both Train and
Test Data.

CentralAir vs SalePrice has the highest Training Accuracy out of the four models.
CentralAir vs GrLivArea has the highest Test Accuracy out of the four models.
However, the train and test accuracy for all four models are pretty high and quite close.
So, it is not easy to justify which model is better just using their classification accuracy.

However, if we look at the True Positive Rate (TPR) and False Positive Rate (FPR) of the four models, we
find that

YearBuilt yields a TPR of 1 (best-case) but an FPR of 1 (worst-case) on both Train and Test data.
Really bad for prediction.
GrLivArea yields a TPR of close to 1 (best-case) but an FPR of close to 1 (worst-case) on Train and
Test set, not good either.
SalePrice and OverallQual yield the best TPR (high) vs FPR (not-as-high) trade-off in case of
both Train and Test data.

Overall, the predictor OverallQual is the best amongst the four in predicting CentralAir , while
SalePrice is a close second as per the models above. YearBuilt is definitely the worst predictor out
of these four variables, with GrLivArea not doing so well either, given the models above.

Did you notice? : Go back and check again all accuracy figures for the four models. I am pretty sure you
did not get the exact same values as I did. This is due to the random selection of Train-Test sets. In fact,
if you run the above cells again, you will get a different set of accuracy figures. If that is so, can we really
be confident that OverallQual will always be the best variable to predict CentralAir ? Think about it.
;-)
In [274]: print ("Sale price Train:\n ````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTrain)
print("TPRTrain: \t", SalePriceTPRTrain)
print("TNRTrain: \t", SalePriceTNRTrain)
print("FPRTrain: \t", SalePriceFPRTrain)
print("FNRTrain: \t", SalePriceFNRTrain)
print ("Sale price Test:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTest)
print("TPRTest: \t", SalePriceTPRTest)
print("TNRTest: \t", SalePriceTNRTest)
print("FPRTest: \t", SalePriceFPRTest)
print("FNRTest: \t", SalePriceFNRTest)
print()
print()
print ("GrLivArea Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTrain)
print("TPRTrain: \t", GrLivAreaTPRTrain)
print("TNRTrain: \t", GrLivAreaTNRTrain)
print("FPRTrain: \t", GrLivAreaFPRTrain)
print("FNRTrain: \t", GrLivAreaFNRTrain)
print ()
print ("GrLivArea Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTest)
print("TPRTest: \t", GrLivAreaTPRTest)
print("TNRTest: \t", GrLivAreaTNRTest)
print("FPRTest: \t", GrLivAreaFPRTest)
print("FNRTest: \t", GrLivAreaFNRTest)
print ()
print ()
print ("OverallQual Train:\n ```````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTrain)
print("TPRTrain: \t", OverallQualTPRTrain)
print("TNRTrain: \t", OverallQualTNRTrain)
print("FPRTrain: \t", OverallQualFPRTrain)
print("FNRTrain: \t", OverallQualFNRTrain)
print ()
print ("OverallQual Test:\n ````````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTest)
print("TPRTest: \t", OverallQualTPRTest)
print("TNRTest: \t", OverallQualTNRTest)
print("FPRTest: \t", OverallQualFPRTest)
print("FNRTest: \t", OverallQualFNRTest)
print ()
print ()
print ("YearBuilt Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTrain)
print("TPRTrain: \t", YearBuiltTPRTrain)
print("TNRTrain: \t", YearBuiltTNRTrain)
print("FPRTrain: \t", YearBuiltFPRTrain)
print("FNRTrain: \t", YearBuiltFNRTrain)
print ()
print ("YearBuilt Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTest)
print("TPRTest: \t", YearBuiltTPRTest)
print("TNRTest: \t", YearBuiltTNRTest)
print("FPRTest: \t", YearBuiltFPRTest)
print("FNRTest: \t", YearBuiltFNRTest)

Sale price Train:


```````````````````````````````````````````````````````````````
Accuracy : 0.9390909090909091
TPRTrain: 0.9990262901655307
TNRTrain: 0.0958904109589041
FPRTrain: 0.9041095890410958
FNRTrain: 0.0009737098344693282
Sale price Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9527777777777777
TPRTest: 0.9970414201183432
TNRTest: 0.2727272727272727
FPRTest: 0.7272727272727273
FNRTest: 0.0029585798816568047

GrLivArea Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9309090909090909
TPRTrain: 1.0
TNRTrain: 0.0379746835443038
FPRTrain: 0.9620253164556962
FNRTrain: 0.0

GrLivArea Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9583333333333334
TPRTest: 1.0
TNRTest: 0.0625
FPRTest: 0.9375
FNRTest: 0.0

OverallQual Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.94
TPRTrain: 0.9941804073714839
TNRTrain: 0.13043478260869565
FPRTrain: 0.8695652173913043
FNRTrain: 0.005819592628516004

OverallQual Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9388888888888889
TPRTest: 0.9910179640718563
TNRTest: 0.2692307692307692
FPRTest: 0.7307692307692307
FNRTest: 0.008982035928143712

YearBuilt Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9345454545454546
TPRTrain: 1.0
TNRTrain: 0.0
FPRTrain: 1.0
FNRTrain: 0.0

YearBuilt Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9361111111111111
TPRTest: 1.0
TNRTest: 0.0
FPRTest: 1.0
FNRTest: 0.0

In [ ]: ​
Extra : Predicting CentralAir using All Variables

1. Note that the DecisionTreeClassifier() model can take more than one
Predictor to model the Response variable. Try to fit a Decision Tree model
to predict “CentralAir” using all the four variables “SalePrice” ,
“GrLivArea” , “OverallQual” and “YearBuilt” . Print the Classification
Accuracy of this multi-variate model on Train and Test datasets, and check
the model’s reliability of prediction on Train and Test data using the
confusion matrices.

Use all the other 4 variables from the dataset to predict CentralAir , as mentioned in the problem.

In [80]: # Import essential models and functions from sklearn


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData[['SalePrice', 'GrLivArea', 'OverallQual', 'YearBuilt']]) #

# Split the Dataset into Train and Test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model

Out[80]: DecisionTreeClassifier(max_depth=2)
In [81]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(9,6))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])

Out[81]: [Text(0.5, 0.8333333333333334, 'YearBuilt <= 1925.5\ngini = 0.119\nsamples = 1100\nval


ue = [70, 1030]\nclass = Y'),
Text(0.25, 0.5, 'SalePrice <= 107750.0\ngini = 0.46\nsamples = 117\nvalue = [42, 75]
\nclass = Y'),
Text(0.125, 0.16666666666666666, 'gini = 0.488\nsamples = 38\nvalue = [22, 16]\nclass
= N'),
Text(0.375, 0.16666666666666666, 'gini = 0.378\nsamples = 79\nvalue = [20, 59]\nclass
= Y'),
Text(0.75, 0.5, 'SalePrice <= 96750.0\ngini = 0.055\nsamples = 983\nvalue = [28, 955]
\nclass = Y'),
Text(0.625, 0.16666666666666666, 'gini = 0.392\nsamples = 56\nvalue = [15, 41]\nclass
= Y'),
Text(0.875, 0.16666666666666666, 'gini = 0.028\nsamples = 927\nvalue = [13, 914]\ncla
ss = Y')]
In [82]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 60})

Train Data
Accuracy : 0.9418181818181818

TPR Train : 0.9844660194174757


TNR Train : 0.3142857142857143

FPR Train : 0.6857142857142857


FNR Train : 0.015533980582524271

Out[82]: <AxesSubplot:>
In [83]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 60})

Test Data
Accuracy : 0.9305555555555556

TPR Test : 0.9761194029850746


TNR Test : 0.32

FPR Test : 0.68


FNR Test : 0.023880597014925373

Out[83]: <AxesSubplot:>

Observation : The model with CentralAir against all the four variables SalePrice , GrLivArea ,
OverallQual , YearBuilt is not necessarily better. That's strange!

However, there is also room to play with the max_depth of the Decision
Tree.
Try other values and check out for yourself. :-)

Experiment with max_depth of the Decision Tree to check the variations in accuracy and confusion
matrix for train and test. Think about it!
In [84]: # Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData[['SalePrice', 'GrLivArea', 'OverallQual', 'YearBuilt']]) #or

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3) # CHANGE IT HERE AND EXPERIMENT
dectree.fit(X_train, y_train) # train the decision tree model

# Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(26,10))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])

Out[84]: [Text(0.4583333333333333, 0.875, 'SalePrice <= 98150.0\ngini = 0.116\nsamples = 1100\n


value = [68, 1032]\nclass = Y'),
Text(0.25, 0.625, 'YearBuilt <= 1957.5\ngini = 0.472\nsamples = 89\nvalue = [34, 55]
\nclass = Y'),
Text(0.16666666666666666, 0.375, 'SalePrice <= 62250.0\ngini = 0.499\nsamples = 65\nv
alue = [34, 31]\nclass = N'),
Text(0.08333333333333333, 0.125, 'gini = 0.198\nsamples = 9\nvalue = [8, 1]\nclass =
N'),
Text(0.25, 0.125, 'gini = 0.497\nsamples = 56\nvalue = [26, 30]\nclass = Y'),
Text(0.3333333333333333, 0.375, 'gini = 0.0\nsamples = 24\nvalue = [0, 24]\nclass =
Y'),
Text(0.6666666666666666, 0.625, 'YearBuilt <= 1919.5\ngini = 0.065\nsamples = 1011\nv
alue = [34, 977]\nclass = Y'),
Text(0.5, 0.375, 'GrLivArea <= 1857.5\ngini = 0.456\nsamples = 54\nvalue = [19, 35]\n
class = Y'),
Text(0.4166666666666667, 0.125, 'gini = 0.291\nsamples = 34\nvalue = [6, 28]\nclass =
Y'),
Text(0.5833333333333334, 0.125, 'gini = 0.455\nsamples = 20\nvalue = [13, 7]\nclass =
N'),
Text(0.8333333333333334, 0.375, 'SalePrice <= 141250.0\ngini = 0.031\nsamples = 957\n
value = [15, 942]\nclass = Y'),
Text(0.75, 0.125, 'gini = 0.09\nsamples = 275\nvalue = [13, 262]\nclass = Y'),
Text(0.9166666666666666, 0.125, 'gini = 0.006\nsamples = 682\nvalue = [2, 680]\nclass
= Y')]
In [85]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 60})

Train Data
Accuracy : 0.95

TPR Train : 0.9922480620155039


TNR Train : 0.3088235294117647

FPR Train : 0.6911764705882353


FNR Train : 0.007751937984496124

Out[85]: <AxesSubplot:>
In [86]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 60})

Test Data
Accuracy : 0.9277777777777778

TPR Test : 0.978978978978979


TNR Test : 0.2962962962962963

FPR Test : 0.7037037037037037


FNR Test : 0.021021021021021023

Out[86]: <AxesSubplot:>
2. Fit a Decision Tree model to predict “CentralAir” using all numeric
variables in the given dataset. You may use all numeric variables from
Exercise 2. Print the Classification Accuracy of this multi-variate model on
Train and Test data, and check the model’s reliability of prediction on Train
and Test data using the confusion matrices.

In [87]: df = pd.read_csv('train.csv')

In [88]: def clean_dataset(df):


assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[indices_to_keep].astype(np.float64)

In [89]: df_sub = df.loc[:, (df.dtypes == np.int64) | (df.dtypes == np.float64)]


column_categorical=['MSSubClass','OverallQual','OverallCond']

df_num = df_sub.drop(column_categorical, axis=1)
print(df_num.shape)
df_num=clean_dataset(df_num)
print(df_num.shape)

(1460, 35)
(1121, 35)

In [92]: y = pd.DataFrame(df['CentralAir'].iloc[df_num.index]) # Response


X = pd.DataFrame(df_num) # Predictor

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", y_train.shape, X_train.shape)
print("Test Set :", y_test.shape, X_test.shape)

Train Set : (761, 1) (761, 35)


Test Set : (360, 1) (360, 35)
In [93]: # Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree object
dectree.fit(X_train, y_train) # train the decision tree model

f = plt.figure(figsize=(12,5))
plot_tree(dectree, filled=True, rounded=True,
feature_names=df_num.columns,
class_names=['N','Y'])
plt.show()

y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

cm_train=confusion_matrix(y_train, y_train_pred)
cm_test=confusion_matrix(y_test, y_test_pred)

acc_train=(cm_train[0,0]+cm_train[1,1])/np.sum(cm_train)
In [95]: # Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Accuracy \t: {:.3f}".format(acc_train))
print("True Positive Rate \t: {:.3f}".format(cm_train[1,1]/np.sum(cm_train[1,:])))
print("True Negative Rate \t: {:.3f}".format(cm_train[0,0]/np.sum(cm_train[0,:])))
print("False Positive Rate \t: {:.3f}".format(cm_train[0,1]/np.sum(cm_train[0,:])))
print("False Negative Rate \t: {:.3f}".format(cm_train[1,0]/np.sum(cm_train[1,:])))
print()

acc_test=(cm_test[0,0]+cm_test[1,1])/np.sum(cm_test)
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Accuracy \t: {:.3f}".format(acc_test))
print("True Positive Rate \t: {:.3f}".format(cm_test[1,1]/np.sum(cm_test[1,:])))
print("True Negative Rate \t: {:.3f}".format(cm_test[0,0]/np.sum(cm_test[0,:])))
print("False Positive Rate \t: {:.3f}".format(cm_test[0,1]/np.sum(cm_test[0,:])))
print("False Negative Rate \t: {:.3f}".format(cm_test[1,0]/np.sum(cm_test[1,:])))
print()

Goodness of Fit of Model Train Dataset


Accuracy : 0.955
True Positive Rate : 0.989
True Negative Rate : 0.381
False Positive Rate : 0.619
False Negative Rate : 0.011

Goodness of Fit of Model Test Dataset


Accuracy : 0.956
True Positive Rate : 0.988
True Negative Rate : 0.429
False Positive Rate : 0.571
False Negative Rate : 0.012
In [96]: # Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(cm_train,
annot = True, fmt=".0f", annot_kws={"size": 50}, ax = axes[0])
axes[0].set_xlabel('Predicted', fontsize = 15)
axes[0].set_ylabel('Actual', fontsize = 15)

sb.heatmap(cm_test,
annot = True, fmt=".0f", annot_kws={"size": 50}, ax = axes[1])
axes[1].set_xlabel('Predicted', fontsize = 15)
axes[1].set_ylabel('Actual', fontsize = 15)

plt.show()

3. Are False Positive Rates of the various Decision Tree models


significantly higher/lower than False Negative Rates ? Does this have
anything to do with the unbalanced classes (Y : N) of 'CentralAir'?
Experiment and think about it.

False Positive Rates of the various Decision Tree models are (significantly) higher than False Negative
Rates. False Positive Rates is the complement of True Negative Rate (i.e., the sum of the two is 1), and
False Negative Rates is the complement of True Positive Rate (i.e., the sum of the two is 1). True Positive
Rate is higher than True Negative Rate as positive samples dominate, thus, False Positive Rates are
higher than False Negative Rates.

To test the significance, 30 random trials are run on each Decision Tree model. A significance test is
then performed on all the results.
In [97]: Index_Trials=[]
Index_Predictors=[]
FPR_train=[]
FNR_train=[]
FPR_test=[]
FNR_test=[]

predictors=['SalePrice', 'GrLivArea', 'OverallQual', 'YearBuilt',
['SalePrice', 'GrLivArea', 'OverallQual', 'YearBuilt'],
df_num.columns]
for j,X_names in enumerate(predictors):
for i in range(30):
print(j,'-',i)
X = clean_dataset(pd.DataFrame(df[X_names])) # Predictor
y = pd.DataFrame(df['CentralAir'].iloc[X.index]) # Response

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)

# Check the sample sizes
print("Train Set :", y_train.shape, X_train.shape)
print("Test Set :", y_test.shape, X_test.shape)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree obje
dectree.fit(X_train, y_train) # train the decision tree model

y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)


cm_train=confusion_matrix(y_train, y_train_pred)
cm_test=confusion_matrix(y_test, y_test_pred)

fpr_train=cm_train[0,1]/np.sum(cm_train[0,:])
fnr_train=cm_train[1,0]/np.sum(cm_train[1,:])
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("False Positive Rate \t: {:.3f}".format(fpr_train))
print("False Negative Rate \t: {:.3f}".format(fnr_train))
print()

fpr_test=cm_test[0,1]/np.sum(cm_test[0,:])
fnr_test=cm_test[1,0]/np.sum(cm_test[1,:])
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("False Positive Rate \t: {:.3f}".format(fpr_test))
print("False Negative Rate \t: {:.3f}".format(fnr_test))
print()

Index_Trials.append(i)
Index_Predictors.append(j)
FPR_train.append(fpr_train)
FNR_train.append(fnr_train)
FPR_test.append(fpr_test)
FNR_test.append(fnr_test)

0 - 0
Train Set : (1100, 1) (1100, 1)
Test Set : (360, 1) (360, 1)
Goodness of Fit of Model Train Dataset
False Positive Rate : 0.835
False Negative Rate : 0.002
Goodness of Fit of Model Test Dataset
False Positive Rate : 1.000
False Negative Rate : 0.000

0 - 1
Train Set : (1100, 1) (1100, 1)
Test Set : (360, 1) (360, 1)
Goodness of Fit of Model Train Dataset
False Positive Rate : 0.875
False Negative Rate : 0.002

Goodness of Fit of Model Test Dataset


In [98]: from scipy import stats

print('Critical p value is 0.05')

t_value_train, p_value_train = stats.ttest_ind(np.array(FPR_train), np.array(FNR_train))
print("Significance test \tTrain Dataset")
print('Test statistic is %f' % float("{:.6f}".format(t_value_train)))
print('p-value for two tailed test is %f' % p_value_train)

t_value_test, p_value_test = stats.ttest_ind(np.array(FPR_test), np.array(FNR_test))
print("Significance test \tTest Dataset")
print('Test statistic is %f' % float("{:.6f}".format(t_value_test)))
print('p-value for two tailed test is %f' % p_value_test)

Critical p value is 0.05


Significance test Train Dataset
Test statistic is 54.231349
p-value for two tailed test is 0.000000
Significance test Test Dataset
Test statistic is 65.663752
p-value for two tailed test is 0.000000

In [ ]: ​

You might also like