Ex 1
Ex 1
Essential Libraries
Let us begin by importing the essential Python Libraries.
Problem 1 : Kaggle
The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
If you do not input a header value, it will default to infer , taking (generally) the first row of the CSV
file as column names.
If you set header = None then it will understand that there is no column names in the CSV file, and
every row contains just data.
If you set heaeder = 0 , it will understand that you want the 0-th row (first row) of the CSV file to be
considered as column names.
Check any function definition in Jupyter Notebook by running function_name? , e.g., try running
pd.read_csv? in a cell.
In [2]: houseData = pd.read_csv('train.csv', header='infer') #can use header= 2, 3, etc if neede
houseData.head(5) #5 means there are 5 rows printed
Out[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P
5 rows × 81 columns
b) How many observations (rows) and variables (columns) are in the above
dataset?
Note that shape is an attribute/variable stored inside the DataFrame class of pandas.
Find out all the stored attributes by checking the documentation on Pandas DataFrame :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
In [3]: houseData.shape
In [4]: houseData.dtypes
Out[4]: Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 81, dtype: object
d) What does the .info() method do? Use the .info() method on the imported
dataset to check this out.
You may also get more information about the dataset using info() .
Note that info() is a method/function stored inside the DataFrame class of pandas.
Find out all the stored methods by checking the documentation on Pandas DataFrame :
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
In [5]: houseData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
e) What does the .describe() method do? Use the .describe() method on the
imported dataset to check.
describe() provides you with statistical information about the data. This is another method.
In [6]: houseData.describe()
Out[6]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt Year
8 rows × 38 columns
Observation : Why are there less number of variables in describe() than in info() ?
Describe provides the basic statistics, but only for the Numeric Variables. You should also be careful that
a variable that looks numeric may actually be categorical, as levels of categorical variables may often
be encoded as numbers. However, Pandas does not know that -- it follows duck-typing.
In Exercise 2, you are explicitly going over each variable, reading its description, and understanding its true
meaning, to judge if a variable that looks like a number should be considered numeric. This will be an
important part of data preparation before you go ahead with exploratory data analysis in Exercise 3.
Problem 2 : Wikipedia
As the dataset is in a table formal within an HTML website, we may use the read_html function from
Pandas.
Same as read_csv , there are multiple optional input parameters to this function. Try running
pd.read_html ?
b) How many tables are in this Wikipedia page? Check the “len” of the
imported data/page to find this out.
Check that the imported data is a list , and note the len of it. This tells us how many tables are there.
c) Which one is the actual “2016 Summer Olympics medal table”? Explore all
tables in the data to know.
Check each table in the dataset to identify the one that we want to extract. Note that this is a standard
python list where each element of the list is a pandas DataFrame . That is, every single table from the
HTML document (the webpage) has been parsed into an individual DataFrame .
In [9]: medal_html[2] # vary the index from 0 to 1, 2, 3 etc. to check each table parsed from th
Out[9]:
Rank NOC Gold Silver Bronze Total
1 2 Great Britain 27 23 17 67
2 3 China 26 18 26 70
3 4 Russia 19 17 20 56
4 5 Germany 17 10 15 42
82 78 Nigeria 0 0 1 1
83 78 Portugal 0 0 1 1
86 Totals (86 entries) Totals (86 entries) 306 307 359 972
87 rows × 6 columns
Just by exploring each element in the list above, you will find that the actual data table corresponding to
the 2016 Summer Olympics medal table is located at index 1 2 of the list .
d) Extract the main table, “2016 Summer Olympics medal table”, and store it
as a new Pandas DataFrame.
Assign it to a new DataFrame , as follows, so that we can use it for further exploration. Check the basic
information too.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 87 non-null object
1 NOC 87 non-null object
2 Gold 87 non-null int64
3 Silver 87 non-null int64
4 Bronze 87 non-null int64
5 Total 87 non-null int64
dtypes: int64(4), object(2)
memory usage: 4.2+ KB
e) Extract the TOP 20 countries from the medal table, as above, and store
these rows as a new DataFrame.
The DataFrame seems to have 87 rows/countries. Extract the top 20 rows of the DataFrame to capture
the TOP 20 countries in the medal tally. There are plenty of ways to do this. You may use the standard
.iloc[] method to index the specific rows you want to extract. You may also use .head(20) .
In [11]: medalData = medalTable.iloc[:20]
medalData
Out[11]:
Rank NOC Gold Silver Bronze Total
1 2 Great Britain 27 23 17 67
2 3 China 26 18 26 70
3 4 Russia 19 17 20 56
4 5 Germany 17 10 15 42
5 6 Japan 12 8 21 41
6 7 France 10 18 14 42
7 8 South Korea 9 3 9 21
8 9 Italy 8 12 8 28
9 10 Australia 8 11 10 29
10 11 Netherlands 8 7 4 19
11 12 Hungary 8 3 4 15
12 13 Brazil* 7 6 6 19
13 14 Spain 7 4 6 17
14 15 Kenya 6 6 1 13
15 16 Jamaica 6 3 2 11
16 17 Croatia 5 3 2 10
17 18 Cuba 5 2 4 11
18 19 New Zealand 4 9 5 18
19 20 Canada 4 3 15 22
Observation : If you can just change the 2016 part within the URL
'https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table' , you should be able to
fetch any other Summer Olympic dataset similarly. Try with 2012 , 2008 etc. Can this be done for any
year? What about 1980 ?
Interesting : If the 2016 part can be changed this way, you may also try to write a loop to iterate over a
list of years [2016, 2012, 2008, 2004, 2000] , and fetch all the tables within the loop. Try it out --
should be fun! This is a bonus problem, and will be discussed in the next Review Session.
More Interesting : Any structured website can be scraped for tables in the same way. However, what
would you do for data that are not in a table format? Can you extract the name of the movie, its rating and
genres from https://www.imdb.com/title/tt0441773/ (https://www.imdb.com/title/tt0441773/) using some
other library in python? Give it a shot! :-)
Bonus Problems A
To download the data: https://www.kaggle.com/datasets/uciml/adult-census-income
(https://www.kaggle.com/datasets/uciml/adult-census-income)
A. Download the “Census Income” dataset (source :
https://archive.ics.uci.edu/ml/datasets/Census+Income
(https://archive.ics.uci.edu/ml/datasets/Census+Income)) from the UCI
Machine Learning Repository (in the “Data Folder”), and import it in Jupyter
Notebook as a DataFrame.
Explore the dataset using .shape , .info() and .describe() , exactly as you did in Problem 1 above.
Do you spot anything interesting while exploring this dataset? Discuss amongst friends or talk to the
Instructor, if you did.
In [18]: census_income=pd.read_csv('adult.csv')
In [30]: print(census_income.shape)
print('\n') #spacing
print(census_income.info())
print('\n') #spacing
print(census_income.describe())
(32560, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 39 32560 non-null int64
1 State-gov 32560 non-null object
2 77516 32560 non-null int64
3 Bachelors 32560 non-null object
4 13 32560 non-null int64
5 Never-married 32560 non-null object
6 Adm-clerical 32560 non-null object
7 Not-in-family 32560 non-null object
8 White 32560 non-null object
9 Male 32560 non-null object
10 2174 32560 non-null int64
11 0 32560 non-null int64
12 40 32560 non-null int64
13 United-States 32560 non-null object
14 <=50K 32560 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None
39 77516 13 2174 0 \
count 32560.000000 3.256000e+04 32560.000000 32560.000000 32560.000000
mean 38.581634 1.897818e+05 10.080590 1077.615172 87.306511
std 13.640642 1.055498e+05 2.572709 7385.402999 402.966116
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000
25% 28.000000 1.178315e+05 9.000000 0.000000 0.000000
50% 37.000000 1.783630e+05 10.000000 0.000000 0.000000
75% 48.000000 2.370545e+05 12.000000 0.000000 0.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000
40
count 32560.000000
mean 40.437469
std 12.347618
min 1.000000
25% 40.000000
50% 40.000000
75% 45.000000
max 99.000000
The method describe() returns only statistics for columns whose dtype is 'int64' and ignores 'object'
ones.
Bonus Problems B
Note that the Summer Olympic medal tally on Wikipedia follows a really nice structure for the URL, where
you can simply change the year in https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table
(https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table) to fetch any Summer Olympic page.
Try changing 2016 in the URL to 2012 or 2008 or 2004 to see for yourself. This allows us to fetch the
Olympics medal table from all these years (in fact, any year) quite easily.
Let’s try the following. Write a loop to extract the main tables, “20XX Summer Olympics medal table”, from
2000 to 2016, that is, for the five consecutive Olympics in 2000, 2004, 2008, 2012 and 2016. Store all five
tables in respective DataFrames. Now, extract the TOP 20 countries from each of these medal tables, and
store these rows as new DataFrames.
In [35]: first_year = 2000
last_year = 2016
medal_tables=[]
for year in range(first_year, last_year + 1, 4):
# fetch the html page with the url for year
wiki_url = "https://en.wikipedia.org/wiki/" + str(year) + "_Summer_Olympics_medal_ta
___________________________________________
2000
0 1 United States 37 24 32 93
1 2 Russia 32 28 29 89
2 3 China 28 16 14 58
3 4 Australia 16 25 17 58
4 5 Germany 13 17 26 56
5 6 France 13 14 11 38
6 7 Italy 13 8 13 34
7 8 Netherlands 12 9 4 25
8 9 Cuba 11 11 7 29
9 10 Great Britain 11 10 7 28
___________________________________________
2004
1 2 China 32 17 14 63
2 3 Russia 28 26 36 90
3 4 Australia 17 16 17 50
4 5 Japan 16 9 12 37
5 6 Germany 13 16 20 49
6 7 France 11 9 13 33
7 8 Italy 10 11 11 32
8 9 South Korea 9 12 9 30
9 10 Great Britain 9 9 12 30
___________________________________________
2008
0 1 China* 48 22 30 100
2 3 Russia 24 13 23 60
3 4 Great Britain 19 13 19 51
4 5 Germany 16 11 14 41
5 6 Australia 14 15 17 46
6 7 South Korea 13 11 8 32
7 8 Japan 9 8 8 25
8 9 Italy 8 9 10 27
9 10 France 7 16 20 43
___________________________________________
2012
1 2 China 39 31 22 92
2 3 Great Britain* 29 18 18 65
3 4 Russia 19 21 27 67
4 5 South Korea 13 9 8 30
5 6 Germany 11 20 13 44
6 7 France 11 11 13 35
7 8 Australia 8 15 12 35
8 9 Italy 8 9 11 28
9 10 Hungary 8 4 6 18
___________________________________________
2016
1 2 Great Britain 27 23 17 67
2 3 China 26 18 26 70
3 4 Russia 19 17 20 56
4 5 Germany 17 10 15 42
5 6 Japan 12 8 21 41
6 7 France 10 18 14 42
7 8 South Korea 9 3 9 21
8 9 Italy 8 12 8 28
9 10 Australia 8 11 10 29
In [33]: medalDataFrames = []
for medal_table in medal_tables:
# extract only TOP 20 countries in the table
medalData = medal_table.iloc[:20]
Out[33]:
Rank Nation Gold Silver Bronze Total
0 1 United States 37 24 32 93
1 2 Russia 32 28 29 89
2 3 China 28 16 14 58
3 4 Australia 16 25 17 58
4 5 Germany 13 17 26 56
15 16 Jamaica 6 3 2 11
16 17 Croatia 5 3 2 10
17 18 Cuba 5 2 4 11
18 19 New Zealand 4 9 5 18
19 20 Canada 4 3 15 22
Essential Libraries
Let us begin by importing the essential Python Libraries.
The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
a) Import the “train.csv” data you downloaded (either from NTU Learn or
Kaggle) in Jupyter Notebook.
Out[2]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P
5 rows × 81 columns
In [25]: houseData.dtypes
Out[25]: Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 81, dtype: object
You may also get more information about the dataset using info() .
In [26]: houseData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
Note that there are 35 int64 and 3 float64 variables in the dataset.
Extract the 38 variables by filtering the variables using their dtypes .
d) Open the “data_description.txt” file you downloaded (either from NTU Learn or Kaggle) in
Wordpad.
Read the description for each variable carefully and try to identify the “actual” Numeric variables.
Categorical variables are often “encoded” as Numeric variables for easy representation. Spot them.
Observation : Note that in a given data, Categorical variables can be "encoded" in either of two ways, as
Characters (as in MSZoning ) or a Numbers (as in MSSubClass ). Even if a categorical variable is
"encoded" as numbers, interpreting it as a numeric variable is wrong. Thus, one should be careful in
reading the given data description file and identifying the "actual" numeric variables from the dataset to
perform statistical exploration.
Read data_description.txt (from the Kaggle data folder) to identify the actual Numeric variables.
Note that this table is created manually, and this is my interpretation. Feel free to choose your own.
Variable Observation
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 LotArea 1460 non-null int64
2 BsmtFinSF1 1460 non-null int64
3 BsmtFinSF2 1460 non-null int64
4 BsmtUnfSF 1460 non-null int64
5 TotalBsmtSF 1460 non-null int64
6 1stFlrSF 1460 non-null int64
7 2ndFlrSF 1460 non-null int64
8 LowQualFinSF 1460 non-null int64
9 GrLivArea 1460 non-null int64
10 BsmtFullBath 1460 non-null int64
11 BsmtHalfBath 1460 non-null int64
12 FullBath 1460 non-null int64
13 HalfBath 1460 non-null int64
14 BedroomAbvGr 1460 non-null int64
15 KitchenAbvGr 1460 non-null int64
16 TotRmsAbvGrd 1460 non-null int64
17 Fireplaces 1460 non-null int64
18 GarageCars 1460 non-null int64
19 GarageArea 1460 non-null int64
20 WoodDeckSF 1460 non-null int64
21 OpenPorchSF 1460 non-null int64
22 EnclosedPorch 1460 non-null int64
23 3SsnPorch 1460 non-null int64
24 ScreenPorch 1460 non-null int64
25 PoolArea 1460 non-null int64
26 MiscVal 1460 non-null int64
27 SalePrice 1460 non-null int64
dtypes: int64(28)
memory usage: 319.5 KB
Out[24]: SalePrice
0 208500
1 181500
2 223500
3 140000
4 250000
In [10]: saleprice.describe()
Out[10]: SalePrice
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
In [32]: # BOXPLOT
f = plt.figure(figsize=(17, 3))
sb.boxplot(data = saleprice, orient = "h")
Out[32]: <AxesSubplot:>
In [50]: # HISTOGRAM
f = plt.figure(figsize=(15, 6))
sb.histplot(data = saleprice)
Out[50]: <AxesSubplot:ylabel='Count'>
C:\Users\timot\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarnin
g: `distplot` is a deprecated function and will be removed in a future version. Please
adapt your code to use either `displot` (a figure-level function with similar flexibil
ity) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[45]: <AxesSubplot:ylabel='Density'>
In [46]: # VIOLIN PLOT
f = plt.figure(figsize=(8, 4))
sb.violinplot(data = saleprice, orient = "h")
Out[46]: <AxesSubplot:>
Out[49]: <AxesSubplot:>
Out[14]: LotArea
0 8450
1 9600
2 11250
3 9550
4 14260
c) Find the Summary Statistics (Mean, Median, Quartiles etc) of LotArea from
the Numeric DataFrame.
(same as part a)
In [15]: lotarea.describe()
Out[15]: LotArea
count 1460.000000
mean 10516.828082
std 9981.264932
min 1300.000000
25% 7553.500000
50% 9478.500000
75% 11601.500000
max 215245.000000
(same as part b)
In [55]: # BOXPLOT
f = plt.figure(figsize=(17, 3))
sb.boxplot(data = lotarea, orient = "h")
Out[55]: <AxesSubplot:>
In [60]: # HISTOGRAM
f = plt.figure(figsize=(6, 4))
sb.histplot(data = lotarea)
Out[60]: <AxesSubplot:ylabel='Count'>
Out[62]: <AxesSubplot:>
e) Plot SalePrice (y-axis) vs LotArea (x-axis) using jointplot and find the
Correlation between the two.
Extract two variables from the DataFrame -- SalePrice and LotArea -- and check their mutual
relationship.
Out[66]: <AxesSubplot:>
In [74]: jointDF.corr()
Heatmap
In [23]: sb.heatmap(jointDF.corr(), vmin = -1, vmax = 1, annot = True, fmt=".2f")
Out[23]: <AxesSubplot:>
Observation : Note that the correlation between LotArea and SalePrice is 0.26, as shown above. Do
you think LotArea will have any effect on SalePrice of a house, in case you want to estimate the
SalePrice using LotArea while exploring the real-estate market? Think about it.
Bonus
Drop non-Numeric variables from the DataFrame to have a clean DataFrame with only the Numeric
variables.
Plot SalePrice vs each of the Numeric variables you identified to understand their correlation or
dependence.
In [94]: df=pd.read_csv('train.csv')
In [95]: df_sub = df.loc[:, (df.dtypes == np.int64) | (df.dtypes == np.float64)]
df_sub.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotFrontage 1201 non-null float64
3 LotArea 1460 non-null int64
4 OverallQual 1460 non-null int64
5 OverallCond 1460 non-null int64
6 YearBuilt 1460 non-null int64
7 YearRemodAdd 1460 non-null int64
8 MasVnrArea 1452 non-null float64
9 BsmtFinSF1 1460 non-null int64
10 BsmtFinSF2 1460 non-null int64
11 BsmtUnfSF 1460 non-null int64
12 TotalBsmtSF 1460 non-null int64
13 1stFlrSF 1460 non-null int64
14 2ndFlrSF 1460 non-null int64
15 LowQualFinSF 1460 non-null int64
16 GrLivArea 1460 non-null int64
17 BsmtFullBath 1460 non-null int64
18 BsmtHalfBath 1460 non-null int64
19 FullBath 1460 non-null int64
20 HalfBath 1460 non-null int64
21 BedroomAbvGr 1460 non-null int64
22 KitchenAbvGr 1460 non-null int64
23 TotRmsAbvGrd 1460 non-null int64
24 Fireplaces 1460 non-null int64
25 GarageYrBlt 1379 non-null float64
26 GarageCars 1460 non-null int64
27 GarageArea 1460 non-null int64
28 WoodDeckSF 1460 non-null int64
29 OpenPorchSF 1460 non-null int64
30 EnclosedPorch 1460 non-null int64
31 3SsnPorch 1460 non-null int64
32 ScreenPorch 1460 non-null int64
33 PoolArea 1460 non-null int64
34 MiscVal 1460 non-null int64
35 MoSold 1460 non-null int64
36 YrSold 1460 non-null int64
37 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35)
memory usage: 433.6 KB
In [96]: column_categorical=['MSSubClass', 'OverallQual', 'OverallCond']
df_num = df_sub.drop(column_categorical, axis=1)
df_num.dtypes
Out[96]: Id int64
LotFrontage float64
LotArea int64
YearBuilt int64
YearRemodAdd int64
MasVnrArea float64
BsmtFinSF1 int64
BsmtFinSF2 int64
BsmtUnfSF int64
TotalBsmtSF int64
1stFlrSF int64
2ndFlrSF int64
LowQualFinSF int64
GrLivArea int64
BsmtFullBath int64
BsmtHalfBath int64
FullBath int64
HalfBath int64
BedroomAbvGr int64
KitchenAbvGr int64
TotRmsAbvGrd int64
Fireplaces int64
GarageYrBlt float64
GarageCars int64
GarageArea int64
WoodDeckSF int64
OpenPorchSF int64
EnclosedPorch int64
3SsnPorch int64
ScreenPorch int64
PoolArea int64
MiscVal int64
MoSold int64
YrSold int64
SalePrice int64
dtype: object
In [100]: for var in df_num:
if var != 'SalePrice':
jointDF = pd.concat([df_num[var], df_num['SalePrice']], axis = 1).reindex(df_num
sb.jointplot(data = jointDF, x = var, y = 'SalePrice', height = 3)
plt.show()
Essential Libraries
Let us begin by importing the essential Python Libraries.
The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P
5 rows × 81 columns
In [ ]:
Extract the required variables from the dataset, as mentioned in the problem.
LotArea , GrLivArea , TotalBsmtSF , GarageArea , SalePrice
Out[5]:
LotArea GrLivArea TotalBsmtSF GarageArea SalePrice
In [6]: houseNumData.describe()
Out[6]:
LotArea GrLivArea TotalBsmtSF GarageArea SalePrice
In [8]: houseNumData.skew()
Out[9]: LotArea 69
GrLivArea 31
TotalBsmtSF 61
GarageArea 21
SalePrice 61
dtype: int64
c) Check the relationship amongst the variables using mutual correlation and
the correlation heatmap. Comment which of the variables has the strongest
correlation with “SalePrice”.
Is this useful in predicting “SalePrice”? Correlation between the variables, followed by all bi-variate
jointplots.
Out[10]: <AxesSubplot:>
d) Check the relationship amongst the variables using mutual jointplots and
an overall pairplot. Comment which of the variables has the strongest linear
relation with “SalePrice”. Is this useful in predicting “SalePrice”?
Observation : Which variables do you think will help us predict SalePrice in this dataset?
Bonus : Attempt a comprehensive analysis with all Numeric variables in the dataset.
Problem 2 : Categorical Variables
Extract the required variables from the dataset, as mentioned in the problem.
MSSubClass , Neighborhood , BldgType , OverallQual
Out[34]:
MSSubClass Neighborhood BldgType OverallQual
0 60 CollgCr 1Fam 7
1 20 Veenker 1Fam 6
2 60 CollgCr 1Fam 7
3 70 Crawfor 1Fam 7
4 60 NoRidge 1Fam 8
In [31]: houseCatData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 1460 non-null int64
1 Neighborhood 1460 non-null object
2 BldgType 1460 non-null object
3 OverallQual 1460 non-null int64
dtypes: int64(2), object(2)
memory usage: 45.8+ KB
a) Convert each of these variables into “category” data type (note that some
are “int64”, and some are “object”).
Fix the data types of the first four variables to convert them to categorical.
In [33]: houseCatData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 1460 non-null category
1 Neighborhood 1460 non-null category
2 BldgType 1460 non-null category
3 OverallQual 1460 non-null category
dtypes: category(4)
memory usage: 7.8 KB
Check the Variables Independently
In [13]: houseCatData.describe()
Out[13]:
MSSubClass Neighborhood BldgType OverallQual
unique 15 25 5 10
c) One may check the relation amongst two categorical variables through the
bi-variate joint heatmap of counts. Use groupby() command to generate joint
heatmap of counts for “OverallQual” against the other three variables.
Comment if this is useful in identifying the relation between “OverallQual”
with the other variables.
Joint heatmaps of some of the important bi-variate relationships in houseCatData .
Out[60]:
MSSubClass Neighborhood BldgType OverallQual SalePrice
Observation : Which variables do you think will help us predict SalePrice in this dataset?
Bonus : Attempt a comprehensive analysis with all Categorical variables in the dataset.
Bonus
Perform a similar analysis on every other variable in the dataset, against
“SalePrice”. This will let you gain more insight about the data, and find out
which variables are actually useful in predicting “SalePrice”.
In [65]: no
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [65], in <cell line: 1>()
----> 1 no
In [ ]:
Exercise 4 : Linear Regression
Essential Libraries
Let us begin by importing the essential Python Libraries.
The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
Out[49]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P
5 rows × 81 columns
In [51]: houseData.SalePrice.corr(houseData.GrLivArea)
Out[51]: 0.7086244776126522
d) Partition the dataset houseData into two “random” portions : Train Data
(1100 rows) and Test Data (360 rows).
Split the dataset in Train and Test sets, uniformly at random.
Train Set with 1100 samples and Test Set with 360 samples.
In [99]: # Import the required function from sklearn
from sklearn.model_selection import train_test_split
# Extract Response and Predictors
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData['GrLivArea'])
# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360) # x, y and
# Check the sample sizes
print("X Train Set :", X_train.shape, y_train.shape)
print("Y Test Set :", X_test.shape, y_test.shape)
# Create a joint dataframe by concatenating the two variables
trainDF = pd.concat([X_train, y_train], axis = 1).reindex(X_train.index)
Out[100]: LinearRegression()
f) Print the coefficients of the Linear Regression model you just fit, and plot
the regression line on a scatterplot
Check the coefficients of the Linear Regression model you just fit.
Intercept : b = [14000.8848827]
Coefficients : a = [[109.85619063]]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:3621, in Index.get_l
oc(self, key, method, tolerance)
3620 try:
-> 3621 return self._engine.get_loc(casted_key)
3622 except KeyError as err:
g) Print Explained Variance (R^2) and Mean Squared Error (MSE) on Train
Data to check Goodness of Fit of model.
h) Predict SalePrice in case of Test Data using the Linear Regression model
and the predictor variable GrLivArea
GrLivArea
=========================================
Train R^2: 0.5123006711203322
Train MSE: 2935725152.5729613
Test MSE: 3772627647.534909
Perform all the above steps on “SalePrice” against each of the variables
“LotArea”, “TotalBsmtSF”, “GarageArea” one by one to perform individual
Linear Regressions and obtain individual univariate Linear Regression
Models in each case.
Check the relationship between the two variables : SalePrice and the Predictor.
In [64]: sb.jointplot(data = houseData, x = "LotArea", y = "SalePrice", height = 4)
In [65]: houseData.SalePrice.corr(houseData.LotArea)
Out[65]: 0.2638433538714057
Out[67]: LinearRegression()
Intercept : b = [153080.85711237]
Coefficients : a = [[2.49403785]]
LotArea
=========================================
Train R^2: 0.07650117075042129
Train MSE: 5559037260.985449
Test MSE: 6884271779.626366
In [ ]:
In [77]: houseData.SalePrice.corr(houseData.TotalBsmtSF)
Out[77]: 0.6135805515591952
In [ ]:
Out[79]: LinearRegression()
In [ ]:
Visual Representation of the Linear Regression Model
Intercept : b = [67116.87650897]
Coefficients : a = [[107.54551488]]
In [ ]:
In [ ]:
TotalBsmtSF
=========================================
Train R^2: 0.396575160056191
Train MSE: 3632339384.9644613
Test MSE: 4889229988.232506
In [ ]:
In [87]: houseData.SalePrice.corr(houseData.GarageArea)
Out[87]: 0.6234314389183618
In [ ]:
Out[89]: LinearRegression()
In [ ]:
Intercept : b = [69930.06931615]
Coefficients : a = [[231.73496835]]
In [ ]:
In [ ]:
GarageArea
=========================================
Train R^2: 0.39437068162335176
Train MSE: 3645609328.965528
Test MSE: 4504815332.85322
Compare and contrast the four models in terms of Explained Variance (R^2)
and Mean Squared Error (MSE) on Train Data, the accuracy of prediction on
Test Data, and comment on which model you think is the best to predict
“SalePrice”.
Original values:
GrLivArea R^2: 0.5340296119421228 Train MSE: 3059483255 Test MSE: 3440839968
Compare and contrast the four models in terms of R^2 and MSE on Train Data, as well as MSE on Test
Data.
SalePrice vs GrLivArea has the best Explained Variance (R^2) out of the four models.
SalePrice vs LotArea has the worst Explained Variance (R^2) out of the four models.
Naturally, the model with GrLivArea is the best one in terms of just the Training accuracy.
We also find SalePrice vs GrLivArea has the minimum MSE on both the Train and Test Sets
compared to other models.
We also find SalePrice vs LotArea has the maximum MSE on both the Train and Test Sets
compared to other models.
Naturally, the model with GrLivArea is the best one in terms of Test accuracy as evident from MSE
(error) on the Test Set.
So, overall, the predictor GrLivArea is the best amongst the four in predicting SalePrice .
Did you notice? : Go back and check again the R^2 and MSE values for the four models. I am pretty sure
you did not get the exact same values as I did. This is due to the random selection of Train-Test sets. In
fact, if you run the above cells again, you will get a different set of R^2 and MSE values. If that is so, can
we really be confident that GrLivArea will always be the best variable to predict SalePrice ? Think
about it. ;-)
In [110]: print ("GrLivArea \n =========================================")
print ("Train R^2:", GrLivAreaPredictR2)
print ("Train MSE:", GrLivAreaTrainMSE)
print ("Test MSE:", GrLivAreaTestMSE)
print()
print ("LotArea \n =========================================")
print ("Train R^2:", LotAreaPredictR2)
print ("Train MSE:", LotAreaTrainMSE)
print ("Test MSE:", LotAreaTestMSE)
print()
print ("TotalBsmtSF \n =========================================")
print ("Train R^2:", TotalBsmtSFPredictR2)
print ("Train MSE:", TotalBsmtSFTrainMSE)
print ("Test MSE:", TotalBsmtSFTestMSE)
print()
print ("GarageArea \n =========================================")
print ("Train R^2:", GarageAreaPredictR2)
print ("Train MSE:", GarageAreaTrainMSE)
print ("Test MSE:", GarageAreaTestMSE)
print()
print ("Root of MSE is standard deviation")
GrLivArea
=========================================
Train R^2: 0.5123006711203322
Train MSE: 2935725152.5729613
Test MSE: 3772627647.534909
LotArea
=========================================
Train R^2: 0.07650117075042129
Train MSE: 5559037260.985449
Test MSE: 6884271779.626366
TotalBsmtSF
=========================================
Train R^2: 0.396575160056191
Train MSE: 3632339384.9644613
Test MSE: 4889229988.232506
GarageArea
=========================================
Train R^2: 0.39437068162335176
Train MSE: 3645609328.965528
Test MSE: 4504815332.85322
1. Note that LinearRegression() model can take more than one Predictor to model the Response
variable. Try using this feature to fit a Linear Regression model to predict “SalePrice” using all the
four variables “GrLivArea” , “LotArea” , “TotalBsmtSF” , and “GarageArea” . Print the Explained
Variance (R^2) of this multi-variate model on Train Data, and check the model’s accuracy of
prediction on the Test Data using Mean Squared Error (MSE).
Extract the required variables from the dataset, and then perform Multi-Variate Regression.
In [109]: # Import LinearRegression model from Scikit-Learn
from sklearn.linear_model import LinearRegression
# Import the required function from sklearn
from sklearn.model_selection import train_test_split
# Extract Response and Predictors ####################
y = pd.DataFrame(houseData['SalePrice'])
X = pd.DataFrame(houseData[['GrLivArea','LotArea','TotalBsmtSF','GarageArea']])
# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)
# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)
# Create a Linear Regression object
linreg = LinearRegression()
# Train the Linear Regression model
linreg.fit(X_train, y_train)
Out[109]: LinearRegression()
Note that you CANNOT visualize the model as a line on a 2D plot, as it is a multi-dimensional surface.
Intercept : b = [-35587.22778885]
Coefficients : a = [[70.53200529 0.26914264 56.88385943 98.63075464]]
In [ ]:
Observation : The model with SalePrice against all the four variables GrLivArea , LotArea ,
TotalBsmtSF , GarageArea is definitely better!
2. Fit a Linear Regression model to predict “SalePrice” using all the numeric variables in the given
dataset. You may use all the numeric variables from Exercise 2. Print the Explained Variance (R^2) of
this multi-variate model on Train Data, and check the model’s accuracy of prediction on the Test Data
using Mean Squared Error (MSE).
3. Is the Explained Variance (R^2) of a multi-variate model equal to the Sum of Explained Variances
(R^2) of the component univariate models? If R^2 for “SalePrice” vs “GrLivArea” is 0.53 and R^2 of
“SalePrice” vs “LotArea” is 0.22, will the R^2 for “SalePrice” vs [“GrLivArea”, “LotArea”] be 0.75?
Experiment a little and think about it.
In [ ]: ######################################################################################
Out[257]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P
5 rows × 81 columns
In [264]: houseDataNum = houseData.select_dtypes(include = np.int64)
print("Data dims : ", houseDataNum.shape)
houseDataNum.info() # note that all variables are now int64
Out[273]: LinearRegression()
In [ ]:
Exercise 5 : Classification Tree
Essential Libraries
Let us begin by importing the essential Python Libraries.
The dataset is train.csv ; hence we use the read_csv function from Pandas.
Immediately after importing, take a quick look at the data using the head function.
Out[2]: Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... P
5 rows × 81 columns
Problem 1 : Predicting CentralAir using SalePrice
In [179]: houseData['CentralAir'].describe()
Print the ratio Y : N for CentralAir to check the imbalance in the classes.
Good to note that the two boxplots for SalePrice , for CentralAir = Y and CentralAir = N , are
different from one another in terms of their median value, as well as spread. This means that CentralAir
has an effect on SalePrice , and hence, SalePrice will probably be an important variable in predicting
CentralAir . Boxplots do not tell us where to make the cuts though -- it will be easier to visualize in the
following swarmplot .
SWARMPLOT
C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 3
9.7% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)
Hmm, it seems that swarmplot asks you to decrease the size of the markers or use stripplot . What is
this stripplot anyway? Let's check.
In [184]: f = plt.figure(figsize=(16, 4))
sb.stripplot(x = 'SalePrice', y = 'CentralAir', data = houseData)
Now it's time to build the Decision Tree classifier. Import the DecisionTreeClassifier model from
sklearn.tree .
d) Partition the dataset houseData into two “random” portions : Train Data
(1100 rows) and Test Data (360 rows)
Split the dataset in Train and Test sets, uniformly at random.
Train Set with 1100 samples and Test Set with 360 samples.
In [186]: # Import the required function from sklearn
from sklearn.model_selection import train_test_split
# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData['SalePrice'])
# Split the Dataset into random Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)
# Check the sample sizes
print("Train Set :", X_train.shape, y_train.shape)
print("Test Set :", X_test.shape, y_test.shape)
e) Training : Fit a Decision Tree model on the Train Dataset to predict the
class (Y/N) of CentralAir using SalePrice.
Fit Decision Tree Classifier model on the Train Dataset.
Out[99]: DecisionTreeClassifier(max_depth=2)
f) Visualize the Decision Tree model using the plot_tree function : from
sklearn.tree import plot_tree
g) Predict CentralAir for the train dataset using the Decision Tree model and
plot the Two-Way Confusion Matrix
Out[189]: <AxesSubplot:>
Confusion Matrix
(0) (1)
* TPR = TP / (TP + FN) : True Positive Rate = True Positives / All Positives
* TNR = TN / (TN + FP) : True Negative Rate = True Negatives / All Negatives
* FPR = FP / (TN + FP) : False Positive Rate = False Positives / All Negatives
* FNR = FN / (TP + FN) : False Negative Rate = False Negatives / All Positives
In [190]: # Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()
# Print the Accuracy Measures from the CONFUSION MATRIX
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()
print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))
Train Data
Accuracy : 0.9390909090909091
i) Predict CentralAir for the test dataset using the Decision Tree model and
plot the Two-Way Confusion Matrix.
Check how good the predictions are on the Test Set. Metrics : Classification Accuracy and Confusion
Matrix.
In [192]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix
# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test) #X_test is used here
# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), #y_test is used here
annot = True, fmt=".0f", annot_kws={"size": 58})
Out[192]: <AxesSubplot:>
Print the Classification Accuracy and all other Accuracy Measures from the Confusion Matrix.
Confusion Matrix
(0) (1)
* TPR = TP / (TP + FN) : True Positive Rate = True Positives / All Positives
* TNR = TN / (TN + FP) : True Negative Rate = True Negatives / All Negatives
* FPR = FP / (TN + FP) : False Positive Rate = False Positives / All Negatives
* FNR = FN / (TP + FN) : False Negative Rate = False Negatives / All Positives
In [193]: # Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()
# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()
print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))
Test Data
Accuracy : 0.9527777777777777
Important : Note the huge imbalance in the False Positives and False Negatives in the confusion matrix.
False Positives are much higher in number than False Negatives in both Train and Test data. This is not
surprising -- actually, this is a direct effect of the huge Y vs N class imbalance in the response variable
CentralAir . As CentralAir = Y was more likely in the data, False Positives are more likely too. Think
about how you can fix it!
In [194]: SalePriceAccuracyTrain = dectree.score(X_train, y_train)
SalePriceTPRTrain = tpTrain/(tpTrain + fnTrain)
SalePriceTNRTrain = tnTrain/(tnTrain + fpTrain)
SalePriceFPRTrain = fpTrain/(tnTrain + fpTrain)
SalePriceFNRTrain = fnTrain/(tpTrain + fnTrain)
print ("Sale price Train:\n ````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTrain)
print("TPRTrain: \t", SalePriceTPRTrain)
print("TNRTrain: \t", SalePriceTNRTrain)
print("FPRTrain: \t", SalePriceFPRTrain)
print("FNRTrain: \t", SalePriceFNRTrain)
SalePriceAccuracyTest = dectree.score(X_test, y_test)
SalePriceTPRTest = tpTest/(tpTest + fnTest)
SalePriceTNRTest = tnTest/(tnTest + fpTest)
SalePriceFPRTest = fpTest/(tnTest + fpTest)
SalePriceFNRTest = fnTest/(tpTest + fnTest)
print ("Sale price Test:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTest)
print("TPRTest: \t", SalePriceTPRTest)
print("TNRTest: \t", SalePriceTNRTest)
print("FPRTest: \t", SalePriceFPRTest)
print("FNRTest: \t", SalePriceFNRTest)
C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 5
0.3% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)
Out[249]: DecisionTreeClassifier(max_depth=2)
In [250]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree
f = plt.figure(figsize=(7,7))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])
In [ ]:
In [251]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix
# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()
# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()
print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))
# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 50})
Train Data
Accuracy : 0.9309090909090909
Out[251]: <AxesSubplot:>
In [ ]:
Test Data
Accuracy : 0.9583333333333334
Out[253]: <AxesSubplot:>
In [254]: GrLivAreaAccuracyTrain = dectree.score(X_train, y_train)
GrLivAreaTPRTrain = tpTrain/(tpTrain + fnTrain)
GrLivAreaTNRTrain = tnTrain/(tnTrain + fpTrain)
GrLivAreaFPRTrain = fpTrain/(tnTrain + fpTrain)
GrLivAreaFNRTrain = fnTrain/(tpTrain + fnTrain)
print ("GrLivArea Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTrain)
print("TPRTrain: \t", GrLivAreaTPRTrain)
print("TNRTrain: \t", GrLivAreaTNRTrain)
print("FPRTrain: \t", GrLivAreaFPRTrain)
print("FNRTrain: \t", GrLivAreaFNRTrain)
print ()
GrLivAreaAccuracyTest = dectree.score(X_test, y_test)
GrLivAreaTPRTest = tpTest/(tpTest + fnTest)
GrLivAreaTNRTest = tnTest/(tnTest + fpTest)
GrLivAreaFPRTest = fpTest/(tnTest + fpTest)
GrLivAreaFNRTest = fnTest/(tpTest + fnTest)
print ("GrLivArea Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTest)
print("TPRTest: \t", GrLivAreaTPRTest)
print("TNRTest: \t", GrLivAreaTNRTest)
print("FPRTest: \t", GrLivAreaFPRTest)
print("FNRTest: \t", GrLivAreaFNRTest)
GrLivArea Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9309090909090909
TPRTrain: 1.0
TNRTrain: 0.0379746835443038
FPRTrain: 0.9620253164556962
FNRTrain: 0.0
GrLivArea Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9583333333333334
TPRTest: 1.0
TNRTest: 0.0625
FPRTest: 0.9375
FNRTest: 0.0
In [ ]:
C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 9
0.6% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)
C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 2
3.2% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)
Out[260]: DecisionTreeClassifier(max_depth=2)
In [261]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree
f = plt.figure(figsize=(8,6))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])
In [ ]:
In [262]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix
# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()
# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()
print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))
# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
Train Data
Accuracy : 0.94
Out[262]: <AxesSubplot:>
In [ ]:
In [263]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix
# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)
# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()
# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()
print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))
# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 50})
Test Data
Accuracy : 0.9388888888888889
Out[263]: <AxesSubplot:>
In [264]: OverallQualAccuracyTrain = dectree.score(X_train, y_train)
OverallQualTPRTrain = tpTrain/(tpTrain + fnTrain)
OverallQualTNRTrain = tnTrain/(tnTrain + fpTrain)
OverallQualFPRTrain = fpTrain/(tnTrain + fpTrain)
OverallQualFNRTrain = fnTrain/(tpTrain + fnTrain)
print ("OverallQual Train:\n ```````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTrain)
print("TPRTrain: \t", OverallQualTPRTrain)
print("TNRTrain: \t", OverallQualTNRTrain)
print("FPRTrain: \t", OverallQualFPRTrain)
print("FNRTrain: \t", OverallQualFNRTrain)
print ()
OverallQualAccuracyTest = dectree.score(X_test, y_test)
OverallQualTPRTest = tpTest/(tpTest + fnTest)
OverallQualTNRTest = tnTest/(tnTest + fpTest)
OverallQualFPRTest = fpTest/(tnTest + fpTest)
OverallQualFNRTest = fnTest/(tpTest + fnTest)
print ("OverallQualArea Test:\n ````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTest)
print("TPRTest: \t", OverallQualTPRTest)
print("TNRTest: \t", OverallQualTNRTest)
print("FPRTest: \t", OverallQualFPRTest)
print("FNRTest: \t", OverallQualFNRTest)
OverallQual Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.94
TPRTrain: 0.9941804073714839
TNRTrain: 0.13043478260869565
FPRTrain: 0.8695652173913043
FNRTrain: 0.005819592628516004
OverallQualArea Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9388888888888889
TPRTest: 0.9910179640718563
TNRTest: 0.2692307692307692
FPRTest: 0.7307692307692307
FNRTest: 0.008982035928143712
In [ ]:
C:\Users\timot\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 2
4.0% of the points cannot be placed; you may want to decrease the size of the markers
or use stripplot.
warnings.warn(msg, UserWarning)
Out[268]: DecisionTreeClassifier(max_depth=2)
In [269]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree
f = plt.figure(figsize=(8,6))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])
In [ ]:
In [270]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix
# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
# Print the Classification Accuracy
print("Train Data")
print("Accuracy :\t", dectree.score(X_train, y_train))
print()
# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Y (1) predicted Y (1)
fpTrain = cmTrain[0][1] # False Positives : N (0) predicted Y (1)
tnTrain = cmTrain[0][0] # True Negatives : N (0) predicted N (0)
fnTrain = cmTrain[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()
print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))
# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred),
annot = True, fmt=".0f", annot_kws={"size": 18})
Train Data
Accuracy : 0.9345454545454546
Out[270]: <AxesSubplot:>
In [ ]:
Test Data
Accuracy : 0.9361111111111111
Out[271]: <AxesSubplot:>
In [272]: YearBuiltAccuracyTrain = dectree.score(X_train, y_train)
YearBuiltTPRTrain = tpTrain/(tpTrain + fnTrain)
YearBuiltTNRTrain = tnTrain/(tnTrain + fpTrain)
YearBuiltFPRTrain = fpTrain/(tnTrain + fpTrain)
YearBuiltFNRTrain = fnTrain/(tpTrain + fnTrain)
print ("YearBuilt Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTrain)
print("TPRTrain: \t", YearBuiltTPRTrain)
print("TNRTrain: \t", YearBuiltTNRTrain)
print("FPRTrain: \t", YearBuiltFPRTrain)
print("FNRTrain: \t", YearBuiltFNRTrain)
print ()
YearBuiltAccuracyTest = dectree.score(X_test, y_test)
YearBuiltTPRTest = tpTest/(tpTest + fnTest)
YearBuiltTNRTest = tnTest/(tnTest + fpTest)
YearBuiltFPRTest = fpTest/(tnTest + fpTest)
YearBuiltFNRTest = fnTest/(tpTest + fnTest)
print ("YearBuilt Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTest)
print("TPRTest: \t", YearBuiltTPRTest)
print("TNRTest: \t", YearBuiltTNRTest)
print("FPRTest: \t", YearBuiltFPRTest)
print("FNRTest: \t", YearBuiltFNRTest)
YearBuilt Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9345454545454546
TPRTrain: 1.0
TNRTrain: 0.0
FPRTrain: 1.0
FNRTrain: 0.0
YearBuilt Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9361111111111111
TPRTest: 1.0
TNRTest: 0.0
FPRTest: 1.0
FNRTest: 0.0
CentralAir vs SalePrice has the highest Training Accuracy out of the four models.
CentralAir vs GrLivArea has the highest Test Accuracy out of the four models.
However, the train and test accuracy for all four models are pretty high and quite close.
So, it is not easy to justify which model is better just using their classification accuracy.
However, if we look at the True Positive Rate (TPR) and False Positive Rate (FPR) of the four models, we
find that
YearBuilt yields a TPR of 1 (best-case) but an FPR of 1 (worst-case) on both Train and Test data.
Really bad for prediction.
GrLivArea yields a TPR of close to 1 (best-case) but an FPR of close to 1 (worst-case) on Train and
Test set, not good either.
SalePrice and OverallQual yield the best TPR (high) vs FPR (not-as-high) trade-off in case of
both Train and Test data.
Overall, the predictor OverallQual is the best amongst the four in predicting CentralAir , while
SalePrice is a close second as per the models above. YearBuilt is definitely the worst predictor out
of these four variables, with GrLivArea not doing so well either, given the models above.
Did you notice? : Go back and check again all accuracy figures for the four models. I am pretty sure you
did not get the exact same values as I did. This is due to the random selection of Train-Test sets. In fact,
if you run the above cells again, you will get a different set of accuracy figures. If that is so, can we really
be confident that OverallQual will always be the best variable to predict CentralAir ? Think about it.
;-)
In [274]: print ("Sale price Train:\n ````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTrain)
print("TPRTrain: \t", SalePriceTPRTrain)
print("TNRTrain: \t", SalePriceTNRTrain)
print("FPRTrain: \t", SalePriceFPRTrain)
print("FNRTrain: \t", SalePriceFNRTrain)
print ("Sale price Test:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",SalePriceAccuracyTest)
print("TPRTest: \t", SalePriceTPRTest)
print("TNRTest: \t", SalePriceTNRTest)
print("FPRTest: \t", SalePriceFPRTest)
print("FNRTest: \t", SalePriceFNRTest)
print()
print()
print ("GrLivArea Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTrain)
print("TPRTrain: \t", GrLivAreaTPRTrain)
print("TNRTrain: \t", GrLivAreaTNRTrain)
print("FPRTrain: \t", GrLivAreaFPRTrain)
print("FNRTrain: \t", GrLivAreaFNRTrain)
print ()
print ("GrLivArea Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",GrLivAreaAccuracyTest)
print("TPRTest: \t", GrLivAreaTPRTest)
print("TNRTest: \t", GrLivAreaTNRTest)
print("FPRTest: \t", GrLivAreaFPRTest)
print("FNRTest: \t", GrLivAreaFNRTest)
print ()
print ()
print ("OverallQual Train:\n ```````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTrain)
print("TPRTrain: \t", OverallQualTPRTrain)
print("TNRTrain: \t", OverallQualTNRTrain)
print("FPRTrain: \t", OverallQualFPRTrain)
print("FNRTrain: \t", OverallQualFNRTrain)
print ()
print ("OverallQual Test:\n ````````````````````````````````````````````````````````````
print("Accuracy :\t",OverallQualAccuracyTest)
print("TPRTest: \t", OverallQualTPRTest)
print("TNRTest: \t", OverallQualTNRTest)
print("FPRTest: \t", OverallQualFPRTest)
print("FNRTest: \t", OverallQualFNRTest)
print ()
print ()
print ("YearBuilt Train:\n `````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTrain)
print("TPRTrain: \t", YearBuiltTPRTrain)
print("TNRTrain: \t", YearBuiltTNRTrain)
print("FPRTrain: \t", YearBuiltFPRTrain)
print("FNRTrain: \t", YearBuiltFNRTrain)
print ()
print ("YearBuilt Test:\n ``````````````````````````````````````````````````````````````
print("Accuracy :\t",YearBuiltAccuracyTest)
print("TPRTest: \t", YearBuiltTPRTest)
print("TNRTest: \t", YearBuiltTNRTest)
print("FPRTest: \t", YearBuiltFPRTest)
print("FNRTest: \t", YearBuiltFNRTest)
GrLivArea Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9309090909090909
TPRTrain: 1.0
TNRTrain: 0.0379746835443038
FPRTrain: 0.9620253164556962
FNRTrain: 0.0
GrLivArea Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9583333333333334
TPRTest: 1.0
TNRTest: 0.0625
FPRTest: 0.9375
FNRTest: 0.0
OverallQual Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.94
TPRTrain: 0.9941804073714839
TNRTrain: 0.13043478260869565
FPRTrain: 0.8695652173913043
FNRTrain: 0.005819592628516004
OverallQual Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9388888888888889
TPRTest: 0.9910179640718563
TNRTest: 0.2692307692307692
FPRTest: 0.7307692307692307
FNRTest: 0.008982035928143712
YearBuilt Train:
```````````````````````````````````````````````````````````````
Accuracy : 0.9345454545454546
TPRTrain: 1.0
TNRTrain: 0.0
FPRTrain: 1.0
FNRTrain: 0.0
YearBuilt Test:
```````````````````````````````````````````````````````````````
Accuracy : 0.9361111111111111
TPRTest: 1.0
TNRTest: 0.0
FPRTest: 1.0
FNRTest: 0.0
In [ ]:
Extra : Predicting CentralAir using All Variables
1. Note that the DecisionTreeClassifier() model can take more than one
Predictor to model the Response variable. Try to fit a Decision Tree model
to predict “CentralAir” using all the four variables “SalePrice” ,
“GrLivArea” , “OverallQual” and “YearBuilt” . Print the Classification
Accuracy of this multi-variate model on Train and Test datasets, and check
the model’s reliability of prediction on Train and Test data using the
confusion matrices.
Use all the other 4 variables from the dataset to predict CentralAir , as mentioned in the problem.
Out[80]: DecisionTreeClassifier(max_depth=2)
In [81]: # Plot the trained Decision Tree
from sklearn.tree import plot_tree
f = plt.figure(figsize=(9,6))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])
Train Data
Accuracy : 0.9418181818181818
Out[82]: <AxesSubplot:>
In [83]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix
# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)
# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()
# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()
print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))
# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 60})
Test Data
Accuracy : 0.9305555555555556
Out[83]: <AxesSubplot:>
Observation : The model with CentralAir against all the four variables SalePrice , GrLivArea ,
OverallQual , YearBuilt is not necessarily better. That's strange!
However, there is also room to play with the max_depth of the Decision
Tree.
Try other values and check out for yourself. :-)
Experiment with max_depth of the Decision Tree to check the variations in accuracy and confusion
matrix for train and test. Think about it!
In [84]: # Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# Extract Response and Predictors
y = pd.DataFrame(houseData['CentralAir'])
X = pd.DataFrame(houseData[['SalePrice', 'GrLivArea', 'OverallQual', 'YearBuilt']]) #or
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3) # CHANGE IT HERE AND EXPERIMENT
dectree.fit(X_train, y_train) # train the decision tree model
# Plot the trained Decision Tree
from sklearn.tree import plot_tree
f = plt.figure(figsize=(26,10))
plot_tree(dectree, filled=True, rounded=True,
feature_names=X_train.columns,
class_names=["N","Y"])
Train Data
Accuracy : 0.95
Out[85]: <AxesSubplot:>
In [86]: # Import the required metric from sklearn
from sklearn.metrics import confusion_matrix
# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)
# Print the Classification Accuracy
print("Test Data")
print("Accuracy :\t", dectree.score(X_test, y_test))
print()
# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Y (1) predicted Y (1)
fpTest = cmTest[0][1] # False Positives : N (0) predicted Y (1)
tnTest = cmTest[0][0] # True Negatives : N (0) predicted N (0)
fnTest = cmTest[1][0] # False Negatives : Y (1) predicted N (0)
print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()
print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))
# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred),
annot = True, fmt=".0f", annot_kws={"size": 60})
Test Data
Accuracy : 0.9277777777777778
Out[86]: <AxesSubplot:>
2. Fit a Decision Tree model to predict “CentralAir” using all numeric
variables in the given dataset. You may use all numeric variables from
Exercise 2. Print the Classification Accuracy of this multi-variate model on
Train and Test data, and check the model’s reliability of prediction on Train
and Test data using the confusion matrices.
In [87]: df = pd.read_csv('train.csv')
(1460, 35)
(1121, 35)
False Positive Rates of the various Decision Tree models are (significantly) higher than False Negative
Rates. False Positive Rates is the complement of True Negative Rate (i.e., the sum of the two is 1), and
False Negative Rates is the complement of True Positive Rate (i.e., the sum of the two is 1). True Positive
Rate is higher than True Negative Rate as positive samples dominate, thus, False Positive Rates are
higher than False Negative Rates.
To test the significance, 30 random trials are run on each Decision Tree model. A significance test is
then performed on all the results.
In [97]: Index_Trials=[]
Index_Predictors=[]
FPR_train=[]
FNR_train=[]
FPR_test=[]
FNR_test=[]
predictors=['SalePrice', 'GrLivArea', 'OverallQual', 'YearBuilt',
['SalePrice', 'GrLivArea', 'OverallQual', 'YearBuilt'],
df_num.columns]
for j,X_names in enumerate(predictors):
for i in range(30):
print(j,'-',i)
X = clean_dataset(pd.DataFrame(df[X_names])) # Predictor
y = pd.DataFrame(df['CentralAir'].iloc[X.index]) # Response
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 360)
# Check the sample sizes
print("Train Set :", y_train.shape, X_train.shape)
print("Test Set :", y_test.shape, X_test.shape)
# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2) # create the decision tree obje
dectree.fit(X_train, y_train) # train the decision tree model
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)
cm_train=confusion_matrix(y_train, y_train_pred)
cm_test=confusion_matrix(y_test, y_test_pred)
fpr_train=cm_train[0,1]/np.sum(cm_train[0,:])
fnr_train=cm_train[1,0]/np.sum(cm_train[1,:])
# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("False Positive Rate \t: {:.3f}".format(fpr_train))
print("False Negative Rate \t: {:.3f}".format(fnr_train))
print()
fpr_test=cm_test[0,1]/np.sum(cm_test[0,:])
fnr_test=cm_test[1,0]/np.sum(cm_test[1,:])
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("False Positive Rate \t: {:.3f}".format(fpr_test))
print("False Negative Rate \t: {:.3f}".format(fnr_test))
print()
Index_Trials.append(i)
Index_Predictors.append(j)
FPR_train.append(fpr_train)
FNR_train.append(fnr_train)
FPR_test.append(fpr_test)
FNR_test.append(fnr_test)
0 - 0
Train Set : (1100, 1) (1100, 1)
Test Set : (360, 1) (360, 1)
Goodness of Fit of Model Train Dataset
False Positive Rate : 0.835
False Negative Rate : 0.002
Goodness of Fit of Model Test Dataset
False Positive Rate : 1.000
False Negative Rate : 0.000
0 - 1
Train Set : (1100, 1) (1100, 1)
Test Set : (360, 1) (360, 1)
Goodness of Fit of Model Train Dataset
False Positive Rate : 0.875
False Negative Rate : 0.002
In [ ]: