0% found this document useful (0 votes)
49 views

Eduonix - Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

Uploaded by

member2 mtri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Eduonix - Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

Uploaded by

member2 mtri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [1]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [2]: df_penguins=sns.load_dataset("penguins")

In [3]: df_penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 344 non-null object
1 island 344 non-null object
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null object
dtypes: float64(4), object(3)
memory usage: 18.9+ KB

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 1/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [4]: df_penguins.head(10)

Out[4]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex

0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male

1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female

2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female

3 Adelie Torgersen NaN NaN NaN NaN NaN

4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male

6 Adelie Torgersen 38.9 17.8 181.0 3625.0 Female

7 Adelie Torgersen 39.2 19.6 195.0 4675.0 Male

8 Adelie Torgersen 34.1 18.1 193.0 3475.0 NaN

9 Adelie Torgersen 42.0 20.2 190.0 4250.0 NaN

In [5]: #looking at our quantitative data


df_penguins.describe()

Out[5]:
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g

count 342.000000 342.000000 342.000000 342.000000

mean 43.921930 17.151170 200.915205 4201.754386

std 5.459584 1.974793 14.061714 801.954536

min 32.100000 13.100000 172.000000 2700.000000

25% 39.225000 15.600000 190.000000 3550.000000

50% 44.450000 17.300000 197.000000 4050.000000

75% 48.500000 18.700000 213.000000 4750.000000

max 59.600000 21.500000 231.000000 6300.000000

In [6]: #looking at our categorical data


df_penguins.describe(include=['O'])

Out[6]:
species island sex

count 344 344 333

unique 3 3 2

top Adelie Biscoe Male

freq 152 168 168

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 2/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [7]: #looks like we are missing the sex for a few penguins
#since we still have a fairly large dataset lets remove these to avoid
#odd visualizations
df_penguins = df_penguins[df_penguins["sex"].notnull()]

In [8]: #looking for missing values


df_penguins.isnull().sum()

Out[8]: species 0
island 0
bill_length_mm 0
bill_depth_mm 0
flipper_length_mm 0
body_mass_g 0
sex 0
dtype: int64

In [10]: df_penguins.columns

Out[10]: Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',


'flipper_length_mm', 'body_mass_g', 'sex'],
dtype='object')

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 3/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [13]: sns.set(color_codes=True)
#choosing a color palette from seaborn
#options: pastel, muted, bright, deep, colorblind, dark
#tons of other ways to do this as well!
colors = sns.color_palette("bright")

#first lets set up our plotting area, this gives us 9 potential spots to plot
fig,axes = plt.subplots(2,2, figsize = (10,10))

sns.histplot(df_penguins["bill_length_mm"], color = colors[0], ax = axes[0,0])


sns.histplot(df_penguins["bill_depth_mm"], color = colors[1], ax = axes[0,1])
sns.histplot(df_penguins["flipper_length_mm"], color = colors[2], ax = axes[1,0])
sns.histplot(df_penguins["body_mass_g"], color = colors[3], ax = axes[1,1])

plt.suptitle("Distribution of Quantitative Variables", y= 1.01, size = 16)


plt.tight_layout()
plt.show()

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 4/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 5/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [14]: sns.set(color_codes=True)
#choosing a color palette from seaborn
#options: pastel, muted, bright, deep, colorblind, dark
#tons of other ways to do this as well!
colors = sns.color_palette("bright")

#first lets set up our plotting area, this gives us 9 potential spots to plot
fig,axes = plt.subplots(2,2, figsize = (10,10))

sns.boxplot(y=df_penguins["bill_length_mm"], color = colors[0], ax = axes[0,0])


sns.boxplot(y=df_penguins["bill_depth_mm"], color = colors[1], ax = axes[0,1])
sns.boxplot(y=df_penguins["flipper_length_mm"], color = colors[2], ax = axes[1,0]
sns.boxplot(y=df_penguins["body_mass_g"], color = colors[3], ax = axes[1,1])

plt.suptitle("Distribution of Quantitative Variables", y= 1.01, size = 16)


plt.tight_layout()
plt.show()

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 6/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 7/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [16]: #examining categorical data wit some catplots


sns.set(color_codes=True)
#choosing a color palette from seaborn
#options: pastel, muted, bright, deep, colorblind, dark
#tons of other ways to do this as well!
colors = sns.color_palette("bright")

#first lets set up our plotting area, this gives us 9 potential spots to plot
fig,axes = plt.subplots(1,3, figsize = (10,6))

sns.countplot(x="species", data = df_penguins ,ax = axes[0])


sns.countplot(x="island", data = df_penguins ,ax = axes[1])
sns.countplot(x="sex", data = df_penguins ,ax = axes[2])

#quick for loop in order to access all of the subplots axes


for ax in fig.axes:
ax.tick_params(labelrotation=90)

plt.suptitle("Categorical Variables", y= 1.01, size = 16)


plt.tight_layout()
plt.show()

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 8/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [17]: #looks like our male/female split is good

#not too concerned about the islands for our purposes today

#however, in terms of species there is definetly varying samples sizes

#lets adjust our data by first removing the 'island' column

#then looking closer at the species column, to create a new sample data set that

#this will avoid bias as we build out classification models

In [18]: df_penguins = df_penguins.drop(columns = "island", axis = 0)

In [19]: df_penguins["species"].value_counts()

Out[19]: Adelie 146


Gentoo 119
Chinstrap 68
Name: species, dtype: int64

In [20]: #look like we have 68 chinstrap, so lets get 68 randomly selected from the other
adelie = df_penguins[df_penguins["species"] == "Adelie"].sample(n=68)
gentoo = df_penguins[df_penguins["species"] == "Gentoo"].sample(n=68)

#getting the entire Chinstrap sample


chinstrap = df_penguins[df_penguins["species"] == "Chinstrap"].sample(n=68)

In [21]: #technique keeps this is as dataframe


type(gentoo)

Out[21]: pandas.core.frame.DataFrame

In [22]: #now we need to merge these together for our new dataframe
#axis = 0 implies a vertical concat
new_peng = pd.concat([adelie, gentoo, chinstrap], axis = 0)

In [23]: new_peng.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 204 entries, 37 to 192
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 204 non-null object
1 bill_length_mm 204 non-null float64
2 bill_depth_mm 204 non-null float64
3 flipper_length_mm 204 non-null float64
4 body_mass_g 204 non-null float64
5 sex 204 non-null object
dtypes: float64(4), object(2)
memory usage: 11.2+ KB

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Produ… 9/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [24]: #now the last thing we need to do is our one hot encoding of the variables "sex",
#that means we'll do this in portions, then merge into our final dataframe we wil
peng_sex = pd.get_dummies(new_peng["sex"])

#dropping sex column from new_peng now before putting the one hot encoded column
new_peng = new_peng.drop(columns= "sex", axis = 0)

#axis = 1 implies a horizontal concat


final_df = pd.concat([new_peng, peng_sex], axis = 1)

In [25]: final_df

Out[25]:
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g Female Male

37 Adelie 42.2 18.5 180.0 3550.0 1 0

98 Adelie 33.1 16.1 178.0 2900.0 1 0

63 Adelie 41.1 18.2 192.0 4050.0 0 1

61 Adelie 41.3 21.1 195.0 4400.0 0 1

29 Adelie 40.5 18.9 180.0 3950.0 0 1

... ... ... ... ... ... ... ...

158 Chinstrap 46.1 18.2 178.0 3250.0 1 0

162 Chinstrap 46.6 17.8 193.0 3800.0 1 0

199 Chinstrap 49.0 19.6 212.0 4300.0 0 1

155 Chinstrap 45.4 18.7 188.0 3525.0 1 0

192 Chinstrap 49.0 19.5 210.0 3950.0 0 1

204 rows × 7 columns

In [26]: #random state, pick a #, ensures that if we go back and re execute code that the
#this creates out two dataframes to pull from to build models, and then unbiasly
#we use a very high test size of 0.5, this data is actually pretty easy to classi
train_df, test_df = train_test_split(final_df, test_size=0.5, random_state=32)

In [27]: #separating our testing data into two data frames, separating the price column fo
#make sure to copy the column before dropping it in the second line here
#for now we'll also drop the categorical variables, those will require additional
Y_test = test_df["species"]

X_test = test_df.drop(columns= ["species"], axis = 1)

Y_train = train_df["species"]

X_train = train_df.drop(columns= ["species"], axis = 1)

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Prod… 10/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [28]: #binarizing each categorical classification


final_df['spec_id'] = final_df['species'].factorize()[0]

#building in needed dictionaries for later use


spec_id_df = final_df[['species', 'spec_id']].drop_duplicates().sort_values('spec

In [29]: #building our classifier for naive bayes


model_NB = MultinomialNB()

In [30]: model_NB.fit(X_train, Y_train)

Out[30]: MultinomialNB()

In [31]: Y_pred_NB = model_NB.predict(X_test)

In [32]: #building confusion matrix for this classifier


NB_conf_matrix = confusion_matrix(Y_test, Y_pred_NB)

In [33]: #heatmap & confusion matrix for NB


sns.set(color_codes=True)
sns.heatmap(NB_conf_matrix, cmap='Greens', annot = True, linewidths=.5,
xticklabels = spec_id_df["species"].values, yticklabels = spec_id_df[
plt.show()

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Prod… 11/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [34]: #printing out the classification reports


print("Classification Report: Naive Bayes")
print(classification_report(Y_test, Y_pred_NB))

Classification Report: Naive Bayes


precision recall f1-score support

Adelie 0.70 0.53 0.60 40


Chinstrap 0.67 0.87 0.75 30
Gentoo 0.82 0.84 0.83 32

accuracy 0.73 102


macro avg 0.73 0.75 0.73 102
weighted avg 0.73 0.73 0.72 102

In [35]: #how to interpret this

#Precision is the ability of a classifier not to label an instance positive that


#For each class it is defined as the ratio of true positives to the sum of true a

#Recall is the ability of a classifier to find all positive instances.


#For each class it is defined as the ratio of true positives to the sum of true p

#The F1 score is a weighted harmonic mean of precision and recall such that the b
#are lower than accuracy measures as they embed precision and recall into their c
#be used to compare classifier models, not global accuracy.

#supprt is number of given samples for each

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Prod… 12/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [36]: #lets rename some things here to make this model more understandable
features = X_train.copy()
targets = Y_train.copy()

#defining the models we want to compare

models = [
MultinomialNB(),
LogisticRegression(multi_class='multinomial',max_iter = 10000),
KNeighborsClassifier(),
SVC(),
RandomForestClassifier()
]

#setting up the instructions for our comparison tool

#Number of cross validation to performs, 5 is standard number


CV = 5
#creating our cross validation data frame
cv_df = pd.DataFrame(index=range(CV * len(models)))
#empty initial list for entries in our dataframe
entries = []
#outer for loop to execute our cross validations of the above models
for model in models:
#accessing model information class
model_name = model.__class__.__name__
#getting parameters of model to calculate R2
accuracies = cross_val_score(model, features, targets, scoring='accuracy', cv=C
#inner for loop to fill the entries list to build the ending dataframe
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
#finalizing the dataframe with appended entries
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Prod… 13/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [37]: #visualizing the results


sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, size=8, jitter=True, edge
plt.title('Classification Models', fontsize=20)
plt.ylabel('Accuracy', fontsize=15)
plt.xlabel('')
plt.xticks(fontsize=15,rotation=90)
plt.yticks(fontsize=15,rotation=0)
plt.show()

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Prod… 14/15
4/7/2021 Eduonix_Model Development Production Case Study - Part 2 CODE - Jupyter Notebook

In [38]: #looking at average r squared


final_comp = cv_df.groupby('model_name').accuracy.mean().reset_index().sort_value
final_comp

Out[38]:
model_name accuracy

1 LogisticRegression 0.980000

3 RandomForestClassifier 0.960952

2 MultinomialNB 0.814286

0 KNeighborsClassifier 0.744762

4 SVC 0.657619

In [40]: #student project would be to attempt to classify island instead of species

In [ ]:

localhost:8888/notebooks/Desktop/Eduonix_Model Development Production Case Study - Part 2 - from Desktop/Eduonix_Model Development Prod… 15/15

You might also like