0% found this document useful (0 votes)
58 views47 pages

Iris Flower Classification

Uploaded by

quiluong453
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views47 pages

Iris Flower Classification

Uploaded by

quiluong453
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Open in Colab

Project Name - Iris Flower Classification


Project Type - Classification

Industry - Oasis Infobyte

Contribution - Individual

Member Name - Arindam Paul

Task - 1

Project Summary -
Project Description:

The Iris Flower Classification project focuses on developing a machine learning model to
classify iris flowers into their respective species based on specific measurements. Iris flowers
are classified into three species: setosa, versicolor, and virginica, each of which exhibits
distinct characteristics in terms of measurements.

Objective:

The primary goal of this project is to leverage machine learning techniques to build a
classification model that can accurately identify the species of iris flowers based on their
measurements. The model aims to automate the classification process, offering a practical
solution for identifying iris species.

Key Project Details:

• Iris flowers have three species: setosa, versicolor, and virginica.


• These species can be distinguished based on measurements such as sepal length, sepal
width, petal length, and petal width.
• The project involves training a machine learning model on a dataset that contains iris
flower measurements associated with their respective species.
• The trained model will classify iris flowers into one of the three species based on their
measurements.

GitHub Link -
GitHub Link - https://github.com/Apaulgithub/oibsip_task1/tree/main

1 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Problem Statement
The iris flower, scientifically known as Iris, is a distinctive genus of flowering plants. Within
this genus, there are three primary species: Iris setosa, Iris versicolor, and Iris virginica. These
species exhibit variations in their physical characteristics, particularly in the measurements of
their sepal length, sepal width, petal length, and petal width.

Objective:

The objective of this project is to develop a machine learning model capable of learning
from the measurements of iris flowers and accurately classifying them into their respective
species. The model's primary goal is to automate the classification process based on the
distinct characteristics of each iris species.

Project Details:

• Iris Species: The dataset consists of iris flowers, specifically from the species setosa,
versicolor, and virginica.
• Key Measurements: The essential characteristics used for classification include sepal
length, sepal width, petal length, and petal width.
• Machine Learning Model: The project involves the creation and training of a machine
learning model to accurately classify iris flowers based on their measurements.

This project's significance lies in its potential to streamline and automate the classification of
iris species, which can have broader applications in botany, horticulture, and environmental
monitoring.

Let's Begin !

1. Know The Data

Import Libraries

2 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Import Libraries
# Importing Numpy & Pandas for data processing & data wrangling
import numpy as np
import pandas as pd

# Importing tools for visualization


import matplotlib.pyplot as plt
import seaborn as sns

# Import evaluation metric libraries


from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

# Library used for data preprocessing


from sklearn.preprocessing import LabelEncoder

# Import model selection libraries


from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

# Library used for ML Model implementation


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb

# Library used for ignore warnings


import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Dataset Loading
In [ ]: # Load Dataset
df = pd.read_csv("https://raw.githubusercontent.com/Apaulgithub/oibsip_task1/main/Iris.csv"

Dataset First View


In [ ]: # Dataset First Look
# View top 5 rows of the dataset
df.head()

Out[ ]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

3 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Dataset Rows & Columns count


In [ ]: # Dataset Rows & Columns count
# Checking number of rows and columns of the dataset using shape
print("Number of rows are: ",df.shape[0])
print("Number of columns are: ",df.shape[1])

Number of rows are: 150


Number of columns are: 6

Dataset Information
In [ ]: # Dataset Info
# Checking information about the dataset using info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

Duplicate Values

In [ ]: # Dataset Duplicate Value Count


dup = df.duplicated().sum()
print(f'number of duplicated rows are {dup}')

number of duplicated rows are 0

Missing Values/Null Values

In [ ]: # Missing Values/Null Values Count


df.isnull().sum()

Out[ ]: Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

What did i know about the dataset?

4 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

• The Iris dataset consists of length and width mesurements of sepal and petal for
different species in centimeter.
• There are 150 rows and 6 columns provided in the data.
• No duplicate values exist.
• No Null values exist.

2. Understanding The Variables


In [ ]: # Dataset Columns
df.columns

Out[ ]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',


'Species'],
dtype='object')

In [ ]: # Dataset Describe (all columns included)


df.describe(include= 'all').round(2)

Out[ ]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

count 150.00 150.00 150.00 150.00 150.00 150

unique NaN NaN NaN NaN NaN 3

top NaN NaN NaN NaN NaN Iris-setosa

freq NaN NaN NaN NaN NaN 50

mean 75.50 5.84 3.05 3.76 1.20 NaN

std 43.45 0.83 0.43 1.76 0.76 NaN

min 1.00 4.30 2.00 1.00 0.10 NaN

25% 38.25 5.10 2.80 1.60 0.30 NaN

50% 75.50 5.80 3.00 4.35 1.30 NaN

75% 112.75 6.40 3.30 5.10 1.80 NaN

max 150.00 7.90 4.40 6.90 2.50 NaN

Check Unique Values for each variable.


In [ ]: # Check Unique Values for each variable.
for i in df.columns.tolist():
print("No. of unique values in",i,"is",df[i].nunique())

No. of unique values in Id is 150


No. of unique values in SepalLengthCm is 35
No. of unique values in SepalWidthCm is 23
No. of unique values in PetalLengthCm is 43
No. of unique values in PetalWidthCm is 22
No. of unique values in Species is 3

5 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

3. Data Wrangling

Data Wrangling Code


In [ ]: # We don't need the 1st column so let's drop that
data=df.iloc[:,1:]

In [ ]: # New updated dataset


data.head()

Out[ ]: SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

What all manipulations have i done?


Only drop the first column of the dataset.

4. Data Vizualization, Storytelling &


Experimenting with charts : Understand the
relationships between variables
Chart - 1 : Distribution of Numerical Variables

6 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Chart - 1 Histogram visualization code for distribution of numerical variables


# Create a figure with subplots
plt.figure(figsize=(8, 6))
plt.suptitle('Distribution of Iris Flower Measurements', fontsize=14)

# Create a 2x2 grid of subplots


plt.subplot(2, 2, 1) # Subplot 1 (Top-Left)
plt.hist(data['SepalLengthCm'])
plt.title('Sepal Length Distribution')

plt.subplot(2, 2, 2) # Subplot 2 (Top-Right)


plt.hist(data['SepalWidthCm'])
plt.title('Sepal Width Distribution')

plt.subplot(2, 2, 3) # Subplot 3 (Bottom-Left)


plt.hist(data['PetalLengthCm'])
plt.title('Petal Length Distribution')

plt.subplot(2, 2, 4) # Subplot 4 (Bottom-Right)


plt.hist(data['PetalWidthCm'])
plt.title('Petal Width Distribution')

# Display the subplots


plt.tight_layout() # Helps in adjusting the layout
plt.show()

Chart - 2 : Sepal Length vs Sepal Width

7 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Define colors for each species and the corresponding species labels.
colors = ['red', 'yellow', 'green']
species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

In [ ]: # Chart - 2 Scatter plot visualization code for Sepal Length vs Sepal Width.
# Create a scatter plot for Sepal Length vs Sepal Width for each species.
for i in range(3):
# Select data for the current species.
x = data[data['Species'] == species[i]]

# Create a scatter plot with the specified color and label for the current species.
plt.scatter(x['SepalLengthCm'], x['SepalWidthCm'], c=colors[i], label=species[i

# Add labels to the x and y axes.


plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')

# Add a legend to identify species based on colors.


plt.legend()

# Display the scatter plot.


plt.show()

Chart - 3 : Petal Length vs Petal Width

8 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Chart - 3 Scatter plot visualization code for Petal Length vs Petal Width.
# Create a scatter plot for Petal Length vs Petal Width for each species.
for i in range(3):
# Select data for the current species.
x = data[data['Species'] == species[i]]

# Create a scatter plot with the specified color and label for the current species.
plt.scatter(x['PetalLengthCm'], x['PetalWidthCm'], c=colors[i], label=species[i

# Add labels to the x and y axes.


plt.xlabel('Petal Length')
plt.ylabel('Petal Width')

# Add a legend to identify species based on colors.


plt.legend()

# Display the scatter plot.


plt.show()

Chart - 4 : Sepal Length vs Petal Length

9 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Chart - 4 Scatter plot visualization code for Sepal Length vs Petal Length.
# Create a scatter plot for Sepal Length vs Petal Length for each species.
for i in range(3):
# Select data for the current species.
x = data[data['Species'] == species[i]]

# Create a scatter plot with the specified color and label for the current species.
plt.scatter(x['SepalLengthCm'], x['PetalLengthCm'], c=colors[i], label=species[

# Add labels to the x and y axes.


plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')

# Add a legend to identify species based on colors.


plt.legend()

# Display the scatter plot.


plt.show()

Chart - 5 : Sepal Width vs Petal Width

10 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Chart - 5 Scatter plot visualization code for Sepal Width vs Petal Width.
# Create a scatter plot for Sepal Width vs Petal Width for each species.
for i in range(3):
# Select data for the current species.
x = data[data['Species'] == species[i]]

# Create a scatter plot with the specified color and label for the current species.
plt.scatter(x['SepalWidthCm'], x['PetalWidthCm'], c=colors[i], label=species[i])

# Add labels to the x and y axes.


plt.xlabel('Sepal Width')
plt.ylabel('Petal Width')

# Add a legend to identify species based on colors.


plt.legend()

# Display the scatter plot.


plt.show()

Chart - 6 : Correlation Heatmap

11 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Correlation Heatmap Visualization Code


corr_matrix = data.corr()

# Plot Heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(corr_matrix, annot=True, cmap='Reds_r')

# Setting Labels
plt.title('Correlation Matrix heatmap')

# Display Chart
plt.show()

5. Feature Engineering & Data Pre-processing

1. Categorical Encoding
In [ ]: # Encode the categorical columns
# Create a LabelEncoder object
le = LabelEncoder()

# Encode the 'Species' column to convert the species names to numerical labels
data['Species'] = le.fit_transform(data['Species'])

# Check the unique values in the 'Species' column after encoding


unique_species = data['Species'].unique()

# Display the unique encoded values


print("Encoded Species Values:")
print(unique_species) # 'Iris-setosa' == 0, 'Iris-versicolor' == 1, 'Iris-virginica' == 2

Encoded Species Values:


[0 1 2]

12 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

2. Data Scaling
In [ ]: # Defining the X and y
x=data.drop(columns=['Species'], axis=1)
y=data['Species']

3. Data Splitting
In [ ]: # Splitting the data to train and test
x_train,x_test,y_train,y_test=train_test_split(x,y, test_size=0.3)

In [ ]: # Checking the train distribution of dependent variable


y_train.value_counts()

Out[ ]: 1 37
2 35
0 33
Name: Species, dtype: int64

6. ML Model Implementation

13 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: def evaluate_model(model, x_train, x_test, y_train, y_test):


'''The function will take model, x train, x test, y train, y test
and then it will fit the model, then make predictions on the trained model,
it will then print roc-auc score of train and test, then plot the roc, auc curve,
print confusion matrix for train and test, then print classification report for train and
then plot the feature importances if the model has feature importances,
and finally it will return the following scores as a list:
recall_train, recall_test, acc_train, acc_test, F1_train, F1_test
'''

# Fit the model to the training data.


model.fit(x_train, y_train)

# make predictions on the test data


y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

# calculate confusion matrix


cm_train = confusion_matrix(y_train, y_pred_train)
cm_test = confusion_matrix(y_test, y_pred_test)

fig, ax = plt.subplots(1, 2, figsize=(11,4))

print("\nConfusion Matrix:")
sns.heatmap(cm_train, annot=True, xticklabels=['Negative', 'Positive'], yticklabels
ax[0].set_xlabel("Predicted Label")
ax[0].set_ylabel("True Label")
ax[0].set_title("Train Confusion Matrix")

sns.heatmap(cm_test, annot=True, xticklabels=['Negative', 'Positive'], yticklabels


ax[1].set_xlabel("Predicted Label")
ax[1].set_ylabel("True Label")
ax[1].set_title("Test Confusion Matrix")

plt.tight_layout()
plt.show()

# calculate classification report


cr_train = classification_report(y_train, y_pred_train, output_dict=True)
cr_test = classification_report(y_test, y_pred_test, output_dict=True)
print("\nTrain Classification Report:")
crt = pd.DataFrame(cr_train).T
print(crt.to_markdown())
# sns.heatmap(pd.DataFrame(cr_train).T.iloc[:, :-1], annot=True, cmap="Blues")
print("\nTest Classification Report:")
crt2 = pd.DataFrame(cr_test).T
print(crt2.to_markdown())
# sns.heatmap(pd.DataFrame(cr_test).T.iloc[:, :-1], annot=True, cmap="Blues")

precision_train = cr_train['weighted avg']['precision']


precision_test = cr_test['weighted avg']['precision']

recall_train = cr_train['weighted avg']['recall']


recall_test = cr_test['weighted avg']['recall']

acc_train = accuracy_score(y_true = y_train, y_pred = y_pred_train)


acc_test = accuracy_score(y_true = y_test, y_pred = y_pred_test)

14 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

F1_train = cr_train['weighted avg']['f1-score']


F1_test = cr_test['weighted avg']['f1-score']

model_score = [precision_train, precision_test, recall_train, recall_test, acc_train


return model_score

In [ ]: # Create a score dataframe


score = pd.DataFrame(index = ['Precision Train', 'Precision Test','Recall Train','Recall Test

ML Model - 1 : Logistic regression


In [ ]: # ML Model - 1 Implementation
lr_model = LogisticRegression(fit_intercept=True, max_iter=10000)

# Model is trained (fit) and predicted in the evaluate model

1. Explain the ML Model used and it's performance using Evaluation metric
Score Chart.

15 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Visualizing evaluation Metric Score chart


lr_score = evaluate_model(lr_model, x_train, x_test, y_train, y_test)

Confusion Matrix:

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.972973 | 0.972973 | 0.972973 | 37 |
| 2 | 0.971429 | 0.971429 | 0.971429 | 35 |
| accuracy | 0.980952 | 0.980952 | 0.980952 | 0.980952 |
| macro avg | 0.981467 | 0.981467 | 0.981467 | 105 |
| weighted avg | 0.980952 | 0.980952 | 0.980952 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: # Updated Evaluation metric Score Chart


score['Logistic regression'] = lr_score
score

Out[ ]: Logistic regression

Precision Train 0.980952

Precision Test 0.979167

Recall Train 0.980952

Recall Test 0.977778

Accuracy Train 0.980952

Accuracy Test 0.977778

F1 macro Train 0.980952

F1 macro Test 0.977692

16 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

2. Cross- Validation & Hyperparameter Tuning

In [ ]: # ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch C


# Define the hyperparameter grid
param_grid = {'C': [100,10,1,0.1,0.01,0.001,0.0001],
'penalty': ['l1', 'l2'],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

# Initializing the logistic regression model


logreg = LogisticRegression(fit_intercept=True, max_iter=10000, random_state=0)

# Repeated stratified kfold


rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=4, random_state=0)

# Using GridSearchCV to tune the hyperparameters using cross-validation


grid = GridSearchCV(logreg, param_grid, cv=rskf)
grid.fit(x_train, y_train)

# Select the best hyperparameters found by GridSearchCV


best_params = grid.best_params_
print("Best hyperparameters: ", best_params)

Best hyperparameters: {'C': 10, 'penalty': 'l2', 'solver': 'saga'}

In [ ]: # Initiate model with best parameters


lr_model2 = LogisticRegression(C=best_params['C'],
penalty=best_params['penalty'],
solver=best_params['solver'],
max_iter=10000, random_state=0)

In [ ]: # Visualizing evaluation Metric Score chart


lr_score2 = evaluate_model(lr_model2, x_train, x_test, y_train, y_test)

Confusion Matrix:

17 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 1 | 0.972973 | 0.986301 | 37 |
| 2 | 0.972222 | 1 | 0.985915 | 35 |
| accuracy | 0.990476 | 0.990476 | 0.990476 | 0.990476 |
| macro avg | 0.990741 | 0.990991 | 0.990739 | 105 |
| weighted avg | 0.990741 | 0.990476 | 0.990478 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: score['Logistic regression tuned'] = lr_score2

Which hyperparameter optimization technique have i used and why?

The hyperparameter optimization technique used is GridSearchCV. GridSearchCV is a


method that performs an exhaustive search over a specified parameter grid to find the best
hyperparameters for a model. It is a popular method for hyperparameter tuning because it is
simple to implement and can be effective in finding good hyperparameters for a model.

The choice of hyperparameter optimization technique depends on various factors such as


the size of the parameter space, the computational resources available, and the time
constraints. GridSearchCV can be a good choice when the parameter space is relatively small
and computational resources are not a major concern.

Have i seen any improvement? Note down the improvement with updates Evaluation metric
Score Chart.

In [ ]: # Updated Evaluation metric Score Chart


score

18 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic regression Logistic regression tuned

Precision Train 0.980952 0.990741

Precision Test 0.979167 0.979167

Recall Train 0.980952 0.990476

Recall Test 0.977778 0.977778

Accuracy Train 0.980952 0.990476

Accuracy Test 0.977778 0.977778

F1 macro Train 0.980952 0.990478

F1 macro Test 0.977692 0.977692

It appears that hyperparameter tuning did not improve the performance of the Logistic
Regression model on the test set. The precision, recall, accuracy and F1 scores on the test set
are same for both tuned and untuned Logistic Regression models.

ML Model - 2 : Decision Tree


In [ ]: # ML Model - 2 Implementation
dt_model = DecisionTreeClassifier(random_state=20)

# Model is trained (fit) and predicted in the evaluate model

1. Explain the ML Model used and it's performance using Evaluation metric
Score Chart.

In [ ]: # Visualizing evaluation Metric Score chart


dt_score = evaluate_model(dt_model, x_train, x_test, y_train, y_test)

Confusion Matrix:

19 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 1 | 1 | 1 | 37 |
| 2 | 1 | 1 | 1 | 35 |
| accuracy | 1 | 1 | 1 | 1 |
| macro avg | 1 | 1 | 1 | 105 |
| weighted avg | 1 | 1 | 1 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: # Updated Evaluation metric Score Chart


score['Decision Tree'] = dt_score
score

Out[ ]: Logistic regression Logistic regression tuned Decision Tree

Precision Train 0.980952 0.990741 1.000000

Precision Test 0.979167 0.979167 0.979167

Recall Train 0.980952 0.990476 1.000000

Recall Test 0.977778 0.977778 0.977778

Accuracy Train 0.980952 0.990476 1.000000

Accuracy Test 0.977778 0.977778 0.977778

F1 macro Train 0.980952 0.990478 1.000000

F1 macro Test 0.977692 0.977692 0.977692

2. Cross- Validation & Hyperparameter Tuning

20 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch C


# Define the hyperparameter grid
grid = {'max_depth' : [3,4,5,6,7,8],
'min_samples_split' : np.arange(2,8),
'min_samples_leaf' : np.arange(10,20)}

# Initialize the model


model = DecisionTreeClassifier()

# repeated stratified kfold


rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize GridSearchCV
grid_search = GridSearchCV(model, grid, cv=rskf)

# Fit the GridSearchCV to the training data


grid_search.fit(x_train, y_train)

# Select the best hyperparameters


best_params = grid_search.best_params_
print("Best hyperparameters: ", best_params)

Best hyperparameters: {'max_depth': 3, 'min_samples_leaf': 10, 'min_samples_split'


: 5}

In [ ]: # Train a new model with the best hyperparameters


dt_model2 = DecisionTreeClassifier(max_depth=best_params['max_depth'],
min_samples_leaf=best_params['min_samples_leaf'],
min_samples_split=best_params['min_samples_split'],
random_state=20)

In [ ]: # Visualizing evaluation Metric Score chart


dt2_score = evaluate_model(dt_model2, x_train, x_test, y_train, y_test)

Confusion Matrix:

21 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.970588 | 0.891892 | 0.929577 | 37 |
| 2 | 0.894737 | 0.971429 | 0.931507 | 35 |
| accuracy | 0.952381 | 0.952381 | 0.952381 | 0.952381 |
| macro avg | 0.955108 | 0.95444 | 0.953695 | 105 |
| weighted avg | 0.954548 | 0.952381 | 0.952353 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.846154 | 0.916667 | 13 |
| 2 | 0.882353 | 1 | 0.9375 | 15 |
| accuracy | 0.955556 | 0.955556 | 0.955556 | 0.955556 |
| macro avg | 0.960784 | 0.948718 | 0.951389 | 45 |
| weighted avg | 0.960784 | 0.955556 | 0.955093 | 45 |

In [ ]: score['Decision Tree tuned'] = dt2_score

Which hyperparameter optimization technique have i used and why?

The hyperparameter optimization technique used is GridSearchCV. GridSearchCV is a


method that performs an exhaustive search over a specified parameter grid to find the best
hyperparameters for a model. It is a popular method for hyperparameter tuning because it is
simple to implement and can be effective in finding good hyperparameters for a model.

The choice of hyperparameter optimization technique depends on various factors such as


the size of the parameter space, the computational resources available, and the time
constraints. GridSearchCV can be a good choice when the parameter space is relatively small
and computational resources are not a major concern.

Have i seen any improvement? Note down the improvement with updates Evaluation metric
Score Chart.

In [ ]: # Updated Evaluation metric Score Chart


score

22 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic Logistic regression Decision Decision Tree


regression tuned Tree tuned

Precision
0.980952 0.990741 1.000000 0.954548
Train

Precision Test 0.979167 0.979167 0.979167 0.960784

Recall Train 0.980952 0.990476 1.000000 0.952381

Recall Test 0.977778 0.977778 0.977778 0.955556

Accuracy
0.980952 0.990476 1.000000 0.952381
Train

Accuracy Test 0.977778 0.977778 0.977778 0.955556

F1 macro
0.980952 0.990478 1.000000 0.952353
Train

F1 macro Test 0.977692 0.977692 0.977692 0.955093

It appears that hyperparameter tuning didn't improved the performance of the Decision Tree
model on the test set. The precision, recall, accuracy and F1 scores on the test set are less for
the tuned Decision Tree model compare to the untuned Decision Tree model.

The tuned model is not overfitting like the untuned model.

ML Model - 3 : Random Forest


In [ ]: # ML Model - 3 Implementation
rf_model = RandomForestClassifier(random_state=0)

# Model is trained (fit) and predicted in the evaluate model

1. Explain the ML Model used and it's performance using Evaluation metric
Score Chart.

In [ ]: # Visualizing evaluation Metric Score chart


rf_score = evaluate_model(rf_model, x_train, x_test, y_train, y_test)

Confusion Matrix:

23 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 1 | 1 | 1 | 37 |
| 2 | 1 | 1 | 1 | 35 |
| accuracy | 1 | 1 | 1 | 1 |
| macro avg | 1 | 1 | 1 | 105 |
| weighted avg | 1 | 1 | 1 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: # Updated Evaluation metric Score Chart


score['Random Forest'] = rf_score
score

Out[ ]: Logistic Logistic regression Decision Decision Tree Random


regression tuned Tree tuned Forest

Precision
0.980952 0.990741 1.000000 0.954548 1.000000
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167
Test

Recall Train 0.980952 0.990476 1.000000 0.952381 1.000000

Recall Test 0.977778 0.977778 0.977778 0.955556 0.977778

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778
Test

F1 macro
0.980952 0.990478 1.000000 0.952353 1.000000
Train

F1 macro
0.977692 0.977692 0.977692 0.955093 0.977692
Test

2. Cross- Validation & Hyperparameter Tuning

24 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch C


# Define the hyperparameter grid
grid = {'n_estimators': [10, 50, 100, 200],
'max_depth': [8, 9, 10, 11, 12,13, 14, 15],
'min_samples_split': [2, 3, 4, 5]}

# Initialize the model


rf = RandomForestClassifier(random_state=0)

# Repeated stratified kfold


rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomSearchCV
random_search = RandomizedSearchCV(rf, grid,cv=rskf, n_iter=10, n_jobs=-1)

# Fit the RandomSearchCV to the training data


random_search.fit(x_train, y_train)

# Select the best hyperparameters


best_params = random_search.best_params_
print("Best hyperparameters: ", best_params)

Best hyperparameters: {'n_estimators': 100, 'min_samples_split': 4, 'max_depth': 1


2}

In [ ]: # Initialize model with best parameters


rf_model2 = RandomForestClassifier(n_estimators = best_params['n_estimators'],
min_samples_leaf= best_params['min_samples_split'],
max_depth = best_params['max_depth'],
random_state=0)

In [ ]: # Visualizing evaluation Metric Score chart


rf2_score = evaluate_model(rf_model2, x_train, x_test, y_train, y_test)

Confusion Matrix:

25 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.972222 | 0.945946 | 0.958904 | 37 |
| 2 | 0.944444 | 0.971429 | 0.957746 | 35 |
| accuracy | 0.971429 | 0.971429 | 0.971429 | 0.971429 |
| macro avg | 0.972222 | 0.972458 | 0.972217 | 105 |
| weighted avg | 0.971693 | 0.971429 | 0.971434 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: score['Random Forest tuned'] = rf2_score

Which hyperparameter optimization technique have i used and why?

The hyperparameter optimization technique i used is RandomizedSearchCV.


RandomizedSearchCV is a method that performs a random search over a specified
parameter grid to find the best hyperparameters for a model. It is a popular method for
hyperparameter tuning because it can be more efficient than exhaustive search methods like
GridSearchCV when the parameter space is large.

The choice of hyperparameter optimization technique depends on various factors such as


the size of the parameter space, the computational resources available, and the time
constraints. RandomizedSearchCV can be a good choice when the parameter space is large
and computational resources are limited.

Have i seen any improvement? Note down the improvement with updates Evaluation metric
Score Chart.

In [ ]: # Updated Evaluation metric Score Chart


score

26 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic
Logistic Decision Decision Random Random
regression
regression Tree Tree tuned Forest Forest tuned
tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429
Train

Recall Test 0.977778 0.977778 0.977778 0.955556 0.977778 0.977778

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778
Test

F1 macro
0.980952 0.990478 1.000000 0.952353 1.000000 0.971434
Train

F1 macro
0.977692 0.977692 0.977692 0.955093 0.977692 0.977692
Test

It appears that hyperparameter tuning improved the performance of the Random Forest
model on the train set. But the precision, recall, accuracy and F1 scores on the test set are
same for both tuned and untuned Random Forest models.

ML Model - 4 : SVM (Support Vector Machine)


In [ ]: # ML Model - 4 Implementation
svm_model = SVC(kernel='linear', random_state=0, probability=True)

# Model is trained (fit) and predicted in the evaluate model

1. Explain the ML Model used and it's performance using Evaluation metric
Score Chart.

In [ ]: # Visualizing evaluation Metric Score chart


svm_score = evaluate_model(svm_model, x_train, x_test, y_train, y_test)

Confusion Matrix:

27 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.972973 | 0.972973 | 0.972973 | 37 |
| 2 | 0.971429 | 0.971429 | 0.971429 | 35 |
| accuracy | 0.980952 | 0.980952 | 0.980952 | 0.980952 |
| macro avg | 0.981467 | 0.981467 | 0.981467 | 105 |
| weighted avg | 0.980952 | 0.980952 | 0.980952 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: # Updated Evaluation metric Score Chart


score['SVM'] = svm_score
score

28 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic Decision Random


Logistic Decision Random
regression Tree Forest SVM
regression Tree Forest
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778
Test

F1 macro
0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952
Train

F1 macro
0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692
Test

2. Cross- Validation & Hyperparameter Tuning

In [ ]: # ML Model - 4 Implementation with hyperparameter optimization techniques (i.e., GridSearch C


# Define the hyperparameter grid
param_grid = {'C': np.arange(0.1, 10, 0.1),
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'degree': np.arange(2, 6, 1)}

# Initialize the model


svm = SVC(random_state=0, probability=True)

# Repeated stratified kfold


rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomizedSearchCV with kfold cross-validation


random_search = RandomizedSearchCV(svm, param_grid, n_iter=10, cv=rskf, n_jobs=-1)

# Fit the RandomizedSearchCV to the training data


random_search.fit(x_train, y_train)

# Select the best hyperparameters


best_params = random_search.best_params_
print("Best hyperparameters: ", best_params)

Best hyperparameters: {'kernel': 'rbf', 'degree': 5, 'C': 8.5}

29 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Initialize model with best parameters


svm_model2 = SVC(C = best_params['C'],
kernel = best_params['kernel'],
degree = best_params['degree'],
random_state=0, probability=True)

In [ ]: # Visualizing evaluation Metric Score chart


svm2_score = evaluate_model(svm_model2, x_train, x_test, y_train, y_test)

Confusion Matrix:

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.972973 | 0.972973 | 0.972973 | 37 |
| 2 | 0.971429 | 0.971429 | 0.971429 | 35 |
| accuracy | 0.980952 | 0.980952 | 0.980952 | 0.980952 |
| macro avg | 0.981467 | 0.981467 | 0.981467 | 105 |
| weighted avg | 0.980952 | 0.980952 | 0.980952 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: score['SVM tuned'] = svm2_score

Which hyperparameter optimization technique have i used and why?

Here Randomized search is used as a hyperparameter optimization technique. Randomized


search is a popular technique because it can be more efficient than exhaustive search
methods like grid search. Instead of trying all possible combinations of hyperparameters,
randomized search samples a random subset of the hyperparameter space. This can save
time and computational resources while still finding good hyperparameters for the model.

30 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Have i seen any improvement? Note down the improvement with updates Evaluation metric
Score Chart.

In [ ]: # Updated Evaluation metric Score Chart


score

Out[ ]: Logistic Decision Random


Logistic Decision Random SVM
regression Tree Forest SVM
regression Tree Forest tuned
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952 0.980952
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778
Test

F1
macro 0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952 0.980952
Train

F1
macro 0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692 0.977692
Test

It appears that hyperparameter tuning did not improve the performance of the SVM model
on the test set. The precision, recall, accuracy and F1 scores on the test set are same for both
tuned and untuned SVM models.

ML Model - 5 : Xtreme Gradient Boosting


In [ ]: # ML Model - 5 Implementation
xgb_model = xgb.XGBClassifier()

# Model is trained (fit) and predicted in the evaluate model

1. Explain the ML Model used and it's performance using Evaluation metric
Score Chart.

In [ ]: # Visualizing evaluation Metric Score chart


xgb_score = evaluate_model(xgb_model, x_train, x_test, y_train, y_test)

Confusion Matrix:

31 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 1 | 1 | 1 | 37 |
| 2 | 1 | 1 | 1 | 35 |
| accuracy | 1 | 1 | 1 | 1 |
| macro avg | 1 | 1 | 1 | 105 |
| weighted avg | 1 | 1 | 1 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: # Updated Evaluation metric Score Chart


score['XGB'] = xgb_score
score

32 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic Decision Random


Logistic Decision Random SVM
regression Tree Forest SVM XGB
regression Tree Forest tuned
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952 0.980952 1.000000
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

F1
macro 0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952 0.980952 1.000000
Train

F1
macro 0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692 0.977692 0.977692
Test

2. Cross- Validation & Hyperparameter Tuning

In [ ]: # ML Model - 5 Implementation with hyperparameter optimization techniques (i.e., GridSearch C


# Define the hyperparameter grid
param_grid = {'learning_rate': np.arange(0.01, 0.3, 0.01),
'max_depth': np.arange(3, 15, 1),
'n_estimators': np.arange(100, 200, 10)}

# Initialize the model


xgb2 = xgb.XGBClassifier(random_state=0)

# Repeated stratified kfold


rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(xgb2, param_grid, n_iter=10, cv=rskf)

# Fit the RandomizedSearchCV to the training data


random_search.fit(x_train, y_train)

# Select the best hyperparameters


best_params = random_search.best_params_
print("Best hyperparameters: ", best_params)

Best hyperparameters: {'n_estimators': 170, 'max_depth': 12, 'learning_rate': 0.2


5}

33 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Initialize model with best parameters


xgb_model2 = xgb.XGBClassifier(learning_rate = best_params['learning_rate'],
max_depth = best_params['max_depth'],
n_estimators = best_params['n_estimators'],
random_state=0)

In [ ]: # Visualizing evaluation Metric Score chart


xgb2_score = evaluate_model(xgb_model2, x_train, x_test, y_train, y_test)

Confusion Matrix:

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 1 | 1 | 1 | 37 |
| 2 | 1 | 1 | 1 | 35 |
| accuracy | 1 | 1 | 1 | 1 |
| macro avg | 1 | 1 | 1 | 105 |
| weighted avg | 1 | 1 | 1 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: score['XGB tuned'] = xgb2_score

Which hyperparameter optimization technique have i used and why?

Here we have used Randomized search to tune the XGB model.

Randomized search is a popular technique because it can be more efficient than exhaustive
search methods like grid search. Instead of trying all possible combinations of
hyperparameters, randomized search samples a random subset of the hyperparameter
space. This can save time and computational resources while still finding good
hyperparameters for the model.

34 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Have i seen any improvement? Note down the improvement with updates Evaluation metric
Score Chart.

In [ ]: # Updated Evaluation metric Score Chart


score

Out[ ]: Logistic Decision Random


Logistic Decision Random SVM
regression Tree Forest SVM XGB
regression Tree Forest tuned
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952 0.980952 1.000000
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

F1
macro 0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952 0.980952 1.000000
Train

F1
macro 0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692 0.977692 0.977692
Test

It appears that hyperparameter tuning did not improve the performance of the XGBoost
model on the test set. The precision, recall, accuracy and F1 scores on the test set are same
for both the untuned and tuned XGBoost models.

ML Model - 6 : Naive Bayes


In [ ]: # ML Model - 6 Implementation
nb_model = GaussianNB()

# Model is trained (fit) and predicted in the evaluate model

1. Explain the ML Model used and it's performance using Evaluation metric
Score Chart.

In [ ]: # Visualizing evaluation Metric Score chart


nb_score = evaluate_model(nb_model, x_train, x_test, y_train, y_test)

Confusion Matrix:

35 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.918919 | 0.918919 | 0.918919 | 37 |
| 2 | 0.914286 | 0.914286 | 0.914286 | 35 |
| accuracy | 0.942857 | 0.942857 | 0.942857 | 0.942857 |
| macro avg | 0.944402 | 0.944402 | 0.944402 | 105 |
| weighted avg | 0.942857 | 0.942857 | 0.942857 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 0.928571 | 1 | 0.962963 | 13 |
| 2 | 1 | 0.933333 | 0.965517 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.97619 | 0.977778 | 0.97616 | 45 |
| weighted avg | 0.979365 | 0.977778 | 0.977806 | 45 |

In [ ]: # Updated Evaluation metric Score Chart


score['Naive Bayes'] = nb_score
score

36 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic Decision Random


Logistic Decision Random SVM
regression Tree Forest SVM XGB
regression Tree Forest tuned
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952 0.980952 1.000000
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

F1
macro 0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952 0.980952 1.000000
Train

F1
macro 0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692 0.977692 0.977692
Test

2. Cross- Validation & Hyperparameter Tuning

In [ ]: # ML Model - 6 Implementation with hyperparameter optimization techniques (i.e., GridSearch C


# Define the hyperparameter grid
param_grid = {'var_smoothing': np.logspace(0,-9, num=100)}

# Initialize the model


naive = GaussianNB()

# repeated stratified kfold


rskf = RepeatedStratifiedKFold(n_splits=4, n_repeats=4, random_state=0)

# Initialize GridSearchCV
GridSearch = GridSearchCV(naive, param_grid, cv=rskf, n_jobs=-1)

# Fit the GridSearchCV to the training data


GridSearch.fit(x_train, y_train)

# Select the best hyperparameters


best_params = GridSearch.best_params_
print("Best hyperparameters: ", best_params)

Best hyperparameters: {'var_smoothing': 0.0001519911082952933}

In [ ]: # Initiate model with best parameters


nb_model2 = GaussianNB(var_smoothing = best_params['var_smoothing'])

37 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Visualizing evaluation Metric Score chart


nb2_score = evaluate_model(nb_model2, x_train, x_test, y_train, y_test)

Confusion Matrix:

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.918919 | 0.918919 | 0.918919 | 37 |
| 2 | 0.914286 | 0.914286 | 0.914286 | 35 |
| accuracy | 0.942857 | 0.942857 | 0.942857 | 0.942857 |
| macro avg | 0.944402 | 0.944402 | 0.944402 | 105 |
| weighted avg | 0.942857 | 0.942857 | 0.942857 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 0.928571 | 1 | 0.962963 | 13 |
| 2 | 1 | 0.933333 | 0.965517 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.97619 | 0.977778 | 0.97616 | 45 |
| weighted avg | 0.979365 | 0.977778 | 0.977806 | 45 |

In [ ]: score['Naive Bayes tuned']= nb2_score

Which hyperparameter optimization technique have i used and why?

Here we have used the GridSearchCV for optimization of the Naive Bayes model.

GridSearchCV is an exhaustive search method that tries all possible combinations of


hyperparameters specified in the hyperparameter grid. This technique can be useful when
the number of hyperparameters to tune is small and the range of possible values for each
hyperparameter is limited. GridSearchCV can find the best combination of hyperparameters,
but it can be computationally expensive for large hyperparameter grids.

Have i seen any improvement? Note down the improvement with updates Evaluation metric
Score Chart.

In [ ]: # Updated Evaluation metric Score Chart


score

38 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic Decision Random


Logistic Decision Random SVM
regression Tree Forest SVM XGB
regression Tree Forest tuned
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952 0.980952 1.000000
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

F1
macro 0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952 0.980952 1.000000
Train

F1
macro 0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692 0.977692 0.977692
Test

It appears that hyperparameter tuning did not improved the performance of the Naive Bayes
model on the test set. The tuned Naive Bayes model has precision, recall, accuracy and F1
score on the test set as same as in the untuned Naive Bayes model.

ML Model - 7 : Neural Network


In [ ]: # ML Model - 7 Implementation
nn_model = MLPClassifier(random_state=0)

# Model is trained (fit) and predicted in the evaluate model

1. Explain the ML Model used and it's performance using Evaluation metric
Score Chart.

In [ ]: # Visualizing evaluation Metric Score chart


neural_score = evaluate_model(nn_model, x_train, x_test, y_train, y_test)

Confusion Matrix:

39 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 0.972973 | 0.972973 | 0.972973 | 37 |
| 2 | 0.971429 | 0.971429 | 0.971429 | 35 |
| accuracy | 0.980952 | 0.980952 | 0.980952 | 0.980952 |
| macro avg | 0.981467 | 0.981467 | 0.981467 | 105 |
| weighted avg | 0.980952 | 0.980952 | 0.980952 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.846154 | 0.916667 | 13 |
| 2 | 0.882353 | 1 | 0.9375 | 15 |
| accuracy | 0.955556 | 0.955556 | 0.955556 | 0.955556 |
| macro avg | 0.960784 | 0.948718 | 0.951389 | 45 |
| weighted avg | 0.960784 | 0.955556 | 0.955093 | 45 |

In [ ]: # Updated Evaluation metric Score Chart


score['Neural Network'] = neural_score
score

40 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Out[ ]: Logistic Decision Random


Logistic Decision Random SVM
regression Tree Forest SVM XGB
regression Tree Forest tuned
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952 0.980952 1.000000
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

F1
macro 0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952 0.980952 1.000000
Train

F1
macro 0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692 0.977692 0.977692
Test

2. Cross- Validation & Hyperparameter Tuning

In [ ]: # ML Model - 7 Implementation with hyperparameter optimization techniques (i.e., GridSearch C


# Define the hyperparameter grid
param_grid = {'hidden_layer_sizes': np.arange(10, 100, 10),
'alpha': np.arange(0.0001, 0.01, 0.0001)}

# Initialize the model


neural = MLPClassifier(random_state=0)

# Repeated stratified kfold


rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=0)

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(neural, param_grid, n_iter=10, cv=rskf, n_jobs=-

# Fit the RandomizedSearchCV to the training data


random_search.fit(x_train, y_train)

# Select the best hyperparameters


best_params = random_search.best_params_
print("Best hyperparameters: ", best_params)

Best hyperparameters: {'hidden_layer_sizes': 40, 'alpha': 0.0068000000000000005}

41 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Initiate model with best parameters


nn_model2 = MLPClassifier(hidden_layer_sizes = best_params['hidden_layer_sizes'],
alpha = best_params['alpha'],
random_state = 0)

In [ ]: # Visualizing evaluation Metric Score chart


neural2_score = evaluate_model(nn_model2, x_train, x_test, y_train, y_test)

Confusion Matrix:

Train Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0 | 1 | 1 | 1 | 33 |
| 1 | 1 | 0.972973 | 0.986301 | 37 |
| 2 | 0.972222 | 1 | 0.985915 | 35 |
| accuracy | 0.990476 | 0.990476 | 0.990476 | 0.990476 |
| macro avg | 0.990741 | 0.990991 | 0.990739 | 105 |
| weighted avg | 0.990741 | 0.990476 | 0.990478 | 105 |

Test Classification Report:


| | precision | recall | f1-score | support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0 | 1 | 1 | 1 | 17 |
| 1 | 1 | 0.923077 | 0.96 | 13 |
| 2 | 0.9375 | 1 | 0.967742 | 15 |
| accuracy | 0.977778 | 0.977778 | 0.977778 | 0.977778 |
| macro avg | 0.979167 | 0.974359 | 0.975914 | 45 |
| weighted avg | 0.979167 | 0.977778 | 0.977692 | 45 |

In [ ]: score['Neural Network tuned']= neural2_score

Which hyperparameter optimization technique have i used and why?

Here we have used Randomized search to tune the Neural Network model.

Randomized search is a popular technique because it can be more efficient than exhaustive
search methods like grid search. Instead of trying all possible combinations of
hyperparameters, randomized search samples a random subset of the hyperparameter
space. This can save time and computational resources while still finding good
hyperparameters for the model.

42 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

Have i seen any improvement? Note down the improvement with updates Evaluation metric
Score Chart.

In [ ]: # Updated Evaluation metric Score Chart


score

Out[ ]: Logistic Decision Random


Logistic Decision Random SVM
regression Tree Forest SVM XGB
regression Tree Forest tuned
tuned tuned tuned

Precision
0.980952 0.990741 1.000000 0.954548 1.000000 0.971693 0.980952 0.980952 1.000000
Train

Precision
0.979167 0.979167 0.979167 0.960784 0.979167 0.979167 0.979167 0.979167 0.979167
Test

Recall
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Recall
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

Accuracy
0.980952 0.990476 1.000000 0.952381 1.000000 0.971429 0.980952 0.980952 1.000000
Train

Accuracy
0.977778 0.977778 0.977778 0.955556 0.977778 0.977778 0.977778 0.977778 0.977778
Test

F1
macro 0.980952 0.990478 1.000000 0.952353 1.000000 0.971434 0.980952 0.980952 1.000000
Train

F1
macro 0.977692 0.977692 0.977692 0.955093 0.977692 0.977692 0.977692 0.977692 0.977692
Test

It appears that hyperparameter tuning improve the performance of the neural network
model on the test set. The precision, recall, accuracy and F1 scores on the test set are
increased for the tuned neural network model compare to untuned neural network model.

In [ ]: print(score.to_markdown())

| | Logistic regression | Logistic regression tuned | Decision Tree |


|:----------------|----------------------:|----------------------------:|----------------:|--
| Precision Train | 0.980952 | 0.990741 | 1 |
| Precision Test | 0.979167 | 0.979167 | 0.979167 |
| Recall Train | 0.980952 | 0.990476 | 1 |
| Recall Test | 0.977778 | 0.977778 | 0.977778 |
| Accuracy Train | 0.980952 | 0.990476 | 1 |
| Accuracy Test | 0.977778 | 0.977778 | 0.977778 |
| F1 macro Train | 0.980952 | 0.990478 | 1 |
| F1 macro Test | 0.977692 | 0.977692 | 0.977692 |

Selection of best model

43 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Removing the overfitted models which have precision, recall, f1 scores for train as 1
score_t = score.transpose() # taking transpose of the score dataframe to create ne
remove_models = score_t[score_t['Recall Train']>=0.98].index # creating a list of models whi
remove_models

adj = score_t.drop(remove_models) # creating a new dataframe with require


adj

Out[ ]: F1 F1
Precision Precision Recall Recall Accuracy Accuracy
macro macro
Train Test Train Test Train Test
Train Test

Decision
Tree 0.954548 0.960784 0.952381 0.955556 0.952381 0.955556 0.952353 0.955093
tuned

Random
Forest 0.971693 0.979167 0.971429 0.977778 0.971429 0.977778 0.971434 0.977692
tuned

Naive
0.942857 0.979365 0.942857 0.977778 0.942857 0.977778 0.942857 0.977806
Bayes

Naive
Bayes 0.942857 0.979365 0.942857 0.977778 0.942857 0.977778 0.942857 0.977806
tuned

In [ ]: def select_best_model(df, metrics):

best_models = {}
for metric in metrics:
max_test = df[metric + ' Test'].max()
best_model_test = df[df[metric + ' Test'] == max_test].index[0]
best_model = best_model_test
best_models[metric] = best_model
return best_models

In [ ]: metrics = ['Precision', 'Recall', 'Accuracy', 'F1 macro']

best_models = select_best_model(adj, metrics)


print("The best models are:")
for metric, best_model in best_models.items():
print(f"{metric}: {best_model} - {adj[metric+' Test'][best_model].round(4)}")

The best models are:


Precision: Naive Bayes - 0.9794
Recall: Random Forest tuned - 0.9778
Accuracy: Random Forest tuned - 0.9778
F1 macro: Naive Bayes - 0.9778

44 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Take recall as the primary evaluation metric


score_smpl = score.transpose()
remove_overfitting_models = score_smpl[score_smpl['Recall Train']>=0.98].index
remove_overfitting_models
new_score = score_smpl.drop(remove_overfitting_models)
new_score = new_score.drop(['Precision Train','Precision Test','Accuracy Train','Accuracy Tes
new_score.index.name = 'Classification Model'
print(new_score.to_markdown())

| Classification Model | Recall Train | Recall Test |


|:-----------------------|---------------:|--------------:|
| Decision Tree tuned | 0.952381 | 0.955556 |
| Random Forest tuned | 0.971429 | 0.977778 |
| Naive Bayes | 0.942857 | 0.977778 |
| Naive Bayes tuned | 0.942857 | 0.977778 |

1. Which Evaluation metrics did i consider for a positive business


impact and why?
After carefully considering the potential consequences of false positives and false negatives
in the context of our business objectives, I have selected recall as the primary evaluation
metric for our Iris flower classification model. This means that our goal is to maximize the
number of true positives (correctly identified the different iris flowers) while minimizing the
number of false negatives (incorrectly identified the flowers not a iris flower). By doing so, we
aim to ensure that we correctly identify as many different iris flowers, even if it means that
we may have some false positives.

2. Which ML model did i choose from the above created models


as our final prediction model and why?
After evaluating the performance of several machine learning models on the Iris dataset, I
have selected the tuned Random Forest as our final prediction model. This decision was
based on the model’s performance on our primary evaluation metric of recall, which
measures the ability of the model to correctly identify different iris flowers. In our analysis,
we found that the Random Forest (tuned) had the highest recall score among the models we
evaluated.

I choose recall as the primary evaluation metric because correctly identifying different iris
flowers are critical to achieving our business objectives. By selecting a model with a high
recall score, we aim to ensure that we correctly identify as many different iris flowers as
possible, even if it means that we may have some false positives. Overall, we believe that the
Random Forest (tuned) is the best choice for our needs and will help us achieve a positive
business impact.

3. Explain the model which i have used for the prediction

45 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In [ ]: # Define a list of category labels for reference.


Category_RF = ['Iris-Setosa', 'Iris-Versicolor', 'Iris-Virginica']

In [ ]: # In this example, it's a data point with Sepal Length, Sepal Width, Petal Length, and Petal
x_rf = np.array([[5.1, 3.5, 1.4, 0.2]])

# Use the tuned random forest model (rf_model2) to make a prediction.


x_rf_prediction = rf_model2.predict(x_rf)
x_rf_prediction[0]

# Display the predicted category label.


print(Category_RF[int(x_rf_prediction[0])])

Iris-Setosa

Conclusion

46 of 47 15/10/2024, 7:06 AM
Iris_Flower_Classification http://localhost:8888/nbconvert/html/Iris_Flower_Classification.ipynb...

In the Iris flower classification project, the tuned Random Forest model has been selected as
the final prediction model. The project aimed to classify Iris flowers into three distinct
species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica. After extensive data exploration,
preprocessing, and model evaluation, the following conclusions can be drawn:

1. Data Exploration: Through a thorough examination of the dataset, we gained insights


into the characteristics and distributions of features. We found that Iris-Setosa exhibited
distinct features compared to the other two species.

2. Data Preprocessing: Data preprocessing steps, including handling missing values and
encoding categorical variables, were performed to prepare the dataset for modeling.

3. Model Selection: After experimenting with various machine learning models, tuned
Random Forest was chosen as the final model due to its simplicity, interpretability, and
good performance in classifying Iris species.

4. Model Training and Evaluation: The Random Forest (tuned) model was trained on the
training dataset and evaluated using appropriate metrics. The model demonstrated
satisfactory accuracy and precision in classifying Iris species.

5. Challenges and Future Work: The project encountered challenges related to feature
engineering and model fine-tuning. Future work may involve exploring more advanced
modeling techniques to improve classification accuracy further.

6. Practical Application: The Iris flower classification model can be applied in real-world
scenarios, such as botany and horticulture, to automate the identification of Iris species
based on physical characteristics.

In conclusion, the Iris flower classification project successfully employed Random Forest
(tuned) as the final prediction model to classify Iris species. The project's outcomes have
practical implications in the field of botany and offer valuable insights into feature
importance for species differentiation. Further refinements and enhancements may lead to
even more accurate and reliable classification models in the future.

47 of 47 15/10/2024, 7:06 AM

You might also like