0% found this document useful (0 votes)
53 views3 pages

Iris Flower Classification

The Iris Flower Classification project aims to classify iris flowers into three species using a dataset of 150 instances with four features. Three classification algorithms—Logistic Regression, K-Nearest Neighbors, and Random Forest—were implemented, all achieving an accuracy of 0.97%, with Random Forest being the best-performing model. The trained Random Forest model is saved for future predictions, highlighting its robustness and effectiveness in handling non-linearity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views3 pages

Iris Flower Classification

The Iris Flower Classification project aims to classify iris flowers into three species using a dataset of 150 instances with four features. Three classification algorithms—Logistic Regression, K-Nearest Neighbors, and Random Forest—were implemented, all achieving an accuracy of 0.97%, with Random Forest being the best-performing model. The trained Random Forest model is saved for future predictions, highlighting its robustness and effectiveness in handling non-linearity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Iris Flower Classification

1. Introduction
The Iris dataset is a well-known dataset in the field of machine learning, commonly used for
classification tasks. The dataset consists of 150 instances with four features: sepal length, sepal width,
petal length, and petal width. The goal of this project is to develop machine learning models that can
accurately classify iris flowers into one of three species: Setosa, Versicolor, and Virginica.
This project employs three classification algorithms: Logistic Regression, K-Nearest Neighbors
(KNN), and Random Forest. The trained models are evaluated based on their accuracy, and the best-
performing model is saved for future predictions.

2. Data Preprocessing
2.1 Dataset Overview
The dataset consists of the following features:
• sepal_length (continuous variable)
• sepal_width (continuous variable)
• petal_length (continuous variable)
• petal_width (continuous variable)
• species (categorical target variable with three classes: Setosa, Versicolor, Virginica)

2.2 Loading the Data


The dataset is loaded into a Pandas DataFrame using the read_csv() function. The info() method is
used to inspect data types and check for missing values. Additionally, describe() is used to understand
statistical properties such as mean, standard deviation, and percentiles of each feature.

2.3 Exploratory Data Analysis (EDA)


Outlier Detection
To identify potential outliers in the dataset, a box plot is created for the sepal_length feature. Outliers
can affect model performance and may require handling through techniques such as removal,
transformation, or imputation.
Feature Relationship Analysis
A scatter plot is generated to analyze the relationship between sepal_length and sepal_width, with
species color-coded. This visualization helps identify patterns or clusters in the data that may be
useful for classification.
Correlation Analysis
A correlation matrix is computed to understand the interdependence of different features. Strong
correlations between features indicate that some variables may be redundant or provide significant
predictive power.

3. Data Splitting and Preprocessing


3.1 Splitting the Dataset
To ensure unbiased model evaluation, the dataset is split into training and testing sets using
train_test_split() from Scikit-learn. A 70-30 split is used, where 70% of the data is used for training,
and 30% is used for testing.

3.2 Feature Scaling


Scaling is crucial for models like Support Vector Machines (SVM) and KNN, which are sensitive to
feature magnitudes. The StandardScaler from Scikit-learn is used to normalize the training and
testing datasets.

4. Model Implementation
4.1 Logistic Regression
Logistic Regression is a widely used classification algorithm that works well for linearly separable
data.
• Model Training: The model is trained with a maximum iteration of 200.
• Prediction: The trained model predicts the species of the test dataset.
• Evaluation: The accuracy score is computed using accuracy_score().

Results:
The Logistic Regression model achieved an accuracy of 0.97%.

4.2 K-Nearest Neighbors (KNN)


KNN is a distance-based algorithm that classifies new points based on their nearest neighbors.
• Model Training: A KNN classifier with 3 neighbors is used.
• Prediction: The trained model predicts species on the test data.
• Evaluation: The accuracy score is computed.
Results:
The KNN model achieved an accuracy of 0.97%.

4.3 Random Forest Classifier


Random Forest is an ensemble learning method that builds multiple decision trees and merges their
results to improve accuracy and reduce overfitting.
• Model Training: A Random Forest classifier with 100 estimators is trained.
• Prediction: The model predicts test set species without requiring feature scaling.
• Evaluation: The accuracy score is calculated.
Results:
The Random Forest model achieved an accuracy of 0.97%.
5. Model Comparison
Model Accuracy

Logistic Regression 0.97%

K-Nearest Neighbors 0.97%

Random Forest 0.97%

The Random Forest model outperformed the other models due to its ability to handle non-linearity
and its robustness against overfitting.

6. Model Deployment
To ensure the model's reusability, the trained Random Forest model is saved using the pickle
module.

7. Conclusion and Future Work


7.1 Conclusion
This project successfully implemented and evaluated three classification models for predicting iris
species. The Random Forest model provided the best accuracy, making it the most suitable model
for deployment.

You might also like