0% found this document useful (0 votes)
32 views8 pages

Mini Project With Output

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

Mini Project With Output

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Step 1: Data Acquisition and Understanding

1. Load required Libraries and Dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

df = pd.read_csv(‘Dry_Bean_Dataset.csv')
df.head()

2. Perform initial Exploratory Data Analysis (EDA) to understand basic statistics:

df.info()
df.describe()

3. Check for missing values:

df.isnull().sum()

OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13611 entries, 0 to 13610
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 13611 non-null int64
1 Perimeter 13611 non-null float64
2 MajorAxisLength 13611 non-null float64
3 MinorAxisLength 13611 non-null float64
4 AspectRation 13611 non-null float64
5 Eccentricity 13611 non-null float64
6 ConvexArea 13611 non-null int64
7 EquivDiameter 13611 non-null float64
8 Extent 13611 non-null float64
9 Solidity 13611 non-null float64
10 roundness 13611 non-null float64
11 Compactness 13611 non-null float64
12 ShapeFactor1 13611 non-null float64
13 ShapeFactor2 13611 non-null float64
14 ShapeFactor3 13611 non-null float64
15 ShapeFactor4 13611 non-null float64
16 Class 13611 non-null object
dtypes: float64(14), int64(2), object(1)
memory usage: 1.8+ MB
0

Area 0

Perimeter 0

MajorAxisLength 0

MinorAxisLength 0

AspectRation 0

Eccentricity 0

ConvexArea 0

EquivDiameter 0

Extent 0

Solidity 0

roundness 0

Compactness 0

ShapeFactor1 0

ShapeFactor2 0

ShapeFactor3 0

ShapeFactor4 0

Class 0

dtype: int64

Step 2: Data Preprocessing and Transformation

1. Handling Missing Values: If the dataset has missing values, we can handle them by
imputing the mean for numerical columns or using forward-fill for categorical columns.
2. Spliting dataset using train_test_split:

X = df.drop(columns=['Class'])
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
3. Feature scaling on data:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

OUTPUT :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13611 entries, 0 to 13610
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 13611 non-null int64
1 Perimeter 13611 non-null float64
2 MajorAxisLength 13611 non-null float64
3 MinorAxisLength 13611 non-null float64
4 AspectRation 13611 non-null float64
5 Eccentricity 13611 non-null float64
6 ConvexArea 13611 non-null int64
7 EquivDiameter 13611 non-null float64
8 Extent 13611 non-null float64
9 Solidity 13611 non-null float64
10 roundness 13611 non-null float64
11 Compactness 13611 non-null float64
12 ShapeFactor1 13611 non-null float64
13 ShapeFactor2 13611 non-null float64
14 ShapeFactor3 13611 non-null float64
15 ShapeFactor4 13611 non-null float64
16 Class 13611 non-null object
dtypes: float64(14), int64(2), object(1)
memory usage: 1.8+ MB

Step 3: Data Visualization

1. Correlation Matrix to see relationships between features:

import seaborn as sns


import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

OUTPUT:
<Axes: >

2. Distribution Plots of key features to understand the spread of values:

sns.displot(df[Area])
sns.displot(df['MajorAxisLength'])

OUTPUT:
<seaborn.axisgrid.FacetGrid at 0x77fc9ff6ad70>
Pair Plots to visualize relationships between input features:

sns.pairplot(df[['Area', 'Perimeter', 'MajorAxisLength',


'MinorAxisLength','AspectRation']], hue='Class')

OUTPUT:
<seaborn.axisgrid.PairGrid at 0x77fce876bee0>

Step 4: Model Building

Implement multiple models for comparison:

1. Logistic Regression:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

2. Support Vector Machine (SVM):

from sklearn.svm import SVC


svm = SVC()
svm.fit(X_train, y_train)

3. K-Nearest Neighbors (KNN):

from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

Step 5: Model Evaluation

1. Accuracy:

Logistic Regression

from sklearn.metrics import accuracy_score

y_pred_logreg = logreg.predict(X_test)
print("Accuracy for Logistic Regression:", accuracy_score(y_test,
y_pred_logreg))

OUTPUT:
Accuracy for Logistic Regression: 0.9265515975027543

Support Vector Machine (SVM):

from sklearn.metrics import accuracy_score

y_pred_svm = svm.predict(X_test)
print("Accuracy for SVM:", accuracy_score(y_test, y_pred_svm))

OUTPUT:
Accuracy for SVM: 0.9338964377524789
K-Nearest Neighbors (KNN):

from sklearn.metrics import accuracy_score

y_pred_knn = knn.predict(X_test)
print("Accuracy for KNN:", accuracy_score(y_test, y_pred_knn))

OUTPUT:

Accuracy for KNN: 0.9232464193903782

2. Accuracy across models to select the best one. For instance:


o Logistic Regression: 92.65% accuracy
o SVM : 93.38% accuracy
o K-Nearest Neighbors (KNN): 92.32% accuracy

Conclusion and Insights

The best-performing model is Support Vector Machine (SVM) , with an accuracy of 93.38%.
The most important features contributing to the prediction of Dry Beans are Area,
MajorAxisLength, Perimeter.

You might also like