Mini Project With Output
Mini Project With Output
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
df = pd.read_csv(‘Dry_Bean_Dataset.csv')
df.head()
df.info()
df.describe()
df.isnull().sum()
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13611 entries, 0 to 13610
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 13611 non-null int64
1 Perimeter 13611 non-null float64
2 MajorAxisLength 13611 non-null float64
3 MinorAxisLength 13611 non-null float64
4 AspectRation 13611 non-null float64
5 Eccentricity 13611 non-null float64
6 ConvexArea 13611 non-null int64
7 EquivDiameter 13611 non-null float64
8 Extent 13611 non-null float64
9 Solidity 13611 non-null float64
10 roundness 13611 non-null float64
11 Compactness 13611 non-null float64
12 ShapeFactor1 13611 non-null float64
13 ShapeFactor2 13611 non-null float64
14 ShapeFactor3 13611 non-null float64
15 ShapeFactor4 13611 non-null float64
16 Class 13611 non-null object
dtypes: float64(14), int64(2), object(1)
memory usage: 1.8+ MB
0
Area 0
Perimeter 0
MajorAxisLength 0
MinorAxisLength 0
AspectRation 0
Eccentricity 0
ConvexArea 0
EquivDiameter 0
Extent 0
Solidity 0
roundness 0
Compactness 0
ShapeFactor1 0
ShapeFactor2 0
ShapeFactor3 0
ShapeFactor4 0
Class 0
dtype: int64
1. Handling Missing Values: If the dataset has missing values, we can handle them by
imputing the mean for numerical columns or using forward-fill for categorical columns.
2. Spliting dataset using train_test_split:
X = df.drop(columns=['Class'])
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
3. Feature scaling on data:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
OUTPUT :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13611 entries, 0 to 13610
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Area 13611 non-null int64
1 Perimeter 13611 non-null float64
2 MajorAxisLength 13611 non-null float64
3 MinorAxisLength 13611 non-null float64
4 AspectRation 13611 non-null float64
5 Eccentricity 13611 non-null float64
6 ConvexArea 13611 non-null int64
7 EquivDiameter 13611 non-null float64
8 Extent 13611 non-null float64
9 Solidity 13611 non-null float64
10 roundness 13611 non-null float64
11 Compactness 13611 non-null float64
12 ShapeFactor1 13611 non-null float64
13 ShapeFactor2 13611 non-null float64
14 ShapeFactor3 13611 non-null float64
15 ShapeFactor4 13611 non-null float64
16 Class 13611 non-null object
dtypes: float64(14), int64(2), object(1)
memory usage: 1.8+ MB
OUTPUT:
<Axes: >
sns.displot(df[Area])
sns.displot(df['MajorAxisLength'])
OUTPUT:
<seaborn.axisgrid.FacetGrid at 0x77fc9ff6ad70>
Pair Plots to visualize relationships between input features:
OUTPUT:
<seaborn.axisgrid.PairGrid at 0x77fce876bee0>
1. Logistic Regression:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
1. Accuracy:
Logistic Regression
y_pred_logreg = logreg.predict(X_test)
print("Accuracy for Logistic Regression:", accuracy_score(y_test,
y_pred_logreg))
OUTPUT:
Accuracy for Logistic Regression: 0.9265515975027543
y_pred_svm = svm.predict(X_test)
print("Accuracy for SVM:", accuracy_score(y_test, y_pred_svm))
OUTPUT:
Accuracy for SVM: 0.9338964377524789
K-Nearest Neighbors (KNN):
y_pred_knn = knn.predict(X_test)
print("Accuracy for KNN:", accuracy_score(y_test, y_pred_knn))
OUTPUT:
The best-performing model is Support Vector Machine (SVM) , with an accuracy of 93.38%.
The most important features contributing to the prediction of Dry Beans are Area,
MajorAxisLength, Perimeter.