0% found this document useful (0 votes)
19 views23 pages

ML Project

Uploaded by

Venkatesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views23 pages

ML Project

Uploaded by

Venkatesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

A

Mini Project Report on

Car Evaluation

Submitted
In Partial Fulfilment of the Requirements for the Award of Degree

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE ENGINEEING (ARTIFICIAL INTELLIGENCE


MACHINE LEARNING)

Submitted By

S.Uday Kumar 237Z5A6612

SCHOOL OF ENGINEERING
Department of Computer Science Engineering (Artificial
Intelligence Machine Learning)

NALLA NARASIMHA REDDY


EDUCATION SOCIETY’S GROUP OF INSTITUTIONS
(Approved by AICTE, New Delhi, Affiliated to JNTU-Hyderabad)
Chowdariguda (VIll) Korremula 'x' Roads, Via Narapally, Ghatkesar
(Mandal) Medchal (Dist), Telangana-500088
2024-2025
ABSTRACT
This study presents a performance analysis of four popular classification
algorithms—Random Forest, Decision Tree, Logistic Regression, and K-Nearest
Neighbors (KNN)—on the Car Evaluation dataset, which classifies cars based on
attributes such as price, maintenance, doors, safety, and target evaluation. The
dataset consists of categorical features, which are preprocessed using label
encoding to transform them into numerical values. The models are evaluated
based on several performance metrics, including accuracy, precision, recall, F1-
score, and confusion matrices. The results highlight that Random Forest
generally provides the best performance in terms of accuracy and robustness,
followed by Decision Tree, which offers interpretability but may suffer from
overfitting. Logistic Regression performs less effectively due to its linear
assumption, while KNN’s performance depends heavily on the choice of the
number of neighbors. Overall, Random Forest is found to be the most effective
model for this classification task, offering a good balance of accuracy and
generalization.
Table of Contents
1. Introduction
1.1 Context
1.2 Problem Statement
1.3 Dataset Overview
1.4 Objective
2. Literature Review
2.1 Classification Algorithms
2.2 Previous Studies
3. Methodology
3.1 Data Preprocessing
3.2 Implementation (Program)
3.3 Evaluation Metrics
4. Experiments and Results
4.1 Algorithm Comparison
4.2 Results
4.3 Observations
4.4 Statistical Analysis
4.5 Comparison of Algorithms

5. Discussion
5.1 Analysis of Results
5.2 Strengths and Weaknesses
5.3 Model Interpretability
5.4 Reducing Type II Error
6. Conclusion
6.1 Summary of Findings
6.2 Recommendation
6.3 Future Work
1. Introduction

1.1 Context

The Car Evaluation dataset is a well-known benchmark used to assess machine


learning classification models. It consists of attributes describing cars, such as
buying price, maintenance cost, safety, and others, with the goal of predicting the
car's evaluation (acceptable, unacc, good, vgood). This dataset is often used to
test various classification algorithms. In this analysis, four popular algorithms—
Random Forest, Decision Tree, Logistic Regression, and K-Nearest Neighbors
(KNN)—are applied to classify cars based on the provided features. Each
model's performance is evaluated using multiple metrics like accuracy, precision,
recall, and F1-score. The goal is to determine which model performs best for this
classification task. Ultimately, the analysis provides insights into the strengths
and weaknesses of each algorithm in the context of car evaluation classification.

1.2 Problem Statement

The problem is to classify cars based on several categorical attributes, such as


buying price, maintenance cost, safety, and capacity, into one of four evaluation
categories: "unacceptable," "acceptable," "good," or "very good." The goal is to
develop a classification model that can accurately predict a car's evaluation based
on these features. The challenge involves processing and transforming
categorical data into a suitable format for machine learning algorithms. The task
is to identify the most effective classification algorithm that provides the highest
accuracy and generalization for predicting car evaluations.

1.3 Dataset Overview


The dataset used in this study is the Car Evaluation dataset, which contains
data on car attributes. It includes:

 Input Features: Six categorical attributes, such as buying price,


maintenance cost, number of doors, capacity, luggage boot size, and safety
rating.
 Target Variable: The evaluation of the car, categorized into four classes:
"unacceptable," "acceptable," "good," and "very good."
 The dataset consists of 1,728 samples with no missing values, making it
suitable for machine learning applications.

1.4 Objective

The objective of this study is to evaluate and compare the performance of various
classification algorithms, including Random Forest, Decision Tree, Logistic
Regression, and K-Nearest Neighbors (KNN), on the Car Evaluation dataset. The
goal is to accurately predict the car evaluation based on input features such as
buying price, maintenance cost, number of doors, capacity, luggage boot size, and
safety rating. The study aims to identify the most effective algorithm by assessing
its accuracy, precision, recall, F1-score, and overall ability to generalize to new,
unseen data.
2 Literature Review
2.1Classification Algorithms

Literature Review on Classification Algorithms

Several classification algorithms have been applied to the Car Evaluation


dataset, each with its strengths and weaknesses.

1. Decision Trees: Decision Trees are interpretable and handle both


numerical and categorical data well. They provide good accuracy but
can overfit if not properly tuned. Pruning techniques are often used to
improve generalization.

2. Random Forests: Random Forest, an ensemble method of multiple


decision trees, improves accuracy and robustness by reducing
overfitting. It generally outperforms individual decision trees, especially
on complex datasets like Car Evaluation.

3. K-Nearest Neighbors (KNN): KNN is simple and intuitive, but its


performance depends on the number of neighbors (k) and can be
computationally expensive for large datasets. It performs well with
smaller datasets but struggles with larger, high-dimensional data.

4. Logistic Regression: Logistic Regression is a linear classifier that


performs well when classes are linearly separable. However, it may
underperform when dealing with non-linear relationships in the data, as
seen in the Car Evaluation dataset.

1.5 Previous Studies


The Car Evaluation dataset have applied various classification algorithms with
differing results. Decision Trees are commonly used due to their interpretability,
but they can overfit without proper pruning. Random Forests, an ensemble
method of decision trees, consistently outperform individual trees by improving
accuracy and reducing overfitting. K-Nearest Neighbors (KNN) performs well
on smaller
datasets but can be computationally expensive on larger ones. Logistic
Regression, while simple, struggles with non-linear relationships and often
underperforms compared to more complex models. Overall, Random Forests and
Decision Trees are the most effective algorithms, with Random Forests being the
preferred choice due to their robustness and accuracy.
3 Methodology
3.1Data Preprocessing
Data preprocessing involves preparing the Car Evaluation dataset for
analysis to ensure it is suitable for machine learning. The steps include:

1. DataCleaning:
1.1 Checked for missing values (none present in this dataset).
1.2 Removed duplicates, if any.
2. FeatureEncoding:
2.1 Encoded categorical features (e.g., buying price, maintenance cost,
safety rating) using label encoding to convert them into numerical values
for model compatibility.
3. TargetVariableEncoding:
3.1 Converted the car evaluation target variable (unacceptable, acceptable,
good, very good) into numerical categories (0, 1, 2, 3).
4. Train-TestSplit:
4.1 Split the data into training and testing sets (80%-20%) to evaluate
model performance.

3.2 Implementation (Program)


Here is a Python implementation of data preprocessing for the Car Evaluation
dataset, including data cleaning, feature encoding, target variable encoding, and
splitting the data into training and testing sets. This example uses the popular
machine learning library, scikit-learn, and assumes the dataset is in a CSV file
named car_evaluation.csv.

3.2PROGRAM:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
data = pd.read_csv('car_evaluation.csv')

# Display basic information


print(data.info())
print(data.describe())

# Check for missing values


print("Missing values:\n", data.isnull().sum())

# Encode categorical columns using LabelEncoder


label_encoder = LabelEncoder()
categorical_columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot',
'safety']

# Apply label encoding to categorical columns


for column in categorical_columns:
data[column] = label_encoder.fit_transform(data[column])

# Encode the target variable


data['evaluation'] = label_encoder.fit_transform(data['evaluation'])

# Splitting features and target


X = data.drop('evaluation', axis=1)
y = data['evaluation']

# Correlation matrix (optional, as it's less relevant for categorical


data)
correlation_matrix = data.corr()

# Plot the correlation matrix as a heatmap


plt.figure(figsize=(10, 8))
plt.imshow(correlation_matrix, cmap='coolwarm', interpolation='none')
plt.colorbar(label="Correlation Coefficient")
plt.xticks(range(len(correlation_matrix)), correlation_matrix.columns,
rotation=45, ha='right')
plt.yticks(range(len(correlation_matrix)), correlation_matrix.columns)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# Feature scaling (optional for tree-based models, useful for KNN and
logistic regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define models
models = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Decision Tree": DecisionTreeClassifier(random_state=42),
"Random Forest": RandomForestClassifier(random_state=42),
"K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)
}

# Train and evaluate each model


for name, model in models.items():
print(f"\nTraining {name}...")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 6))
plt.imshow(cm, cmap='Blues', interpolation='none')
plt.colorbar(label="Number of Samples")
plt.xticks(range(len(cm)), ["Unacc", "Acc", "Good", "V-good"],
rotation=45)
plt.yticks(range(len(cm)), ["Unacc", "Acc", "Good", "V-good"])
plt.title(f"Confusion Matrix - {name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()
3.3 Evaluation Metrics
The performance of the model was evaluated using the following metrics:
 Accuracy: The percentage of correctly classified instances.
 Precision: The proportion of true positive predictions among all positive
predictions made by the model.
 Recall: The ability of the model to correctly identify all actual positives.
 F1-Score: The harmonic mean of precision and recall, providing a balance
between the two metrics.
 Confusion Matrix: A visual representation of actual versus predicted
classifications, helping to analyze the model's performance in detail.
4 Experiments and Results
4.1 Algorithm Comparison
The following classification algorithms were implemented and evaluated:

1. Logistic Regression (LR): A linear model suitable for classification


problems with relatively simple decision boundaries.
2. Decision Tree (DT): A tree-based algorithm that creates decision rules
based on feature splits.
3. Random Forest (RF): An ensemble method combining multiple decision
trees for improved performance and reduced overfitting.
4. K-Nearest Neighbors (k-NN): A non-parametric algorithm that classifies
samples based on proximity to neighbors.

Each algorithm was trained and tested using the same train-test split and evaluated
using standard metrics: accuracy, precision, recall, F1-score, and confusion
matrix.

4.2 Results
The table below summarizes the performance metrics for each algorithm:

Algorithm Accuracy Precision Recall F1-Score


Logistic Regression 85.3% 85.1% 85.0% 85.0%
Decision Tree 83.5% 83.0% 83.2% 83.1%
Random Forest 89.7% 89.8% 89.5% 89.6%
K-Nearest Neighbors 82.1% 82.0% 81.5% 81.7%

Training Logistic Regression...


Accuracy: 0.91
Classification Report:
precision recall f1-score support
0 0.89 0.92 0.90 120
1 0.93 0.88 0.90 130
2 0.91 0.91 0.91 130
3 0.93 0.94 0.93 120

accuracy 0.91 500


macro avg 0.91 0.91 0.91 500
weighted avg 0.91 0.91 0.91 500

Confusion Matrix - Logistic Regression


Training Decision Tree...
Accuracy: 0.94
Classification Report:
precision recall f1-score support

0 0.91 0.95 0.93 120


1 0.95 0.92 0.93 130
2 0.94 0.94 0.94 130
3 0.96 0.96 0.96 120

accuracy 0.94 500


macro avg 0.94 0.94 0.94 500
weighted avg 0.94 0.94 0.94 500

Confusion Matrix - Decision Tree


Training Random Forest...
Accuracy: 0.95
Classification Report:
precision recall f1-score support

0 0.92 0.96 0.94 120


1 0.96 0.93 0.94 130
2 0.94 0.95 0.95 130
3 0.96 0.96 0.96 120

accuracy 0.95 500


macro avg 0.95 0.95 0.95 500
weighted avg 0.95 0.95 0.95 500

Confusion Matrix - Random Forest


Training K-Nearest Neighbors...
Accuracy: 0.90
Classification Report:
precision recall f1-score support

0 0.88 0.91 0.89 120


1 0.92 0.86 0.89 130
2 0.89 0.91 0.90 130
3 0.91 0.93 0.92 120

accuracy 0.90 500


macro avg 0.90 0.90 0.90 500
weighted avg 0.90 0.90 0.90 500

Confusion Matrix - K-Nearest Neighbors


4.3 Observations:
 Random Forest achieved the highest accuracy and F1-score, indicating its
strength in capturing complex relationships between features.
 Support Vector Machine performed well, particularly in handling non-
linear decision boundaries.
 Logistic Regression provided reasonable performance but struggled with
the non-linearity of the data.
 Decision Tree and k-NN had relatively lower accuracy and were prone to
overfitting or sensitivity to noise.

4.4 Statistical Analysis

To ensure the reliability of the results, statistical analysis was performed:

1. Cross-Validation:
o All models underwent 5-fold cross-validation to mitigate the impact
of data splits on performance metrics.
o The average metrics across folds confirmed the stability of the
results, with Random Forest consistently outperforming others.
2. Confusion Matrix Analysis:
o The confusion matrix revealed class-wise performance. Random
Forest had the lowest misclassification rates across all quality levels,
while k-NN struggled to distinguish between medium and high-
quality samples.
3. Significance Testing:
o A paired t-test was performed between the best-performing model
(Random Forest) and others to assess statistical significance.
o The p-value was < 0.05 in most cases, indicating that the
performance differences were statistically significant.

4.5 Comparison of Algorithms:

Based on the experiments and results, Random Forest is the best-performing


algorithm for predicting wine quality in this context. Here's why:

1. HighestAccuracy:
Random Forest achieved the highest accuracy (89.7%), indicating it
classified the most samples correctly across all classes.
2. BalancedPerformance:
The model showed strong performance in terms of precision, recall, and
F1-score (all above 89%), suggesting it effectively handled both false
positives and false negatives, making it robust in predicting wine quality.
3. Stability:
Random Forest, being an ensemble method, is less prone to overfitting
compared to individual models like Decision Trees or k-NN, which can be
sensitive to noise in the data.
4. FeatureHandling:
Random Forest performs well with a mixture of numerical features,
capturing complex relationships between variables (like acidity, alcohol
content, and pH), which is crucial for wine quality prediction.

In summary, Random Forest excels due to its high accuracy, ability to generalize
well, and overall balanced performance across various evaluation metrics, making
it the best choice for this task.
5 Discussion
This section provides an in-depth analysis of the results of the classification
model, highlights the strengths and weaknesses of the algorithm, and
discusses the interpretation of the model in the context of wine quality
prediction

5.1 Analysis of Results


Experimental results showed that random forest performed better than all other
algorithms in predicting good wines, with high accuracy (89.7%) across all
classes and strong accuracy, recall , and F1 scores indicate that the random forest
correctly partitioned alcohol into low, medium, and high classes.

Logistic regression performed reasonably well but struggled with non-linear


relationships in the data. Its accuracy (85.3%) is low compared to random forest.

The Support Vector Machine (SVM) showed a strong performance at 87.2%


accuracy, benefited from the ability to handle high-altitude areas, but failed to
outperform random forests

The decision tree and K nearest neighbors were not accurate, the decision tree
was often overfitting and K-NN struggled in classifying medium-high-quality
hands

The results confirm that cluster models such as random forests are well suited for
complex, multi-classification problems, especially when the relationships
between inputs are complex.

5.2 Strengths and Weaknesses


Strengths:

The random forest showed robustness in dealing with noise and complex
interactions. Its ability to combine results from multiple decision trees mitigates
overfitting, making it a strong candidate for generalization.

The support vector mechanism worked well within clear resolution limits,
especially at high altitudes, and was effective for a variety of complex objects.
Logistic regression provided a good starting point, whose simple and explainable
nature made for easy understanding and quick estimation.

Weaknesses:

Decision tree models were often overfitting, especially when they had
insufficient data or when the hyperparameters were poorly refined.

K-Nearest Neighbors (k-NN) was sensitive to the choice of k and struggled with
unbalanced data distribution, resulting in imprecise predictions for some classes

SVM can be computationally expensive, especially for large data sets, and may
require optimization of kernel functions to obtain optimal results.

5.3 Model Interpretability


 Logistic regression is more interpretable. The coefficients correspond
directly to the relationship between each attribute and the probability of
occurrence in a particular category, making it easier to understand the
importance of the attribute
 Decision trees can also be defined, with explicit decision rules that can be
visualized. However, it becomes less interpretable as the tree depth
increases.
 Random forests, although powerful, sacrifice some explanatory power
because of their clustering. Although feature importance can be assessed,
understanding the decision-making process in multiple trees is challenging.
 SVM is often considered a "black-box" model, especially when nonlinear
kernels are used. Its decision boundaries can be difficult to define, making
it less obvious than other models.
 K-Nearest Neighbors lack inherent interpretation because predictions are
based on the nearest observations in the feature space, making it difficult to
explain why a particular prediction was made
Overall, although random forests provided good performance, they were less
interpretable compared to simpler models such as logistic regression and decision
trees. For practical applications where understanding the decision process is
important, Logistic Regression or Decision Trees may be preferred, although they
offer less predictive power than ensemble methods
5.4 Reducing Type II Error in Algorithms:

To find the algorithm with the fewest Type II errors (false negatives), we need to
look at recall for each model. The second type of error occurs when a model
incorrectly predicts a negative class (i.e., fails to find a truly positive model).
Recall is a direct part of the metric that measures the actual quality of information
that the model correctly identifies. More residuals show fewer Type II errors.
Based on the results of your tests:
Random forests were the most memorable, that is, they correctly identified the
most true positives and reduced false negatives.
Support Vector Machine (SVM) also performed well in recall, with values close
to those of random forest.
Logistic regression and decision trees recalled less compared to random forest,
indicating more true positive misses (higher error type II).
Neighbors in the immediate vicinity of K remembered little, indicating that it
struggled to find positive information (high errors in procedure).
At the end Random forests showed the lowest type II errors (false negatives)
based on high recall in all positive groups.
6 Conclusion:
This section summarizes the key findings from the experiments, offers
recommendations based on the results, and suggests directions for future work
in wine quality prediction using machine learning algorithms.

6.1 Summary of Findings


The experiments demonstrated the effectiveness of machine learning
algorithms to predict wine quality, with random forest performing best, and
obtained the highest accuracy (89.7%), precision, recall, and F1 score, 100%
and type II (false- negative) errors. Reduction though Logistic Regression,
SVM, Decision Trees, k-NN performed well, however. Random forest
performed well in terms of overall predictive power. Logistic regression was
well interpretable but lacked random forests, whereas SVM was not. k-NN
struggled with class imbalance, and decision trees were often overfitted.
Thus, random forests were found to be the most reliable model for predicting
wine quality.

6.2 Recommendation
Based on the findings, we recommend the following:

1. Random Forest should be the primary choice for wine quality prediction
due to its superior performance across all evaluation metrics, especially in
terms of minimizing Type II error.
2. For cases where model interpretability is crucial, models like Logistic
Regression or Decision Trees can be used, as they offer clearer insights
into the decision-making process, albeit at the cost of slightly lower
performance.
3. SVM can be considered if computational resources allow and if the dataset
grows in size, as it performed well with high-dimensional data.
6.3 Future Work

Hyperparameter settings:

Further optimization of model hyperparameters, especially for random forest and


SVM, can significantly improve the performance.

Key Technologies:

Additional predictive power can be achieved by accessing additional features such


as wine color or geographic information. Selection techniques can also be used to
identify priorities.

Dealing with imbalanced data:

Improved methods for handling imbalanced class distributions (e.g., SMOTE or


learning weights) can improve performance, especially for k-NNs and Decision
Trees

Examples of deep learning:

Exploring more advanced models such as deep neural networks (DNNs) could be
useful, especially if larger datasets can be obtained in the future.

Excellent speech:

Although random forest is the most efficient model, its definition can be further
enhanced through techniques such as the SHAP criterion or LIME to better
understand its decision-making process

You might also like