0% found this document useful (0 votes)
455 views5 pages

About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository

The document summarizes a dataset on car evaluations from the UCI Machine Learning Repository. It contains 6 attributes for 1728 car instances, including buying price, maintenance cost, number of doors, passenger capacity, luggage space, and safety. The data is preprocessed by converting string values to integers and splitting into training and test sets. Logistic regression and KNN classifiers are applied to predict the class attribute, with accuracy scores of around 69% and 66% respectively.

Uploaded by

frankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
455 views5 pages

About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository

The document summarizes a dataset on car evaluations from the UCI Machine Learning Repository. It contains 6 attributes for 1728 car instances, including buying price, maintenance cost, number of doors, passenger capacity, luggage space, and safety. The data is preprocessed by converting string values to integers and splitting into training and test sets. Logistic regression and KNN classifiers are applied to predict the class attribute, with accuracy scores of around 69% and 66% respectively.

Uploaded by

frankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

8/21/2020 Course 2 - Final Project

About the dataset - Car Evaluation Dataset (UCI Machine Learning


Repository
Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the
demonstration of DEX, M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-
157, 1990.). The model evaluates cars according to the following concept structure:

CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH
technical characteristics . . COMFORT comfort . . . doors number of doors . . . persons capacity in terms of
persons to carry . . . lug_boot the size of luggage boot . . safety estimated safety of the car

1. Number of Instances: 1728 (instances completely cover the attribute space)


2. Number of Attributes: 6
3. Attribute Values:

buying -> v-high, high, med, low maint -> v-high, high, med, low doors -> 2, 3, 4, 5-more persons -> 2, 4,
more lug_boot -> small, med, big safety -> low, med, high
4. Missing Attribute Values: none

In [19]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data has been modified and stored in an excel sheet, and we will convert string values to integer values to
implement scikit packages

In [20]: data = pd.read_excel('car_data1.xlsx')

In [21]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 8 columns):
Unnamed: 0 1728 non-null int64
buying 1728 non-null object
maint 1728 non-null object
doors 1728 non-null object
persons 1728 non-null object
lug_boot 1728 non-null object
safety 1728 non-null object
class 1728 non-null object
dtypes: int64(1), object(7)
memory usage: 108.1+ KB

localhost:8888/nbconvert/html/Desktop/Python Data Products for Predictive Anlaytics Specialization/Course 2 - Design Thinking and Predictive Analytics for Data P… 1/5
8/21/2020 Course 2 - Final Project

To check for null values^

In [22]: data.head()

Out[22]:
Unnamed: 0 buying maint doors persons lug_boot safety class

0 0 vhigh vhigh 2 2 small low unacc

1 1 vhigh vhigh 2 2 small med unacc

2 2 vhigh vhigh 2 2 small high unacc

3 3 vhigh vhigh 2 2 med low unacc

4 4 vhigh vhigh 2 2 med med unacc

In [23]: # Now checking for unique values

In [24]: for i in data.columns:


print(data[i].unique(),"\t",data[i].nunique())

[ 0 1 2 ... 1725 1726 1727] 1728


['vhigh' 'high' 'med' 'low'] 4
['vhigh' 'high' 'med' 'low'] 4
['2' '3' '4' '5more'] 4
['2' '4' 'more'] 3
['small' 'med' 'big'] 3
['low' 'med' 'high'] 3
['unacc' 'acc' 'vgood' 'good'] 4

In [25]: from sklearn.preprocessing import LabelEncoder


# scikit packages work well with integers, and I faced a few issues as t
he current dataset has string values
# converted string values to integer

In [26]: le=LabelEncoder()
for i in data.columns:
data[i]=le.fit_transform(data[i])

In [27]: data.head()

Out[27]:
Unnamed: 0 buying maint doors persons lug_boot safety class

0 0 3 3 0 0 2 1 2

1 1 3 3 0 0 2 2 2

2 2 3 3 0 0 2 0 2

3 3 3 3 0 0 1 1 2

4 4 3 3 0 0 1 2 2

localhost:8888/nbconvert/html/Desktop/Python Data Products for Predictive Anlaytics Specialization/Course 2 - Design Thinking and Predictive Analytics for Data P… 2/5
8/21/2020 Course 2 - Final Project

We also create a heatmap of columns of dataset. It shows us Pearson's correlation coefficient as well.

In [29]: fig=plt.figure(figsize=(10,6))
sns.heatmap(data.corr(),annot=True).set_title("Heatmap showing Pearson's
correlation coefficient")

Out[29]: Text(0.5, 1, "Heatmap showing Pearson's correlation coefficient")

X dataframe consists of input data and features and y dataframe is the series which has results that we will try
to predict

In [30]: X = data[data.columns[:-1]]
y = data['class']

In [31]: data.head()

Out[31]:
Unnamed: 0 buying maint doors persons lug_boot safety class

0 0 3 3 0 0 2 1 2

1 1 3 3 0 0 2 2 2

2 2 3 3 0 0 2 0 2

3 3 3 3 0 0 1 1 2

4 4 3 3 0 0 1 2 2

In [32]: # We divide our data into train and test splits


from sklearn.model_selection import train_test_split

localhost:8888/nbconvert/html/Desktop/Python Data Products for Predictive Anlaytics Specialization/Course 2 - Design Thinking and Predictive Analytics for Data P… 3/5
8/21/2020 Course 2 - Final Project

In [33]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

In [35]: from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

Implementing Logistical Regression


In [36]: regression = LogisticRegression(solver='newton-cg',multi_class='multinom
ial')

In [37]: regression.fit(X_train,y_train)

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/optimize.py:20
3: ConvergenceWarning: newton-cg failed to converge. Increase the numbe
r of iterations.
"number of iterations.", ConvergenceWarning)

Out[37]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=


True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='multinomial', n_jobs=None, penalty='l
2',
random_state=None, solver='newton-cg', tol=0.0001, v
erbose=0,
warm_start=False)

In [38]: pred = regression.predict(X_test)

In [39]: regression.score(X_test,y_test)

Out[39]: 0.697495183044316

Our accuracy score for logistical regression is not completely ideal.

Implementing KNN Classification


In [40]: knn = KNeighborsClassifier(n_jobs=-1)

In [41]: knn.fit(X_train,y_train)
pred = knn.predict(X_test)
knn.score(X_test,y_test)

Out[41]: 0.6608863198458574

localhost:8888/nbconvert/html/Desktop/Python Data Products for Predictive Anlaytics Specialization/Course 2 - Design Thinking and Predictive Analytics for Data P… 4/5
8/21/2020 Course 2 - Final Project

In [43]: # We see that our accuracy scores for KNN Classifier is somewhat similar

In [44]: print(classification_report(y_test,pred))

precision recall f1-score support

0 0.42 0.32 0.37 118


1 0.20 0.37 0.26 19
2 0.76 0.83 0.79 358
3 0.00 0.00 0.00 24

accuracy 0.66 519


macro avg 0.35 0.38 0.35 519
weighted avg 0.63 0.66 0.64 519

In [45]: # We see above a detailed report of our KNN classification that provides
precision, recall, f-1 and a support scores
# We can observe a good f1 score here

Conclusion
Logistical Regression provided us a better accuracy than knn classifiers.

In [ ]:

localhost:8888/nbconvert/html/Desktop/Python Data Products for Predictive Anlaytics Specialization/Course 2 - Design Thinking and Predictive Analytics for Data P… 5/5

You might also like