ISyE7406 Homework3
ISyE7406 Homework3
Homework Assignment #3
Instructor: Yajun, Mei Name: Chen-Yang(Jim), Liu GTID: 90345****
(a) Introduction
The purpose of this assignment is to use classification methods to analyze data. Before
data analysis, data source and feature explanation are needed. Data source is critical
because we could not analyze artificial data. So we will briefly introduce data sources.
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon
University. Originally, there are nine features and one of them is names of manufacturer.
However, for further analysis, this feature is deleted and updated dataset is provided
with eight features. Among those features, the origin feature is hard to interpret. As a
result, looking back to the original dataset, it seems that the numbers in origin column
represent where cars are made from.
• mpg: Miles per gallon (Continuous variable)
• cylinders: Power unit of an engine (Categorical variable)
• displacement: Combined swept volume of the pistons inside the cylinders of an
engine
(Continuous variable)
• horsepower: Power an engine produces(Continuous variable)
• weight: mass of a vehicle (Continuous variable)
• acceleration: the rate of change of the velocity of a car (Continuous variable)
• year: the time taken after a car is made (Discrete variable)
• origin: 1 is a car made in America, 2 in Europe and 3 in Asia or other part of the
world (Categorical variable)
After feature explanation is presented, transformation of data sometimes is necessary for
future data analysis. In this dataset, as an ouput, mpg feature was transformed into bi-
nary variable(0-1) based on the median of mpg. Classification methods such as Linear
Discriminant analysis(LDA), Quadratic Disciminant analysis(QDA), Naive Bayes, Logis-
tic Regression and K-nearest neighbors(KNN) will be utilized.
1
Chen-Yang(Jim), Liu – Homework Assignment #3 2
Clearly, displacement, horsepower, weight and acceleration features have possible in-
fluence on mpg. From the figure, there are invisible lines that could separate two groups.
As for origin, years and cylinders, they are uniform on mpg feature.
In order to further know the possible relationship, box and violin plots are performed.
Box plots provide an insight about quantiles and violin plots offer different distribution
among two groups. From below figures, weight, displacement, horsepower and acceler-
ation have distinction between two groups of mpg. But noticed that categorical variables
are not useful when there are shown as violin or box plots. Since from those two fig-
ures, there are no clear difference. Therefore, in this assignment, weight, displacement,
horsepower and acceleration are selected as independent variables.
Chen-Yang(Jim), Liu – Homework Assignment #3 3
Chen-Yang(Jim), Liu – Homework Assignment #3 4
(c) Methods
1. Linear Discriminant Analysis(LDA)
• Assumption: The two classes have a common covariance matrix. δk ( x ) =
x T Σ−1 µk − 12 µkT Σ−1 µk + log(πk ) where training data is used to estimate
πk , µk , Σ
• Validation: Testing examples are provided to confirm the trained model.
• Statistical Package: scikit-learn is used for this Linear Discriminant Analysis.
2. Quadratic Discriminant Analysis(QDA)
• Assumption: δk ( x ) = − 12 log(|Σk |) − 21 ( x − µk ) T Σ− 1
k ( x − µk ) + log ( πk ) where
training data is used to estimate πk , µk , Σ
• Validation: Testing examples are provided to confirm the trained model.
• Statistical Package: scikit-learn is used for Quadratic Discriminant Analysis.
3. Naive Bayes
• Assumption: Naive Bayes assumes each predictor is independent from each
p
other. Then, argmaxk (πk Π j=1 f kj ) where training data is used to estimate f kj (·)
• Validation: Testing examples are provided to confirm the trained model.
• Statistical Package: scikit-learn is used for Naive Bayes.
4. Logistic Regression
• Assumption: Logistic regression uses probability of each case to build models.
It utilizes logit function g(πi ) = log( 1−ππi ). And the model is P(Yi = 1) =
i
πi , P(Yi = 0) = 1 − πi and then log( 1−ππi ) = β 0 + β 1 xi1 + · · · + β p−1 xip−1
i
(d) Results
The table is provided to show comparison among different classification methods.
Methods Training Errors Testing Errors
LDA 0.0839416 0.1610169
QDA 5 0.0839416 0.1694915
Naive Bayes 0.098540145 0.1694915
Logistic Regression 0.098540145 0.13559322
From the above table, LDA, QDA and Naive Bayes have similar testing errors. Since LDA
and QDA are based on Naive Bayes, LDA assumes common variance and QDA assumes
estimator is normal. For this dataset, there is no so much difference. However, when
Logistic Regression is utilized, logistic regression performs better than previous methods
based on testing errors. Last but not least, KNN is also a available choice by choosing
right K. The error rate figure is provided as follows.
(e) Findings
After analyzing the dataset, we performed some of classification methods and we
found QDA, LDA and Naive Bayes have similar results. The possible reason for this situ-
ation might be that we select features correctly or this dataset contains not too much data
points. But, Logistic regression is better than previous methods, since they have stronger
assumption than previous ones. Previous ones are based on Bayes but Logistic regression
is based on probability. Logistic regression has linear model assumption via a link func-
tion. Moreover, KNN is also a good candidates, since its accuracy could be increased to
91% in this dataset. Although those methods have low testing errors, if large sample size
or high-dimensional data is provided, the result might have significant difference among
those methods.
Chen-Yang(Jim), Liu – Homework Assignment #3 8
0.1 Problem 1
In this problem, you are asked to write a report to summarize your analysis of the popular
“Auto MPG” data set in the literature. Much research has been done to analyze this data
set, and here the objective of our analysis is to predict whether a given car gets high or low
gas mileage based 7 car attributes such as cylinders, displacement, horsepower, weight,
acceleration, model year and origin.
[1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
[3]: df.head()
origin
0 1
1 1
2 1
3 1
4 1
[4]: df.shape
Chen-Yang(Jim), Liu – Homework Assignment #3 9
[4]: (392, 8)
[5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 8 columns):
mpg 392 non-null float64
cylinders 392 non-null int64
displacement 392 non-null float64
horsepower 392 non-null int64
weight 392 non-null int64
acceleration 392 non-null float64
year 392 non-null int64
origin 392 non-null int64
dtypes: float64(3), int64(5)
memory usage: 24.6 KB
Dataset Description
• mpg: Miles per gallon (Continuous variable)
• cylinders: Power unit of an engine (Categorical variable)
• displacement: Combined swept volume of the pistons inside the cylinders of an
engine(Continuous variable)
• horsepower: Power an engine produces(Continuous variable)
• weight: mass of a vehicle (Continuous variable)
• acceleration: the rate of change of the velocity of a car (Continuous variable)
• year: the time taken after a car is made (Discrete variable)
• origin: 1 is a car made in america, 2 in europe and 3 in asia or other part of the world
(Categorical variable)
[6]: 22.75
else:
return 1
origin mpg01
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
origin
0 1
1 1
2 1
3 1
Chen-Yang(Jim), Liu – Homework Assignment #3 11
4 1
cols = df_new.columns
row = 0
count = 0
for i in range(7):
sns.boxplot(x = "mpg01", y = df_new.columns[i + 1], data=df_new,␣
,→palette="Set3", ax = axes[row, (i + 3) % 3])
count += 1
if (count == 3):
count = 0
row += 1
Chen-Yang(Jim), Liu – Homework Assignment #3 13
cols = df_new.columns
row = 0
count = 0
for i in range(7):
sns.violinplot(x = "mpg01", y = df_new.columns[i + 1],␣
,→data=df_new,palette='Set1', ax = axes[row, (i + 3) % 3])
count += 1
if (count == 3):
count = 0
row += 1
Chen-Yang(Jim), Liu – Homework Assignment #3 14
LDA
[59]: from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)
y_pred_lda = clf.fit(X_train, y_train).predict(X_test)
QDA
Naive Bayes
Logistic Regression
KNN
[63]: from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
scaled_features = scaler.transform(X_train)
df_feat = pd.DataFrame(scaled_features, columns = X_train.columns)
df_feat.head()
Chen-Yang(Jim), Liu – Homework Assignment #3 17
df_feat_test.head()
[[51 14]
[ 0 53]]
[88]: print(classification_report(y_test,pred_knn))
[93]: error_rate = []
for i in range(1,30):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(df_feat,y_train)
pred_i = knn.predict(df_feat_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(10,6))
plt.plot(range(1,30),error_rate,color='blue', linestyle='dashed',␣
,→marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
knn1.fit(df_feat,y_train)
pred_k1 = knn.predict(df_feat_test)
Chen-Yang(Jim), Liu – Homework Assignment #3 19
print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,pred_k1))
print('\n')
print(classification_report(y_test,pred_k1))
WITH K=1
[[49 16]
[ 1 52]]
knn_2.fit(df_feat,y_train)
pred_2 = knn_2.predict(df_feat_test)
print('WITH K=2')
print('\n')
print(confusion_matrix(y_test,pred_2))
print('\n')
print(classification_report(y_test,pred_2))
WITH K=2
[[54 11]
[ 0 53]]