Pattern and Classification
Pattern and Classification
Fig: Binary and Multiclass Classification. Here x1 and x2 are our variables
upon which the class is predicted.
How does classification works?
Suppose we have to predict whether a given patient has a certain disease or
not, on the basis of 3 variables, called features.
This means there are two possible outcomes:
1. The patient has the said disease. Basically, a result labeled “Yes”
or “True”.
2. The patient is disease-free. A result labeled “No” or “False”.
The fruits dataset was created by Dr. Iain Murray from University of
Edinburgh. He bought a few dozen oranges, lemons and apples of
different varieties, and recorded their measurements in a table. And
then the professors at University of Michigan formatted the fruits
data slightly and it can be downloaded from here.
Figure 1
(59, 7)
Figure 2
import seaborn as sns
sns.countplot(fruits['fruit_name'],label="Count")
plt.show()
Figure 3
Visualization
Box plot for each numeric variable will give us a clearer idea
of the distribution of the input variables:
fruits.drop('fruit_label', axis=1).plot(kind='box',
subplots=True, layout=(2,2), sharex=False, sharey=False,
figsize=(9,9),
title='Box Plot for each
input variable')
plt.savefig('fruits_box')
plt.show()
Figure 4
Statistical Summary
Figure 7
We can see that the numerical values do not have the same scale. We
will need to apply scaling to the test set that we computed for the
training set.
Build Models
Logistic Regression
from sklearn.linear_model import LogisticRegressionlogreg =
LogisticRegression()
logreg.fit(X_train, y_train)print('Accuracy of Logistic
regression classifier on training set: {:.2f}'
.format(logreg.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set:
{:.2f}'
.format(logreg.score(X_test, y_test)))
Accuracy of Logistic regression classifier on training set:
0.70
Accuracy of Logistic regression classifier on test set: 0.40
Decision Tree
from sklearn.tree import DecisionTreeClassifierclf =
DecisionTreeClassifier().fit(X_train, y_train)print('Accuracy of
Decision Tree classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifierknn =
KNeighborsClassifier()
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
.format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
.format(knn.score(X_test, y_test)))
The KNN algorithm was the most accurate model that we tried. The
confusion matrix provides an indication of no error made on the test
set. However, the test set was very small.
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
pred = knn.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
Figure 7
Figure 8
k_range = range(1, 20)
scores = []for k in k_range:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)
scores.append(knn.score(X_test, y_test))
plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20])
Figure 9
For this particular dateset, we obtain the highest accuracy when k=5.
Summary
Source code that created this post can be found here. I would be
pleased to receive feedback or questions on any of the above.