Collecting Data
Image processing is a difficult task for many types of machine learning algorithms. The
relationships linking patterns of pixels to higher concepts are extremely complex and hard to
define. For instance, it’s easy for a human being to recognize a face, a cat, or the letter “A”, but
defining these patterns in strict rules is difficult. Furthermore, image data is often noisy. There
can be many slight variations in how the image was captured, depending on the lighting,
orientation, and positioning of the subject.
The attributes measure such characteristics as the horizontal and vertical dimensions of the
glyph, the proportion of black (versus white) pixels, and the average horizontal and vertical
position of the pixels.
Presumably, differences in the concentration of black pixels across various areas of the box
should provide a way to differentiate among the 26 letters of the alphabet.
import pandas as pd
letters = pd.read_csv("letterdata.csv")
## letter object
## xbox int64
## ybox int64
## width int64
## height int64
## onpix int64
## xbar int64
## ybar int64
## x2bar int64
## y2bar int64
## xybar int64
## x2ybar int64
## xy2bar int64
## xedge int64
## xedgey int64
## yedge int64
## yedgex int64
## dtype: object
## (20000, 17)
## U 813
## D 805
## P 803
## T 796
## M 792
## A 789
## X 787
## Y 786
## N 783
## Q 783
## F 775
## G 773
## E 768
## B 766
## V 764
## L 761
## R 758
## I 755
## O 753
## W 752
## S 748
## J 747
## K 739
## C 736
## H 734
## Z 734
## Name: letter, dtype: int64
import numpy as np
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=0)
## (20000, 16)
## 0
## [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
cor = letters.iloc[:,1:].corr().values
import numpy as np
np.fill_diagonal(cor, 0)
threTF = abs(cor) > 0.8
print(np.argwhere(threTF == True))
## [[0 2]
## [1 3]
## [2 0]
## [3 1]]
Recall that SVM learners require all features to be numeric, and moreover, that each feature is
scaled to a fairly small interval. In this case, every feature is an integer, so we do not need to
convert any factors into numbers. On the other hand, some of the ranges for these integer
variables appear fairly wide. This indicates that we need to normalize or standardize the data.
Given that the data preparation has been largely done for us, we can move directly to the
training and testing phases of the machine learning process. In the previous analyses, we
randomly divided the data between the training and testing sets.
Although we could do so here, Frey and Slate have already randomized the data, and therefore
suggest using the first 16,000 records (80 percent) to build the model and the next 4,000
records (20 percent) to test. Following their advice, we can create training and testing data
frames as follows:
# StandardScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
tr_pred = svm.predict(X_train_std)
y_pred = svm.predict(X_test_std)
## 17815 I
## 18370 M
## 1379 Z
## 14763 D
## 7346 L
## Name: letter, dtype: object
It can be challenging, however, to choose from the many different kernel functions. A popular
convention is to begin with the Gaussian RBF kernel, which has been shown to perform well
for many types of data.
tr_pred = svm.predict(X_train_std)
y_pred = svm.predict(X_test_std)
## 17815 I
## 18370 M
## 1379 Z
## 14763 D
## 7346 L
## Name: letter, dtype: object
pandas_ml need:
## Predicted A B C D E F G H I J K L
## Actual
## A 147 0 0 0 0 0 0 0 0 0 0 0
## B 0 153 0 0 0 0 0 0 0 0 0 0
## C 0 0 152 0 0 0 3 0 0 0 0 0
## D 1 1 0 166 0 0 0 2 0 0 0 0
## E 0 1 0 0 141 0 1 0 0 0 0 1
## F 0 1 0 1 0 163 0 0 0 0 0 0
## G 0 1 0 2 0 0 175 0 0 0 0 1
## H 0 1 0 2 0 0 0 111 0 0 2 0
## I 0 0 0 0 0 1 0 0 118 8 0 0
## J 0 0 0 0 0 1 0 0 1 156 0 0
## K 0 0 0 0 0 0 0 3 0 0 136 0
## L 0 0 1 0 1 0 0 0 0 0 1 156
perf_indx = cm.stats()
FutureWarning: supplying multiple axes to axis is deprecated and will
be removed in a future version.
be removed in a future version.
## num = df[df > 1].dropna(axis=[0, 1], thresh=1).applymap(lambda n: choose
(n, 2)).sum().sum() - np.float64(nis2 * njs2) / n2
## /opt/anaconda3/lib/python3.7/site-packages/pandas_ml/confusion_matrix/bcm.p
return(np.float64(self.LRP) / self.LRN)
## return(np.float64(self.LRP) / self.LRN)
## /opt/anaconda3/lib/python3.7/site-packages/pandas_ml/confusion_matrix/bcm.p
return(np.float64(self.TPR) / self.FPR)
## return(np.float64(self.TPR) / self.FPR)
## <class 'collections.OrderedDict'>
# perf_indx['overall'].keys()
## <class 'collections.OrderedDict'>
print(" acc:{}".format(perf_indx['overall']
## acc:0.9725
print(" acc95%:\n{}".format(perf_indx
['overall']['95% CI']))
## acc95%:
## (0.9669490685534711, 0.9773453558266993)
## Kappa:
## 0.9713890027910028
## <class 'pandas.core.frame.DataFrame'>
## (26, 26)
## Classes A ... Z
## Population 4000 ... 4000
## P: Condition positive 147 ... 143
## N: Condition negative 3853 ... 3857
## Test outcome positive 148 ... 142
## Test outcome negative 3852 ... 3858
## TP: True Positive 147 ... 142
## TN: True Negative 3852 ... 3857
## FP: False Positive 1 ... 0
## FN: False Negative 0 ... 1
## TPR: (Sensitivity, hit rate, recall) 1 ... 0.993007
## TNR=SPC: (Specificity) 0.99974 ... 1
## PPV: Pos Pred Value (Precision) 0.993243 ... 1
## NPV: Neg Pred Value 1 ... 0.999741
## FPR: False-out 0.000259538 ... 0
## FDR: False Discovery Rate 0.00675676 ... 0
## FNR: Miss Rate 0 ... 0.00699301
## ACC: Accuracy 0.99975 ... 0.99975
## F1 score 0.99661 ... 0.996491
## MCC: Matthews correlation coefficient 0.996487 ... 0.996368
## Informedness 0.99974 ... 0.993007
## Markedness 0.993243 ... 0.999741
## Prevalence 0.03675 ... 0.03575
## LR+: Positive likelihood ratio 3853 ... inf
## LR-: Negative likelihood ratio 0 ... 0.00699301
## DOR: Diagnostic odds ratio inf ... inf
## FOR: False omission rate 0 ... 0.000259202
## [26 rows x 26 columns]
