0% found this document useful (0 votes)

10 views12 pages

SVM Updated

Uploaded by

Yenni Prawita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views12 pages

SVM Updated

Uploaded by

Yenni Prawita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

Statistical Machine Learning with

Python Week #3
Ching-Shih Tsou (Ph.D.), Distinguished Prof. at the Department of
Mechanical Engineering/Director of the Center for Artifical Intelligence &
Data Science, Ming Chi University of Technology
September/2020 at MCUT

Collecting Data
Image processing is a diﬃcult task for many types of machine learning algorithms. The
relationships linking patterns of pixels to higher concepts are extremely complex and hard to
define. For instance, it’s easy for a human being to recognize a face, a cat, or the letter “A”, but
defining these patterns in strict rules is diﬃcult. Furthermore, image data is often noisy. There
can be many slight variations in how the image was captured, depending on the lighting,
orientation, and positioning of the subject.

According to the documentation provided by Frey and Slate (1991)(http://archive.ics.uci.edu/ml

(http://archive.ics.uci.edu/ml)), when the glyphs are scanned into the computer, they are
converted into pixels and 16 statistical attributes are recorded.

The attributes measure such characteristics as the horizontal and vertical dimensions of the
glyph, the proportion of black (versus white) pixels, and the average horizontal and vertical
position of the pixels.

Presumably, diﬀerences in the concentration of black pixels across various areas of the box
should provide a way to diﬀerentiate among the 26 letters of the alphabet.

import pandas as pd
letters = pd.read_csv("letterdata.csv")
print(letters.dtypes)

1 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

## letter object
## xbox int64
## ybox int64
## width int64
## height int64
## onpix int64
## xbar int64
## ybar int64
## x2bar int64
## y2bar int64
## xybar int64
## x2ybar int64
## xy2bar int64
## xedge int64
## xedgey int64
## yedge int64
## yedgex int64
## dtype: object

print(letters.shape)

## (20000, 17)

Exploring and Preparing the Data

print(letters.describe(include = 'all'))

## letter xbox ... yedge yedgex

## count 20000 20000.000000 ... 20000.000000 20000.00000
## unique 26 NaN ... NaN NaN
## top U NaN ... NaN NaN
## freq 813 NaN ... NaN NaN
## mean NaN 4.023550 ... 3.691750 7.80120
## std NaN 1.913212 ... 2.567073 1.61747
## min NaN 0.000000 ... 0.000000 0.00000
## 25% NaN 3.000000 ... 2.000000 7.00000
## 50% NaN 4.000000 ... 3.000000 8.00000
## 75% NaN 5.000000 ... 5.000000 9.00000
## max NaN 15.000000 ... 15.000000 15.00000
##
## [11 rows x 17 columns]

print(letters['letter'].value_counts())

2 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

## U 813
## D 805
## P 803
## T 796
## M 792
## A 789
## X 787
## Y 786
## N 783
## Q 783
## F 775
## G 773
## E 768
## B 766
## V 764
## L 761
## R 758
## I 755
## O 753
## W 752
## S 748
## J 747
## K 739
## C 736
## H 734
## Z 734
## Name: letter, dtype: int64

import numpy as np
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold(threshold=0)
print(vt.fit_transform(letters.iloc[:,1:]).shape)

## (20000, 16)

print(np.sum(vt.get_support() == False)) # Get a mask of the features selected

## 0

print(vt.get_support(indices=True)) # Get integer index of the features select

## [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]

cor = letters.iloc[:,1:].corr().values
print(cor[:5,:5])

3 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

## [[1. 0.7577928 0.851514 0.67276367 0.61909688]

## [0.7577928 1. 0.67191188 0.82320706 0.55506655]
## [0.851514 0.67191188 1. 0.66021536 0.76571612]
## [0.67276367 0.82320706 0.66021536 1. 0.64436627]
## [0.61909688 0.55506655 0.76571612 0.64436627 1. ]]

import numpy as np
np.fill_diagonal(cor, 0)
threTF = abs(cor) > 0.8
print(threTF[:5,:5])

## [[False False True False False]

## [False False False True False]
## [ True False False False False]
## [False True False False False]
## [False False False False False]]

print(np.argwhere(threTF == True))

## [[0 2]
## [1 3]
## [2 0]
## [3 1]]

print(letters.columns[1:5])

## Index(['xbox', 'ybox', 'width', 'height'], dtype='object')

import matplotlib.pyplot as plt

ax1 = letters[['xbox', 'letter']].boxplot(by = 'letter')
fig1 = ax1.get_figure()
plt.show()
# fig1.savefig('./_img/xbox_boxplot.png')

4 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

ax2 = letters[['ybar', 'letter']].boxplot(by = 'letter')

fig2 = ax2.get_figure()
plt.show()
# fig2.savefig('./_img/ybar_boxplot.png')

5 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

Recall that SVM learners require all features to be numeric, and moreover, that each feature is
scaled to a fairly small interval. In this case, every feature is an integer, so we do not need to
convert any factors into numbers. On the other hand, some of the ranges for these integer
variables appear fairly wide. This indicates that we need to normalize or standardize the data.

Given that the data preparation has been largely done for us, we can move directly to the
training and testing phases of the machine learning process. In the previous analyses, we
randomly divided the data between the training and testing sets.

Although we could do so here, Frey and Slate have already randomized the data, and therefore
suggest using the first 16,000 records (80 percent) to build the model and the next 4,000
records (20 percent) to test. Following their advice, we can create training and testing data
frames as follows:

# create training and testing data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

letters.iloc[:, 1:], letters['letter'], test_size=0.2,
random_state=0)

# StandardScaler
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

6 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

## StandardScaler(copy=True, with_mean=True, with_std=True)

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

Training a Model on the Data

# SVC: Support Vector Classification
# SVR: Support Vector Regression
# OneClassSVM: (Outlier Detection
from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(X_train_std, y_train)

## SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

## decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
## kernel='linear', max_iter=-1, probability=False, random_state=None,
## shrinking=True, tol=0.001, verbose=False)

tr_pred = svm.predict(X_train_std)
y_pred = svm.predict(X_test_std)
print(tr_pred[:5])

## ['I' 'M' 'Z' 'D' 'G']

print(y_train[:5])

## 17815 I
## 18370 M
## 1379 Z
## 14763 D
## 7346 L
## Name: letter, dtype: object

print(y_pred[:5])

## ['Y' 'B' 'K' 'T' 'Q']

print(y_test[:5].tolist())

## ['Y', 'B', 'K', 'Y', 'Q']

7 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

err_tr = (y_train != tr_pred).sum()/len(y_train)

print("train set error: %f" % err_tr)

## train set error: 0.133250

err = (y_test != y_pred).sum()/len(y_test)

print("test set error: %f" % err)

## test set error: 0.134500

Improving Model Performance

Our previous SVM model used the simple linear kernel function. By using a more complex
kernel function, we can map the data into a higher dimensional space, and potentially obtain a
better, probably a simple, model fit.

It can be challenging, however, to choose from the many diﬀerent kernel functions. A popular
convention is to begin with the Gaussian RBF kernel, which has been shown to perform well
for many types of data.

svm = SVC(kernel='rbf', random_state=0, gamma=0.2, C=1.0)

svm.fit(X_train_std, y_train)

## SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

## decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',
## max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.0
01,
## verbose=False)

tr_pred = svm.predict(X_train_std)
y_pred = svm.predict(X_test_std)
print(tr_pred[:5])

## ['I' 'M' 'Z' 'D' 'L']

print(y_train[:5])

## 17815 I
## 18370 M
## 1379 Z
## 14763 D
## 7346 L
## Name: letter, dtype: object

8 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

print(y_pred[:5])

## ['Y' 'B' 'K' 'X' 'Q']

print(y_test[:5].tolist())

## ['Y', 'B', 'K', 'Y', 'Q']

err_tr = (y_train.values != tr_pred).sum()/len(y_train)

print("train set error: %f" % err_tr)

## train set error: 0.011750

err = (y_test != y_pred).sum()/len(y_test)

print("test set error: %f" % err)

## test set error: 0.027500

pandas_ml need：

pip install scikit-learn==0.21.1

pip install pandas==0.24.2
pip install pandas_ml

import pandas_ml as pdml

cm = pdml.ConfusionMatrix(y_test.values, y_pred)
cm_df = cm.to_dataframe(normalized=False, calc_sum=True,
sum_label='all')
print(cm_df.iloc[:12, :12])

## Predicted A B C D E F G H I J K L
## Actual
## A 147 0 0 0 0 0 0 0 0 0 0 0
## B 0 153 0 0 0 0 0 0 0 0 0 0
## C 0 0 152 0 0 0 3 0 0 0 0 0
## D 1 1 0 166 0 0 0 2 0 0 0 0
## E 0 1 0 0 141 0 1 0 0 0 0 1
## F 0 1 0 1 0 163 0 0 0 0 0 0
## G 0 1 0 2 0 0 175 0 0 0 0 1
## H 0 1 0 2 0 0 0 111 0 0 2 0
## I 0 0 0 0 0 1 0 0 118 8 0 0
## J 0 0 0 0 0 1 0 0 1 156 0 0
## K 0 0 0 0 0 0 0 3 0 0 136 0
## L 0 0 1 0 1 0 0 0 0 0 1 156

9 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

perf_indx = cm.stats()

## /opt/anaconda3/lib/python3.7/site-packages/pandas_ml/confusion_matrix/stat
s.py:60: FutureWarning: supplying multiple axes to axis is deprecated and will
be removed in a future version.
## num = df[df > 1].dropna(axis=[0, 1], thresh=1).applymap(lambda n: choose
(n, 2)).sum().sum() - np.float64(nis2 * njs2) / n2
## /opt/anaconda3/lib/python3.7/site-packages/pandas_ml/confusion_matrix/bcm.p
y:344: RuntimeWarning: divide by zero encountered in double_scalars
## return(np.float64(self.LRP) / self.LRN)
## /opt/anaconda3/lib/python3.7/site-packages/pandas_ml/confusion_matrix/bcm.p
y:330: RuntimeWarning: divide by zero encountered in double_scalars
## return(np.float64(self.TPR) / self.FPR)

print(type(perf_indx))

## <class 'collections.OrderedDict'>

print(perf_indx.keys())

## odict_keys(['cm', 'overall', 'class'])

print(type(perf_indx['overall']))
# perf_indx['overall'].keys()

## <class 'collections.OrderedDict'>

print(" acc:{}".format(perf_indx['overall']
['Accuracy']))

## acc:0.9725

print(" acc95%:\n{}".format(perf_indx
['overall']['95% CI']))

## acc95%:
## (0.9669490685534711, 0.9773453558266993)

print("Kappa:\n{}".format(perf_indx['overall']
['Kappa']))

## Kappa:
## 0.9713890027910028

10 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

print(type(perf_indx['class']))

## <class 'pandas.core.frame.DataFrame'>

print(perf_indx['class'].shape)

## (26, 26)

print(perf_indx['class'])

## Classes A ... Z
## Population 4000 ... 4000
## P: Condition positive 147 ... 143
## N: Condition negative 3853 ... 3857
## Test outcome positive 148 ... 142
## Test outcome negative 3852 ... 3858
## TP: True Positive 147 ... 142
## TN: True Negative 3852 ... 3857
## FP: False Positive 1 ... 0
## FN: False Negative 0 ... 1
## TPR: (Sensitivity, hit rate, recall) 1 ... 0.993007
## TNR=SPC: (Specificity) 0.99974 ... 1
## PPV: Pos Pred Value (Precision) 0.993243 ... 1
## NPV: Neg Pred Value 1 ... 0.999741
## FPR: False-out 0.000259538 ... 0
## FDR: False Discovery Rate 0.00675676 ... 0
## FNR: Miss Rate 0 ... 0.00699301
## ACC: Accuracy 0.99975 ... 0.99975
## F1 score 0.99661 ... 0.996491
## MCC: Matthews correlation coefficient 0.996487 ... 0.996368
## Informedness 0.99974 ... 0.993007
## Markedness 0.993243 ... 0.999741
## Prevalence 0.03675 ... 0.03575
## LR+: Positive likelihood ratio 3853 ... inf
## LR-: Negative likelihood ratio 0 ... 0.00699301
## DOR: Diagnostic odds ratio inf ... inf
## FOR: False omission rate 0 ... 0.000259202
##
## [26 rows x 26 columns]

import matplotlib.pyplot as plt

ax = cm.plot()
fig = ax.get_figure()
plt.show()
# fig.savefig('./_img/svc_rbf.png')

11 of 12 2020/9/1, 4:07 PM
Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

12 of 12 2020/9/1, 4:07 PM

Industrial Statistics - A Computer Based Approach With Python
No ratings yet
Industrial Statistics - A Computer Based Approach With Python
140 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
No ratings yet
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
10 pages
Test Bank For An Introduction To Statistical Methods and Data Analysis 7th Edition HQ File Download
No ratings yet
Test Bank For An Introduction To Statistical Methods and Data Analysis 7th Edition HQ File Download
410 pages
Solutions Modernstatistics
No ratings yet
Solutions Modernstatistics
144 pages
Code ExerciseModelSelection
100% (1)
Code ExerciseModelSelection
19 pages
ML 3
No ratings yet
ML 3
24 pages
Fds Lab Record
No ratings yet
Fds Lab Record
84 pages
Practical File (Xii - Ip) 2023-24
No ratings yet
Practical File (Xii - Ip) 2023-24
40 pages
PR Final File
No ratings yet
PR Final File
70 pages
ML Labmanual
No ratings yet
ML Labmanual
33 pages
Lin Mod Book
No ratings yet
Lin Mod Book
567 pages
Machine Learning (ML)
No ratings yet
Machine Learning (ML)
35 pages
ML Labs
No ratings yet
ML Labs
14 pages
ML Lab Manual
No ratings yet
ML Lab Manual
24 pages
Data Science Lab Experiments
No ratings yet
Data Science Lab Experiments
32 pages
Lab Mannual
No ratings yet
Lab Mannual
49 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
AD3411
No ratings yet
AD3411
28 pages
cs229 Python Friday
No ratings yet
cs229 Python Friday
40 pages
Xgboost
No ratings yet
Xgboost
12 pages
Datascience Lab
No ratings yet
Datascience Lab
24 pages
AI Lab Codes.
No ratings yet
AI Lab Codes.
12 pages
Experimenting With Data Analysis Packages and Statistical Operations
No ratings yet
Experimenting With Data Analysis Packages and Statistical Operations
18 pages
Machine Learning Lab Manaul BCSL606
No ratings yet
Machine Learning Lab Manaul BCSL606
27 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
45 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Eidd S8 TD1
No ratings yet
Eidd S8 TD1
3 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Lab 8
No ratings yet
Lab 8
8 pages
Ad3411 - Data Science and Analytics Laboratory
No ratings yet
Ad3411 - Data Science and Analytics Laboratory
26 pages
The Impact of Digital Medical Resources On USMLE Step 2 CK Scores - A Retrospective Study of 1,985 US Medical Students
No ratings yet
The Impact of Digital Medical Resources On USMLE Step 2 CK Scores - A Retrospective Study of 1,985 US Medical Students
40 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Ad3411-Data Science and Analytics Laboratory
No ratings yet
Ad3411-Data Science and Analytics Laboratory
27 pages
Lab4 KNN
No ratings yet
Lab4 KNN
9 pages
Mlalllabprgs
No ratings yet
Mlalllabprgs
17 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
ML Prac 1
No ratings yet
ML Prac 1
17 pages
Fda Batch2program
No ratings yet
Fda Batch2program
18 pages
Batch2 FDS Printout
No ratings yet
Batch2 FDS Printout
38 pages
Numpy Dataframe
No ratings yet
Numpy Dataframe
12 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
AI Final PDF
No ratings yet
AI Final PDF
38 pages
ML Lab
No ratings yet
ML Lab
14 pages
ML Lab Manual
No ratings yet
ML Lab Manual
28 pages
Machine Learning Programs
No ratings yet
Machine Learning Programs
10 pages
Final ML File
No ratings yet
Final ML File
34 pages
Excel Professional Services, Inc.: Discussion Questions
100% (1)
Excel Professional Services, Inc.: Discussion Questions
7 pages
Lean Six Sigma Green Belt Certification Training Manual CSSC 2018 06b (1) (251 300)
No ratings yet
Lean Six Sigma Green Belt Certification Training Manual CSSC 2018 06b (1) (251 300)
50 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
ML Record
No ratings yet
ML Record
19 pages
Lean Six Sigma Green Belt Certification Training Manual CSSC 2018 06b (1) (201 250)
No ratings yet
Lean Six Sigma Green Belt Certification Training Manual CSSC 2018 06b (1) (201 250)
50 pages
Rob 2.0 CLL Webinar Slides
No ratings yet
Rob 2.0 CLL Webinar Slides
44 pages
Assignment II Machine Learning
No ratings yet
Assignment II Machine Learning
8 pages
Willmott 2005 - Advantages of The MAE Over RMSE
No ratings yet
Willmott 2005 - Advantages of The MAE Over RMSE
4 pages
Experiment 1
No ratings yet
Experiment 1
19 pages
Data Science Using R
No ratings yet
Data Science Using R
11 pages
Python Code - Summary Statistics
No ratings yet
Python Code - Summary Statistics
6 pages
Lecture 3
No ratings yet
Lecture 3
25 pages
DP Prog
No ratings yet
DP Prog
10 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
Dal Programs With Output
No ratings yet
Dal Programs With Output
11 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
Normal Distribution and Probabilities
100% (1)
Normal Distribution and Probabilities
16 pages
MIT18 S096F13 CaseStudy2
No ratings yet
MIT18 S096F13 CaseStudy2
20 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Lecture 7 Slides
No ratings yet
Lecture 7 Slides
33 pages
Business Stati Basic Numerical Skills
No ratings yet
Business Stati Basic Numerical Skills
15 pages
DLP Week 2 Q4
No ratings yet
DLP Week 2 Q4
8 pages
MSA Changes Englisch
No ratings yet
MSA Changes Englisch
8 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Principles and Procedures of Exploratory Data Analysis: John T. Behrens
No ratings yet
Principles and Procedures of Exploratory Data Analysis: John T. Behrens
30 pages
04-SAS For Statistical Genetics
No ratings yet
04-SAS For Statistical Genetics
19 pages
Template For Activity No. 3 T TESTANOVA
No ratings yet
Template For Activity No. 3 T TESTANOVA
6 pages
Homework 5
No ratings yet
Homework 5
3 pages
Rroccomp
No ratings yet
Rroccomp
12 pages
Two Sample Z Hypothesis Tests
No ratings yet
Two Sample Z Hypothesis Tests
1 page
Engle & Manganelli (2004) - CAViaR Conditional Autoregressive Value at Risk by Regression Quantiles
No ratings yet
Engle & Manganelli (2004) - CAViaR Conditional Autoregressive Value at Risk by Regression Quantiles
15 pages
Geog 3mb3 Section 4
No ratings yet
Geog 3mb3 Section 4
30 pages
Chapter 6 Multicollinerity
No ratings yet
Chapter 6 Multicollinerity
4 pages
4 Measures of Centrality: Mean, Median, Mode, Grouped Data
No ratings yet
4 Measures of Centrality: Mean, Median, Mode, Grouped Data
18 pages
Coefficient of Variance+ Combined SD and Skewness
No ratings yet
Coefficient of Variance+ Combined SD and Skewness
4 pages
HW 5 CHAP 5 Simulación Espol
No ratings yet
HW 5 CHAP 5 Simulación Espol
4 pages
Econometrics Questions
No ratings yet
Econometrics Questions
3 pages
Univariate Statistics: Assignment 2
No ratings yet
Univariate Statistics: Assignment 2
5 pages
Understanding The Independent-Samples T Test
No ratings yet
Understanding The Independent-Samples T Test
8 pages

SVM Updated

Uploaded by

SVM Updated

Uploaded by

Statistical Machine Learning with Python Week #3 file:///Users/Vince/Google Drive/MCUT/Indonesia/rmd_python...

Statistical Machine Learning with

According to the documentation provided by Frey and Slate (1991)(http://archive.ics.uci.edu/ml

Exploring and Preparing the Data

## letter xbox ... yedge yedgex

print(np.sum(vt.get_support() == False)) # Get a mask of the features selected

print(vt.get_support(indices=True)) # Get integer index of the features select

## [[1. 0.7577928 0.851514 0.67276367 0.61909688]

## [[False False True False False]

## Index(['xbox', 'ybox', 'width', 'height'], dtype='object')

import matplotlib.pyplot as plt

ax2 = letters[['ybar', 'letter']].boxplot(by = 'letter')

# create training and testing data

X_train, X_test, y_train, y_test = train_test_split(

## StandardScaler(copy=True, with_mean=True, with_std=True)

Training a Model on the Data

## SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

## ['I' 'M' 'Z' 'D' 'G']

## ['Y' 'B' 'K' 'T' 'Q']

## ['Y', 'B', 'K', 'Y', 'Q']

err_tr = (y_train != tr_pred).sum()/len(y_train)

## train set error: 0.133250

err = (y_test != y_pred).sum()/len(y_test)

## test set error: 0.134500

Improving Model Performance

svm = SVC(kernel='rbf', random_state=0, gamma=0.2, C=1.0)

## SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

## ['I' 'M' 'Z' 'D' 'L']

## ['Y' 'B' 'K' 'X' 'Q']

## ['Y', 'B', 'K', 'Y', 'Q']

err_tr = (y_train.values != tr_pred).sum()/len(y_train)

## train set error: 0.011750

err = (y_test != y_pred).sum()/len(y_test)

## test set error: 0.027500

pip install scikit-learn==0.21.1

import pandas_ml as pdml

## odict_keys(['cm', 'overall', 'class'])

import matplotlib.pyplot as plt

You might also like