Support Vector Machines com Python
Support Vector Machines com Python
Podemos pegar informações e arrays deste dicionário para configurar nosso dataframe e entender
os recursos:
[4]: print(cancer['DESCR'])
Notes
-----
Data Set Characteristics:
:Number of Instances: 569
1
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
2
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ======= ========
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
3
References
----------
- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages
570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning
techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77
(1994)
163-171.
[56]: cancer['feature_names']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
mean radius 569 non-null float64
mean texture 569 non-null float64
mean perimeter 569 non-null float64
mean area 569 non-null float64
mean smoothness 569 non-null float64
mean compactness 569 non-null float64
mean concavity 569 non-null float64
mean concave points 569 non-null float64
4
mean symmetry 569 non-null float64
mean fractal dimension 569 non-null float64
radius error 569 non-null float64
texture error 569 non-null float64
perimeter error 569 non-null float64
area error 569 non-null float64
smoothness error 569 non-null float64
compactness error 569 non-null float64
concavity error 569 non-null float64
concave points error 569 non-null float64
symmetry error 569 non-null float64
fractal dimension error 569 non-null float64
worst radius 569 non-null float64
worst texture 569 non-null float64
worst perimeter 569 non-null float64
worst area 569 non-null float64
worst smoothness 569 non-null float64
worst compactness 569 non-null float64
worst concavity 569 non-null float64
worst concave points 569 non-null float64
worst symmetry 569 non-null float64
worst fractal dimension 569 non-null float64
dtypes: float64(30)
memory usage: 133.4 KB
[14]: cancer['target']
[14]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
5
0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])
[8]: mean radius mean texture mean perimeter mean area mean smoothness \
0 17.99 10.38 122.80 1001.0 0.11840
1 20.57 17.77 132.90 1326.0 0.08474
2 19.69 21.25 130.00 1203.0 0.10960
3 11.42 20.38 77.58 386.1 0.14250
4 20.29 14.34 135.10 1297.0 0.10030
6
worst fractal dimension
0 0.11890
1 0.08902
2 0.08758
3 0.17300
4 0.07678
[5 rows x 30 columns]
[61]: model.fit(X_train,y_train)
[46]: print(confusion_matrix(y_test,predictions))
7
[[ 0 66]
[ 0 105]]
[62]: print(classification_report(y_test,predictions))
/Users/marci/anaconda/lib/python3.5/site-
packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning:
Precision and F-score are ill-defined and being set to 0.0 in labels with no
predicted samples.
'precision', 'predicted', average, warn_for)
Note que estamos classificando tudo em uma única classe! Isso significa que nosso modelo precisa
ter parâmetros ajustados (também pode ajudar a normalizar os dados).
Podemos procurar por parâmetros usando um GridSearch!
4 Gridsearch
Encontrar os parâmetros certos (como o que o C ou os valores de gama para usar) é uma tarefa
complicada! Mas, felizmente, podemos ser um pouco preguiçosos e apenas tentar um monte de
combinações e ver o que funciona melhor! Essa idéia de criar uma “grade” de parâmetros e ape-
nas experimentar todas as combinações possíveis é chamada Gridsearch, esse método é comum o
suficiente para que o Scikit-learn tenha essa funcionalidade incorporada no GridSearchCV!
O GridSearchCV usa um dicionário que descreve os parâmetros que devem ser testados e um
modelo para treinar. A grade de parâmetros é definida como um dicionário, onde as chaves são os
parâmetros e os valores são as configurações a serem testadas.
[63]: param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001],␣
,→'kernel': ['rbf']}
Uma das grandes coisas do GridSearchCV é que é um meta-estimador. Ele pega um estimador
como SVC e cria um novo estimador, que se comporta exatamente da mesma forma - neste caso,
como um classificador. Você deve adicionar refit=True e escolher detalhadamente para número
desejado, maior o número, mais detalhado.
Você deve adicionar refit=True e escolher um número para o parâmetro verbose. Quanto maior o
número, mais detalhamento teremos.
[65]: grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
8
O que o fit faz é um pouco mais complicado do que o habital. Primeiro, ele executa o mesmo loop
com validação cruzada, para encontrar a melhor combinação de parâmetros. Uma vez que tenha a
melhor combinação, ele corre novamente em todos os dados para fitá-los (sem validação cruzada),
para construir um único modelo novo usando a melhor configuração de parâmetro.
[40]: # Talvez demore um pouco
grid.fit(X_train,y_train)
9
[CV] … gamma=0.1, C=1, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.1, C=1, kernel=rbf …
[CV] … gamma=0.1, C=1, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.01, C=1, kernel=rbf …
[CV] … gamma=0.01, C=1, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=1, kernel=rbf …
[CV] … gamma=0.01, C=1, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=1, kernel=rbf …
[CV] … gamma=0.01, C=1, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.001, C=1, kernel=rbf …
[CV] … gamma=0.001, C=1, kernel=rbf, score=0.902256 - 0.0s
[CV] gamma=0.001, C=1, kernel=rbf …
[CV] … gamma=0.001, C=1, kernel=rbf, score=0.939850 - 0.0s
[CV] gamma=0.001, C=1, kernel=rbf …
[CV] … gamma=0.001, C=1, kernel=rbf, score=0.954545 - 0.0s
[CV] gamma=0.0001, C=1, kernel=rbf …
[CV] … gamma=0.0001, C=1, kernel=rbf, score=0.939850 - 0.0s
[CV] gamma=0.0001, C=1, kernel=rbf …
[CV] … gamma=0.0001, C=1, kernel=rbf, score=0.969925 - 0.0s
[CV] gamma=0.0001, C=1, kernel=rbf …
[CV] … gamma=0.0001, C=1, kernel=rbf, score=0.946970 - 0.0s
[CV] gamma=1, C=10, kernel=rbf …
[CV] … gamma=1, C=10, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=1, C=10, kernel=rbf …
[CV] … gamma=1, C=10, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=1, C=10, kernel=rbf …
[CV] … gamma=1, C=10, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.1, C=10, kernel=rbf …
[CV] … gamma=0.1, C=10, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.1, C=10, kernel=rbf …
[CV] … gamma=0.1, C=10, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.1, C=10, kernel=rbf …
[CV] … gamma=0.1, C=10, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.01, C=10, kernel=rbf …
[CV] … gamma=0.01, C=10, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=10, kernel=rbf …
[CV] … gamma=0.01, C=10, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=10, kernel=rbf …
[CV] … gamma=0.01, C=10, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.001, C=10, kernel=rbf …
[CV] … gamma=0.001, C=10, kernel=rbf, score=0.894737 - 0.0s
[CV] gamma=0.001, C=10, kernel=rbf …
[CV] … gamma=0.001, C=10, kernel=rbf, score=0.932331 - 0.0s
[CV] gamma=0.001, C=10, kernel=rbf …
[CV] … gamma=0.001, C=10, kernel=rbf, score=0.916667 - 0.0s
[CV] gamma=0.0001, C=10, kernel=rbf …
[CV] … gamma=0.0001, C=10, kernel=rbf, score=0.932331 - 0.0s
[CV] gamma=0.0001, C=10, kernel=rbf …
10
[CV] … gamma=0.0001, C=10, kernel=rbf, score=0.969925 - 0.0s
[CV] gamma=0.0001, C=10, kernel=rbf …
[CV] … gamma=0.0001, C=10, kernel=rbf, score=0.962121 - 0.0s
[CV] gamma=1, C=100, kernel=rbf …
[CV] … gamma=1, C=100, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=1, C=100, kernel=rbf …
[CV] … gamma=1, C=100, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=1, C=100, kernel=rbf …
[CV] … gamma=1, C=100, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.1, C=100, kernel=rbf …
[CV] … gamma=0.1, C=100, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.1, C=100, kernel=rbf …
[CV] … gamma=0.1, C=100, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.1, C=100, kernel=rbf …
[CV] … gamma=0.1, C=100, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.01, C=100, kernel=rbf …
[CV] … gamma=0.01, C=100, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=100, kernel=rbf …
[CV] … gamma=0.01, C=100, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=100, kernel=rbf …
[CV] … gamma=0.01, C=100, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.001, C=100, kernel=rbf …
[CV] … gamma=0.001, C=100, kernel=rbf, score=0.894737 - 0.0s
[CV] gamma=0.001, C=100, kernel=rbf …
[CV] … gamma=0.001, C=100, kernel=rbf, score=0.932331 - 0.0s
[CV] gamma=0.001, C=100, kernel=rbf …
[CV] … gamma=0.001, C=100, kernel=rbf, score=0.916667 - 0.0s
[CV] gamma=0.0001, C=100, kernel=rbf …
[CV] … gamma=0.0001, C=100, kernel=rbf, score=0.917293 - 0.0s
[CV] gamma=0.0001, C=100, kernel=rbf …
[CV] … gamma=0.0001, C=100, kernel=rbf, score=0.977444 - 0.0s
[CV] gamma=0.0001, C=100, kernel=rbf …
[CV] … gamma=0.0001, C=100, kernel=rbf, score=0.939394 - 0.0s
[CV] gamma=1, C=1000, kernel=rbf …
[CV] … gamma=1, C=1000, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=1, C=1000, kernel=rbf …
[CV] … gamma=1, C=1000, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=1, C=1000, kernel=rbf …
[CV] … gamma=1, C=1000, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.1, C=1000, kernel=rbf …
[CV] … gamma=0.1, C=1000, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.1, C=1000, kernel=rbf …
[CV] … gamma=0.1, C=1000, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.1, C=1000, kernel=rbf …
[CV] … gamma=0.1, C=1000, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.01, C=1000, kernel=rbf …
[CV] … gamma=0.01, C=1000, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=1000, kernel=rbf …
11
[CV] … gamma=0.01, C=1000, kernel=rbf, score=0.631579 - 0.0s
[CV] gamma=0.01, C=1000, kernel=rbf …
[CV] … gamma=0.01, C=1000, kernel=rbf, score=0.636364 - 0.0s
[CV] gamma=0.001, C=1000, kernel=rbf …
[CV] … gamma=0.001, C=1000, kernel=rbf, score=0.894737 - 0.0s
[CV] gamma=0.001, C=1000, kernel=rbf …
[CV] … gamma=0.001, C=1000, kernel=rbf, score=0.932331 - 0.0s
[CV] gamma=0.001, C=1000, kernel=rbf …
[CV] … gamma=0.001, C=1000, kernel=rbf, score=0.916667 - 0.0s
[Parallel(n_jobs=1)]: Done 31 tasks | elapsed: 0.3s
[Parallel(n_jobs=1)]: Done 75 out of 75 | elapsed: 0.8s finished
[ ]: grid.best_estimator_
Então você pode re-executar previsões neste objeto da grade exatamente como você faria com um
modelo normal.
[48]: grid_predictions = grid.predict(X_test)
[49]: print(confusion_matrix(y_test,grid_predictions))
[[ 60 6]
[ 3 102]]
12
[50]: print(classification_report(y_test,grid_predictions))
13