0% found this document useful (0 votes)
66 views10 pages

sectionSVM PDF

This document provides an introduction to support vector machines (SVMs) using the R package kernlab. It begins with a simple linear SVM example using synthetic two-dimensional data to classify points as positive or negative. The document explores changing parameters like C and using different kernels to handle non-linearly separable data. Finally, it applies SVMs to a cancer gene expression dataset to classify tumors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views10 pages

sectionSVM PDF

This document provides an introduction to support vector machines (SVMs) using the R package kernlab. It begins with a simple linear SVM example using synthetic two-dimensional data to classify points as positive or negative. The document explores changing parameters like C and using different kernels to handle non-linearly separable data. Finally, it applies SVMs to a cancer gene expression dataset to classify tumors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Practical session: Introduction to SVM in R

Jean-Philippe Vert

November 23, 2015

In this session you will

• Learn how manipulate a SVM in R with the package kernlab


• Observe the effect of changing the C parameter and the kernel

• Test a SVM classifier for cancer diagnosis from gene expression data

1 Linear SVM
Here we generate a toy dataset in 2D, and learn how to train and test a SVM.

1.1 Generate toy data


First generate a set of positive and negative examples from 2 Gaussians.

n <- 150 #number of data points


p <- 2 # dimension
sigma <- 1 # variance of the distribution
meanpos <- 0 # centre of the distribution of positive examples
meanneg <- 3 # centre of the distribution of negative examples
npos <- round(n / 2) # number of positive examples
nneg <- n - npos # number of negative examples

# Generate the positive and negative examples


xpos <- matrix(rnorm(npos * p, mean = meanpos, sd = sigma), npos, p)
xneg <- matrix(rnorm(nneg * p, mean = meanneg, sd = sigma), npos, p)
x <- rbind(xpos, xneg)

# Generate the labels


y <- matrix(c(rep(1, npos), rep(-1, nneg)))

# Visualize the data


plot(x, col = ifelse(y > 0, 1, 2))
legend("topleft", c("Positive", "Negative"), col = seq(2), pch = 1, text.col = seq(2))

1
1.1 Generate toy data 1 LINEAR SVM

Positive
Negative
4
2
x[,2]

0
−2

−2 0 2 4

x[,1]

Now we split the data into a training set (80%) and a test set (20%)

# Prepare a training and a test set


ntrain <- round(n * 0.8) # number of training examples
tindex <- sample(n, ntrain) # indices of training samples
xtrain <- x[tindex, ]
xtest <- x[-tindex, ]
ytrain <- y[tindex]
ytest <- y[-tindex]
istrain <- rep(0, n)
istrain[tindex] <- 1

# Visualize
plot(x, col = ifelse(y > 0, 1, 2), pch = ifelse(istrain == 1,1,2))
legend("topleft", c("Positive Train", "Positive Test", "Negative Train", "Negative Test"), col = c(1, 1,

2
1.2 Train a SVM 1 LINEAR SVM

Positive Train
Positive Test
Negative Train
4

Negative Test
2
x[,2]

0
−2

−2 0 2 4

x[,1]

1.2 Train a SVM


Now we train a linear SVM with parameter C=100 on the training set.

# load the kernlab package


# install.packages("kernlab")
library(kernlab)

# train the SVM


svp <- ksvm(xtrain, ytrain, type = "C-svc", kernel = "vanilladot", C=100, scaled=c())

#Look and understand what svp contains


# General summary
svp

# Attributes that you can access


attributes(svp)

# For example, the support vectors


alpha(svp)
alphaindex(svp)
b(svp)

3
1.3 Predict with a SVM 1 LINEAR SVM

# Use the built-in function to pretty-plot the classifier


plot(svp, data = xtrain)

QUESTION1 - Write a function plotlinearsvm=function(svp,xtrain) to plot the points and the


decision boundaries of a linear SVM, as in Figure 1. To add a straight line to a plot, you may
use the function abline.

1.3 Predict with a SVM


Now we can use the trained SVM to predict the label of points in the test set, and we analyze the results
using variant metrics.

# Predict labels on test


ypred <- predict(svp, xtest)
table(ytest, ypred)

4
1.4 Cross-validation 1 LINEAR SVM

# Compute accuracy
sum(ypred == ytest) / length(ytest)

# Compute at the prediction scores


ypredscore <- predict(svp, xtest, type = "decision")

# Check that the predicted labels are the signs of the scores
table(ypredscore > 0, ypred)

# Package to compute ROC curve, precision-recall etc...


# install.packages("ROCR")
library(ROCR)

## Loading required package: gplots


##
## Attaching package: ’gplots’
##
## The following object is masked from ’package:stats’:
##
## lowess

pred <- prediction(ypredscore, ytest)

# Plot ROC curve


perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf)
# Plot precision/recall curve
perf <- performance(pred, measure = "prec", x.measure = "rec")
plot(perf)
# Plot accuracy as function of threshold
perf <- performance(pred, measure = "acc")
plot(perf)

1.4 Cross-validation
Instead of fixing a training set and a test set, we can improve the quality of these estimates by running k-fold
cross-validation. We split the training set in k groups of approximately the same size, then iteratively train
a SVM using k - 1 groups and make prediction on the group which was left aside. When k is equal to the
number of training points, we talk of leave-one-out (LOO) cross-validatin. To generate a random split of n
points in k folds, we can for example create the following function:

cv.folds <- function(y, folds = 3){


## randomly split the n samples into folds
split(sample(length(y)), rep(1:folds, length = length(y)))
}

QUESTION2 - Write a function cv.ksvm = function(x, y, folds = 3,...) which returns a vector
ypred of predicted decision score for all points by k-fold cross-validation

QUESTION3 - Compute the various performance of the SVM by 5-fold cross-validation. Al-
ternatively, the ksvm function can automatically compute the k-fold cross-validation accuracy:

5
1.5 Effect of C 1 LINEAR SVM

svp <- ksvm(x, y, type = "C-svc", kernel = "vanilladot", C = 100, scaled=c(), cross = 5)
print(cross(svp))
print(error(svp))

QUESTION4 - Compare the 5-fold CV estimated by your function and ksvm.

1.5 Effect of C
The C parameters balances the trade-off between having a large margin and separating the positive and
unlabeled on the training set. It is important to choose it well to have good generalization.

QUESTION5 - Plot the decision functions of SVM trained on the toy examples for different
values of C in the range 2seq(−10,14) . To look at the different plots you can use the function
par(ask=T) that will ask you to press a key between successive plots. Alternatively, you can
use par(mfrow = c(5,5)) to see all the plots in the same window

QUESTION6 - Plot the 5-fold cross-validation error as a function of C.

QUESTION7 - Do the same on data with more overlap between the two classes, e.g., re-
generate toy data with meanneg being 1.

6
2 NONLINEAR SVM

2 Nonlinear SVM
Sometimes linear SVM are not enough. For example, generate a toy dataset where positive and negative
examples are mixture of two Gaussians which are not linearly separable.

QUESTION8 - Make a toy example that looks like Figure 2, and test a linear SVM with
different values of C.

To solve this problem, we should instead use a nonlinear SVM. This is obtained by simply changing the
kernel parameter. For example, to use a Gaussian RBF kernel with σ = 1 and C = 1:

# Train a nonlinear SVM


svp <- ksvm(x, y, type = "C-svc", kernel="rbf", kpar = list(sigma = 1), C = 1)

# Visualize it
plot(svp, data = x)

You should obtain something that look like Figure 3. Much better than the linear SVM, no? The nonlinear
SVM has now two parameters: σ and C. Both play a role in the generalization capacity of the SVM.

QUESTION9 - Visualize and compute the 5-fold cross-validation error for different values of
C and σ. Observe their influence.

7
2 NONLINEAR SVM

A useful heuristic to choose σ is implemented in kernlab. It is based on the quantiles of the distances between
the training point.

# Train a nonlinear SVM with automatic selection of sigma by heuristic


svp <- ksvm(x, y, type = "C-svc", kernel = "rbf", C = 1)

# Visualize it
plot(svp, data = x)

QUESTION10 - Train a nonlinear SVM with various of C with automatic determination of σ.


In fact, many other nonlinear kernels are implemented. Check the documentation of kernlab
to see them: ?kernels

QUESTION11 - Test the polynomial, hyperbolic tangent, Laplacian, Bessel and ANOVA ker-
nels on the toy examples.

8
3 APPLICATION: CANCER DIAGNOSIS FROM GENE EXPRESSION DATA

3 Application: cancer diagnosis from gene expression data


As a real-world application, let us test the ability of SVM to predict the class of a tumour from gene ex-
pression data. We use a publicly available dataset of gene expression data for 128 different individuals with
acute lymphoblastic leukemia (ALL).

# Load the ALL dataset


library(ALL)

## Loading required package: Biobase


## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: ’BiocGenerics’
##
## The following objects are masked from ’package:parallel’:
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from ’package:stats’:
##
## xtabs
##
## The following objects are masked from ’package:base’:
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## ’browseVignettes()’. To cite Bioconductor, see
## ’citation("Biobase")’, and for packages ’citation("pkgname")’.

data(ALL)

# Inspect them
?ALL
show(ALL)
print(summary(pData(ALL)))

Here we focus on predicting the type of the disease (B-cell or T-cell). We get the expression data and disease
type as follows

9
3 APPLICATION: CANCER DIAGNOSIS FROM GENE EXPRESSION DATA

x <- t(exprs(ALL))
y <- substr(ALL$BT,1,1)

QUESTION12 - Test the ability of a SVM to predict the class of the disease from gene ex-
pression. Check the influence of the parameters.

Finally, we may want to predict the type and stage of the diseases. We are then confronted with a multi-class
classification problem, since the variable to predict can take more than two values:

y <- ALL$BT
print(y)

## [1] B2 B2 B4 B1 B2 B1 B1 B1 B2 B2 B3 B3 B3 B2 B3 B B2 B3 B2 B3 B2 B2 B2
## [24] B1 B1 B2 B1 B2 B1 B2 B B B2 B2 B2 B1 B2 B2 B2 B2 B2 B4 B4 B2 B2 B2
## [47] B4 B2 B1 B2 B2 B3 B4 B3 B3 B3 B4 B3 B3 B1 B1 B1 B1 B3 B3 B3 B3 B3 B3
## [70] B3 B3 B1 B3 B1 B4 B2 B2 B1 B3 B4 B4 B2 B2 B3 B4 B4 B4 B1 B2 B2 B2 B1
## [93] B2 B B T T3 T2 T2 T3 T2 T T4 T2 T3 T3 T T2 T3 T2 T2 T2 T1 T4 T
## [116] T2 T3 T2 T2 T2 T2 T3 T3 T3 T2 T3 T2 T
## Levels: B B1 B2 B3 B4 T T1 T2 T3 T4

Fortunately, kernlab implements automatically multi-class SVM by an all-versus-all strategy to combine


several binary SVM.

QUESTION13 - Test the ability of a SVM to predict the class and the stage of the disease
from gene expression.

10

You might also like