Class3 AdvancedDataMiningWithWeka 2016 PDF
Class3 AdvancedDataMiningWithWeka 2016 PDF
Class 3 – Lesson 1
LibSVM and LibLINEAR
Ian Witten
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Lesson 3.1: LibSVM and LibLINEAR
Class 3 Interfacing to R and other data Lesson 3.3 Using R to plot data
mining packages
LibLINEAR
Speed test
Data generator: 10,000 instances of LED24 data, percentage split evaluation
– LibLinear 2 secs to build model
– LibSVM, default parameters (RBF kernel) 18 secs
choose linear kernel 10 sec
– SMO, default parameters (linear) 21 secs
LibSVM and LibLINEAR
Linear boundary
small margin
0 errors on training data
LibSVM and LibLINEAR
Linear boundary
small margin
0 errors on training data
4 errors on test data
LibSVM and LibLINEAR
Linear boundary
small margin
0 errors on training data
4 errors on test data
LibSVM and LibLINEAR
Linear boundary
small margin
LibSVM and LibLINEAR
Linear boundary
large margin
1 error on training data
LibSVM and LibLINEAR
Linear boundary
small margin
1 error on training data
0 errors on test data
LibSVM and LibLINEAR
Linear boundary
LibLINEAR
LibSVM with linear kernel
(or SMO)
21 errors
on the training set
LibSVM and LibLINEAR
Nonlinear boundary
LibSVM, RBF kernel
default parameters
cost=1, gamma=0
9 errors on training set
Do it!
with BoundaryVisualizer
in Explorer
LibSVM and LibLINEAR
Nonlinear boundary
LibSVM:
OK parameters
cost=10, gamma=0
0 errors on training set
Poor generalization
LibSVM and LibLINEAR
Nonlinear boundary
LibSVM
optimized parameters
cost=1000, gamma=10
0 errors on training set
Good generalization
LibSVM and LibLINEAR
steps of 1
from 103
gamma: 103, 102, 10, 1, 10–1, 10–2, 10–3
down to 10–3
use LibSVM (classification) gamma
steps of 1
evaluate using Accuracy
LibSVM
cost = 1000, gamma = 10 Accuracy
LibSVM and LibLINEAR i
10
SMO from 103
Optimizing LibSVM parameters down to 10–3
with gridSearch c
steps of 1
(RBFKernel): c, kernel.gamma
from 103
kernel.gamma: 103, 102, 10, 1, 10–1, 10–2, 10–
3
down to 10–3
kernel.gamma
use SMO (classification)
steps of 1
evaluate using Accuracy SMO
Accuracy
LibSVM and LibLINEAR
Eibe Frank
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Lesson 3.2: Setting up R with Weka
Class 3 Interfacing to R and other data Lesson 3.3 Using R to plot data
mining packages
The instructions are based on using 64-bit Windows, 64-bit Java, and 64-bit
R, and assume admin rights
– Mixing 32-bit versions with 64-bit ones will produce problems, e.g., the installation
process for Weka’s RPlugin may halt for no apparent reason
– If you have 32-bit Windows, use 32-bit Java and 32-bit R
– Support for R in Weka can also be installed on OS X and Linux: refer to the installation
instructions that come with Weka’s RPlugin
There are four main steps to the installation process:
– Downloading and installing R
– Installing the rJava package in R
– Setting up some Windows environment variables
– Downloading and installing the RPlugin package for Weka
Downloading and installing R
https://cran.r-project.org/mirrors.html
Start the R console, e.g., by double-clicking on the shortcut that the installer
has put on your desktop
In the R console, type install.packages("rJava") and press the return key on
your keyboard
Note that this will only work if you have direct web access, i.e., if your web
access is not provided by a proxy computer
(see the next slide on what to do if you are behind a proxy)
In the pop-up menu, choose a mirror to download from
Accept defaults when asked for install options
Close R once the package has been installed, by typing q(), without saving the
workspace
For users with web connections provided by a proxy
Start Weka, and from the Tools menu in the GUIChooser, select the Weka
package manager
Choose RPlugin from the list of packages and press Install
– If your internet access is through a proxy server, see Using a HTTP proxy at
http://weka.wikispaces.com/How+do+I+use+the+package+manager
Eibe Frank
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Lesson 3.3: Using R to plot data
Class 3 Interfacing to R and other data Lesson 3.3 Using R to plot data
mining packages
We need some data to work with, so first load the Iris data into the
Preprocess panel of the Explorer
To plot data with R, go to the RConsole
Before we can use ggplot2, we need to download and install the
corresponding R package:
– To install the package, type install.packages("ggplot2") and press return
– To load the package, type library(ggplot2) and press return
Try entering the following to see if the package works:
ggplot(rdata, aes(x = petallength)) + geom_density()
This should give you a kernel density estimate for the petal length of the Iris
flowers
The data layer
The first layer is the data layer, which specifies the data we want to plot
The data layer is specified using the ggplot() function
The first argument of this function is the data we want to plot
We use rdata here, because this is the name of the data that has been
transferred into R from the Preprocess panel
The second argument is often a call to the aes() function, which maps data
to a plot’s visual aspects and components
We use it to define which attributes are plotted, and how parts of the plot
are colored and filled
The geometry layer(s)
Once the data layer has been defined, we can define geometry layers to
specify how the data is plotted
In the previous example, we specified a kernel density plot by using the
geom_density() function
– The kernel density estimate generated this way is too wide, but we can use the xlim()
function to change the range of the x axis:
ggplot(rdata, aes(x = petallength)) + geom_density() + xlim(0,8)
We can use the adjust parameter to scale the kernel bandwidth that is used
by the estimate
ggplot(rdata, aes(x = petallength)) + geom_density(adjust = 0.5) + xlim(0,8)
Using values smaller than 1 makes the density estimate fit the data more
closely and we get more peaks and valleys
Plotting classification data
Use pdf() function to redirect output of plot from screen to a PDF file, e.g.:
pdf("/Users/eibe/Documents/Test.pdf")
Reissue command, so that plot is written to the file:
ggplot(ndata, aes(x = value, color = class, fill = class)) + geom_density(adjust = 0.5, alpha =
0.5) + xlim(0,8) + facet_grid(. ~ variable)
Once the data has been plotted to the file, redirect output back to screen:
dev.off()
The resulting PDF can be viewed with any PDF reader and integrated into
other documents
One more example of a geometry layer: box plots
http://ggplot2.org/
This site also enables you to subscribe to a mailing list where you can get
help
There are several books dedicated to ggplot2, including two that are
mentioned at the above location
Advanced Data Mining with Weka
Class 3 – Lesson 4
Using R to run a classifier
Eibe Frank
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Lesson 3.4: Using R to run a classifier
Class 3 Interfacing to R and other data Lesson 3.3 Using R to plot data
mining packages
To try MLRClassifier, load some data into the Preprocess panel of the Exlorer,
e.g., the diabetes data
Then, switch to the Classify panel and select the Choose button to pop up the
menu with available classifiers
It will take a little while for the menu to pop up because Weka will download
and install the mlr package in R
– This only happens once, when the package is first required
Now, expand the new mlr item in the menu and select MLRClassifier
Pressing the Start button will run MLRClassifier on the data
Considering the output
http://www.rdocumentation.org
– Select the link for the learning method from this page
Growing different ferns
The MLR package has many other facilities for machine learning in R, e.g.,
running experiments in the R environment
There is an extensive tutorial on how to use MLR from R at
https://mlr-org.github.io/mlr-tutorial/release/html/
The MLR package is constantly being expanded and every release adds new
algorithms
– When new releases come out, the RPlugin package needs to be updated so that these
algorithms become available through MLRClassifier in Weka
Every official R package has a dedicated web page with a link to a PDF
reference manual for this package
– For example, the URL for the rFerns package is
https://cran.r-project.org/web/packages/rFerns/index.html
Advanced Data Mining with Weka
Class 3 – Lesson 5
Using R to preprocess data
Eibe Frank
Department of Computer Science
University of Waikato
New Zealand
weka.waikato.ac.nz
Lesson 3.5: Using R to preprocess data
Class 3 Interfacing to R and other data Lesson 3.3 Using R to plot data
mining packages
Assume we want to remove the last attribute from the Iris data
First, we configure an ArffLoader component to load the data
Then, we place the RScriptExecutor component on the canvas
Now, we can connect the two using a dataSet component
To visualize the processed data we can use a ScatterPlotMatrix
component, which we can put on the canvas but not yet connect
To process the data, we need to configure the RScriptExecutor by
entering an appropriate script
The single-line script rdata[1:4] (square brackets!) creates a data frame
from the first four attributes of the incoming data (rdata)
Now, we can make a dataSet connection to the ScatterPlotMatrix
Installing an R package using the R Console perspective
install.packages("fastICA")
Another example script: using ICA
Now that we have installed the package, we can use fastICA in our R script
First, we need to load the library into R using library(fastICA)
Then, we may want to set up a variable specifying the number of components
we would like to extract using ICA
– Assume we want to use as many components as there are predictor attributes in the
input, so we can use num = ncol(rdata) – 1 for this, where ncol() gives the number of
attributes in rdata
To apply fastICA to the reduced Iris data and extract num components, we can
use fastICA(rdata[1:num], num)
This function returns a list of results, we want S, so we use fastICA(...)$S
– Check the R documenation for fastICA to see what values are returned by this function
This will produce an R matrix, which we need to turn into an R data frame
using the data.frame() function, so that Weka can import the data
The complete script
library(fastICA)
num = ncol(rdata) – 1
data.frame(fastICA(rdata[1:num], num)$S)
Pamela Douglas
Department of Psychiatry and Biobehavioral Sciences
David Geffen School of Medicine
University of California, Los Angeles
USA
weka.waikato.ac.nz
Lesson 3.6: Application: Functional MRI Neuroimaging data
Class 3 Interfacing to R and other data Lesson 3.3 Using R to plot data
mining packages
Independent Components
Activity: Learn how to do Nested Cross Validation for Parameter Tuning …. Test out Multiple Classifiers, and
Test the Importance of Using Feature Selection
Data from Haxby et al. (2001) “Distributed and overlapping representations of faces and objects in
ventral temporal cortex,” Science, Vol. 293
creativecommons.org/licenses/by/3.0/
weka.waikato.ac.nz