0% found this document useful (0 votes)

76 views10 pages

Tutorial 1

This tutorial provides an introduction to MOA (Massive Online Analysis), a software environment for implementing algorithms and running experiments for online learning from evolving data streams. It explains how to start the MOA graphical user interface to configure and run tasks. It then describes the classification cycle for data streams and different evaluation methods like interleaved test-then-train and prequential evaluation. It also introduces concept drift and generators for simulating drifting data streams. Exercises are provided to familiarize the user with comparing classifiers on drifting streams using different evaluations. Finally, it demonstrates how to run tasks from the command line.

Uploaded by

usernameuserna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views10 pages

Tutorial 1

Uploaded by

usernameuserna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Tutorial 1.

Introduction to MOA {M}assive {O}nline {A}nalysis

Albert Bifet and Richard Kirkby March 2012

Getting Started

This tutorial is a basic introduction to MOA. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. We suppose that MOA is installed in your system. Start a graphical user interface for conguring and running tasks with the command: java -cp moa.jar -javaagent:sizeofag.jar moa.gui.GUI

Figure 1: MOA Graphical User Interface Click Congure to set up a task, when ready click to launch a task click Run. Several tasks can be run concurrently. Click on different tasks in the list and control them using the buttons below. If textual output of a task is available it will be displayed in the middle of the GUI, and can be saved to disk. Note that the command line text box displayed at the top of the window represents textual commands that can be used to run tasks on the command line. The text can be selected then copied onto the clipboard. In the bottom of the GUI there is a graphical display of the results. It is possible to compare the results of two different tasks: the current task is displayed in red, and the selected previously is in blue.

Figure 2: The data stream classication cycle

The Classication Graphical User Interface

We start comparing the accuracy of two classiers. First, we explain briey two different data stream evaluations.

2.1

Data streams Evaluation

The most signicant requirements for a data stream setting are the following: Requirement 1 Process an example at a time, and inspect it only once (at most) Requirement 2 Use a limited amount of memory Requirement 3 Work in a limited amount of time Requirement 4 Be ready to predict at any time Figure 2 illustrates the typical use of a data stream classication algorithm, and how the requirements t in a repeating cycle: 2

1. The algorithm is passed the next available example from the stream (requirement 1). 2. The algorithm processes the example, updating its data structures. It does so without exceeding the memory bounds set on it (requirement 2), and as quickly as possible (requirement 3). 3. The algorithm is ready to accept the next example. On request it is able to predict the class of unseen examples (requirement 4). In traditional batch learning the problem of limited data is overcome by analyzing and averaging multiple models produced with different random arrangements of training and test data. In the stream setting the problem of (effectively) unlimited data poses different challenges. One solution involves taking snapshots at different times during the induction of a model to see how much the model improves. When considering what procedure to use in the data stream setting, one of the unique concerns is how to build a picture of accuracy over time. Two main approaches arise: Holdout: When traditional batch learning reaches a scale where crossvalidation is too time consuming, it is often accepted to instead measure performance on a single holdout set. This is most useful when the division between train and test sets have been pre-dened, so that results from different studies can be directly compared. Interleaved Test-Then-Train or Prequential: Each individual example can be used to test the model before it is used for training, and from this the accuracy can be incrementally updated. When intentionally performed in this order, the model is always being tested on examples it has not seen. This scheme has the advantage that no holdout set is needed for testing, making maximum use of the available data. It also ensures a smooth plot of accuracy over time, as each individual example will become increasingly less signicant to the overall average. Holdout evaluation gives a more accurate estimation of the accuracy of the classier on more recent data. However, it requires recent test data that it is difcult to obtain for real datasets. Gama et al. propose to use a forgetting mechanism for estimating holdout accuracy using prequential accuracy: a sliding window of size w with the most recent observations, or fading factors that weigh observations using a decay factor . The output

of the two mechanisms is very similar (every window of size w0 may be approximated by some decay factor 0 ). As data stream classication is a relatively new eld, such evaluation practices are not nearly as well researched and established as they are in the traditional batch setting.

2.2

Exercises

To familiarize yourself with the functions discussed so far, please do the following two exercises. The solutions to these and other exercises in this tutorial are given at the end. Exercise 1 Compare the accuracy of the Hoeffding Tree with the Naive Bayes classier, for a RandomTreeGenerator stream of 1,000,000 instances using Interleaved Test-Then-Train evaluation. Use for all exercises a sample frequency of 10, 000. Exercise 2 Compare and discuss the accuracy for the same stream of the previous exercise using three different evaluations with a Hoeffding Tree: Periodic Held Out with 1,000 instances for testing Interleaved Test Then Train Prequential with a sliding window of 1,000 instances.

2.3

Drift Stream Generators

MOA streams are build using generators, reading ARFF les, joining several streams, or ltering streams. MOA streams generators allow to simulate potentially innite sequence of data. Two streams evolving on time are: Rotating Hyperplane Random RBF Generator To model concept drift we only have to set up the drift parameter of the stream. We can model concept drift also joining several streams. MOA models a concept drift event as a weighted combination of two pure distributions that characterizes the target concepts before and after the drift. MOA uses the sigmoid function, as an elegant and practical solution to dene the 4

f(t) 1 f(t)

0.5

t0 W

Figure 3: A sigmoid function f(t) = 1/(1 + es(tt0 ) ). probability that every new instance of the stream belongs to the new concept after the drift. We see from Figure 3 that the sigmoid function f(t) = 1/(1 + es(tt0 ) ) has a derivative at the point t0 equal to f (t0 ) = s/4. The tangent of angle is equal to this derivative, tan = s/4. We observe that tan = 1/W , and as s = 4 tan then s = 4/W . So the parameter s in the sigmoid gives the length of W and the angle . In this sigmoid model we only need to specify two parameters : t0 the point of change, and W the length of change. Note that for any positive real number f(t0 + W ) = 1 f(t0 W ), and that f(t0 + W ) and f(t0 W ) are constant values that dont depend on t0 and W : f(t0 + W/2) = 1 f(t0 W/2) = 1/(1 + e2 ) 88.08% f(t0 + W ) = 1 f(t0 W ) = 1/(1 + e4 ) 98.20% f(t0 + 2W ) = 1 f(t0 2W ) = 1/(1 + e8 ) 99.97% Denition 1 Given two data streams a, b, we dene c = a W t0 b as the data stream built joining the two data streams a and b, where t0 is the point of change, W is the length of change and 5

Pr[c(t) = a(t)] = e4(tt0 )/W /(1 + e4(tt0 )/W ) Pr[c(t) = b(t)] = 1/(1 + e4(tt0 )/W ). Example:
ConceptDriftStream -s (generators.AgrawalGenerator -f 7) -d (generators.AgrawalGenerator -f 2) -w 1000000 -p 900000

ConceptDriftStream parameters: -s : Stream -d : Concept drift Stream -p : Central position of concept drift change -w : Width of concept drift change

2.4

Exercises

Exercise 3 Compare the accuracy of the Hoeffding Tree with the Naive Bayes classier, for a RandomRBFGenerator stream of 1,000,000 instances with speed change of 0,001 using Interleaved Test-Then-Train evaluation. Exercise 4 Compare the accuracy for the same stream of the previous exercise using three different classiers: Hoeffding Tree with Majority Class at the leaves Hoeffding Adaptive Tree OzaBagAdwin with 10 HoeffdingTree

Using the command line

An easy way to use the command line, is to copy and paste the text in the Conguration line of the Graphical User Interface. For example, suppose we want to process the task EvaluatePrequential -l trees.HoeffdingTree -i 1000000 -w 10000 using the command line. We simply write java -cp moa.jar -javaagent:sizeofag.jar moa.DoTask \ "EvaluatePrequential -l trees.HoeffdingTree -i 1000000 -w 10000" Note that some parameters are missing, since they use default values. 6

3.1

Learning and Evaluating Models

The moa.DoTask class is the main class for running tasks on the command line. It will accept the name of a task followed by any appropriate parameters. The rst task used is the LearnModel task. The -l parameter species the learner, in this case the HoeffdingTree class. The -s parameter species the stream to learn from, in this case generators.WaveformGenerator is specied, which is a data stream generator that produces a three-class learning problem of identifying three types of waveform. The -m option species the maximum number of examples to train the learner with, in this case one million examples. The -O option species a le to output the resulting model to: java -cp moa.jar -javaagent:sizeofag.jar moa.DoTask \ LearnModel -l trees.HoeffdingTree \ -s generators.WaveformGenerator -m 1000000 -O model1.moa This will create a le named model1.moa that contains the decision stump model that was induced during training. The next example will evaluate the model to see how accurate it is on a set of examples that are generated using a different random seed. The EvaluateModel task is given the parameters needed to load the model produced in the previous step, generate a new waveform stream with a random seed of 2, and test on one million examples: java -cp moa.jar -javaagent:sizeofag.jar moa.DoTask \ "EvaluateModel -m file:model1.moa \ -s (generators.WaveformGenerator -i 2) -i 1000000" This is the rst example of nesting parameters using brackets. Quotes have been added around the description of the task, otherwise the operating system may be confused about the meaning of the brackets. After evaluation the following statistics are output: classified instances = 1,000,000 classifications correct (percent) = 84.474 Kappa Statistic (percent) = 76.711 Note the the above two steps can be achieved by rolling them into one, avoiding the need to create an external le, as follows:

java -cp moa.jar -javaagent:sizeofag.jar moa.DoTask \ "EvaluateModel -m (LearnModel -l trees.HoeffdingTree \ -s generators.WaveformGenerator -m 1000000) \ -s (generators.WaveformGenerator -i 2) -i 1000000" The task EvaluatePeriodicHeldOutTest will train a model while taking snapshots of performance using a held-out test set at periodic intervals. The following command creates a comma separated values le, training the HoeffdingTree classier on the WaveformGenerator data, using the rst 100 thousand examples for testing, training on a total of 100 million examples, and testing every one million examples: java -cp moa.jar -javaagent:sizeofag.jar moa.DoTask \ "EvaluatePeriodicHeldOutTest -l trees.HoeffdingTree \ -s generators.WaveformGenerator \ -n 100000 -i 10000000 -f 1000000" > dsresult.csv

3.2

Exercises

Exercise 5 Repeat the experiments of exercises 1 and 2 using the command line. Exercise 6 Compare accuracy and RAM-Hours needed using a prequential evaluation (sliding window of 1,000 instances) of 1,000,000 instances for a Random Radius Based Function stream with speed of change 0,001 using the following methods: OzaBag with 10 HoeffdingTree OzaBagAdwin with 10 HoeffdingTree LeveragingBag with 10 HoeffdingTree

4
2.

Answers To Exercises
1. Naive Bayes: 73.63% Hoeffding Tree : 94.45% Periodic Held Out with 1,000 instances for testing :96.5% Interleaved Test Then Train : 94.45% Prequential with a sliding window of 1,000 instances: 96.7%. 3. Naive Bayes: 53.14% Hoeffding Tree : 57.60% 8

Hoeffding Tree with Majority Class at Leaves: 51.71% Hoeffding Adaptive Tree: 65.28% OzaBagAdwin with 10 HoeffdingTree: 67.23%

EvaluateInterleavedTestThenTrain -i 1000000 EvaluateInterleavedTestThenTrain -l trees.HoeffdingTree -i 1000000 EvaluatePeriodicHeldOutTest -n 1000 -i 1000000 EvaluateInterleavedTestThenTrain -l trees.HoeffdingTree -i 1000000 EvaluatePrequential -l trees.HoeffdingTree -i 1000000

OzaBag with 10 HoeffdingTree: 57.4% Accuracy, 4 104 RAM-Hours OzaBagAdwin with 10 HoeffdingTree: 71.5% Accuracy, 2.93 106 RAM-Hours LeveragingBag with 10 HoeffdingTree: 82.9% Accuracy, 1.25 104 RAM-Hours

The Zoo Story
100% (1)
The Zoo Story
3 pages
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
4.5/5 (3)
JD Gyms Wythenshawe - Class Timetable
No ratings yet
JD Gyms Wythenshawe - Class Timetable
1 page
Letter Informing HMRC A New Company Will Be Dormant
No ratings yet
Letter Informing HMRC A New Company Will Be Dormant
1 page
ECON1203 Course Outline
No ratings yet
ECON1203 Course Outline
21 pages
Massive Online Analysis: Manual
No ratings yet
Massive Online Analysis: Manual
55 pages
StreamMining PDF
No ratings yet
StreamMining PDF
185 pages
StreamMining Manual
No ratings yet
StreamMining Manual
185 pages
MOA - Massive Online Analysis Manual
No ratings yet
MOA - Massive Online Analysis Manual
67 pages
5 Tracking - Recurrent - Concept - Drift - in - Streaming - Data - Using - Ensemble - Classifiers
No ratings yet
5 Tracking - Recurrent - Concept - Drift - in - Streaming - Data - Using - Ensemble - Classifiers
6 pages
Scalable and Efficient Multi-Label Classification For Evolving Data Streams
No ratings yet
Scalable and Efficient Multi-Label Classification For Evolving Data Streams
30 pages
Classification of Data Streams With Skewed Distribution
No ratings yet
Classification of Data Streams With Skewed Distribution
55 pages
A Sketch-Based Naive Bayes Algorithms For Evolving Data Streams
No ratings yet
A Sketch-Based Naive Bayes Algorithms For Evolving Data Streams
10 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Lecture 7 Overview of ML Models
No ratings yet
Lecture 7 Overview of ML Models
77 pages
ML Important
No ratings yet
ML Important
11 pages
Methodologies For Stream Data Processing and Stream Data Systems
No ratings yet
Methodologies For Stream Data Processing and Stream Data Systems
20 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
CSE545 Sp23 (2) Streaming Algorithms 2-4
No ratings yet
CSE545 Sp23 (2) Streaming Algorithms 2-4
60 pages
KRAWXZYKINFFUS2017
No ratings yet
KRAWXZYKINFFUS2017
86 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
No ratings yet
Crash Course On Data Stream Algorithms: Part I: Basic Definitions and Numerical Streams
76 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
Adaptive Parameter-Free Learning From Evolving Data Streams
No ratings yet
Adaptive Parameter-Free Learning From Evolving Data Streams
12 pages
Unit 6
No ratings yet
Unit 6
55 pages
1 s2.0 S1877050914010850 Main
No ratings yet
1 s2.0 S1877050914010850 Main
10 pages
Lec7 - Nonparametric Methods - II
No ratings yet
Lec7 - Nonparametric Methods - II
38 pages
Lecture1 SML-I Merged
No ratings yet
Lecture1 SML-I Merged
157 pages
Dtree&rf
No ratings yet
Dtree&rf
26 pages
Jdavis Indlearn2
No ratings yet
Jdavis Indlearn2
91 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
Decision Trees MIT 15.097 Course Notes
No ratings yet
Decision Trees MIT 15.097 Course Notes
17 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
21 Decision Trees
No ratings yet
21 Decision Trees
62 pages
Lecture 07
No ratings yet
Lecture 07
31 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Data Mining NOTES
No ratings yet
Data Mining NOTES
57 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
Learning Analytics
No ratings yet
Learning Analytics
56 pages
Meeting 6 CE609-supervised-learning
No ratings yet
Meeting 6 CE609-supervised-learning
166 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
A Streaming Ensemble Algorithm (SEA) For Large-Scale Classification
No ratings yet
A Streaming Ensemble Algorithm (SEA) For Large-Scale Classification
6 pages
ml2 PDF
No ratings yet
ml2 PDF
5 pages
Big Data Notes
No ratings yet
Big Data Notes
33 pages
Pattern Recognition
No ratings yet
Pattern Recognition
50 pages
Salazar CPE124 Courswork 1
No ratings yet
Salazar CPE124 Courswork 1
22 pages
DataStreamsCRC Anjaly
No ratings yet
DataStreamsCRC Anjaly
258 pages
Bayesian Nonparametric Unsupervised Concept Drift Detection For Data Stream Mining
No ratings yet
Bayesian Nonparametric Unsupervised Concept Drift Detection For Data Stream Mining
22 pages
A Survey On Learning From Imbalanced Data Streams: Taxonomy, Challenges, Empirical Study, and Reproducible Experimental Framework
No ratings yet
A Survey On Learning From Imbalanced Data Streams: Taxonomy, Challenges, Empirical Study, and Reproducible Experimental Framework
63 pages
DWMExp 5
No ratings yet
DWMExp 5
6 pages
Week03 Classification
No ratings yet
Week03 Classification
22 pages
DM Chapter 4
No ratings yet
DM Chapter 4
6 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Ocs Unit 5
No ratings yet
Ocs Unit 5
19 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Classification
No ratings yet
Classification
45 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
A Survey On Machine Learning For Recurring Concept Drifting Data Streams
No ratings yet
A Survey On Machine Learning For Recurring Concept Drifting Data Streams
17 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
A Friendly Introduction to MATLAB Programming
From Everand
A Friendly Introduction to MATLAB Programming
Orhan Gazi
No ratings yet
Worked Examples in Mechanical Vibrations using MATLAB
From Everand
Worked Examples in Mechanical Vibrations using MATLAB
Eric Okoth Ogur
No ratings yet
Worked Examples in Mechanics of Machines using MATLAB
From Everand
Worked Examples in Mechanics of Machines using MATLAB
Eric Ogur
No ratings yet
Extreme Modelling in Practice
No ratings yet
Extreme Modelling in Practice
13 pages
Aerospike A Distributed Database Zine
No ratings yet
Aerospike A Distributed Database Zine
8 pages
Guidance 2021 PLP Mythology Competition
No ratings yet
Guidance 2021 PLP Mythology Competition
1 page
5 Spark Kafka Cassandra Slides PDF
No ratings yet
5 Spark Kafka Cassandra Slides PDF
20 pages
Tennis Camp - Tennis Centre
No ratings yet
Tennis Camp - Tennis Centre
1 page
4 Spark Cassandra
No ratings yet
4 Spark Cassandra
15 pages
Construction Budget Template
No ratings yet
Construction Budget Template
14 pages
The Lives of Others (Das Leben Der Anderen) .2006.BRRip - xvid-VLiS
No ratings yet
The Lives of Others (Das Leben Der Anderen) .2006.BRRip - xvid-VLiS
89 pages
Mining Interesting Locations and Travel Sequences From GPS Trajectories
No ratings yet
Mining Interesting Locations and Travel Sequences From GPS Trajectories
27 pages
Dinosaur.2000.DVDRip - xvid.DualAudio - int.CD2 CNXP - Hi
No ratings yet
Dinosaur.2000.DVDRip - xvid.DualAudio - int.CD2 CNXP - Hi
32 pages
Plumley Rail Trail Walk Route Download 2078753169
No ratings yet
Plumley Rail Trail Walk Route Download 2078753169
4 pages
Rail Walks in Knutsford and Surrounding Villages 2014
No ratings yet
Rail Walks in Knutsford and Surrounding Villages 2014
15 pages
JCBose Fabrizio Aalst BPM2013
No ratings yet
JCBose Fabrizio Aalst BPM2013
16 pages
SMIREP: Predicting Chemical Activity From SMILES: Andreas Karwath and Luc de Raedt
No ratings yet
SMIREP: Predicting Chemical Activity From SMILES: Andreas Karwath and Luc de Raedt
13 pages
Chapter Three Mixed Method Virtual Simulation
No ratings yet
Chapter Three Mixed Method Virtual Simulation
3 pages
Seda S. Syllabus Fall 2023
No ratings yet
Seda S. Syllabus Fall 2023
15 pages
Cognitive Ability Test
No ratings yet
Cognitive Ability Test
7 pages
True or False Questionnnaire EIM
No ratings yet
True or False Questionnnaire EIM
8 pages
BAF Guidelines For Project Presentation For Workshop
No ratings yet
BAF Guidelines For Project Presentation For Workshop
13 pages
Year Gap Certificate
No ratings yet
Year Gap Certificate
2 pages
Registered Electrical Engineer 09-2016 Room Assignment
No ratings yet
Registered Electrical Engineer 09-2016 Room Assignment
18 pages
Detailed Syllabus
No ratings yet
Detailed Syllabus
11 pages
Principles of Marketing: Consumer Markets and Consumer Buyer Behavior
No ratings yet
Principles of Marketing: Consumer Markets and Consumer Buyer Behavior
20 pages
Span 1001-Section D-Syllabus
No ratings yet
Span 1001-Section D-Syllabus
6 pages
AU-CUIC-A Ready Reckoner For Placement
No ratings yet
AU-CUIC-A Ready Reckoner For Placement
3 pages
Peer Observations 31 October
No ratings yet
Peer Observations 31 October
4 pages
RTB Schedule For Written Exam
No ratings yet
RTB Schedule For Written Exam
1 page
Sample Test For Architecture Student
100% (1)
Sample Test For Architecture Student
1 page
Updated Lasc234 Syllabus
No ratings yet
Updated Lasc234 Syllabus
17 pages
Paper 3 Text Analysis
0% (1)
Paper 3 Text Analysis
2 pages
Research Objectives, Questions Hypothesis
100% (2)
Research Objectives, Questions Hypothesis
42 pages
Centre For Participatory and Online Programmes Bharathiar University Coimbatore - 641 046
No ratings yet
Centre For Participatory and Online Programmes Bharathiar University Coimbatore - 641 046
4 pages
188 Syllabus - Fall Semester 2024
No ratings yet
188 Syllabus - Fall Semester 2024
5 pages
Math 1030 Syllabus
No ratings yet
Math 1030 Syllabus
7 pages
Airline Transport Pilot Licence (A) Integrated
No ratings yet
Airline Transport Pilot Licence (A) Integrated
6 pages
2024 HVAC Exam Information
No ratings yet
2024 HVAC Exam Information
8 pages
Cambridge IGCSE: Mathematics 0580/21
No ratings yet
Cambridge IGCSE: Mathematics 0580/21
12 pages
SOAS Guidance For Examiners
No ratings yet
SOAS Guidance For Examiners
6 pages
Measure Criativity Haberland Dacin
No ratings yet
Measure Criativity Haberland Dacin
15 pages
Block F 2020 Primer
No ratings yet
Block F 2020 Primer
29 pages
Screenshot 2024-10-02 at 8.56.31 AM
No ratings yet
Screenshot 2024-10-02 at 8.56.31 AM
4 pages
Project Report On HR by Jahangir Ansari
No ratings yet
Project Report On HR by Jahangir Ansari
65 pages

Tutorial 1

Uploaded by

Tutorial 1

Uploaded by

Tutorial 1.

Introduction to MOA {M}assive {O}nline {A}nalysis

Figure 2: The data stream classication cycle

The Classication Graphical User Interface

Data streams Evaluation

Drift Stream Generators

Using the command line

Learning and Evaluating Models

You might also like