0% found this document useful (0 votes)

225 views36 pages

Introduction To Weka: Statistical Learning

This document provides an introduction to the Weka machine learning toolkit. It describes Weka's capabilities for preprocessing, analyzing, visualizing, and classifying data. It also discusses Weka's GUI interface and command line usage. Examples are provided on loading the adult census dataset and building decision tree and logistic regression classifiers on it. The document also briefly covers linear classifiers, comparing different classifier performance, and using feature selection in Weka.

Uploaded by

Raggu Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

225 views36 pages

Introduction To Weka: Statistical Learning

Uploaded by

Raggu Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

Introduction to Weka

Statistical Learning

Michel Galley Artificial Intelligence class November 2, 2006

Machine Learning with Weka

Comprehensive set of tools:

Pre-processing and data analysis Learning algorithms (for classification, clustering, etc.) Evaluation metrics

Three modes of operation:

GUI command-line (not discussed today) Java API (not discussed today)

Weka Resources
Web page

http://www.cs.waikato.ac.nz/ml/weka/ Extensive documentation (tutorials, trouble-shooting guide, wiki, etc.)

At Columbia

Installed locally at:

~mg2016/weka (CUNIX network) ~galley/weka (CS network)

Downloads for Windows or UNIX: http://www1.cs.columbia.edu/~galley/weka/downloads

Attribute-Relation File Format (ARFF)

Weka reads ARFF files:
@relation adult @attribute age numeric Header @attribute name string @attribute education {College, Masters, Doctorate} @attribute class {>50K,<=50K} @data

Comma Separated 50,Leslie,Masters,>50K Values (CSV) ?,Morgan,College,<=50K

Supported attributes:

numeric, nominal, string, date http://www.cs.waikato.ac.nz/~ml/weka/arff.html

Details at:

Sample database: the sensus data (adult)

Binary classification:

Task: predict whether a person earns > $50K a year Attributes: age, education level, race, gender, etc. Attribute types: nominal and numeric Training/test instances: 32,000/16,300

Original UCI data available at:

ftp.ics.uci.edu/pub/machine-learning-databases/adult
Data already converted to ARFF:

http://www1.cs.columbia.edu/~galley/weka/datasets/

Starting the GUI

CS accounts
> java -Xmx128M -jar ~galley/weka/weka.jar > java -Xmx512M -jar ~galley/weka/weka.jar (with more mem.)

CUNIX accounts
> java -Xmx128M -jar ~mg2016/weka/weka.jar

Start Explorer

Weka Explorer
What we will use today in Weka:
I. I.

Pre-process:

Load, analyze, and filter data Compare pairs of attributes Plot matrices All algorithms seem in class (Naive Bayes, etc.) Forward feature subset selection, etc.
7

Visualize:

I. I.

Classify:

Feature selection:

load filter analyze

visualize attributes

Demo #1: J48 decision trees (=C4.5)

Steps:

load data from URL:

http://www1.cs.columbia.edu/~galley/weka/datasets/adu lt.train.arff

select only three attributes: age, education-num, class

weka.unsupervised.attribute.Remove V R 1,5,last

visualize the age/education-num matrix: find this in the Visualize pane classify with decision trees, percent split of 66%:
weka.classifier.trees.J48

visualize decision tree: (right)-click on entry in result list, select Visualize tree compare matrix with decision tree: does it make sense to you?

Try it for yourself after the class!

Demo #1: J48 decision trees

>50K <=50K AGE

EDUCATION-NUM

Demo #1: J48 decision trees

>50K <=50K

_ _ _ + + _

Demo #1: J48 decision trees

>50K <=50K

EDUCATION-NUM

31 34 36

AGE

Demo #1: J48 result analysis

Comparing classifiers
Classifiers allowed in assignment:

decision trees (seen) naive Bayes (seen) linear classifiers (next week) Previous experiment easy to reproduce with other classifiers and parameters (e.g., inside Weka Experimenter) Less time coding and experimenting means you have more time for analyzing intrinsic differences between classifiers.

Repeating many experiments in Weka:

Linear classifiers
Prediction is a linear function of the input

in the case of binary predictions, a linear classifier splits a high-dimensional input space with a hyperplane (i.e., a plane in 3D, or a straight line in 2D).

Many popular effective classifiers are linear: perceptron, linear SVM, logistic regression (a.k.a. maximum entropy, exponential model).

Comparing classifiers
Results on adult data

Majority-class baseline:

76.51%

(always predict <=50K)

weka.classifier.rules.ZeroR

Naive Bayes:
weka.classifier.bayes.NaiveBayes

79.91% 78.88% 79.97%

Linear classifier:
weka.classifier.function.Logistic

Decision trees:
weka.classifier.trees.J48

Why this difference?

A linear classifier in a 2D space:

it can classify correctly (shatter) any set of 3 points; not true for 4 points; we say then that 2D-linear classifiers have capacity 3.

A decision tree in a 2D space:

can shatter as many points as leaves in the tree; potentially unbounded capacity! (e.g., if no tree pruning)

Demo #2: Logistic Regression

Can we improve upon logistic regression results?
Steps:

use same data as before (3 attributes) discretize and binarize data (numeric binary):
weka.filters.unsupervised.attribute.Discretize D F B 10

classify with logistic regression, percent split of 66%:

weka.classifier.function.Logistic

compare result with decision tree: your conclusion? repeat classification experiment with all features, comparing the three classifiers: J48, Logistic, and Logistic with binarization: your conclusion?

Demo #2: Results

two features (age, education-num):

decision tree logistic regression logistic regression with feature binarization

79.97% 78.88% 79.97%

all features:

decision tree logistic regression logistic regression with feature binarization

84.38% 85.03% 85.82%

Feature Selection
Feature selection:
find a feature subset that is a good substitute to all features good for knowing which features are actually useful often gives better accuracy (especially on new data)

Forward feature selection (FFS): [John et al., 1994]

wrapper feature selection: uses a classifier to determine the goodness of feature sets. greedy search: fast, but prone to search errors

Feature Selection in Weka

Forward feature selection:

search method: GreedyStepwise select a classifier (e.g., NaiveBayes) number of folds in cross validation (default: 5) attribute evaluator: WrapperSubsetEval generateRanking: true numToSelect (default: maximum) startSet: good features you previously identified attribute selection mode: full training data or cross validation double cross validation because of GreedyStepwise change number of folds to achieve desired tade-off between selection accuracy and running time.

Notes:

Weka Experimenter
If you need to perform many experiments:

Experimenter makes it easy to compare the performance of different learning schemes Results can be written into file or database Evaluation options: cross-validation, learning curve, etc. Can also iterate over different parameter settings Significance-testing built in.

Beyond the GUI

How to reproduce experiments

with the command-line/API

GUI, API, and command-line all rely on the same set of Java classes Generally easy to determine what classes and parameters were used in the GUI. Tree displays in Weka reflect its Java class hierarchy.

> java -cp ~galley/weka/weka.jar weka.classifiers.trees.J48 C 0.25 M 2 -t <train_arff> -T <test_arff>

Important command-line parameters

> java -cp ~galley/weka/weka.jar weka.classifiers.<classifier_name> [classifier_options] [options]

where options are:

Create/load/save a classification model:

-t <file> : training set -l <file> : load model file -d <file> : save model file
Testing:

-x <N> : N-fold cross validation -T <file> : test set -p <S> : print predictions + attribute selection S
36

Data Mining - Lab - Manual
No ratings yet
Data Mining - Lab - Manual
20 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
WEKA - ExperimenterTutorial
100% (13)
WEKA - ExperimenterTutorial
40 pages
DWDM File-Final Ver3.pdf 20241230 172003 0000
No ratings yet
DWDM File-Final Ver3.pdf 20241230 172003 0000
54 pages
CS-703 (B) Data Warehousing and Data Mining Lab
No ratings yet
CS-703 (B) Data Warehousing and Data Mining Lab
50 pages
2022 Mathematics Paper I-1
No ratings yet
2022 Mathematics Paper I-1
13 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
55 pages
AI-43 Data Mining
No ratings yet
AI-43 Data Mining
96 pages
Oware Book Final Short
No ratings yet
Oware Book Final Short
106 pages
Weka Tutorial
100% (1)
Weka Tutorial
58 pages
Weka Tutorial 2
No ratings yet
Weka Tutorial 2
50 pages
Data Warehousing
No ratings yet
Data Warehousing
54 pages
Data Werehousing Lab Manual
No ratings yet
Data Werehousing Lab Manual
63 pages
WEKA Experimenter Tutorial For Version 3-5-8: David Scuse Peter Reutemann July 14, 2008
No ratings yet
WEKA Experimenter Tutorial For Version 3-5-8: David Scuse Peter Reutemann July 14, 2008
40 pages
Lecture 7 - Weka
No ratings yet
Lecture 7 - Weka
69 pages
WEKA
No ratings yet
WEKA
81 pages
Introduction To Weka
No ratings yet
Introduction To Weka
38 pages
DWBI Lab Manual 2023-24 Final
No ratings yet
DWBI Lab Manual 2023-24 Final
40 pages
ESMKT02023A14 Industry XLS
No ratings yet
ESMKT02023A14 Industry XLS
600 pages
Lab Updated - Merged
No ratings yet
Lab Updated - Merged
49 pages
Aiml Manual
No ratings yet
Aiml Manual
27 pages
Weka Software Manuala
No ratings yet
Weka Software Manuala
20 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
12 pages
Machine Learning With WEKA An Introduction
No ratings yet
Machine Learning With WEKA An Introduction
66 pages
Weka Tool
No ratings yet
Weka Tool
12 pages
Meka Tutorial
No ratings yet
Meka Tutorial
18 pages
DWM - Exp No 5
No ratings yet
DWM - Exp No 5
7 pages
Seismic Design of Multi Storey Building NBCC 2005
100% (1)
Seismic Design of Multi Storey Building NBCC 2005
38 pages
Collapse Strength Calculations: Home Water Well Products Calculations & Specifications
No ratings yet
Collapse Strength Calculations: Home Water Well Products Calculations & Specifications
8 pages
Introduction To Weka: Xingquan (Hill) Zhu
No ratings yet
Introduction To Weka: Xingquan (Hill) Zhu
63 pages
Weka Tutorial
No ratings yet
Weka Tutorial
13 pages
Weka Experiment
No ratings yet
Weka Experiment
13 pages
Exp. 5 Demonstration of Classification Process On Dataset Student - Arff Using j48 Algorithm
No ratings yet
Exp. 5 Demonstration of Classification Process On Dataset Student - Arff Using j48 Algorithm
6 pages
Lab 04
No ratings yet
Lab 04
7 pages
Dataware Practical 5
No ratings yet
Dataware Practical 5
4 pages
Weka 3.6 Tutorial: (Waikato Environment For Knowledge Analysis)
No ratings yet
Weka 3.6 Tutorial: (Waikato Environment For Knowledge Analysis)
12 pages
Anfis (Adaptive Network Fuzzy Inference System) : G.Anuradha
No ratings yet
Anfis (Adaptive Network Fuzzy Inference System) : G.Anuradha
25 pages
More Data Mining With Weka: Ian H. Witten
No ratings yet
More Data Mining With Weka: Ian H. Witten
61 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
53 pages
DWDM Lab 2
No ratings yet
DWDM Lab 2
3 pages
Wekappt
No ratings yet
Wekappt
58 pages
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
No ratings yet
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
42 pages
Experiment No. 7
No ratings yet
Experiment No. 7
4 pages
DMLB 1
No ratings yet
DMLB 1
3 pages
Classification With WEKA: Data Mining Lab 2
No ratings yet
Classification With WEKA: Data Mining Lab 2
8 pages
Part-I Matrix Algebra
No ratings yet
Part-I Matrix Algebra
20 pages
Weka Tutorial
No ratings yet
Weka Tutorial
32 pages
Weka Exercise 1
No ratings yet
Weka Exercise 1
7 pages
Classifying Nweka Newew Data We Ka
No ratings yet
Classifying Nweka Newew Data We Ka
5 pages
WEKA Intro
No ratings yet
WEKA Intro
17 pages
Weka Overview Slides
No ratings yet
Weka Overview Slides
31 pages
Weka (20030421-Version1 by Kdelab)
No ratings yet
Weka (20030421-Version1 by Kdelab)
51 pages
6.034 Design Assignment 2: 1 Data Sets
No ratings yet
6.034 Design Assignment 2: 1 Data Sets
6 pages
Machine Learning: Algorithms and Applications: Quang Nhat Nguyen
No ratings yet
Machine Learning: Algorithms and Applications: Quang Nhat Nguyen
16 pages
Application of Linear Programming For Optimal Use of Raw Materials in Bakery
No ratings yet
Application of Linear Programming For Optimal Use of Raw Materials in Bakery
11 pages
AI32 Guide To Weka PDF
No ratings yet
AI32 Guide To Weka PDF
6 pages
Data Base Management Key Points
No ratings yet
Data Base Management Key Points
8 pages
Weka Exercise 1
No ratings yet
Weka Exercise 1
7 pages
IrcamLab The-Snail Manual
No ratings yet
IrcamLab The-Snail Manual
15 pages
Part I - Installing Weka: HW Assignment 1
No ratings yet
Part I - Installing Weka: HW Assignment 1
3 pages
(Exp 4) Classification Via Decision Trees in WEKA
No ratings yet
(Exp 4) Classification Via Decision Trees in WEKA
10 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
Terminal Operator Job Description: Essential Duties and Responsibilities
No ratings yet
Terminal Operator Job Description: Essential Duties and Responsibilities
3 pages
Feynmans Thesis A New Approach To Quantum Theory Reprint Richard P Feynman Laurie M Brown PDF Download
No ratings yet
Feynmans Thesis A New Approach To Quantum Theory Reprint Richard P Feynman Laurie M Brown PDF Download
41 pages
1.1 Corrected Airfoil Characteristics
No ratings yet
1.1 Corrected Airfoil Characteristics
13 pages
Chap3.Introduction To R
No ratings yet
Chap3.Introduction To R
45 pages
Painel Madeira Com Concreto Ingles
No ratings yet
Painel Madeira Com Concreto Ingles
76 pages
2023 B.SC Nautical Science
No ratings yet
2023 B.SC Nautical Science
20 pages
Distance Displacement Speed and Velocity
No ratings yet
Distance Displacement Speed and Velocity
21 pages
AP EAPCET Chapter-Wise Weightage - 1747013651648
No ratings yet
AP EAPCET Chapter-Wise Weightage - 1747013651648
13 pages
Faculty Development Program (FDP) On
No ratings yet
Faculty Development Program (FDP) On
34 pages
03 Hoofstuk 3 Chapter 3
No ratings yet
03 Hoofstuk 3 Chapter 3
35 pages
LP
No ratings yet
LP
4 pages
GATE - Computer Science (CS) - 1998 Exam Paper
No ratings yet
GATE - Computer Science (CS) - 1998 Exam Paper
17 pages
Pearson
No ratings yet
Pearson
4 pages
AI Series Questions
No ratings yet
AI Series Questions
3 pages
Semester 2, 2020 Week 8: Data Mining in WEKA Tutorial/Lab Session - 7
No ratings yet
Semester 2, 2020 Week 8: Data Mining in WEKA Tutorial/Lab Session - 7
13 pages
Relay Test
No ratings yet
Relay Test
4 pages
Wang 2018
No ratings yet
Wang 2018
4 pages
A Short Proof of Seymour S 6-Flow Theorem
No ratings yet
A Short Proof of Seymour S 6-Flow Theorem
3 pages
Program Checkpoint Sains & Matematik Secondary School Mathematics Form 3
No ratings yet
Program Checkpoint Sains & Matematik Secondary School Mathematics Form 3
6 pages
MTTR
No ratings yet
MTTR
5 pages
Syllabus Guidelines
No ratings yet
Syllabus Guidelines
2 pages
Oracle Database Security Interview Questions, Answers, and Explanations: Oracle Database Security Certification Review
From Everand
Oracle Database Security Interview Questions, Answers, and Explanations: Oracle Database Security Certification Review
equitypress
No ratings yet
MCSD Certification Toolkit (Exam 70-483): Programming in C#
From Everand
MCSD Certification Toolkit (Exam 70-483): Programming in C#
Rod Stephens
3/5 (2)
SAP ABAP Objects Interview Questions
From Everand
SAP ABAP Objects Interview Questions
Equity Press
4/5 (18)
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
From Everand
ISTQB Advanced Level Technical Test Analyst- Exam Insights: Q&A with Explanations
SUJAN
No ratings yet

Introduction To Weka: Statistical Learning

Uploaded by

Introduction To Weka: Statistical Learning

Uploaded by

Introduction to Weka

Michel Galley Artificial Intelligence class November 2, 2006

Machine Learning with Weka

Three modes of operation:

http://www.cs.waikato.ac.nz/ml/weka/ Extensive documentation (tutorials, trouble-shooting guide, wiki, etc.)

Installed locally at:

~mg2016/weka (CUNIX network) ~galley/weka (CS network)

Downloads for Windows or UNIX: http://www1.cs.columbia.edu/~galley/weka/downloads

Attribute-Relation File Format (ARFF)

Comma Separated 50,Leslie,Masters,>50K Values (CSV) ?,Morgan,College,<=50K

numeric, nominal, string, date http://www.cs.waikato.ac.nz/~ml/weka/arff.html

Sample database: the sensus data (adult)

Original UCI data available at:

Starting the GUI

load filter analyze

Demo #1: J48 decision trees (=C4.5)

load data from URL:

select only three attributes: age, education-num, class

Try it for yourself after the class!

Demo #1: J48 decision trees

>50K <=50K AGE

Demo #1: J48 decision trees

Demo #1: J48 decision trees

Demo #1: J48 result analysis

Repeating many experiments in Weka:

(always predict <=50K)

79.91% 78.88% 79.97%

Why this difference?

A decision tree in a 2D space:

Demo #2: Logistic Regression

classify with logistic regression, percent split of 66%:

Demo #2: Results

decision tree logistic regression logistic regression with feature binarization

79.97% 78.88% 79.97%

decision tree logistic regression logistic regression with feature binarization

84.38% 85.03% 85.82%

Forward feature selection (FFS): [John et al., 1994]

Feature Selection in Weka

Beyond the GUI

with the command-line/API

> java -cp ~galley/weka/weka.jar weka.classifiers.trees.J48 C 0.25 M 2 -t <train_arff> -T <test_arff>

Important command-line parameters

where options are:

You might also like