0% found this document useful (0 votes)
46 views12 pages

Data Mining With Weka - Demo

The document provides an introduction to using the Weka data mining tool. It covers downloading and installing Weka, exploring datasets, using classification and clustering algorithms on sample datasets, evaluating models with training and testing as well as cross validation, and finding association rules. Visualization techniques are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views12 pages

Data Mining With Weka - Demo

The document provides an introduction to using the Weka data mining tool. It covers downloading and installing Weka, exploring datasets, using classification and clustering algorithms on sample datasets, evaluating models with training and testing as well as cross validation, and finding association rules. Visualization techniques are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Mining with Weka

Instructor: Solomon (Ph.D)


Getting started with Weka
Introduction
• Download from:
– https://waikato.github.io/weka-wiki/downloading_weka/

(for Windows, Mac, Linux)

• Weka 3.8.6 (the latest stable version of Weka and includes


datasets)
Exploring the Explorer
• Install Weka
• Get datasets
– Convert .xls to .csv (Save As -> CSV (MS-DOS))
– Convert .csv to .arff (Experimenter -> Analyse -> File -> .csv ->
Open Explorer -> Edit -> Ok -> Save -> .arff)
• Open Explorer
• Open a dataset (weather.nominal.arff)
• Look at attributes and their values
• Edit the dataset
• Save it?
Exploring datasets
• The classification problem (weather.numeric.arff)
• weather.nominal.arff, weather.numeric.arff
• Nominal vs. numeric attributes
Training and testing
• Use J48 to analyse the segment dataset
– Open file segment-challenge.arff
– Choose J48 decision tree learner
– Supplied test set segment-test.arff
– Run it: 96% accuracy

– Evaluate on training set: 99% accuracy

– Evaluate on percentage split: 95% accuracy


– Do it again: get exactly the same result!
Repeated training and testing
• Evaluate J48 on segment-challenge
– With segment-challenge.arff …
– and J48
– Set percentage split to 90%
– Run it: 96.7% accuracy (seed = 1)
– Repeat with seed 2, 3, 4, 5, 6, 7, 8, 9, 10 -> 0.94, 0.94, 0.967,
0.953, 0.967, 0.920, 0.947, 0.933, 0.947
• Sample Mean (x̄) =  xi/n = 0.949
• Standard deviation () =  (xi - x̄)2/n-1 = 0.018
Cross-validation
• 10-fold cross-validation
– Divide dataset into 10 parts (folds)
– Hold out each part in turn
– Average the results
– Each data point used once for testing, 9 times for training

• Stratified cross-validation: ensure that each fold has the right


proportion of each class value
• Practical rule of thumb:
– Lots of data? – use percentage split
– Else stratified 10-fold cross-validation
Clustering

• With clustering, there is no “class” attribute


• Try to divide the instances into natural groups, or
“clusters”
• Example:
– Examine iris.arff in the Explorer
– Imagine deleting the class attribute
– Could you recover the classes by clustering the data?

Iris Setosa Iris Versicolor Iris Virginica


Visualizing clusters

• Iris data (iris.arff), SimpleKMeans, specify 3 clusters


– 3 clusters with 50 instances each
• Visualize cluster assignments (right-click menu)
– Plot Cluster against Instance_number to see what the errors are
• Perfect? – surely not!
– Ignore class attribute; 3 clusters - with 61, 50, 39 instances

• Which instances does a cluster contain?


– Use the AddCluster unsupervised attribute
filter
– Try with SimpleKMeans; Apply and click Edit
• Hard to evaluate clustering
– It should really be evaluated with respect to an
application.
Association rules
• Weather data (weather.nominal.arff) has 336 rules with confidence
100%
– But only 8 have support >= 3, only 58 have support >= 2
• Weka: specify minimum confidence level (minMetric, default 90%)
number of rules sought (numRules, default 10)
• Support is expressed as a proportion of the number of instances
• Weka runs Apriori algorithm several times
starts at upperBoundMinSupport (usually left at 100%)
decreases by delta at each iteration (default 5%)
stops when numRules reached
… or at lowerBoundMinSupport (default 10%)
Thank You!

You might also like