MCSL-223 Section 2 Data Mining Lab

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

COMPUTER

NETWORKS AND
DATA MINING LAB
SECTION 2 DATA MINING LAB
Structure Page Nos.
2.0 Introduction
2.1 Objectives
2.2 Introduction to WEKA
2.3 Latest Version and Downloads
2.4 Data Sets
2.5 Installation of WEKA
2.6 Features of WEKA Explorer
2.7 Data Preprocessing
2.8 Association Rule Mining
2.9 Classification
2.10 Clustering
2.11 Practical Sessions
2.12 Summary
2.13 Further Readings
2.14 Website References
2.15 Online Lab Resources

2.0 INTRODUCTION

This is the lab course, wherein you will have the hands on experience. You
have studied the course material (MCS-221 Data Warehousing and Data
Mining). Along with the examples discussed in this section, separately a list of
lab-sessions to be performed sessionwise, is given towards the end. Please go
through the general guidelines and the program documentation guidelines
carefully.

2.1 OBJECTIVES
After going through this practical course, you will be able to:
 Understand how to handle data mining tasks using a data mining
toolkit (such as open source WEKA)
 Understand the various kinds of options available in WEKA
 Understand the data sets and data preprocessing.
 Demonstrate the classification, clustering and etc. in large data sets.
 Demonstrate the working of algorithms for data mining tasks such
association rule mining, classification, clustering and regression.
 Exercise the data mining techniques with varied input values for
different parameters.
 Emphasize hands-on experience working with all real data sets.
 Ability to add mining algorithms as a component to the exiting
tools.
 Ability to apply mining techniques for realistic data.

14
DATA MINING
2.2 INTRODUCTION TO WEKA LAB

WEKA is a data mining/machine learning application and is being developed


by Waikato University in New Zealand. WEKA is an open-source tool
designed and developed by the scientists/researchers at the University of
Waikato, New Zealand. WEKA stands for Waikato Environment for
Knowledge Analysis. It is developed by the international scientific community
and distributed under the free GNU GPL license.

WEKA is fully developed in Java. It provides integration with the SQL


database using Java Database connectivity. It provides many machine learning
algorithms to implement data mining tasks. These algorithms can either be
used directly using the WEKA tool or can be used with other applications
using Java programming language.

It provides a lot of tools for data preprocessing, classification, clustering,


regression analysis, association rule creation, feature extraction, and data
visualization. It is a powerful tool that supports the development of new
algorithms in machine learning.

With WEKA, the machine learning algorithms are readily available to the
users. The ML specialists can use these methods to extract useful information
from high volumes of data. Here, the specialists can create an environment to
develop new machine learning methods and implement them on real data.

WEKA is used by machine learning and applied sciences researchers for


learning purposes. It is an efficient tool for carrying out many data mining
tasks.

Advantages of WEKA include:

 Free availability under the GNU General Public License.


 Portability, since it is fully implemented in the Java programming
language and thus runs on almost any modern computing
platform.
 A comprehensive collection of data preprocessing and modeling
techniques.
 Ease of use due to its graphical user interfaces.

2.3 LATEST VERSION AND DOWNLOADS

There are two versions of WEKA: WEKA 3.8 is the latest stable version and
WEKA 3.9 is the development version. The stable version receives only bug
fixes and feature upgrades that do not break compatibility with its earlier
releases, while the development version may receive new features that break
compatibility with its earlier releases.

WEKA 3.8 and 3.9 feature a package management system that makes it easy
for the WEKA community to add new functionality to WEKA. The package
15
COMPUTER
NETWORKS AND
management system requires an internet connection in order to download and
DATA MINING LAB install packages.

Stable version

WEKA 3.8 is the latest stable version of WEKA. This branch of WEKA only
receives bug fixes and upgrades that do not break compatibility with earlier 3.8
releases, although major new features may become available in packages.
There are different options for downloading and installing it on your system:

Windows

Use https://sourceforge.net/projects/weka/files/weka-3-8/3.8.5/weka-3-8-5-
azul-zulu-windows.exe/download?use_mirror=nchc to download a self-
extracting executable for 64-bit Windows that includes Azul's 64-bit OpenJDK
Java VM 11 (weka-3-8-5-azul-zulu-windows.exe; 124.6 MB)
This executable will install WEKA in your Program Menu. Launching via the
Program Menu or shortcuts will automatically use the included JVM to run
WEKA.

Linux

Use https://sourceforge.net/projects/weka/files/weka-3-8/3.8.5/weka-3-8-5-
azul-zulu-arm-osx.dmg/download?use_mirror=nchc to download a zip archive
for Linux that includes Azul's 64-bit OpenJDK Java VM 11 (weka-3-8-5-azul-
zulu-linux.zip; 137.4 MB)

First unzip the zip file. This will create a new directory called weka-3-8-5. To
run Weka, change into that directory and type:
./weka.sh

WEKA packages

There is a list of packages for WEKA that can be installed using the built-in
package manager. Javadoc for a package is available at
https://weka.sourceforge.io/doc.packages/ followed by the name of the
package.

Requirements

The latest official releases of WEKA require Java 8 or later. Note that if you
are using Windows and your computer has a display with high pixel density
(HiDPI), you may need to use Java 9 or later to avoid problems with
inappropriate scaling of WEKA's graphical user interfaces.

2.4 DATASETS

Below are some sample WEKA data sets available in .arff format.
 airline.arff
 breast-cancer.arff
 contact-lens.arff
16
 cpu.arff DATA MINING
LAB
 cpu.with-vendor.arff
 credit-g.arff
 diabetes.arff
 glass.arff
 hypothyroid.arff
 ionospehre.arff
 iris.2D.arff
 iris.arff
 labor.arff
 ReutersCorn-train.arff
 ReutersCorn-test.arff
 ReutersGrain-train.arff
 ReutersGrain-test.arff
 segment-challenge.arff
 segment-test.arff
 soybean.arff
 supermarket.arff
 unbalanced.arff
 vote.arff
 weather.numeric.arff
 weather.nominal.arff

The WEKA datasets can be explored from the “C:\Program Files\Weka-3-


8\data” link. The datasets are in . arff format.

Miscellaneous Collections of Datasets can be Downloaded


 A jarfile containing 37 classification problems originally obtained from
the UCI repository of machine learning datasets are available at
http://archive.ics.uci.edu/ml/index.php .

 A jarfile containing 37 regression problems obtained from various sources


(datasets-numeric.jar available at
https://sourceforge.net/projects/weka/files/datasets/datasets-
numeric/datasets-numeric.jar/download?use_mirror=netix .

 A jarfile containing 6 agricultural datasets obtained from agricultural


researchers in New Zealand (agridatasets.jar, 31,200 Bytes).

 A jarfile containing 30 regression datasets collected by Professor Luis


Torgo (regression-datasets.jar, 10,090,266 Bytes).

 A gzip'ed tar containing UCI ML at


http://archive.ics.uci.edu/ml/index.php and UCI KDD datasets at
https://kdd.ics.uci.edu/ .

 A gzip'ed tar containing StatLib datasets at


https://sourceforge.net/projects/weka/files/datasets/UCI%20and%20StatLib/
statlib-20050214.tar.gz/download?use_mirror=kumisystems

17
COMPUTER
 A gzip'ed tar containing ordinal, real-world datasets donated by Professor
NETWORKS AND
DATA MINING LAB Arie Ben David at
https://sourceforge.net/projects/weka/files/datasets/regression-
datasets/datasets-arie_ben_david.tar.gz/download?use_mirror=pilotfiber

 A zip file containing 19 multi-class (1-of-n) text datasets donated by Dr


George Forman available at
https://sourceforge.net/projects/weka/files/datasets/text-
datasets/19MclassTextWc.zip/download?use_mirror=netactuate&download
= (19MclassTextWc.zip, 14,084,828 Bytes).

 A bzip'ed tar file containing the Reuters21578 dataset split into separate
files according to the ModApte split reuters21578-ModApte.tar.bz2,
81,745,032 Bytes.

 A zip file containing 41 drug design datasets formed using


the Adriana.Code software donated by Dr Mehmet Fatih Amasyali at
https://sourceforge.net/projects/weka/files/datasets/Drug%20design%20data
sets/Drug-datasets.zip/download?use_mirror=netix&use_mirror=internode .

 A zip file containing 80 artificial datasets generated from the Friedman


function donated by Dr. M. Fatih Amasyali (Yildiz Technical Unversity)
(Friedman-datasets.zip, 5,802,204 Bytes)

 A zip file containing a new, image-based version of the classic iris data,
with 50 images for each of the three species of iris. The images have size
600x600. Please see the ARFF file for further information (iris_reloaded.zip,
92,267,000 Bytes). After expanding into a directory using your jar utility (or
an archive program that handles tar-archives/zip files in case of the gzip'ed
tars/zip files), these datasets may be used with WEKA.

2.5 INSTALLATION OF WEKA

1. Download the software


from https://sourceforge.net/projects/weka/files/weka-3-8/3.8.5/weka-
3-8-5-azul-zulu-windows.exe/download?use_mirror=nchc for
WINDOWS operating system. Check the configuration of the computer
system and download the stable version of WEKA (currently 3.8).

2. After successful download, open the file location and double click on
the downloaded file. The Step Up wizard will appear. Click on Next.

3. The License Agreement terms will open. Read it thoroughly and click
on “I Agree”.

4. According to your requirements, select the components to be installed.


Full component installation is recommended. Click on Next.
18
5. Select the destination folder and Click on Next. DATA MINING
LAB

6. Then, Installation will start.

7. After Installation is complete, WEKA Tool and Explorer window


opens as shown in Figure 1.1.

Figure 1.1 WEKA GUI Interface

The GUI of WEKA gives five options: Explorer, Experimenter, Knowledge


flow, Workbench, and Simple CLI. Let us understand each of these
individually.

1. Simple CLI is WEKA Shell with command line and output. With
“help”, the overview of all the commands can be seen. Simple CLI
offers access to all classes such as classifiers, clusters, and filters, etc.
2. Explorer is an environment to discover the data. The WEKA Explorer
window show different tabs starting with preprocess. Initially, the
preprocess tab is active, as first the data set is preprocessed before
applying algorithms to it and explored the dataset.
3. Experimenter is an environment to make experiments and statistical
tests between learning schemes. The WEKA experimenter button allows
the users to create, run, and modify different schemes in one experiment
on a dataset. The experimenter has 2 types of configuration: Simple and
Advanced. Both configurations allow users to run experiments locally
and on remote computers.
4. KnowledgeFlow is a Java-Beans based interface for tuning and
machine learning experiments. Knowledge flow shows a graphical
representation of WEKA algorithms. The user can select the
components and create a workflow to analyze the datasets. The data can
19
COMPUTER
NETWORKS AND
be handled by batch-wise or incrementally. Parallel workflows can be
DATA MINING LAB designed and each will run in a separate thread. The different
components available are Datasources, Datasavers, Filters, Classifiers,
Clusters, Evaluation, and Visualization.
5. WEKA has workbench module which contains all the GUI’s in a single
window.

2.6 FEATURES OF WEKA EXPLORER

In this section we will learn about the features of the WEKA Explorer.

2.6.1 Dataset

A dataset is made of items. It represents an object for example: in the


marketing database, it will represent customers and products. The datasets are
described by attributes. The dataset contains data tuples in a database. A dataset
has attributes that can be nominal, numeric, or string. In WEKA, the dataset is
represented by weka.core.Instances class. Representation of dataset with 5
examples:
@data
sunny, FALSE,85,85,no
sunny, TRUE,80,90,no
overcast, FALSE,83,86,yes
rainy, FALSE,70,96,yes
rainy, FALSE,68,80,yes

Attribute and its Types

An attribute is a data field representing the characteristic of a data object. For


example, in a customer database, the attributes will be customer_id,
customer_email, customer_address, etc. Attributes have different types:

(i) Nominal Attributes: Attribute which relates to a name and has predefined
values such as color, weather. These attributes are called categorical attributes.
These attributes do not have any order and their values are also called
enumerations.
@attribute outlook {sunny, overcast, rainy}: declaration of the nominal
attribute.
(ii) Binary Attributes: These attributes represent only values 0 and 1. These
are the type of nominal attributes with only 2 categories. These attributes are
also called Boolean.
(iii) Ordinal Attributes: The attributes which preserve some order or ranking
amongst them are ordinal attributes. Successive values cannot be predicted but
only order is maintained. Example: size, grade, etc.
(iv) Numeric Attributes: Attributes representing measurable quantities are
numeric attributes. These are represented by real numbers or
integers. Example: temperature, humidity.
@attribute humidity real: declaration of a numeric attribute
(v) String Attributes: These attributes represent a list of characters represented
in double-quotes.
20
2.6.2 ARFF Data format DATA MINING
LAB

WEKA works on the ARFF file for data analysis. ARFF stands for Attribute
Relation File Format. It has 3 sections: relation, attributes, and data. Every
section starts with “@”.

ARFF files take Nominal, Numeric, String, Date, and Relational data attributes.
Some of the well-known machine learning datasets is present in WEKA as
ARFF.

Format for ARFF is:


@relation <relation name>
@attribute <attribute name and data type>
@data
An example of an ARFF file is:

@relation weather
@attribute outlook {sunny, overcast, rainy}:
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no} //class attribute: The class attribute represents the
output.

@data
sunny, FALSE,85,85,no
sunny, TRUE,80,90,no
overcast, FALSE,83,86,yes
rainy, FALSE,70,96,yes
rainy, FALSE,68,80,yes

CSV - Comma Separated Values


CSV is a very old, very simple and very common “standard” for (tabular) data.
We say “standard” in quotes because there was never a formal standard for
CSV, though in 2005 someone did put together a RFC for it.

CSV is supported by a huge number of tools from spreadsheets like Excel,


Open Office and Google Docs to complex databases to almost all
programming languages. As such it is probably the most widely supported
structured data format in the world.

CSV Format

 Key points are:CSV is probably the simplest possible structured format for
data
 CSV strikes a delicate balance, remaining readable by both machines &
humans
 CSV is a two dimensional structure consisting of rows of data, each row
containing multiple cells. Rows are (usually) separated by line terminators so
each row corresponds to one line. Cells within a row are separated by
commans (hence the C(ommmas) part)
 Note that strictly we’re really talking about DSV files in that we can allow
‘delimiters’ between cells other than a comma. However, many people and
21
COMPUTER
NETWORKS AND
many programs still call such data CSV (since comma is so common as the
DATA MINING LAB delimiter)
 CSV is a “text-based” format, i.e. a CSV file is a text file. This makes it
amenable for processing with all kinds of text-oriented tools (from text editors
to unix tools like sed,gep etc..)

What a CSV looks like?

If you open up a CSV file in a text editor it would look something like:
A,B,C
1,2,3
4,”5,3”,6,

Here there are 3 rows each of 3 columns. Notice how the second column in the
last line is “quoted” because the content of that value actually contains a “,”
character.

Without the quotes this character would be interpreted as a column separator.


To avoid this confusion we put quotes around the whole value. The result is
that we have 3 rows each of 3 columns (Note a CSV file does not have to have
the same number of columns in each row).

2.6.3 XRFF Data Format

XRFF stands for the XML attribute Relation File Format. It represents data that
can store comments, attributes, and instance weights. It has .xrff extension
and .xrff.gz (compressed format) file extension. The XRFF files represented
data in XML format.

2.6.4 Database Connectivity

With WEKA, it is easy to connect to a database using a JDBC driver. JDBC


driver is necessary to connect to the database, for example:
MS SQL Server (com.microsoft.jdbc.sqlserver.SQLServerDriver)
Oracle (oracle.jdbc.driver.OracleDriver)

2.6.5 Classifiers

To predict the output data, WEKA contains classifiers. The classification


algorithms available for learning are decision-trees, support vector machines,
instance-based classifiers, and logistic regression, and Bayesian networks.
Depending upon the requirement using trial and test, the user can find out a
suitable algorithm for the analysis of data. Classifiers are used to classify the
data sets based on the characteristics of the attributes.

2.6.6 Clustering

WEKA uses the Cluster tab to predict the similarities in the dataset. Based on
clustering, the user can find out the attributes useful for analysis and ignore
other attributes. The available algorithms for clustering in WEKA are k-means,
EM, Cobweb, X-means, and FarhtestFirst.

22
DATA MINING
LAB
2.6.7 Association

The only algorithm available in WEKA for finding out association rules is
Apriori.

2.6.8 Attribute Section Measures

WEKA uses 2 approaches for best attribute selection for calculation purpose:
 Using Search method algorithm: Best-first, forward selection, random,
exhaustive, genetic algorithm, and ranking algorithm.

 Using Evaluation method algorithms: Correlation-based, wrapper,


information gain, chi-squared.

2.6.9 Visualization

WEKA supports the 2D representation of data, 3D visualizations with rotation,


and 1D representation of single attribute. It has the “Jitter” option for nominal
attributes and “hidden” data points.

2.7 DATA PREPROCESSING

2.7.1 Discretization

Dicretization of numerical data is one of the most influential data preprocessing


tasks in knowledge discover and data mining.

Discretization Process – In supervised learning and specifically in classification


discredit the data in the columns to enable the use of the algorithms to produce
a mining model. Discretization is the process of putting values into buckets so
that there are a limited number of possible states. The buckets themselves are
treated as ordered and discrete values.

 Start weka select the explorerselect the dataset (weather.nominal).


 Choose discretize filter.

23
COMPUTER
NETWORKS AND
DATA MINING LAB

Synopsis

An instance filter that discreteness a range of numeric attributes in the dataset


into nominal attributes. Discretization is by Fayyad & Irani's MDL method (the
default).

Options

attributeIndices -- Specify range of attributes to act on. This is a comma


separated list of attribute indices, with "first" and "last" valid values. Specify an
inclusive range with "-". Example: "first-3,5,6-10,last".
invertSelection -- Set attribute selection mode. If false, only selected (numeric)
attributes in the range will be discretized; if true, only non-selected attributes
will be discretized.
makeBinary -- Make resulting attributes binary.
useBetterEncoding -- Uses a more efficient split point encoding.
useKononenko -- Use Kononenko's MDL criterion. If set to false uses the
Fayyad & Irani criterion.

Select the outlook attribute based on class temperature to visualize below :

24
Select the temperature attribute based on class temperature to visualize below: DATA MINING
LAB

Select the humidity attribute based on class temperature to visualize below:

Select the windy attribute based on class temperature to visualize below:

Select the play attribute based on class temperature to visualize below: 25


COMPUTER
NETWORKS AND
DATA MINING LAB

Comparison of all visualize results:

2.7.2 Random Sampling

Filters instances according to the value of an attribute produces


a randomsubsample of a dataset using either sampling with replacement or
without replacement. Produces a random subsample of a dataset using the
reservoirsampling Algorithm “R” by Vitter.

The original dataset must fit entirely in memory. The number of instances in the
generated dataset may be specified. The dataset must have a nominal class
attribute. If not, use the unsupervised version. The filter can be made to
maintain the class distribution in the subsample, or to bias the class distribution
toward a uniform distribution. When used in batch mode subsequent batches are
NOT re-sampled.

Start weka select the explorerselect the dataset (weather.nominal).


Choose resample filter.

26
DATA MINING
LAB

Synopsis
Produces a random subsample of a dataset using either sampling with
replacement or without replacement. The original dataset must fit entirely in
memory. The number of instances in the generated dataset may be specified.
The dataset must have a nominal class attribute. If not, use the unsupervised
version. The filter can be made to maintain the class distribution in the
subsample, or to bias the class distribution toward a uniform distribution. When
used in batch mode (i.e. in the Filtered Classifier), subsequent batches are NOT
resampled.

Options

randomSeed -- Sets the random number seed for subsampling.


biasToUniformClass -- Whether to use bias towards a uniform class. A value of
0 leaves the class distribution as-is, a value of 1 ensures the class distribution is
uniform in the output data.

debug -- If set to true, filter may output additional info to the console.
noReplacement -- Disables the replacement of instances.
doNotCheckCapabilities -- If set, filters capabilities are not checked before
filter is built (Use with caution to reduce runtime).
sampleSizePercent -- The subsample size as a percentage of the original set.
invertSelection -- Inverts the selection (only if instances are drawn WITHOUT
replacement).

Select the outlook attribute based on class outlook (Nom) to visualize below:

27
COMPUTER
NETWORKS AND
DATA MINING LAB

Select the temperature attribute based on class outlook (Nom) to visualize


below:

Select the humidity attribute based on class outlook (Nom) to visualize below:

Select the windy attribute based on class outlook (Nom) to visualize below:

28
DATA MINING
LAB

Select the play attribute based on class outlook (Nom) to visualize below:

Comparison of all visualize results:

2.7.3 Attribute Selection in WEKA

Attribute selection or variable subset selection, is the process of selecting a


subset of relevant features in this there are four algorithms are as follows:
 CFS Subset eval algorithm. 29
COMPUTER
NETWORKS AND
 Information gain algorithm.
DATA MINING LAB  Correlation attribute evaluated.
 Gain ratio attribute evaluated.

Start-->weka-->select Explorer-->select dataset (Buys Computers)

2.7.3.1 CFS Subset Evaluated Algorithm

It evaluates the worth of a subset of attributes by considering the individual


predictive ability of each feature along with the degree of redundancy
between them.

Subsets of features that are highly correlated with the class while having low
inter-correlation are preferred.
30
DATA MINING
LAB
Options

numThreads: The number of threads to use, which should be >= size of


thread pool.

Debug: Output debugging info

MissingSeparate: Treat missing as a separate value. Otherwise, counts for


missing values are distributed across other values in proportion to their
frequency.

PoolSize: The size of the thread pool, for example, the number of cores in the
CPU.

DoNotCheckCapabilities: If set, evaluator capabilities are not checked before


evaluator is built (Use with caution to reduce runtime).

PreComputeCorrelationMatrix: Recomputed the full correlation matrix at


the outset, rather than compute correlations lazily (as needed) during the
search. Use this in conjunction with parallel processing in order to speed up a
backward search.

LocallyPredictive: Identify locally predictive attributes. Iteratively adds


attributes with the highest correlation with the class as long as there is not
already an attribute in the subset that has a higher correlation with the attribute
in question weka.attributeSelection.CfsSubsetEval

Attribute selection CfsSubsetEval Output

=== Run information ===


Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 -E 1
Search: weka.attributeSelection.BestFirst -D 1 -N 5
Relation: dessision.arff
31
COMPUTER
NETWORKS AND
Instances: 14
DATA MINING LAB Attributes: 6
Rid
Age
Income
student
Credit-Rating
Class: Buys-computers
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 19
Merit of best subset found: 0.247
Attribute Subset Evaluator (supervised, Class (nominal): 6 Class: Buys-
computers):
CFS Subset Evaluator
Including locally predictive attributes
Selected attributes: 2, 4 : 2
Age
student

2.7.3.2 Information Gain Algorithm

weka.attributeSelection.information gain algorithm

Algorithm: Information Gain Algorithm

InfoGainAttributeEval:
Evaluates the worth of an attribute by measuring the information gain with
respect to the class.
32
InfoGain (Class, Attribute) = H (Class) - H (Class | Attribute). DATA MINING
LAB

Options
missingMerge -- Distribute counts for missing values. Counts are distributed
across other values in proportion to their frequency. Otherwise, missing is
treated as a separate value.
binarizeNumericAttributes -- Just binaries numeric attributes instead of
properly discrediting them.
doNotCheckCapabilities -- If set, evaluator capabilities are not checked before
evaluator is built (Use with caution to reduce runtime).

Attribute selection output


=== Run information ===
Evaluator: weka.attributeSelection.InfoGainAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N
-1
Relation: dessision.arff
Instances: 14
Attributes: 6
Evaluation mode: evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 6 Class: Buys-computers):
Information Gain Ranking Filter
Ranked attributes:
0.2467 2 Age
0.1518 4 student
0.0481 5 Credit-Rating
0.0292 3 Income
0 1 Rid
Selected attributes: 2, 4, 5, 3, 1 : 5

2.7.3.3 Correlation Attribute Evaluated Algorithm


33
COMPUTER
NETWORKS AND
DATA MINING LAB

weka.attributeSelection.CorrelationAttributeEval

CorrelationAttributeEval

Evaluates the worth of an attribute by measuring the correlation (Pearson's)


between it and the class.
Nominal attributes are considered on a value by value basis by treating each
value as an indicator. An overall correlation for a nominal attribute is arrived
at via a weighted average.

Options

OutputDetailedInfo Output per value correlation for nominal attributes

DoNotCheckCapabilities If set, evaluator capabilities are not checked before


evaluator is built (Use with caution to reduce runtime).

Attribute selection correlation attribute eval


34
Output: DATA MINING
LAB

=== Run information ===


Evaluator: weka.attributeSelection.CorrelationAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 --1
Relation: dessision.arff
Instances: 14
Attributes: 6
Evaluation mode:
evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 6 Class: Buys-computers):
Correlation Ranking Filter

Ranked attributes:
0.447 4 student
0.293 2 Age
0.258 5 Credit-Rating
0.24 1 Rid
0.113 3 Income
Selected attributes: 4, 2, 5, 1, 3 : 5

2.7.3.4 Gain Ratio Attribute Evaluated Algorithm:

weka.attributeSelection.GainRatioAttributeEval

GainRatioAttributeEval

Evaluates the worth of an attribute by measuring the gain ratio with respect to
the class.

GainR(Class, Attribute) = (H(Class) - H(Class | Attribute)) / H(Attribute).

Options
35
COMPUTER
NETWORKS AND
MissingMerge: Distribute counts for missing values. Counts are distributed
DATA MINING LAB across other values in proportion to their frequency. Otherwise, missing is
treated as a separate value.

DoNotCheckCapabilities : If set, evaluator capabilities are not checked


before evaluator is built (Use with caution to reduce runtime).

Attribute selection gain ratio attribute algorithm output

=== Run information ===


Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N
-1
Relation: dessision.arff
Instances: 14
Attributes: 6

Evaluation mode: evaluate on all training data

=== Attribute Selection on all input data ==


Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 6 Class: Buys-computers):
Gain Ratio feature evaluator

Ranked attributes:
0.1564 2 Age
0.1518 4student
0.0488 5 Credit-Rating
0.0188 3 Income
0 1 Rid

Selected attributes: 2, 4, 5, 3, 1 : 5

36
DATA MINING
2.8 ASSOCIATION RULE MINING LAB

The Association rule mining is a procedure which is meant to find frequent


patterns, correlations, associations, or causal structures from data sets found in
various kinds of databases such as relational databases, transactional databases,
and other forms of data repositories.

2.8.1 Apriori Algorithm

Apriori is an algorithm for frequent item set mining and association rule
learning over transitional databases. It proceeds by identifying the frequent
individual items in the database and attending them to larger and larger item
sets as long as those item sets appear sufficiently often in the database. The
frequent item sets determine by Apriori can be used to define by association
rules which highlight tends of the databases.

The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is
designed to operate on database containing transactions. Each transaction is
seen as a set of items. Given a threshold, the Apriori algorithm identifies the
item sets which are subsets of atleast transactions in the database

Apriori uses breath-first search and a Hash tree structure to count candidate
item sets efficiently. It generates candidate item sets of length k from item sets
of length k-1. Then it prunes the candidates which have an infrequent sub
pattern.

Pseudo code:
37
COMPUTER
NETWORKS AND
DATA MINING LAB

In metric type there are four metrics:


 Confidence
 Lift
 Leverage
 Conviction

Useful Concepts
To select interesting rules from the set of all possible rules, constraints on
various measures of significance and interest can be used. The best-known
constraints are minimum thresholds on support and confidence.

(i) Support
The support supp(X) of an item set X is defined as the proportion of
transactions in the data set which contain the item set.
Supp(X) =no of transactions which contain the item set X/total no of
transactions

(ii) Confidence
The confidence of a rule is defined
Conf(x->y) = supp (XUY)/supp(x)

(iii) Lift
The lift of a rule is defined as:
Lift(X->Y) = supp (XUY)/supp(Y)*supp(X)

(iv) Conviction
The conviction of a rule is defined as:
Conv (X_>Y) =1-supp(Y)/1-conf(X->Y)

38 Options
DATA MINING
LAB

car: If enabled class association rules are mined instead of (general) association
rules.
classIndex: Index of the class attribute. If set to -1, the last attribute is taken as
class attribute.
delta: Iteratively decrease support by this factor. Reduces support until min
support is reached or required number of rules has been generated.
LowerBoundMinSupport: Lower bound for minimum support.
MetricType: Set the type of metric by which to rank rules. Confidence is the
proportion of the examples covered by the premise that are also covered by the
consequence (Class association rules can only be mined using confidence). Lift
is confidence divided by the proportion of all examples that are covered by the
consequence. This is a measure of the importance of the association that is
independent of support. Leverage is the proportion of additional examples
covered by both the premise and consequence above those expected if the
premise and consequence were independent of each other. The total number of
examples that this represents is presented in brackets following the leverage.
Conviction is another measure of departure from independence.
MinMetric :Minimum metric score. Consider only rules with scores higher
than this value.
NumRules : Number of rules to find.
OutputItemSets :If enabled the item sets are output as well.
RemoveAllMissingCols :Remove columns with all missing values.
SignificanceLevel: Significance level. Significance test (confidence metric
only).
UpperBoundMinSupport: Upper bound for minimum support. Start
iteratively decreasing minimum support from this value.
Verbose :If enabled the algorithm will be run in verbose mode.
How to open the Apriori Algorithm in WEKA 39
COMPUTER
NETWORKS AND
Startwekaselect the explorerselect the open file(weather
DATA MINING LAB Nominal)select the Association chooses the algorithm(Apriori)click on
start.

(a) Using 10 numrules by using confidence

=== Run information ===


Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Associator model (full training set) ===
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 47
Size of set of large itemsetsL(3): 39
Size of set of large itemsetsL(4): 6
Best rules found:
1. outlook=overcast 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
40
9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1) DATA MINING
LAB
10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Associator model (full training set) ===
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 47
Size of set of large itemsetsL(3): 39
Size of set of large itemsetsL(4): 6
Best rules found:
1. outlook=overcast 4 ==> play=yes 4 conf:(1)
2. temperature=cool 4 ==> humidity=normal 4 conf:(1)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)
4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)
5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)
7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)
8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)
9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1)
10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)

(b) Using 10 numrules by using Lift

41
COMPUTER
NETWORKS AND
DATA MINING LAB

=== Run information ===


Scheme: weka.associations.Apriori -N 10 -T 1 -C 1.1 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
=== Associator model (full training set) ===
Minimum support: 0.3 (4 instances)
Minimum metric <lift>: 1.1
Number of cycles performed: 14
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 9
Size of set of large itemsetsL(3): 1
Best rules found:
1. temperature=cool 4 ==> humidity=normal 4 conf:(1) < lift:(2)>lev:(0.14)
[2] conv:(2)
2. humidity=normal 7 ==> temperature=cool 4 conf:(0.57) <
lift:(2)>lev:(0.14) [2] conv:(1.25)
3. humidity=high 7 ==> play=no 4 conf:(0.57) < lift:(1.6)>lev:(0.11) [1]
conv:(1.13)
4. play=no 5 ==> humidity=high 4 conf:(0.8) < lift:(1.6)>lev:(0.11) [1]
conv:(1.25)
5. outlook=overcast 4 ==> play=yes 4 conf:(1) < lift:(1.56)>lev:(0.1) [1]
conv:(1.43)
6. play=yes 9 ==> outlook=overcast 4 conf:(0.44) < lift:(1.56)>lev:(0.1) [1]
conv:(1.07)
7. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1) <
lift:(1.56)>lev:(0.1) [1] conv:(1.43)
8. play=yes 9 ==> humidity=normal windy=FALSE 4 conf:(0.44) <
lift:(1.56)>lev:(0.1) [1] conv:(1.07)
9. humidity=normal 7 ==> play=yes 6 conf:(0.86) < lift:(1.33)>lev:(0.11) [1]
conv:(1.25)

42
10. play=yes 9 ==> humidity=normal 6 conf:(0.67) < lift:(1.33)>lev:(0.11) [1] DATA MINING
LAB
conv:(1.13)

(c) Using 10 numrules by using Leverage

43
COMPUTER
NETWORKS AND
=== Run information ===
DATA MINING LAB
Scheme: weka.associations.Apriori -N 10 -T 2 -C 0.1 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5

=== Associator model (full training set) ===

Minimum support: 0.3 (4 instances)


Minimum metric <leverage>: 0.1
Number of cycles performed: 14
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 9
Size of set of large itemsetsL(3): 1

Best rules found:

1. temperature=cool 4 ==> humidity=normal 4 conf:(1) lift:(2) <lev:(0.14)


[2]>conv:(2)
2. humidity=normal 7 ==> temperature=cool 4 conf:(0.57) lift:(2) <lev:(0.14)
[2]>conv:(1.25)
3. humidity=normal 7 ==> play=yes 6 conf:(0.86) lift:(1.33) <lev:(0.11)
[1]>conv:(1.25)
4. play=yes 9 ==> humidity=normal 6 conf:(0.67) lift:(1.33) <lev:(0.11)
[1]>conv:(1.13)
5. humidity=high 7 ==> play=no 4 conf:(0.57) lift:(1.6) <lev:(0.11)
[1]>conv:(1.13)
6. play=no 5 ==> humidity=high 4 conf:(0.8) lift:(1.6) <lev:(0.11)
[1]>conv:(1.25)
7. outlook=overcast 4 ==> play=yes 4 conf:(1) lift:(1.56) <lev:(0.1)
[1]>conv:(1.43)
8. play=yes 9 ==> outlook=overcast 4 conf:(0.44) lift:(1.56) <lev:(0.1)
[1]>conv:(1.07)
9. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1) lift:(1.56)
<lev:(0.1) [1]>conv:(1.43)
10. play=yes 9 ==> humidity=normal windy=FALSE 4 conf:(0.44) lift:(1.56)
<lev:(0.1) [1]>conv:(1.07)

44
DATA MINING
LAB

(d) Using 10 numrules by using Conviction

=== Run information ===


Scheme: weka.associations.Apriori -N 10 -T 3 -C 1.1 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
=== Associator model (full training set) ===
Minimum support: 0.25 (3 instances)
Minimum metric <conviction>: 1.1
Number of cycles performed: 15
Generated sets of large item sets:
Size of set of large itemsetsL(1): 12
Size of set of large itemsetsL(2): 26
Size of set of large itemsetsL(3): 4 45
COMPUTER
NETWORKS AND
Best rules found:
DATA MINING LAB 1. temperature=cool 4 ==> humidity=normal 4 conf:(1) lift:(2) lev:(0.14) [2]
<conv:(2)>
2. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1) lift:(2.8)
lev:(0.14) [1] <conv:(1.93)>
3. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1) lift:(2) lev:(0.11)
[1] <conv:(1.5)>
4. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1) lift:(2)
lev:(0.11) [1] <conv:(1.5)>
5. outlook=overcast 4 ==> play=yes 4 conf:(1) lift:(1.56) lev:(0.1) [1]
<conv:(1.43)>
6. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1) lift:(1.56)
lev:(0.1) [1] <conv:(1.43)>
7. play=no 5 ==> outlook=sunny humidity=high 3 conf:(0.6) lift:(2.8)
lev:(0.14) [1] <conv:(1.31)>
8. humidity=high play=no 4 ==> outlook=sunny 3 conf:(0.75) lift:(2.1)
lev:(0.11) [1] <conv:(1.29)>
9. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1) lift:(1.75)
lev:(0.09) [1] <conv:(1.29)>
10. humidity=normal 7 ==> play=yes 6 conf:(0.86) lift:(1.33) lev:(0.11) [1]
<conv:(1.25)>

2.9 CLASSIFICATION

The concept of classification is basically distributing data among the various


classes defined on a data set. Classification algorithms learn this form of
distribution from a given set of training and then try to classify it correctly
when it comes to test data for which the class is not specified. The values that
specify these classes on the dataset are given a label name and are used to
determine the class of data to be given during the test.

2.9.1 Classification Algorithms in WEKA

In this section let us discuss 5 classification algorithms in WEKA. Each


algorithm that you are going to study covers briefly how it works, key
algorithm parameters and its demonstration using the WEKA Explorer interface.
The five classification algorithms are:

 Logistic Regression
 Naive Bayes
 Decision Tree
 k-Nearest Neighbors
 Support Vector Machines

A standard classification problem will be used to demonstrate each algorithm,


specifically, the Ionosphere binary classification problem. This is a good dataset
to demonstrate classification algorithms because the input variables are numeric
and all have the same scale the problem only has two classes to discriminate.

Each instance describes the properties of radar returns from the atmosphere and
the task is to predict whether or not there is structure in the ionosphere or not.
46
There are 34 numerical input variables of generally the same scale. You can DATA MINING
LAB
learn more about this dataset on the UCI Machine Learning Repository. Top
results are in the order of 98% accuracy.

Start the WEKA Explorer:


1. Open the Weka GUI Chooser.
2. Click the “Explorer” button to open the WEKA Explorer.
3. Load the Ionosphere dataset from the data/ionosphere.arff file.
4. Click “Classify” to open the Classify tab.

2.9.1.1 Logistic Regression

Logistic regression is a binary classification algorithm.It assumes the input


variables are numeric and have a Gaussian (bell curve) distribution. This last
point does not have to be true, as logistic regression can still achieve good
results if your data is not Gaussian. In the case of the Ionosphere dataset, some
input attributes have a Gaussian-like distribution, but many do not.

The algorithm learns a coefficient for each input value, which are linearly
combined into a regression function and transformed using a logistic (s-shaped)
function. Logistic regression is a fast and simple technique, but can be very
effective on some problems.
The logistic regression only supports binary classification problems, although
the WEKA implementation has been adapted to support multi-class
classification problems.
Choose the logistic regression algorithm:

1. Click the “Choose” button and select “Logistic” under the “functions”
group.
2. Click on the name of the algorithm to review the algorithm
configuration.

WEKA Configuration for the Logistic Regression Algorithm

The algorithm can run for a fixed number of iterations (maxIts), but by default
will run until it is estimated that the algorithm has converged. The
implementation uses a ridge estimator which is a type of regularization. This
47
COMPUTER
NETWORKS AND
method seeks to simplify the model during training by minimizing the
DATA MINING LAB coefficients learned by the model. The ridge parameter defines how much
pressure to put on the algorithm to reduce the size of the coefficients. Setting
this to 0 will turn off this regularization.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that logistic regression achieves
an accuracy of 88%.

WEKA Classification Results for the Logistic Regression Algorithm

2.9.1.2 Naive Bayes

Naive Bayes is a classification algorithm. Traditionally it assumes that the input


values are nominal, although it numerical inputs are supported by assuming a
distribution.

Naive Bayes uses a simple implementation of Bayes Theorem (hence naive)


where the prior probability for each class is calculated from the training data
and assumed to be independent of each other (technically called conditionally
independent).

This is an unrealistic assumption because we expect the variables to interact and


be dependent, although this assumption makes the probabilities fast and easy to
calculate. Even under this unrealistic assumption, Naive Bayes has been shown
to be a very effective classification algorithm. Naive Bayes calculates the
posterior probability for each class and makes a prediction for the class with the
highest probability. As such, it supports both binary classification and multi-
class classification problems.

48
Choose the Naive Bayes algorithm: DATA MINING
LAB
1. Click the “Choose” button and select “NaiveBayes” under the “bayes”
group.
2. Click on the name of the algorithm to review the algorithm
configuration.

WEKA Configuration for the Naive Bayes Algorithm

By default a Gaussian distribution is assumed for each numerical attributes.


You can change the algorithm to use a kernel estimator with the
useKernelEstimator argument that may better match the actual distribution of
the attributes in your dataset. Alternately, you can automatically convert
numerical attributes to nominal attributes with the useSupervisedDiscretization
parameter.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.
3.
You can see that with the default configuration that Naive Bayes achieves an
accuracy of 82%.

49
COMPUTER
NETWORKS AND
DATA MINING LAB

WEKA Classification Results for the Naive Bayes Algorithm

There are a number of other flavors of naive bayes algorithms that you could
work with.

2.9.1.3 Decision Tree

Decision trees can support classification and regression problems. Decision


trees are more recently referred to as Classification And Regression Trees
(CART).

They work by creating a tree to evaluate an instance of data, start at the root of
the tree and moving town to the leaves (roots) until a prediction can be made.
The process of creating a decision tree works by greedily selecting the best split
point in order to make predictions and repeating the process until the tree is a
fixed depth.

After the tree is constructed, it is pruned in order to improve the model’s ability
to generalize to new data.

Choose the decision tree algorithm:


1. Click the “Choose” button and select “REPTree” under the “trees”
group.
2. Click on the name of the algorithm to review the algorithm
configuration.

50
DATA MINING
LAB

WEKA Configuration for the Decision Tree Algorithm

The depth of the tree is defined automatically, but a depth can be specified in
the maxDepth attribute.

You can also choose to turn of pruning by setting the noPruning parameter to
True, although this may result in worse performance. The minNum parameter
defines the minimum number of instances supported by the tree in a leaf node
when constructing the tree from the training data.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that the decision tree algorithm
achieves an accuracy of 89%.

51
COMPUTER
NETWORKS AND
DATA MINING LAB

WEKA Classification Results for the Decision Tree Algorithm

Another more advanced decision tree algorithm that you can use is the C4.5
algorithm, called J48 in WEKA.You can review a visualization of a decision
tree prepared on the entire training data set by right clicking on the “Result list”
and clicking “Visualize Tree”.

WEKA Visualization of a Decision Tree

2.9.1.4 k-Nearest Neighbors

The k-nearest neighbors algorithm supports both classification and regression.


It is also called kNN for short. It works by storing the entire training dataset and
52
querying it to locate the k most similar training patterns when making a DATA MINING
LAB
prediction. As such, there is no model other than the raw training dataset and
the only computation performed is the querying of the training dataset when a
prediction is requested.

It is a simple algorithm, but one that does not assume very much about the
problem other than that the distance between data instances is meaningful in
making predictions. As such, it often achieves very good performance.

When making predictions on classification problems, KNN will take the mode
(most common class) of the k most similar instances in the training dataset.

Choose the k-Nearest Neighbors algorithm:


1. Click the “Choose” button and select “IBk” under the “lazy” group.
2. Click on the name of the algorithm to review the algorithm
configuration.

WEKA Configuration for the k-Nearest Neighbors Algorithm

The size of the neighborhood is controlled by the k parameter. For example, if k


is set to 1, then predictions are made using the single most similar training
instance to a given new pattern for which a prediction is requested. Common
values for k are 3, 7, 11 and 21, larger for larger dataset sizes. WEKA can
automatically discover a good value for k using cross validation inside the
algorithm by setting the crossValidate parameter to True. Another important
parameter is the distance measure used. This is configured in the
nearestNeighbourSearchAlgorithm which controls the way in which the training
data is stored and searched.
53
COMPUTER
NETWORKS AND
DATA MINING LAB The default is a LinearNNSearch. Clicking the name of this search algorithm
will provide another configuration window where you can choose a
distanceFunction parameter. By default, Euclidean distance is used to calculate
the distance between instances, which is good for numerical data with the same
scale. Manhattan distance is good to use if your attributes differ in measures or
type.

WEKA Configuration for the Search Algorithm in the k-Nearest Neighbors Algorithm

It is a good idea to try a suite of different k values and distance measures on


your problem and see what works best.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that the kNN algorithm achieves
an accuracy of 86%.

WEKA Classification Results for k-Nearest Neighbors


54
DATA MINING
LAB
2.9.1.5 Support Vector Machines

Support Vector Machines were developed for binary classification problems,


although extensions to the technique have been made to support multi-class
classification and regression problems. The algorithm is often referred to as
SVM for short.

SVM was developed for numerical input variables, although will automatically
convert nominal values to numerical values. Input data is also normalized
before being used.

SVM work by finding a line that best separates the data into the two groups.
This is done using an optimization process that only considers those data
instances in the training dataset that are closest to the line that best separates the
classes. The instances are called support vectors, hence the name of the
technique.

In almost all problems of interest, a line cannot be drawn to neatly separate the
classes, therefore a margin is added around the line to relax the constraint,
allowing some instances to be misclassified but allowing a better result overall.

Finally, few datasets can be separated with just a straight line. Sometimes a line
with curves or even polygonal regions need to be marked out. This is achieved
with SVM by projecting the data into a higher dimensional space in order to
draw the lines and make predictions. Different kernels can be used to control
the projection and the amount of flexibility in separating the classes.
Choose the SVM algorithm:

1. Click the “Choose” button and select “SMO” under the “function”
group.
2. Click on the name of the algorithm to review the algorithm
configuration.
SMO refers to the specific efficient optimization algorithm used inside the
SVM implementation, which stands for Sequential Minimal Optimization.

55
COMPUTER
NETWORKS AND
DATA MINING LAB

WEKA Configuration for the Support Vector Machines Algorithm

The C parameter, called the complexity parameter in WEKA controls how


flexible the process for drawing the line to separate the classes can be. A value
of 0 allows no violations of the margin, whereas the default is 1.

A key parameter in SVM is the type of Kernel to use. The simplest kernel is a
Linear kernel that separates data with a straight line or hyperplane. The default
in WEKA is a Polynomial Kernel that will separate the classes using a curved
or wiggly line, the higher the polynomial, the more wiggly (the exponent value).

A popular and powerful kernel is the RBF Kernel or Radial Basis Function
Kernel that is capable of learning closed polygons and complex shapes to
separate the classes. It is a good idea to try a suite of different kernels and C
(complexity) values on your problem and see what works best.

1. Click “OK” to close the algorithm configuration.


2. Click the “Start” button to run the algorithm on the Ionosphere dataset.

You can see that with the default configuration that the SVM algorithm
achieves an accuracy of 88%.

56
DATA MINING
LAB

WEKA Classification Results for the Support Vector Machine Algorithm

2.10 CLUSTERING
K-Means Algorithm Using WEKA Explorer
Let us see how to implement the K-means algorithm for clustering using
WEKA Explorer.

Cluster Analysis

Clustering Algorithms are unsupervised learning algorithms used to create


groups of data with similar characteristics. It aggregates objects with
similarities into groups and subgroups thus leading to the partitioning of
datasets. Cluster analysis is the process of portioning of datasets into subsets.

These subsets are called clusters and the set of clusters is called clustering.
Cluster Analysis is used in many applications such as image recognition,
pattern recognition, web search, and security, in business intelligence such as
the grouping of customers with similar likings.

K-Means Clustering

K means clustering is the simplest clustering algorithm. In the K-Clustering


algorithm, the dataset is partitioned into K-clusters. An objective function is
used to find the quality of partitions so that similar objects are in one cluster
and dissimilar objects in other groups.

In this method, the centroid of a cluster is found to represent a cluster. The


centroid is taken as the center of the cluster which is calculated as the mean
value of points within the cluster. Now the quality of clustering is found by
measuring the Euclidean distance between the point and center. This distance
should be maximum.

57
COMPUTER
NETWORKS AND
How Does K-Mean Clustering Algorithm Work?
DATA MINING LAB
Step #1: Choose a value of K where K is the number of clusters.
Step #2: Iterate each point and assign the cluster which is having the nearest
center to it. When each element is iterated then compute the centroid of all the
clusters.
Step #3: Iterate every element from the dataset and calculate the Euclidean
distance between the point and the centroid of every cluster. If any point is
present in the cluster which is not nearest to it then reassign that point to the
nearest cluster and after performing this to all the points in the dataset, again
calculate the centroid of each cluster.
Step #4: Perform Step#3 until there is no new assignment that took place
between the two consecutive iterations.

K-means Clustering Implementation Using WEKA

The steps for implementation using WEKA are as follows:

1) Open WEKA Explorer and click on Open File in the Preprocess


tab. Choose dataset “vote.arff”.

2) Go to the “Cluster” tab and click on the “Choose” button. Select


the clustering method as “SimpleKMeans”.

3) Choose Settings and then set the following fields:

 Distance function as Euclidian

 The number of clusters as 6. With more number of clusters, the sum of


squared error will reduce.

58 Click on Ok and start the algorithm.


DATA MINING
LAB

4) Click on Start in the left panel. The algorithm display results on the white
screen. Let us analyze the run information:
 Scheme, Relation, Instances, and Attributes describe the property of the
dataset and the clustering method used. In this case, vote.arff dataset
has 435 instances and 13 attributes.

 With the Kmeans cluster, the number of iterations is 5.

 The sum of the squared error is 1098.0. This error will reduce with an
increase in the number of clusters.

 The 5 final clusters with centroids are represented in the form of a table.
In our case, Centroids of clusters are 168.0, 47.0, 37.0, 122.0.33.0 and
28.0.

 Clustered instances represent the number and percentage of total


instances falling in the cluster.

59
COMPUTER
NETWORKS AND
DATA MINING LAB

5) Choose “Classes to Clusters Evaluations” and click on Start.


The algorithm will assign the class label to the cluster. Cluster 0 represents
republican and Cluster 3 represents democrat. The Incorrectly clustered
instance is 39.77% which can be reduced by ignoring the unimportant
attributes.

60
DATA MINING
LAB

6) To ignore the unimportant attributes. Click on the “Ignore attributes” button


and select the attributes to be removed.
7) Use the “Visualize” tab to visualize the Clustering algorithm result. Go to
the tab and click on any box. Move the Jitter to the max.
 The X-axis and Y-axis represent the attribute.

 The blue color represents class label democrat and the red color
represents class label republican.

 Jitter is used to view Clusters.

 Click the box on the right-hand side of the window to change the x
coordinate attribute and view clustering with respect to other attributes.

Output
61
COMPUTER
NETWORKS AND
K means clustering is a simple cluster analysis method. The number of clusters
DATA MINING LAB can be set using the setting tab. The centroid of each cluster is calculated as the
mean of all points within the clusters. With the increase in the number of
clusters, the sum of square errors is reduced. The objects within the cluster
exhibit similar characteristics and properties. The clusters represent the class
labels.

2.10 GENERAL GUIDELINES


Following are some of the general guidelines:

 Observation book and Lab record are compulsory.


 You should attempt all problems/assignments given in the list session wise.
 For the tasks related to the working with the WEKA, describe the procedure and
also present screenshots wherever applicable.

 You may seek assistance in doing the lab exercises from the concerned lab
instructor. Since the assignments have credits, the lab instructor is obviously not
expected to tell you how to solve these, but you may ask questions concerning the
Operating system and C programs.

 Add comments wherever necessary.

 The program should be interactive, general and properly documented with real
Input/ Output data.

 If two or more submissions from different students appear to be of the same origin
(i.e. are variants of essentially the same program), none of them will be counted.
You are strongly advised not to copy somebody else's work.

 It is your responsibility to create a separate directory to store all the programs, so


that nobody else can read or copy.

 The list of the programs(list of programs given at the end, session-wise) is


available to you in this lab manual. For each session, you must come prepare with
the algorithms and the programs written in the Observation Book. You should
utilize the lab hours for executing the programs, testing for various desired outputs
and enhancements of the programs.

 As soon as you have finished a lab exercise, contact one of the lab instructor /
incharge in order to get the exercise evaluated and also get the signature from
him/her on the Observation book.

 Completed lab assignments should be submitted in the form of a Lab Record in


which you have to write the algorithm, program code along with comments and
output for various inputs given.

 The total no. of lab sessions (3 hours each) are 10 sessions and the list of
assignments is provided session-wise. It is important to observe the deadline given
for each assignment.

Get started with WEKA and explore the utilities sessionwise.

62
DATA MINING
LAB
2.11 PRACTICAL SESSIONS

Sessionwise practical problems are given as following:

Session-1

1. Download and install WEKA. Navigate the various options available in


WEKA. Explore the available datasets in WEKA. Load various datasets
and observe the following”
a. List the attribute names and their types
b. No. of records in each dataset
c. Identify the class attribute(if any)
d. Plot Histogram
e. Determine the no. of records for each class.
f. Visualize the data in different dimensions.
2. Create your own EXCEL file. Convert the EXCEL file to .csv format and
prepare it as .arff file.
3. Try to create your own datasets.
4. Preprocess and classify Customer, Argiculture, Weather, Whole-sale
Customers or the datasets of your own choice from
https://archive.ics.uci.edu/ml/datasets.php

Session-2

5. Perform the basic pre-processing operations on data relation such as


removing an attribute and filter attribute bank data
6. Demonstrate the preprocessing mechanism on the following datasets:
a. student.arff
b. labor.arff
c. contactlenses.arff

Session-3

7. Perform the following:


 Explore various options available in WEKA for preprocessing data
and apply unsupervised filters like Discretization, Resample-filter etc..
on various datasets.
 Load weather, nominal, Iris, Glass datasets into WEKA and run
Apriori algorithms with different support and confidence values.
 Study the rules generated.
 Apply different discretization filters on numerical attributes and run
the Apriori association rule algorithm. Study the rules generated.
 Derive interesting insights and observe the effect of discretization in
the rule generation process.

63
COMPUTER
NETWORKS AND
8. Implement the Apriori Algorithm to find the association rules in
DATA MINING LAB contactless.arff dataset.

Session-4

9. Find the frequent patterns using FP-Growth algorithm on


contactlenses.arff and test.arff datasets.
10. Generate association rules using Apriori algorithm with Bank.arff relation
a. Set minimum support range as 20% -100% incremental decrease
factor as 5% and
confidence factor as 80% and generate 5 rules.
b. Set minimum support as 10%, delta 5%, minimum average(4ft)
as 150% and
generate 4 rules.
11. Generate association rule for the credit card promotion dataset using a
priory algorithm with the support range 40% to 100% confidence as
10% incremental decrease as 5% and generate 6 rules.

Session-5

12. Perform the following:


 Use contactlenses.arff and load it into WEKA. Check that all attributes
are nominal (categorical).
 Change to the Associate Panel. Select “Apriori” as associator. After
pressing the start button, Apriori starts to build its model and writes its
output into the output field. The first part of the output (“Run
information”) describes the options that have been set and the data set
used. Make sure you understand all the data reported.
 The rules that have been generated are listed at the end of the output.
By default, only the 10 most valuable rules according to their
confidence level are shown. Each rule consists of some attribute values
on a left hand side of the arrow, the arrow sign and the right hand side
list of attribute values. Right of the arrow sign are the predicted
attribute values. Rules have certain support and confidence values. The
number before the arrow sign is the number of instances the rule
applies to. The number after the arrow sign is the number of instances
predicted correctly. The number in brackets after ‘conf:’ is the
confidence of the rule. Analyse the rules mined from the data set. What
are their confidence and support values? Examine the number of large
itemsets – make sure you understand how this data has been calculated
(check that the values you would get ‘manually’ are correct).

Session-6

13. Perform the following:


 Use zoo.arff dataset and load it into WEKA. Examine the attributes and
make sure you understand their meaning. Are all attributes nominal?
64
 In the preprocess area, deselect the animal and legs attributes. The DATA MINING
LAB
animal attribute is the name of the animal, and is not useful for mining.
The legs attribute is numeric and cannot be used directly with Apriori.
Alternatively, you can try to use the Discretize Filter to discretize the
legs attribute.
 After deselecting the attributes, use the Apply Filters button to generate
a working relation that removes those attributes. Notice how the
working relation changes, and has fewer attributes than the base
relation.
 First, try using the Apriori algorithm with the default parameters.
Record the generated rules.
 Vary the number of rules generated (click on the command that you are
running). Try 20, 30, ... Record how many rules you have to generate
before generating a rule containing type=mammal.
 Vary the maximum support until a rule containing type=mammal is the
top rule generated. Record the maximum support needed. B7. Select
one generated rule that was interesting to you. Why was it interesting?
What does it mean? Check its confidence and support – are they high
enough? B8. Suggest one improvement to the Apriori implementation
in WEKA that would have made this data mining lab easier to
accomplish.
14. Demonstrate to predict the Numerical Values in the given Data Set is
using Regression Methods.

Session-7

15. Demonstrate the classification rule process on the student.arff,


employee.arff and labor.arff datasets using the following algorithms:
a. Logistic Regression
b. Decision Tree
c. Naïve Bayes

16. Demonstrate the classification rule process on the student.arff,


employee.arff and labor.arff datasets using the following algorithms:
a. K-Nearest Neighbour
b. SVM

Session-8

17. Perform the following:


 Demonstrate performing classification on various data sets.
 Load each dataset into WEKA and run ID3, J48 classification
algorithm.
 Study the classifier output. Compute entropy values, Kappa statistic.
 Extract if-then rules from the decision tree generated by the classifier.
 Observe the confusion matrix.
18. Perform the following:
 Load each dataset into WEKA and perform Naïve-bayes classification
and k-Nearest Neighbour classification. 65
COMPUTER
NETWORKS AND
 Interpret the results obtained.
DATA MINING LAB  Plot RoC Curves
 Compare classification results of ID3, J48, Naïve-Bayes and k-NN
classifiers for each dataset, and deduce which classifier is performing
best and poor for each dataset and justify.

Session-9

19. Demonstrate Clustering features in Large Databases with noise.


20. Implement simple K-Means Algorithm to demonstrate the clustering rule
on the following datasets:
a. iris.arff
b. student.arff
21. Perform the following:
 Load each dataset into WEKA and run simple k-means clustering
algorithm with different values of k (number of desired clusters).
 Study the clusters formed.
 Observe the sum of squared errors and centroids, and derive insights.
 Explore other clustering techniques available in WEKA.
 Explore visualization features of WEKA to visualize the clusters.
 Derive interesting insights and explain.

Session-10

22. Implement Hierarchical Clustering Algorithm to demonstrate the


clustering rule process in the following datasets;
a. employee.arff
b. student.arff
23. Implement Density based Clustering Algorithm to demonstrate the
clustering rule process on dataset employee.arff.

2.12 SUMMARY

Data mining (also known as knowledge discovery from databases) is the


process of extraction of hidden, previously unknown and potentially
useful information from databases. The outcome of the extracted data
can be analyzed for the future planning and development perspectives.

Data mining steps in the knowledge discovery process are as follows:

 Data Cleaning- The removal of noise and inconsistent data.


 Data Integration - The combination of multiple sources of data.
 Data Selection - The data relevant for analysis is retrieved from
the database.
 Data Transformation - The consolidation and transformation of
data into forms appropriate for mining.
 Data Mining - The use of intelligent methods to extract patterns
from data.
 Pattern Evaluation - Identification of patterns that are interesting.
66
 Knowledge Presentation - Visualization and knowledge DATA MINING
LAB
representation techniques are used to present the extracted or
mined knowledge to the end user

In this Data Mining lab course you were given exposure to work with WEKA.
WEKA is a collection of machine learning algorithms for data mining tasks. It
contains tools for data preparation, classification, regression, clustering,
association rules mining, and visualization. Found only on the islands of
New Zealand, the WEKA. WEKA is open source software issued under
the GNU General Public License. The video links for the courses are available
at Online Lab Resources.

2.13 FURTHER READINGS

1. Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA
Workbench. Online Appendix for "Data Mining: Practical Machine
Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition,
2016.
2. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, and Ian H. Witten (2009). The WEKA Data Mining
Software: An Update. SIGKDD Explorations, Volume 11, Issue 1.
3. Jason Bell (2020) Machine Learning: Hands-On for Developers and
Technical Professionals, Second Edition, Wiley.
4. Richard J. Roiger (2020) Just Enough R! An Interactive Approach to
Machine Learning and Analytics, CRC Press.
5. Parteek Bhatia (2019) Data Mining and Data Warehousing Principles
and Practical Techniques, Cambridge University Press.
6. Mark Wickham (2018) Practical Java Machine Learning Projects with
Google Cloud Platform and Amazon Web Services, APress.
7. AshishSingh Bhatia, Bostjan Kaluza (2018) Machine Learning in Java -
Second Edition, Packt Publishing.
8. Richard J. Roiger (2016) Data Mining: A Tutorial-Based Primer, CRC
Press.
9. Mei Yu Yuan (2016) Data Mining and Machine Learning: WEKA
Technology and Practice, Tsinghua University Press (in Chinese).
10. Jürgen Cleve, Uwe Lämmel (2016) Data Mining, De Gruyter (in
German).
11. Eric Rochester (2015) Clojure Data Analysis Cookbook - Second
Edition, Packt Publishing.
12. Boštjan Kaluža (2013) Instant Weka How-to, Packt Publishing.
13. Hongbo Du (2010) Data Mining Techniques and Applications,
Cengage Learning.

2.14 WEBSITE REFERENCES

1. https://www.cs.waikato.ac.nz/ml/weka
2. https://weka.wikispaces.com

67
COMPUTER
NETWORKS AND
DATA MINING LAB
2.15 ONLINE LAB RESOURCES

1. Weka have put together several free online courses available at


https://www.cs.waikato.ac.nz/ml/weka/courses.html that teach
machine learning and data mining using Weka.
2. The videos for the courses are available on Youtube available at URL
https://www.youtube.com/user/WekaMOOC .
3. FAQs on WEKA: https://waikato.github.io/weka-wiki/faq/
4. WEKA Downloads: https://waikato.github.io/weka-
wiki/downloading_weka/
5. DATASETS: https://waikato.github.io/weka-wiki/datasets/

68

You might also like