Weka Data Miningvsem
Weka Data Miningvsem
Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. The original non-
Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modelling algorithms implemented
in other programming languages, plus data preprocessing utilities in C and a makefile-based system for
running machine learning experiments.
This original version was primarily designed as a tool for analyzing data from agricultural domains. Still,
the more recent fully Java-based version (Weka 3), developed in 1997, is now used in many different
application areas, particularly for educational purposes and research. Weka has the following
advantages, such as:
Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted
according to the Attribute-Relational File Format and filename with the .arff extension.
All Weka's techniques are predicated on the assumption that the data is available as one flat file or
relation, where a fixed number of attributes describes each data point (numeric or nominal attributes,
but also supports some other attribute types). Weka provides access to SQL databases using Java
Database Connectivity and can process the result returned by a database query. Weka provides access to
deep learning with Deeplearning4j.
It is not capable of multi-relational data mining. Still, there is separate software for converting a
collection of linked database tables into a single table suitable for processing using Weka. Another
important area currently not covered by the algorithms included the Weka distribution in sequence
modelling.
History of Weka
o In 1993, the University of Waikato in New Zealand began the development of the original version
of Weka, which became a mix of Tcl/Tk, C, and makefiles.
o In 1997, the decision was made to redevelop Weka from scratch in Java, including implementing
modelling algorithms.
o In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery Service Award.
o In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for business intelligence.
It forms the data mining and predictive analytics component of the Pentaho business intelligence
suite. Hitachi Vantara has since acquired Pentaho, and Weka now underpins the PMI (Plugin for
Machine Intelligence) open-source component.
Features of Weka
Weka has the following features, such as:
1. Preprocess
The preprocessing of data is a crucial task in data mining. Because most of the data is raw, there are
chances that it may contain empty or duplicate values, have garbage values, outliers, extra columns, or
have a different naming convention. All these things degrade the results.
To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive set of options
under the filter category. Here, the tool provides both supervised and unsupervised types of operations.
Here is the list of some operations for preprocessing:
Classification is one of the essential functions in machine learning, where we assign classes or categories
to items. The classic examples of classification are: declaring a brain tumour as "malignant" or
"benign" or assigning an email to a "spam" or "not_spam" class.
After selecting the desired classifier, we select test options for the training set. Some of the options are:
o Use training set: the classifier will be tested on the same training set.
o A supplied test set: evaluates the classifier based on a separate test set.
o Cross-validation Folds: assessment of the classifier based on cross-validation using the number
of provided folds.
o Percentage split: the classifier will be judged on a specific percentage of data.
Other than these, we can also use more test options such as Preserve order for % split, Output source
code, etc.
3. Cluster
In clustering, a dataset is arranged in different groups/clusters based on some similarities. In this case,
the items within the same cluster are identical but different from other clusters. Examples of clustering
include identifying customers with similar behaviours and organizing the regions according to
homogenous land use.
4. Associate
Association rules highlight all the associations and correlations between items of a dataset. In short, it is
an if-then statement that depicts the probability of relationships between data items. A classic example
of association refers to a connection between the sale of milk and bread.
The tool provides Apriori, FilteredAssociator, and FPGrowth algorithms for association rules mining
in this category.
5. Select Attributes
Every dataset contains a lot of attributes, but several of them may not be significantly valuable.
Therefore, removing the unnecessary and keeping the relevant details are very important for building a
good model.
Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and Ranker.
6. Visualize
In the visualize tab, different plot matrices and graphs are available to show the trends and errors
identified by the model.
As shown in the above screenshot, five options are available in the Applications category.
o The Exploreris the central panel where most data mining tasks are performed. We will further
explore this panel in upcoming sections.
o The tool provides an Experimenter In this panel, we can run experiments and also design them.
o WEKA provides the KnowledgeFlow panel. It provides an interface to drag and drop
components, connect them to form a knowledge flow and analyze the data and results.
o The Simple CLIpanel provides the command line powers to run WEKA. For example, to fire up
the ZeroR classifier on the arff data, we'll run from the command line:
It is important to note that the declaration of the header (@attribute) and the declaration of
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,yes
10. overcast,hot,high,TRUE,yes
11. overcast,cool,normal,TRUE,yes
12. rainy,cool,normal,FALSE,no
13. rainy,cool,normal,TRUE,no
Besides ARFF, the tool supports different file formats such as CSV, JSON, and XRFF.
Once data is loaded from different sources, the next step is to preprocess the data. For this purpose, we
can choose any suitable filter technique. All the methods come up with default settings that are
configurable by clicking on the name:
If there are some errors or outliers in one of the attributes, such as sepallength, in that case, we can
remove or update it from the Attributes section.
Types of Algorithms by Weka
WEKA provides many algorithms for machine learning tasks. Because of their core nature, all the
algorithms are divided into several groups. These are available under the Explorer tab of the WEKA. Let's
look at those groups and their core nature:
Each algorithm has configuration parameters such as batchSize, debug, etc. Some configuration
parameters are common across all the algorithms, while some are specific. These configurations can be
editable once the algorithm is selected to use.
Next