ML Assignment 2
ML Assignment 2
Assignment 2
WEKA offers a wide range of sample datasets to apply machine learning algorithms. The users
can perform machine learning tasks such as classification, regression, attribute selection, and
association on these sample datasets, and can also learn the tool using them.
WEKA explorer is used for performing several functions, starting from preprocessing.
Preprocessing takes input as a .arff file, processes the input, and gives an output that can be used
by other computer programs.
In WEKA the output of preprocessing gives the attributes present in the dataset which can be
further used for statistical analysis and comparison with class labels.
WEKA also offers many classification algorithms for decision trees. J48 is one of the popular
classification algorithms which outputs a decision tree. Using the Classify tab the user can
visualize the decision tree. If the decision tree is too populated, tree pruning can be applied from
the Preprocess tab by removing the attributes which are not required and starting the
classification process again.
Jython#
If you're starting from scratch, you might want to consider Jython, a rewrite of Python to
seamlessly integrate with Java. The drawback is, that you can only use the libraries that Jython
implements, not others like NumPy or SciPy. The article Using WEKA from Jython explains
how to use WEKA classes from Jython and how to implement a new classifier in Jython, with an
example of ZeroR implemented in Jython.
Jepp#
An approach making use of the javax.script package (new in Java 6) is Jepp, Java embedded
Python. Jepp seems to have the same limitations as Jython, not being able to import Scipy or
Numpy, but one can import pure Python libraries. The article Using WEKA via Jepp contains
more information and examples.
JPype#
python-weka-wrapper3#
You can use the python-weka-wrapper Python 3 library to access most of the non-GUI
functionality of Weka (3.9.x):
● pypi
● github
● examples
sklearn-weka-plugin#
With the sklearn-weka-plugin library, you can use Weka from within the scikit-learn
framework. The library itself uses python-weka-wrapper3 under the hood to make use of the
Weka algorithms.
● pypi
● github
● Examples
The WEKA machine learning tool provides a directory of some sample datasets. These datasets
can be directly loaded into WEKA for users to start developing models immediately.
The WEKA datasets can be explored from the “C:\Program Files\Weka-3-8\data” link. The
datasets are in .arff format.
Sample WEKA Datasets
Some sample datasets present in WEKA are enlisted in the table below:
1. airline.arff
2. breast-cancer.arff
3. contact-lens.arff
4. cpu.arff
5. cpu.with-vendor.arff
6. credit-g.arff
7. diabetes.arff
8. glass.arff
9. hypothyroid.arff
10. ionospehre.arff
11. iris.2D.arff
12. iris.arff
13. labor.arff
14. ReutersCorn-train.arff
15. ReutersCorn-test.arff
16. ReutersGrain-train.arff
17. ReutersGrain-test.arff
18. segment-challenge.arff
19. segment-test.arff
20. soybean.arff
21. supermarket.arff
22. unbalanced.arff
23. vote.arff
24. weather.numeric.arff
25. weather.nominal.arff
contact-lens.arff dataset is a database for fitting contact lenses. It was donated by the donor,
Benoit Julien in the year 1990.
Database: This database is complete. The examples used in this database are complete and
noise-free. The database has 24 instances and 4 attributes.
Attributes: All four attributes are nominal. There are no missing attribute values. The four
attributes are as follows:
#1) Age of the patient: The attribute age can take values:
● young
● pre-presbyopic
● presbyopic
● myope
● hypermetrope
● no
● yes
● reduced
● normal
Class Distribution: The instances that are classified into class labels are enlisted below:
iris.arff
iris.arff dataset was created in 1988 by Michael Marshall. It is the Iris Plants database.
Database: This database is used for pattern recognition. The data set contains 3 classes of 50
instances. Each class represents a type of iris plant. One class is linearly separable from the other
2 but the latter are not linearly separable from each other. It predicts to which species of the 3 iris
flowers the observation belongs. This is called a multi-class classification dataset.
Attributes: It has 4 numeric, predictive attributes, and the class. There are no missing attributes.
● sepal length in cm
● sepal width in cm
● petal length in cm
● petal width in cm
● class:
○ Iris Setosa
○ Iris Versicolour
○ Iris Virginica
Summary Statistics:
diabetes.arff
The database of this dataset is Pima Indians Diabetes. This dataset predicts whether the patient is
prone to be diabetic in the next 5 years. The patients in this dataset are all females of at least 21
years of age from Pima Indian Heritage. It has 768 instances and 8 numerical attributes plus a
class. This is a binary classification dataset where the output variable predicted is nominal
comprising of two classes.
ionosphere.arff
This is a popular dataset for binary classification. The instance in this dataset describes the
properties of radar returns from the atmosphere. It is used to predict whether the ionosphere has
some structure or not. It has 34 numerical attributes and a class.
The class attribute is “good” or “bad” which is predicted based on 34 attributes observation. The
received signals are processed by the autocorrelation function taking time pulse and pulse
number as arguments.
Regression Datasets
The regression datasets can be downloaded from the WEKA webpage “Collections of datasets”.
It has 37 regression problems obtained from different sources. The downloaded file will create a
numeric/directory with regression datasets in .arff format.
The popular datasets present in the directory are Longley economic dataset (longley.arff),
Boston house price dataset (housing.arff), and sleep in mammals data set (sleep.arff).
Let us now see how to identify real-valued and nominal attributes in the dataset using WEKA
explorer.
Real valued attributes are numeric attributes containing only real values. These are measurable
quantities. These attributes can be interval scaled such as temperature or ratio scaled such as
mean, or median.
Nominal attributes represent names or some representation of things. There is no order in such
attributes and they represent some category. For example, color.
Follow the steps enlisted below to use WEKA for identifying real values and nominal
attributes in the dataset.
#3) Select the input file from the WEKA3.8 folder stored on the local system. Select the
predefined .arff file “credit-g.arff” file and click on “Open”.
#4) An attribute list will open on the left panel. Selected attribute statistics will be shown on the
right panel along with the histogram.
The panel below the current relation shows the name of attributes.
In the right panel, the selected attribute statistics are displayed. Select the attribute
“checking_status”.
It shows:
#5) To find out only numeric attributes, click on the Filter button. From there, click on Choose
->WEKA >FILTERS -> Unsupervised Type ->Remove Type.
WEKA filters have many functionalities to transform the attribute values of the dataset to make
it suitable for the algorithms. For example, the numeric transformation of attributes.
Filtering the nominal and real-valued attributes from the dataset is another example of using
WEKA filters.
#6) Click on the RemoveType in the filter tab. An object editor window will open. Select
attributeType “Delete numeric attributes” and click on OK.
#7) Apply the filter. Only numeric attributes will be shown.
The class attribute is of the nominal type. It classifies the output and hence cannot be deleted.
Thus it is seen with the numeric attribute.
Output:
The real-valued and nominal values attributes in the dataset are identified. Visualization with the
class label is seen in the form of histograms.
Now, we will see how to implement decision tree classification on weather.nominal.arff dataset
using the J48 classifier.
weather.nominal.arff
It is a sample dataset present in the direct of WEKA. This dataset predicts if the weather is
suitable for playing cricket. The dataset has 5 attributes and 14 instances. The class label “play”
classifies the output as “yes’ or “no”.
A decision Tree is the classification technique that consists of three components root node,
branch (edge or link), and leaf node. Root represents the test condition for different attributes, the
branch represents all possible outcomes that can be there in the test, and leaf nodes contain the
label of the class to which it belongs. The root node is at the starting of the tree which is also
called the top of the tree.
J48 Classifier
It is an algorithm to generate a decision tree that is generated by C4.5 (an extension of ID3). It is
also known as a statistical classifier. For decision tree classification, we need a database.
Steps include:
#2) Select weather.nominal.arff file from the “choose file” under the preprocess tab option.
#3) Go to the “Classify” tab for classifying the unclassified data. Click on the “Choose” button.
From this, select “trees -> J48”. Let us also have a quick look at other options in the Choose
button:
The output is in the form of a decision tree. The main attribute is “outlook”.
If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then the
class label play= “yes”.
If the outlook is overcast, the class label, play is “yes”. The number of instances which obey the
classification is 4.
If outlook is rainy, further classification takes place to analyze the attribute “windy”. If
windy=true, the play = “no”. The number of instances which obey the classification for outlook=
windy and windy=true is 2.
Conclusion
WEKA offers a wide range of sample datasets to apply machine learning algorithms. The users
can perform machine learning tasks such as classification, regression, attribute selection, and
association on these sample datasets, and can also learn the tool using them.
WEKA explorer is used for performing several functions, starting from preprocessing.
Preprocessing takes input as a .arff file, processes the input, and gives an output that can be used
by other computer programs. In WEKA the output of preprocessing gives the attributes present
in the dataset which can be further used for statistical analysis and comparison with class labels.
WEKA also offers many classification algorithms for decision trees. J48 is one of the popular
classification algorithms which outputs a decision tree. Using the Classify tab the user can
visualize the decision tree. If the decision tree is too populated, tree pruning can be applied from
the Preprocess tab by removing the attributes which are not required and starting the
classification process again.