Weka Activity Report
Weka Activity Report
ASSIGNMENT REPORT
ON
Submitted by
Shilpa V 1JS18IS081
Spoorthi Kulkarni 1JS18IS091
of 7th B
2021-2022
1
J S S Mahavidyapeetha
J S S Academy of Technical Education, Bengaluru-60
ACTIVITY-No: 1,2,3
Self-Learning: Explore WEKA tool: A Machine Learning Software
Learning Objectives:
Sl.No Objective
1 Students will be able to learn the WEKA tool and make use of the tool to
understand the Knowledge discovery process.
C301.1 Apply different search techniques to find solutions for various artificial L3
intelligence problems
C301.2 Identify the various knowledge representation techniques L3
C301.3 Utilize the decision tree learning algorithms and ANN methods for L3
approximating real, discrete and vector valued target functions.
C301.4 Identify the Bayesian perspective on machine learning L3
C301.5 Develop Solutions Using instance based and reinforcement learning L3
techniques.
Mode of Submission:
1. Detail information should be written in Assignment Blue Book for Evaluation.
2. Screen shots should to be neatly arranged with proper captions where ever necessary.
3. Each student has to submit a separate blue book
Evaluation
Activity
Work to be carried Out Marks
No.
Write about weka, Installation of weka with brief explanation of
1
steps along with necessary screen shots. 10
Implement any one algorithm from classification, clustering,
2 10
association rule mining.
My observations: Write in your own words and Demonstration
3 10
References:
(Seamless integration of NPTEL lecturers or any other learning materials from the Web in the
class room teaching.)
1. Demonstration on WEKA
https://www.youtube.com/watch?v=UDGI3R7wyG0
2. Weka demo and how to read the results
https://www.youtube.com/watch?v=XlbM9ibjUuM
3. Machine Learning #24 Weka Tutorial for Beginners
https://www.youtube.com/watch?v=yseeBbEluik
3
ACTIVITY 1
1. ABOUT WEKA
Waikato Environment for Knowledge Analysis (Weka), developed at the University of
Waikato, New Zealand, is free software licensed under the GNU General Public License.
Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modeling, together with graphical user interfaces for easy access to these functions. The original
version was primarily designed as a tool for analyzing data from agricultural domains, but the
more recent fully Java-based version (Weka 3), is now used in many different application areas,
in particular for educational purposes and research. Advantages of Weka include:
4
2. INSTALLATION OF WEKA WITH BRIEF EXPLANATION OF STEPS
Step 1: Browse for Weka tool free download and click on the first click i.e. sourceforge.net. Click
on the downloads button to start the download.
Step 2: Click on the .exe file in your downloads and run the installed. Click on Next when presented
with the installation window
5
Step 3: Click on I Agree when presented with the Licence Agreement to go to the next step.
6
Step 5: Choose your installation location
Step 6: Choose where you want the application to appear on the Start Menu
7
Step 7: Wait for the application to finish installing
8
ACTIVITY 2 : Algorithm Execution
and
ACTIVITY 3: Observation of Results
INTRODUCTION
What is Machine Learning?
Machine learning is the study of computer algorithms that can improve automatically through
experience and by the use of data. It is seen as a part of artificial intelligence. It is a method of data
analysis that automates analytical model building. A branch of artificial intelligence, which is
based on the idea that systems can learn from data, identify patterns and make decisions with
minimal human intervention.
Techniques used:
Machine learning uses two types of techniques: supervised learning, which trains a model on
known input and output data so that it can predict future outputs, and unsupervised learning, which
finds hidden patterns or intrinsic structures in input data.
9
Supervised Learning
Supervised machine learning builds a model that makes predictions based on evidence in the
presence of uncertainty. A supervised learning algorithm takes a known set of input data and
known responses to the data (output) and trains a model to generate reasonable predictions for the
response to new data.
Classification techniques predict discrete responses. Classification models classify input data
into categories. Typical applications include medical imaging, speech recognition, and credit
scoring. For example, applications for hand-writing recognition use classification to recognize
letters and numbers.
Common algorithms for performing classification include k-nearest neighbor, Naïve
Bayes, discriminant analysis, logistic regression, and neural networks.
Regression techniques predict continuous responses—for example, changes in temperature or
fluctuations in power demand.
Common regression algorithms include linear model, nonlinear model, regularization
Unsupervised Learning
Unsupervised learning finds hidden patterns or intrinsic structures in data. It is used to draw
inferences from datasets consisting of input data without labeled responses.
Clustering is the most common unsupervised learning technique. It is used for exploratory data
analysis to find hidden patterns or groupings in data. Applications for cluster analysis include gene
sequence analysis, market research, and object recognition.
It is not easy to understand and implement machine learning algorithms, it can be quite difficult
and complex especially for the first-time learners. This topic requires a large number of skills and
efforts for students. It is very important that students are encouraged to learn and start studying
10
appropriately for this subject. The development of algorithmic and problem thinking is very
important not only for the school environment, but also for a large number of activities in real life.
Using a visual software tool for the teaching machine learning algorithm increased the level of
interest for the first-time learners than the using code. However, seeing the algorithm with the code
enabled participants to better understand what is told. They think they understand better than the
tool users according to the results. In order to increase the work on machine learning, it would be
more appropriate for the trainers to explain algorithms with the code after increasing the level of
interest of the students by using visual tools. In this case, students should be encouraged to learn
the subjects of machine learning and to work in this area and their interest in the subject should be
increased. It would be useful to use software tools when teaching machine learning algorithms to
increase the interest of students and prevent the prejudice about the difficulty of machine learning
algorithms. However, the code should be explained in order to better understand the algorithms
and to increase the beliefs of the students could implement the algorithm themselves.
ALGORITHMS:
1. Classification → ID3, J48
2. Neural Network → Multilayer Perceptron
3. CLUSTERING → EM, K-Means
4. ASSOCIATION RULE MINING → Apriori
Open the Weka Application and click on the Explorer option on the right-hand bar.
11
1. CLASSIFICATION
Under Classification, we have chosen 2 algorithms: ID3 and J48
Step 2: Click on the J48 option in the classify tab in the choose option and click on Start.
(Using Training Set)
12
(Using Cross-validation)
Step 3: Right click on the trees- J48 and click on visualise tree to view the decision tree.
Decision Tree is the classification technique that consists of three components root node, branch
(edge or link), and leaf node. Root represents the test condition for different attributes, the branch
represents all possible outcomes that can be there in the test, and leaf nodes contain the label of
13
the class to which it belongs. The root node is at the starting of the tree which is also called the top
of the tree. J48 Classifier is an algorithm to generate a decision tree that is generated by C4.5 (an
extension of ID3). It is also known as a statistical classifier. For decision tree classification, we
need a database.
The dataset used to implement J48 classifier algorithm is weather.nominal. This dataset predicts
if the weather is suitable for playing cricket. The dataset has 5 attributes and 14 instances. The
class label “play” classifies the output as “yes’ or “no”. While using training set the 'Correctly
Classified Instances' measures the accuracy from the correctly classified instances. The model here
gives 100% accuracy. While using Cross-validation the 'Correctly Classified Instances'
measures the accuracy from the correctly classified instances. The model here gives 50%
accuracy.
The output is in the form of a decision tree. The main attribute is “outlook”.
• If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then
class label play= “yes”.
• If the outlook is overcast, the class label, play is “yes”. The number of instances which
obey the classification is 4.
• If outlook is rainy, further classification takes place to analyze the attribute “windy”. If
windy=true, the play = “no”. The number of instances which obey the classification for
outlook= windy and windy=true is 2.
The row indicates the true class, the column indicates the classifier output. Each entry, then, gives
the number of instances of <row> that were classified as <column>. In our output no b's were
classifies as 'a' and no a's were classified as 'b'. 9 and 5 are the correct classification summing to
14 indicating 100% accuracy.
a b <-- classified as
5 4 | a = yes
3 2 | b = no
The row indicates the true class, the column indicates the classifier output. Each entry, then, gives
the number of instances of <row> that were classified as <column>. In our output 3 b's were
classified as 'a' and 2 a's were classified as 'b'. 5 and 2 are the correct classification summing to 7
indicating 100% accuracy.
14
ID3 Algorithm DB USED- WeatherNominal
Here, we can download the simpleEducationalLearningSchemes package to get the ID3 algorithm.
Now, we can choose the ID3 algorithm option under the trees folder in Classify tab, after clicking
on Choose. Following output can be seen
(Using Training Set)
15
(Using Cross-Validation)
16
The J48 model is more accurate in the quality in the process, based in C4.5 is an extension of ID3
that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees,
rule derivation, and so on. The result in this case only reflects the kind of your data set you used.
The ID3 could be implemented when you need a faster/simpler result without taking into account
all those additional factors in the J48 consider.
2. CLUSTERING
Under Clustering , we have chosen 2 algorithms: EM Algorithm and K Means
17
Step 2: Choose the K Means options from the choose option from the Cluster tab. Click on Start
and observe the output.
Step 3: Right click on the SimpleKMeans button on the result tab and select View Cluster
Assignments.
18
K-Means Clustering Algorithm Observation
Clustering is one of the most common exploratory data analysis techniques used to get an intuition
about the structure of the data. It can be defined as the task of identifying subgroups in the data
such that data points in the same subgroup (cluster) are very similar while data points in different
clusters are very different. K-means algorithm is an iterative algorithm that tries to partition the
dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point
belongs to only one group.
The dataset has 6 attributes, Id, SepalLength, SepalWidth, PetalLength, Petal Width and Species.
It has a total of 149 instances. We need to cluster this dataset into different clusters. We then
compute the statistical variation in the data- the mean, minimum and the standard deviation. Each
attribute is then scattered and clustered based on their similarity and the distance between each
cluster is calculated.
If it is found that the distance between one node to another is less, then the cluster’s centroid shifts
and the cluster updates itself. The number of iterations required to make the cluster is analysed and
the output is projected i.e., the number of iterations required to calculate the number of clusters=7.
The sum of squared errors within the clusters is calculated as 62.1436882815. The initial starting
points with the species are computed and shown iteration wise. The distance between each other
is calculated and the clusters are formed. The cluster is then visualised in different illustrative
colours- red blue and purple for a particular instance. The number of clusters for the given instance
= 3 with mean distance 74.5 with difference of 6.1 in the Y-axis, as depicted in the scatter-plot.
19
Step 2: Navigate to where the datasets are stored. We have chosen the weather numeric dataset for
this.
Step 3: Under the Cluster tab, click on Choose to select the algorithm to be applied. Then select
the EM option
Step 4: By right-clicking on the Result list, we can get the option to Visualize Cluster
Arrangements. Click on that to see the cluster map.
20
K Means Algorithm vs EM Algorithm
Expectation Maximization (EM) is clustering algorithm that relies on maximizing the likelihood
to find the statistical parameters of the underlying sub-populations in the dataset. I will not get into
the probabilistic theory behind EM. The EM algorithm alternates between two steps (E-step and
M-step). In the E-step the algorithm tries to find a lower bound function on the original likelihood
using the current estimate of the statistical parameters. In the M-step the algorithm finds new
estimates of those statistical parameters by maximizing the lower bound function (i.e. determine
the MLE of the statistical parameters).
Neither of the algorithms has the ability to detect outliers and hence we must pre-process the data
to mitigate the effect of outliers from the detected clusters. EM, in particular, tends to be sensitive
to the presence of outliers as there are no constraints on the covariance matrix.
Unlike K-means, in EM, the clusters are not limited to spherical shapes. In EM we can constrain
the algorithm to provide different covariance matrices (spherical, diagonal and generic). These
different covariance matrices in return allow us to control the shape of our clusters and hence we
can detect sub-populations in our data with different characteristics.
21
3. ANN
For a Neural Network, we have chosen the Multilayer Perceptron algorithm
Step 2 : Click on Multilayer perceptron in the classify tab and click on Start.
22
To get the Neural Network as a graphical representation,
We have to click on the classifier option of MultilayerPerceptron which opens up the following
dialogue box
23
Here, the GUI option must be changed to True and click on Okay
24
Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron. A multilayer perceptron (MLP) is a
feed forward artificial neural network that generates a set of outputs from a set of inputs. An MLP
is characterized by several layers of input nodes connected as a directed graph between the input
nodes connected as a directed graph between the input and output layers. MLP uses
backpropagation for training the network. MLP is a deep learning method.
The dataset used to implement multi-Layer perceptron is iris.arff. The database contains 150
instances and 5 attributes.
Backpropagation repeatedly adjust the weights so as to minimize the difference between actual
output and desired output. Hidden Layers, which are neuron nodes are stacked in between inputs
and outputs, allowing neural networks to learn more complicated features.
The 'Correctly Classified Instances' measures the accuracy from the correctly classified instances.
The model here gives 98.667% accuracy for 148 correctly classified instance,
A value of kappa is 0.98 which indicate excellent agreement between observed agreement and
expected agreement.
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 1 49 | c = Iris-virginica
The row indicates the true class, the column indicates the classifier output. Each entry, then, gives
the number of instances of <row> that were classified as <column>. In our output 1 'c' is classified
as 'b' and 1 'b' were classified as 'c'.
25
4. ASSOCIATION RULE MINING
For Association rule mining, we have chosen Apriori algorithm
Step 2 : Under Associate tab, we can choose the Apriori option, after the usual data pre-processing.
We have chosen nominal weather data for this as well.
26
Step 3 : After selecting the algorithm, we can run it to get the following result
27
Apriori Algorithm Observation
Association rule learning is additionally a basic rule-based machine learning technique used for
locating fascinating relations between variables in massive databases.
The Apriori algorithm uses frequent item sets to generate association rules, and it is designed to
work on the databases that contain transactions. With the help of these association rules, it
determines how strongly or how weakly two objects are connected. This algorithm uses a breadth-
first search and Hash Tree to calculate the itemset associations efficiently. It is the iterative process
for finding the frequent item sets from the large dataset.
The supermarket.arff database is used to implement Apriori Algorithm. Apriori algorithm expects
data that is purely nominal without numeric attributes.
The database contains 4627 instances and 217 attributes. It can be seen how difficult it would be
to detect the association between such a large number of attributes. This task is therefore automated
with the help of Apriori algorithm.
28
Algorithm finds 10 probable best association rules.
biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 means that out of 4627
instances, 788 show that biscuits=t frozen foods=t fruit=t total=high is "high" and cake is true.
This gives a strong association and confidence level is 0.92.
29
EVALUATION
1.
2.
3.
30