0% found this document useful (0 votes)
631 views30 pages

Weka Activity Report

This document describes an assignment for students to explore the WEKA machine learning software tool. It provides instructions for students to download and install WEKA, preprocess data, and use classification, clustering, and association rule mining algorithms. Students are asked to interpret the outputs and write observations. The assignment aims to help students learn about the knowledge discovery process using a machine learning tool. It will be evaluated based on students' documentation of the WEKA installation process, implementation of an algorithm, and written observations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
631 views30 pages

Weka Activity Report

This document describes an assignment for students to explore the WEKA machine learning software tool. It provides instructions for students to download and install WEKA, preprocess data, and use classification, clustering, and association rule mining algorithms. Students are asked to interpret the outputs and write observations. The assignment aims to help students learn about the knowledge discovery process using a machine learning tool. It will be evaluated based on students' documentation of the WEKA installation process, implementation of an algorithm, and written observations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

JSS ACADEMY OF TECHNICAL EDUCATION

JSS Campus, Dr.Vishnuvardhan Road, Bengaluru-560060

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

ASSIGNMENT REPORT
ON

Explore WEKA tool: A Machine Learning Software


For the course:
ARTIFICIAL INTELLIGENCE AND
MACHINE LEARNING (18CS71)

Submitted by

Shilpa V 1JS18IS081
Spoorthi Kulkarni 1JS18IS091

of 7th B

Under the Guidance of

Dr. Malini M Patil


Asso. Prof, Dept. of ISE, JSSATE, Bengaluru

2021-2022

1
J S S Mahavidyapeetha
J S S Academy of Technical Education, Bengaluru-60

2018 SCHEME (60:40)

NAME OF THE FACULTY: Dr Malini M Patil TERM: OCT-JAN 2021


SUB: Artificial Intelligence and Machine Learning SEM/SEC: 7/B
SUB CODE: 18CS71 MODULES: 1 TO 5
MAXIMUM MARKS: 30 DURATIONS:3 hours
DATE: 27-11-2021 TIME: 1.45 to 4:45 PM

ACTIVITY-No: 1,2,3
Self-Learning: Explore WEKA tool: A Machine Learning Software

Learning Objectives:
Sl.No Objective
1 Students will be able to learn the WEKA tool and make use of the tool to
understand the Knowledge discovery process.

Relevance of Course Outcomes:


CO. NO. COURSE OUTCOMES STATEMENT BLL

C301.1 Apply different search techniques to find solutions for various artificial L3
intelligence problems
C301.2 Identify the various knowledge representation techniques L3
C301.3 Utilize the decision tree learning algorithms and ANN methods for L3
approximating real, discrete and vector valued target functions.
C301.4 Identify the Bayesian perspective on machine learning L3
C301.5 Develop Solutions Using instance based and reinforcement learning L3
techniques.

Relevance of COxPO and COxPSO Mapping


CO’s NAME OF ACTIVITY RELEVANCE OF PO AND PSO
MAPPING
CO1 to Explore WEKA tool: A Machine PO1, PO2, PO3, P04, PO5, PO6, PO7, PO8,
CO5 Learning Software PO9, PO10, PO12
PSO1, PSO2, PSO3

Description: Artificial intelligence (AI) is a technology which enables a machine to simulate


human behavior. Machine learning is a subset of AI which allows a machine to automatically learn
from past data without programming explicitly. The activity is focused to cover important features
of AI and ML by learning, exploring and understanding the machine learning software WEKA.
Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for
data preparation, classification, regression, clustering, association rules mining, and visualization.
2
Found only on the islands of New Zealand, the Weka is a flightless bird with an inquisitive nature.
Weka is open-source software issued under the GNU General Public License. The student has to
download the weka software from the website https://www.cs.waikato.ac.nz/ml/weka/tool. Install
the same. Explore and understand behavior of Machine learning algorithms and Interpret the
Knowledge discovery process.
Steps to be followed for Activity-1, 2, 3
Note: Sample demo will be shown in the class.
1. Activity should be conducted in a batch of two students.
2. Down load weka tool from https://www.cs.waikato.ac.nz/ml/weka/tool
3. Install it.
4. Explain the steps of Installation with proper screen shots.
5. Conduct the Data Pre-processing.
6. Identify the algorithms from Association Rule Mining, Classification and Clustering.
7. Identify the data set. Reference https://archive.ics.uci.edu/ml/index.php Or use the inbuilt
data set from the weka tool.
8. Interpret the Output as per the selected algorithm.
9. Write your Observations.
10. Date for submission: 31-12-2021

Mode of Submission:
1. Detail information should be written in Assignment Blue Book for Evaluation.
2. Screen shots should to be neatly arranged with proper captions where ever necessary.
3. Each student has to submit a separate blue book

Evaluation
Activity
Work to be carried Out Marks
No.
Write about weka, Installation of weka with brief explanation of
1
steps along with necessary screen shots. 10
Implement any one algorithm from classification, clustering,
2 10
association rule mining.
My observations: Write in your own words and Demonstration
3 10

References:
(Seamless integration of NPTEL lecturers or any other learning materials from the Web in the
class room teaching.)
1. Demonstration on WEKA
https://www.youtube.com/watch?v=UDGI3R7wyG0
2. Weka demo and how to read the results
https://www.youtube.com/watch?v=XlbM9ibjUuM
3. Machine Learning #24 Weka Tutorial for Beginners
https://www.youtube.com/watch?v=yseeBbEluik

3
ACTIVITY 1

1. ABOUT WEKA
Waikato Environment for Knowledge Analysis (Weka), developed at the University of
Waikato, New Zealand, is free software licensed under the GNU General Public License.
Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modeling, together with graphical user interfaces for easy access to these functions. The original
version was primarily designed as a tool for analyzing data from agricultural domains, but the
more recent fully Java-based version (Weka 3), is now used in many different application areas,
in particular for educational purposes and research. Advantages of Weka include:

• Free availability under the GNU General Public License.


• Portability, since it is fully implemented in the Java programming language and thus
runs on almost any modern computing platform.
• A comprehensive collection of data pre-processing and modelling techniques.
• Ease of use due to its graphical user interfaces.
Weka supports several standard data mining tasks, more specifically, data
preprocessing, clustering, classification, regression, visualization, and feature selection.
Input to Weka is expected to be formatted according the Attribute-Relational File Format and with
the filename bearing the. arff extension. It is not capable of multi-relational data mining, but there
is separate software for converting a collection of linked database tables into a single table that is
suitable for processing using Weka. Another important area that is currently not covered by the
algorithms included in the Weka distribution is sequence modeling.
The current version of Weka being used in our project is 3.8.5.

4
2. INSTALLATION OF WEKA WITH BRIEF EXPLANATION OF STEPS

Step 1: Browse for Weka tool free download and click on the first click i.e. sourceforge.net. Click
on the downloads button to start the download.

Step 2: Click on the .exe file in your downloads and run the installed. Click on Next when presented
with the installation window

5
Step 3: Click on I Agree when presented with the Licence Agreement to go to the next step.

Step 4: Choose the components to install. It can be Full, Custom or Minimal

6
Step 5: Choose your installation location

Step 6: Choose where you want the application to appear on the Start Menu

7
Step 7: Wait for the application to finish installing

Step 8: Finish the installation

8
ACTIVITY 2 : Algorithm Execution
and
ACTIVITY 3: Observation of Results

Implement any one algorithm from classification, clustering, association rule


mining and neural network. Interpret the results and write your observations
The algorithms are chosen under the following traditional Machine learning
Techniques.

INTRODUCTION
What is Machine Learning?

Machine learning is the study of computer algorithms that can improve automatically through
experience and by the use of data. It is seen as a part of artificial intelligence. It is a method of data
analysis that automates analytical model building. A branch of artificial intelligence, which is
based on the idea that systems can learn from data, identify patterns and make decisions with
minimal human intervention.

How Algorithms are influencing Machine Learning?


The main benefits of machine learning processes are the cost-effective solutions that are executed
perfectly, without human intervention. Algorithms allow machines to follow, sets of instructions
to perform tasks; algorithms also help the machines choose and decide which set of instructions
can yield better results. It leads to powerful insights that can be used to predict future outcomes.
Machine learning algorithms do all of that and more, using statistics to find patterns in vast
amounts of data that encompasses everything from images, numbers, words, etc.

Techniques used:
Machine learning uses two types of techniques: supervised learning, which trains a model on
known input and output data so that it can predict future outputs, and unsupervised learning, which
finds hidden patterns or intrinsic structures in input data.

9
Supervised Learning
Supervised machine learning builds a model that makes predictions based on evidence in the
presence of uncertainty. A supervised learning algorithm takes a known set of input data and
known responses to the data (output) and trains a model to generate reasonable predictions for the
response to new data.

Classification techniques predict discrete responses. Classification models classify input data
into categories. Typical applications include medical imaging, speech recognition, and credit
scoring. For example, applications for hand-writing recognition use classification to recognize
letters and numbers.
Common algorithms for performing classification include k-nearest neighbor, Naïve
Bayes, discriminant analysis, logistic regression, and neural networks.
Regression techniques predict continuous responses—for example, changes in temperature or
fluctuations in power demand.
Common regression algorithms include linear model, nonlinear model, regularization

Unsupervised Learning
Unsupervised learning finds hidden patterns or intrinsic structures in data. It is used to draw
inferences from datasets consisting of input data without labeled responses.

Clustering is the most common unsupervised learning technique. It is used for exploratory data
analysis to find hidden patterns or groupings in data. Applications for cluster analysis include gene
sequence analysis, market research, and object recognition.

Difference between Tool-based and Programming-Based ML Algorithms

It is not easy to understand and implement machine learning algorithms, it can be quite difficult
and complex especially for the first-time learners. This topic requires a large number of skills and
efforts for students. It is very important that students are encouraged to learn and start studying

10
appropriately for this subject. The development of algorithmic and problem thinking is very
important not only for the school environment, but also for a large number of activities in real life.
Using a visual software tool for the teaching machine learning algorithm increased the level of
interest for the first-time learners than the using code. However, seeing the algorithm with the code
enabled participants to better understand what is told. They think they understand better than the
tool users according to the results. In order to increase the work on machine learning, it would be
more appropriate for the trainers to explain algorithms with the code after increasing the level of
interest of the students by using visual tools. In this case, students should be encouraged to learn
the subjects of machine learning and to work in this area and their interest in the subject should be
increased. It would be useful to use software tools when teaching machine learning algorithms to
increase the interest of students and prevent the prejudice about the difficulty of machine learning
algorithms. However, the code should be explained in order to better understand the algorithms
and to increase the beliefs of the students could implement the algorithm themselves.

ALGORITHMS:
1. Classification → ID3, J48
2. Neural Network → Multilayer Perceptron
3. CLUSTERING → EM, K-Means
4. ASSOCIATION RULE MINING → Apriori

Open the Weka Application and click on the Explorer option on the right-hand bar.

11
1. CLASSIFICATION
Under Classification, we have chosen 2 algorithms: ID3 and J48

J48 Algorithm DB USED- WeatherNominal


Step 1: Select the required database.

Step 2: Click on the J48 option in the classify tab in the choose option and click on Start.
(Using Training Set)

12
(Using Cross-validation)

Step 3: Right click on the trees- J48 and click on visualise tree to view the decision tree.

J48 Algorithm Observation

Decision Tree is the classification technique that consists of three components root node, branch
(edge or link), and leaf node. Root represents the test condition for different attributes, the branch
represents all possible outcomes that can be there in the test, and leaf nodes contain the label of

13
the class to which it belongs. The root node is at the starting of the tree which is also called the top
of the tree. J48 Classifier is an algorithm to generate a decision tree that is generated by C4.5 (an
extension of ID3). It is also known as a statistical classifier. For decision tree classification, we
need a database.

The dataset used to implement J48 classifier algorithm is weather.nominal. This dataset predicts
if the weather is suitable for playing cricket. The dataset has 5 attributes and 14 instances. The
class label “play” classifies the output as “yes’ or “no”. While using training set the 'Correctly
Classified Instances' measures the accuracy from the correctly classified instances. The model here
gives 100% accuracy. While using Cross-validation the 'Correctly Classified Instances'
measures the accuracy from the correctly classified instances. The model here gives 50%
accuracy.

The output is in the form of a decision tree. The main attribute is “outlook”.
• If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then
class label play= “yes”.
• If the outlook is overcast, the class label, play is “yes”. The number of instances which
obey the classification is 4.
• If outlook is rainy, further classification takes place to analyze the attribute “windy”. If
windy=true, the play = “no”. The number of instances which obey the classification for
outlook= windy and windy=true is 2.

While using training set the Confusion matrix obtained is


a b <-- classified as
9 0 | a = yes
0 5 | b = no

The row indicates the true class, the column indicates the classifier output. Each entry, then, gives
the number of instances of <row> that were classified as <column>. In our output no b's were
classifies as 'a' and no a's were classified as 'b'. 9 and 5 are the correct classification summing to
14 indicating 100% accuracy.

While using Cross-validation the Confusion Matrix obtained is

a b <-- classified as
5 4 | a = yes
3 2 | b = no

The row indicates the true class, the column indicates the classifier output. Each entry, then, gives
the number of instances of <row> that were classified as <column>. In our output 3 b's were
classified as 'a' and 2 a's were classified as 'b'. 5 and 2 are the correct classification summing to 7
indicating 100% accuracy.

14
ID3 Algorithm DB USED- WeatherNominal
Here, we can download the simpleEducationalLearningSchemes package to get the ID3 algorithm.

Now, we can choose the ID3 algorithm option under the trees folder in Classify tab, after clicking
on Choose. Following output can be seen
(Using Training Set)

15
(Using Cross-Validation)

ID3 Algorithm vs J48 Algorithm


ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that follows a
greedy approach of building a decision tree by selecting a best attribute that yields maximum
Information Gain (IG) or minimum Entropy (H).

16
The J48 model is more accurate in the quality in the process, based in C4.5 is an extension of ID3
that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees,
rule derivation, and so on. The result in this case only reflects the kind of your data set you used.
The ID3 could be implemented when you need a faster/simpler result without taking into account
all those additional factors in the J48 consider.

Characteristics J48 Algorithm ID3 Algorithm


Correctly Classified Instances 7 12
Accuracy 50 % 85.7143%
Incorrectly Classified Instances 7 2
Kappa statistic - 0.0426 0.6889
Mean absolute error 0.4167 0.1429
Root mean squared error 0.5984 0.378
Relative absolute error 87.5 30%
Root relative squared error 121.2987 76.6097 %
a b <-- classified as a b <-- classified as
Confusion Matrix 5 4 | a = yes 8 1 | a = yes
3 2 | b = no 1 4 | b = no

2. CLUSTERING
Under Clustering , we have chosen 2 algorithms: EM Algorithm and K Means

K MEANS Algorithm DB USED – Iris


Step 1: Choose a dataset that comes preinstalled with Weka by selecting the ‘Open File’ option
under the ‘Preprocess’ tab. We have chosen the Iris dataset for this.

17
Step 2: Choose the K Means options from the choose option from the Cluster tab. Click on Start
and observe the output.

Step 3: Right click on the SimpleKMeans button on the result tab and select View Cluster
Assignments.

18
K-Means Clustering Algorithm Observation

Clustering is one of the most common exploratory data analysis techniques used to get an intuition
about the structure of the data. It can be defined as the task of identifying subgroups in the data
such that data points in the same subgroup (cluster) are very similar while data points in different
clusters are very different. K-means algorithm is an iterative algorithm that tries to partition the
dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point
belongs to only one group.

The dataset has 6 attributes, Id, SepalLength, SepalWidth, PetalLength, Petal Width and Species.
It has a total of 149 instances. We need to cluster this dataset into different clusters. We then
compute the statistical variation in the data- the mean, minimum and the standard deviation. Each
attribute is then scattered and clustered based on their similarity and the distance between each
cluster is calculated.

If it is found that the distance between one node to another is less, then the cluster’s centroid shifts
and the cluster updates itself. The number of iterations required to make the cluster is analysed and
the output is projected i.e., the number of iterations required to calculate the number of clusters=7.
The sum of squared errors within the clusters is calculated as 62.1436882815. The initial starting
points with the species are computed and shown iteration wise. The distance between each other
is calculated and the clusters are formed. The cluster is then visualised in different illustrative
colours- red blue and purple for a particular instance. The number of clusters for the given instance
= 3 with mean distance 74.5 with difference of 6.1 in the Y-axis, as depicted in the scatter-plot.

EM Algorithm DB USED – Weathernominal


Step 1: Choose a dataset by selecting the ‘Open File’ option under the ‘Preprocess’ tab

19
Step 2: Navigate to where the datasets are stored. We have chosen the weather numeric dataset for
this.

Step 3: Under the Cluster tab, click on Choose to select the algorithm to be applied. Then select
the EM option

Step 4: By right-clicking on the Result list, we can get the option to Visualize Cluster
Arrangements. Click on that to see the cluster map.

20
K Means Algorithm vs EM Algorithm

Expectation Maximization (EM) is clustering algorithm that relies on maximizing the likelihood
to find the statistical parameters of the underlying sub-populations in the dataset. I will not get into
the probabilistic theory behind EM. The EM algorithm alternates between two steps (E-step and
M-step). In the E-step the algorithm tries to find a lower bound function on the original likelihood
using the current estimate of the statistical parameters. In the M-step the algorithm finds new
estimates of those statistical parameters by maximizing the lower bound function (i.e. determine
the MLE of the statistical parameters).
Neither of the algorithms has the ability to detect outliers and hence we must pre-process the data
to mitigate the effect of outliers from the detected clusters. EM, in particular, tends to be sensitive
to the presence of outliers as there are no constraints on the covariance matrix.
Unlike K-means, in EM, the clusters are not limited to spherical shapes. In EM we can constrain
the algorithm to provide different covariance matrices (spherical, diagonal and generic). These
different covariance matrices in return allow us to control the shape of our clusters and hence we
can detect sub-populations in our data with different characteristics.

Characteristics K-Means Algorithm EM Algorithm


Number of iterations 4 1
Time taken to build model 0 seconds 0.06 seconds
Number of clusters selected by cross
2 1
validation
0 10 ( 71%)
Clustered Instances 0 14 (100%)
1 4 ( 29%)

21
3. ANN
For a Neural Network, we have chosen the Multilayer Perceptron algorithm

Multilayer Perceptron DB USED- Iris


Step 1 : Choose the required database.

Step 2 : Click on Multilayer perceptron in the classify tab and click on Start.

22
To get the Neural Network as a graphical representation,
We have to click on the classifier option of MultilayerPerceptron which opens up the following
dialogue box

23
Here, the GUI option must be changed to True and click on Okay

24
Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron. A multilayer perceptron (MLP) is a
feed forward artificial neural network that generates a set of outputs from a set of inputs. An MLP
is characterized by several layers of input nodes connected as a directed graph between the input
nodes connected as a directed graph between the input and output layers. MLP uses
backpropagation for training the network. MLP is a deep learning method.

The dataset used to implement multi-Layer perceptron is iris.arff. The database contains 150
instances and 5 attributes.

Backpropagation repeatedly adjust the weights so as to minimize the difference between actual
output and desired output. Hidden Layers, which are neuron nodes are stacked in between inputs
and outputs, allowing neural networks to learn more complicated features.

The 'Correctly Classified Instances' measures the accuracy from the correctly classified instances.
The model here gives 98.667% accuracy for 148 correctly classified instance,

A value of kappa is 0.98 which indicate excellent agreement between observed agreement and
expected agreement.
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 1 49 | c = Iris-virginica

The row indicates the true class, the column indicates the classifier output. Each entry, then, gives
the number of instances of <row> that were classified as <column>. In our output 1 'c' is classified
as 'b' and 1 'b' were classified as 'c'.

25
4. ASSOCIATION RULE MINING
For Association rule mining, we have chosen Apriori algorithm

Apriori Algorithm DB USED- Supermarket


Step 1 : Choose the required database.

Step 2 : Under Associate tab, we can choose the Apriori option, after the usual data pre-processing.
We have chosen nominal weather data for this as well.

26
Step 3 : After selecting the algorithm, we can run it to get the following result

Changing the number of Rules:

27
Apriori Algorithm Observation

Association rule learning is additionally a basic rule-based machine learning technique used for
locating fascinating relations between variables in massive databases.

The Apriori algorithm uses frequent item sets to generate association rules, and it is designed to
work on the databases that contain transactions. With the help of these association rules, it
determines how strongly or how weakly two objects are connected. This algorithm uses a breadth-
first search and Hash Tree to calculate the itemset associations efficiently. It is the iterative process
for finding the frequent item sets from the large dataset.

The supermarket.arff database is used to implement Apriori Algorithm. Apriori algorithm expects
data that is purely nominal without numeric attributes.

The database contains 4627 instances and 217 attributes. It can be seen how difficult it would be
to detect the association between such a large number of attributes. This task is therefore automated
with the help of Apriori algorithm.

Minimum-Support is a parameter supplied to the Apriori algorithm in order to prune candidate


rules by specifying a minimum lower bound for the Support measure of resulting association rules.
There is a corresponding Minimum-Confidence pruning parameter as well. Each rule produced by
the algorithm has its own Support and Confidence measures.
The minimum support is not fixed. “lowerBoundMinSupport” and “upperBoundMinSupport”, this
is the support level interval in which our algorithm will work. Delta is the increment in the
support.

Algorithm starts with an upperBoundMinSupport of 100% and is iteratively decreased by delta of


5%. The algorithm starts with minimum support as 100% and stops at 15% after running 17 times.

28
Algorithm finds 10 probable best association rules.
biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 means that out of 4627
instances, 788 show that biscuits=t frozen foods=t fruit=t total=high is "high" and cake is true.
This gives a strong association and confidence level is 0.92.

29
EVALUATION

Activity No. Marks Scored Sign of Faculty

1.

2.

3.

30

You might also like